<a href="https://colab.research.google.com/github/amruthaduvvuri/ML---practise-projects-/blob/main/Keyword_Extractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Keywords can also be extracted from the given sentences using RAKE - Rapid Automatic Keyword Extraction along side NLTK(Natural Language ToolKit) , but here we will be using a combination of tokenization, part-of-speech tagging, and filtering to extract meaningful keywords.
Other Methods that we can also use include :
1. Spacy - is used to identity nouns and entities
2. Term Frequency-Inverse Document Frequency - TF-IDF is a statistical method to evaluate the importance of words in a document

Importing Libraries

In [None]:
import nltk
from nltk.corpus import stopwords # Library to handle common English stopwords
from nltk.tokenize import word_tokenize # Tokenizer to split text into individual words
from collections import Counter # Efficient way to count word frequencies

 Download necessary NLTK resources for tokenization and stopwords

In [None]:
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

 Extracts the top `num_keywords` keywords from the given text by:
    1. Tokenizing the text into individual words
    2. Removing stopwords and non-alphabetic characters (like punctuation)
    3. Counting the frequency of remaining words
    4. Returning the most common keywords based on frequency

    Parameters:
    - text (str): The input text from which keywords need to be extracted.
    - num_keywords (int): Number of top keywords to return (default is 5).

    Returns:
    - list: List of extracted keywords in descending order of frequency.

In [None]:
def extract_keywords(text, num_keywords=5):
  # Step 1: Tokenization - Split text into individual words
    words = word_tokenize(text.lower())  # Convert text to lowercase to ensure uniformity

    # Step 2: Removing stopwords and non-alphabetic characters
    # Stopwords are common words like "the", "is", "and", etc., which usually don't add value
    filtered_words = [word for word in words if word.isalpha() and word not in stopwords.words('english')]

    # Step 3: Frequency count using Counter
    word_freq = Counter(filtered_words)

    # Step 4: Extract the most common keywords
    # `.most_common(n)` returns the `n` most frequently occurring items as (word, frequency) pairs
    keywords = word_freq.most_common(num_keywords)

    # Extract only the keywords (ignoring their counts)
    return [word for word, freq in keywords]

# Example usage text
text = "The impact of artificial intelligence on modern technology is growing rapidly."

# Display the extracted keywords
print(extract_keywords(text))


['impact', 'artificial', 'intelligence', 'modern', 'technology']
