- **Key Phrase Extraction** is an NLP technique used to identify the most important words or phrases in a text.
- It helps in quickly understanding the **core topics** of a document.
- Key phrases are usually **noun phrases** that represent meaningful concepts.
- Common approaches include **rule-based**, **statistical**, and **machine learning** methods.
- Rule-based methods rely on **POS tagging and grammar patterns**.
- Statistical methods use techniques like **TF-IDF** to find important terms.
- ML-based methods learn key phrases from labeled data.
- Key phrase extraction is widely used in **search, summarization, and document indexing**.
- It improves information retrieval and content analysis.
- Key phrase extraction is a foundational concept for **modern NLP and RAG systems**.


In [1]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to C:\Users\Ishan
[nltk_data]     Pande\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Ishan
[nltk_data]     Pande\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Ishan Pande\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

text = """
Natural Language Processing enables computers to understand human language.
It is widely used in chatbots, search engines, and voice assistants.
"""

# Step 1: Tokenize
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Step 2: Remove stopwords
stop_words = set(stopwords.words('english'))
print("Stop Words:", stop_words)
filtered_tokens = [word for word in tokens if word.isalpha() and word.lower() not in stop_words]

print("Filtered Tokens:", filtered_tokens)
# Step 3: POS tagging
pos_tags = pos_tag(filtered_tokens)

# Step 4: Define grammar for noun phrases
grammar = "NP: {<JJ>*<NN.*>+}"

chunk_parser = RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)

# Step 5: Extract key phrases
key_phrases = []

for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
    phrase = " ".join(word for word, tag in subtree.leaves())
    key_phrases.append(phrase)

print("Key Phrases:")
for phrase in set(key_phrases):
    print("-", phrase)


Tokens: ['Natural', 'Language', 'Processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', '.', 'It', 'is', 'widely', 'used', 'in', 'chatbots', ',', 'search', 'engines', ',', 'and', 'voice', 'assistants', '.']
Stop Words: {'haven', 'are', 'here', 'more', 're', 'his', "they'll", "wasn't", 'than', 'there', 'if', 'most', 'only', 'did', 'mustn', 'as', 'all', 'what', 'just', 'in', 'below', 'so', 'some', "isn't", 'won', "he'll", 'can', "it'd", 'wasn', 'shouldn', 'because', 'does', "i've", 'don', 'about', "you're", 'd', 'why', 'further', "it's", 'aren', 'out', "wouldn't", 'has', 'we', 'very', "hadn't", 'didn', 'such', "weren't", 'were', 'again', 'that', 'any', 'which', 'against', "it'll", "mustn't", 'himself', "needn't", 'the', 'on', 'with', 'a', 'few', 'above', 'under', "they'd", 'yourselves', 'an', 'nor', 'should', 'this', 'yourself', 'each', 'own', 'by', "aren't", 'for', 'isn', 'am', 'm', 'where', 'me', 'after', 'was', 'how', 've', "we'll", 'before', 'll', 'my', 'from',