
Stemming and lemmatisation are both techniques used in natural language processing (NLP) to reduce words to their base or root form. However, they differ in how they achieve this:

**Stemming:**
- Definition: Stemming involves chopping off the end of words in the hope of getting the root form, which may not always be a valid word. It uses heuristic rules to strip suffixes.
- Example: The words "running," "runner," and "ran" may all be reduced to "run" using a stemmer.
- Accuracy: Stemming is generally faster but less accurate because it works by cutting off common prefixes and suffixes without considering the word's context.
- Output: Often leads to non-linguistic forms, like "organiz" for "organizing" and "organization."


**Lemmatisation:**
- Definition: Lemmatisation reduces words to their base form, called a "lemma," by considering the context and morphological analysis of the word. It uses a dictionary to ensure that the root form is a valid word.
- Example: For the words "better" and "good," lemmatisation would return "good," as it understands the relationship between them.
- Accuracy: More accurate and context-sensitive compared to stemming, but typically slower since it requires a lookup in lexical resources.
- Output: Always returns a meaningful word in the language.

**Summary:**
Stemming is faster but less precise, often producing incomplete words.
Lemmatisation is slower but more accurate and returns valid words.

In [None]:
!pip install nltk



In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')  # For wordnet's multilingual support
nltk.download('punkt')  # For tokenization


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Sample text
text = "The leaves on the trees are falling. The birds are flying. He ran quickly, but running was hard."

# Tokenize the text
words = nltk.word_tokenize(text)

# Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]

# Lemmatisation
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# Output the results
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)
print("Lemmatized Words:", lemmatized_words)


Original Words: ['The', 'leaves', 'on', 'the', 'trees', 'are', 'falling', '.', 'The', 'birds', 'are', 'flying', '.', 'He', 'ran', 'quickly', ',', 'but', 'running', 'was', 'hard', '.']
Stemmed Words: ['the', 'leav', 'on', 'the', 'tree', 'are', 'fall', '.', 'the', 'bird', 'are', 'fli', '.', 'he', 'ran', 'quickli', ',', 'but', 'run', 'wa', 'hard', '.']
Lemmatized Words: ['The', 'leaf', 'on', 'the', 'tree', 'are', 'falling', '.', 'The', 'bird', 'are', 'flying', '.', 'He', 'ran', 'quickly', ',', 'but', 'running', 'wa', 'hard', '.']


In **lemmatisation**, specifying the correct **Part of Speech (POS)** tag improves accuracy by providing context to the lemmatizer. Each word can play different grammatical roles (noun, verb, adjective, etc.), and depending on this role, its lemma (base form) may differ.

**Common POS Tags:**
**Noun ('n')**: Names of things, places, or people (e.g., "dog," "car").

**Verb ('v')**: Action words (e.g., "run," "jump").

**Adjective ('a')**: Describes qualities (e.g., "big," "red").

**Adverb ('r')**: Describes how something is done (e.g., "quickly").

For example:

The word "running" can be either a verb ("He is running fast") or a noun ("Running is fun").

**Verb lemma:** "run"

**Noun lemma:** "running"

If the lemmatizer isn't given a POS tag, it assumes the word is a noun, which can lead to incorrect results in some cases. By specifying the POS tag, you ensure the word is lemmatised according to its correct grammatical role.

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Without POS tagging
print(lemmatizer.lemmatize("running"))  # Assumes it's a noun, so output: 'running'

# With POS tagging
print(lemmatizer.lemmatize("running", pos='v'))  # Correctly recognizes it as a verb, so output: 'run'


running
run


In [None]:
#ASHIS KUMAR SAHU

## Applying POS Tagging Automatically:
You can use **nltk.pos_tag** to automatically tag words in a sentence with their POS tags and then use this information in lemmatisation.

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
import nltk
from nltk.corpus import wordnet

# Function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

# Sample sentence
sentence = "The leaves on the trees are falling. The birds are flying."

# Tokenize the sentence
words = nltk.word_tokenize(sentence)

# Get POS tags
pos_tags = nltk.pos_tag(words)

# Lemmatize using POS tags
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
for word, tag in pos_tags:
    wordnet_pos = get_wordnet_pos(tag) or wordnet.NOUN  # Default to noun if no tag
    lemmatized_sentence.append(lemmatizer.lemmatize(word, pos=wordnet_pos))

print("Original:", words)
print("POS Tags:", pos_tags)
print("Lemmatized:", lemmatized_sentence)


Original: ['The', 'leaves', 'on', 'the', 'trees', 'are', 'falling', '.', 'The', 'birds', 'are', 'flying', '.']
POS Tags: [('The', 'DT'), ('leaves', 'NNS'), ('on', 'IN'), ('the', 'DT'), ('trees', 'NNS'), ('are', 'VBP'), ('falling', 'VBG'), ('.', '.'), ('The', 'DT'), ('birds', 'NNS'), ('are', 'VBP'), ('flying', 'VBG'), ('.', '.')]
Lemmatized: ['The', 'leaf', 'on', 'the', 'tree', 'be', 'fall', '.', 'The', 'bird', 'be', 'fly', '.']


- DT (Determiner): Introduces nouns (e.g., "The").
- NNS (Noun, Plural): Represents plural nouns (e.g., "leaves," "trees," "birds").
- IN (Preposition): Shows relationships between words (e.g., "on").
- VBP (Verb, Present Tense, Non-3rd Person Singular): Present-tense verb for plural subjects or non-3rd person singular subjects (e.g., "are").
- VBG (Verb, Gerund or Present Participle): Verbs in the "-ing" form indicating continuous action (e.g., "falling," "flying").
- . (Punctuation): Punctuation mark indicating the end of the sentence.