In [1]:
! pip install nltk



# Lemmatization

We use the lemmatization technique over stemming to maintain word meaning and context. Lemmatization is a text preprocessing technique that transforms
words into their base or dictionary form, known as lemmas, ensuring a more accurate semantic representation. It considers a word’s grammatical context 
and part of speech and, as a result, produces meaningful results. For instance, both “jumps” and “jumping” will be lemmatized to “jump,” preserving 
semantic integrity. However, while this technique offers an enhanced approach, it can be computationally more intensive than stemming. We can perform 
lemmatization using the NLTK and spaCy libraries. However, as we’ll see, NLTK’s lemmatization might not handle all cases as effectively as spaCy’s 
lemmatization due to its advanced contextual understanding capabilities.


In [None]:
### 1. WordNet Lemmatizer

In [11]:
from nltk.stem import WordNetLemmatizer as wnl

# example 1
print(wnl().lemmatize('dogs'))
print(wnl().lemmatize('churches'))
print(wnl().lemmatize('aardwolves'))

print("="*100)

# example 2
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]
for word in words:
    print(word+"---->"+wnl().lemmatize(word,pos='v'))
       

dog
church
aardwolf
eating---->eat
eats---->eat
eaten---->eat
writing---->write
writes---->write
programming---->program
programs---->program
history---->history
finally---->finally
finalized---->finalize


In [14]:
# example 2 : tokenization , POS tagging and lemmatization of a paragraph

import nltk

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Define the paragraph
paragraph = "The quick brown fox jumps over the lazy dog."

# Tokenize the paragraph and get part-of-speech tags
tokens = nltk.word_tokenize(paragraph)
print(tokens)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

# Create a WordNetLemmatizer object
lemmatizer = nltk.WordNetLemmatizer()

# Lemmatize each word based on its part-of-speech tag
lemmatized_words = []
for word, pos_tag in pos_tags:
  tag = {
      'N': 'n',
      'V': 'v',
      'A': 'a',
      'R': 'r'
  }.get(pos_tag[0], 'n')  # Default to noun if tag is unknown
  lemma = lemmatizer.lemmatize(word, pos=tag)
  lemmatized_words.append(lemma)

# Print the lemmatized words
print(lemmatized_words)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ariji\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ariji\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ariji\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [1]:
# detailed example

import pandas as pd
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('stopwords', quiet=True)

True

In [2]:
# Read the necessary dataset

df = pd.read_csv("C:/Users/ariji/OneDrive/Desktop/Data/reviews.csv")
df.head()

Unnamed: 0,review_id,text
0,txt145,The software had a steep learning curve at fir...
1,txt327,I'm really impressed with the user interface o...
2,txt209,The latest update to the software fixed severa...
3,txt825,I encountered a few glitches while using the s...
4,txt878,I was skeptical about trying the software init...


In [3]:
nlp = spacy.load('en_core_web_sm')
lemmatizer = WordNetLemmatizer()
nltk_stop_words = set(stopwords.words('english'))

def lemmatize_with_nltk(text):
    tokens = word_tokenize(text)
    filtered_tokens = [token for token in tokens if token.isalpha() and token not in nltk_stop_words]
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text
    
def lemmatize_with_spacy(text):
    doc = nlp(text)
    lemmatized_tokens = [token.lemma_ for token in doc if not token.is_stop and token.text.isalpha()]
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text

In [4]:
df['lemmatized_text_nltk'] = df['text'].apply(lemmatize_with_nltk)
df['lemmatized_text_spacy'] = df['text'].apply(lemmatize_with_spacy)
print("NLTK Lemmatized Text:")
print(df['lemmatized_text_nltk'])
print("\nSpaCy Lemmatized Text:")
print(df['lemmatized_text_spacy'])

NLTK Lemmatized Text:
0     The software steep learning curve first I star...
1     I really impressed user interface software It ...
2     The latest update software fixed several bug i...
3     I encountered glitch using software customer s...
4     I skeptical trying software initially turned p...
5     The analytics feature provided u valuable insi...
6     I appreciate regular update software receives ...
7     I attended training session software greatly i...
8     The software documentation could comprehensive...
9     I recommended software colleague due excellent...
10    The software integration plugins expanded func...
11    I looking forward upcoming release software pr...
12    The user community active supportive making ea...
13    I using software I consistently impressed stab...
14    The user interface could use modernization fee...
15           I went run software good job mapping route
Name: lemmatized_text_nltk, dtype: object

SpaCy Lemmatized Text:
0     software s