## Stemming and Lemmatization

**Author: Abhishek Dey**

**Date: June 16, 2024**

## Defination:

Stemming and Lemmatization are two important pre-processing techniques in NLP for reducing the word into it's base or root form. The purpose of stemming and Lemmatization is to reduce the dimentionality of text data.



## Stemming:

- In Stemming, the word is reduced to it's base form (stem) by removing suffixes and prefixes considering some pre-defined set of rules. **The resulting stem may not be a valid word**. It's simple and **faster** compared to lemmatization

- Popular stemming algos: **Porter Stemmer, Snowball Stemmer**

## Lemmatization:

- In Lemmatization, the word is reduced to its base form (lemma) considering the word's meaning and contextual information. **The resulting lemma is a valid word that can be found in a dictionary**. It's more **accurate** than stemming but relatively **slower**.

- Popular Lemmatization algo: **WordNet Lemmatizer**

### Import library

In [1]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/abhishek/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/abhishek/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer

### Words

In [3]:
words=['going', 'goes', 'history', 'congratulations', 'fairly', 'eating', 'eats', 'eaten', 'finalized', 'finally', 'programming', 'sportingly']

In [4]:
print(words)

['going', 'goes', 'history', 'congratulations', 'fairly', 'eating', 'eats', 'eaten', 'finalized', 'finally', 'programming', 'sportingly']


### Intialize objects

In [5]:
porter_stemmer=PorterStemmer()
snowball_stemmer=SnowballStemmer('english')
wordnet_lemmatizer=WordNetLemmatizer()

### NOTE: For WordNet Lemmatizerm, Parts of speech (POS) tag is required as argument
 
- Noun (n)
- Verb (v)
- Adjective (a)
- Adverb (r)


### Comparing Porter Stemmer, Snowball Stemmer and WordNet Lemmatizer

In [6]:
for word in words:
    
    print("======================================")
    print("Input word : ", word)
    print("stem (porter_stemmer) : ", porter_stemmer.stem(word))
    print("stem (snowball_stemmer) : ", snowball_stemmer.stem(word))
    print("Lemma (Wordnet Lemmatizer) : ", wordnet_lemmatizer.lemmatize(word,pos='v'))

Input word :  going
stem (porter_stemmer) :  go
stem (snowball_stemmer) :  go
Lemma (Wordnet Lemmatizer) :  go
Input word :  goes
stem (porter_stemmer) :  goe
stem (snowball_stemmer) :  goe
Lemma (Wordnet Lemmatizer) :  go
Input word :  history
stem (porter_stemmer) :  histori
stem (snowball_stemmer) :  histori
Lemma (Wordnet Lemmatizer) :  history
Input word :  congratulations
stem (porter_stemmer) :  congratul
stem (snowball_stemmer) :  congratul
Lemma (Wordnet Lemmatizer) :  congratulations
Input word :  fairly
stem (porter_stemmer) :  fairli
stem (snowball_stemmer) :  fair
Lemma (Wordnet Lemmatizer) :  fairly
Input word :  eating
stem (porter_stemmer) :  eat
stem (snowball_stemmer) :  eat
Lemma (Wordnet Lemmatizer) :  eat
Input word :  eats
stem (porter_stemmer) :  eat
stem (snowball_stemmer) :  eat
Lemma (Wordnet Lemmatizer) :  eat
Input word :  eaten
stem (porter_stemmer) :  eaten
stem (snowball_stemmer) :  eaten
Lemma (Wordnet Lemmatizer) :  eat
Input word :  finalized
stem (por