## Stemming and Lemmatization

**Author: Abhishek**

### Defination :

Stemming and Lemmatization are two important pre-processing techniques in NLP for reducing the word into it's base or root form. The purpose of stemming and Lemmatization is to reduce the dimentionality of text data.

### Stemming :

* In Stemming, the word is reduced to it's base form **(stem)** by removing suffixes and prefixes considering some pre-defined set of rules. **The resulting stem may not be a valid word.** It's simple and **faster** compared to lemmatization

* Popular stemming algorithms: **Porter Stemmer, Snowball Stemmer**

### Lemmatization :

* In Lemmatization, the word is reduced to its base form **(lemma)** considering the word's meaning and contextual information. The resulting lemma is a **valid word** that can be found in a dictionary. It's more **accurate** than stemming but relatively **slower**.

* Popular Lemmatization algorithm: **WordNet Lemmatizer**

### Import Library

In [1]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/abhishek/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/abhishek/nltk_data...


True

In [2]:
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer

### Text data

In [11]:
words = ['going', 'goes', 'history', 'congratulations', 'fairly', 'eating', 'eats', 'eaten', 'finalized', 'finally', 'programming', 'sportingly']

In [None]:
words = [
    {'going': 'v'},
    {'goes': 'v'},
    {'history': 'n'},
    {'congratulations': 'n'},
    {'fairly': 'r'},
    {'eating': 'v'},
    {'eats': 'v'},
    {'eaten': 'v'},
    {'finalized': 'v'},
    {'finally': 'r'},
    {'programming': 'n'},  
    {'sportingly': 'r'},
    {'best': 'a'}
]

In [17]:
words[0]

{'going': 'v'}

### Intialize objects

In [12]:
porter_stem = PorterStemmer()
snowball_stem = SnowballStemmer('english')
wordnet_lemma = WordNetLemmatizer()

### NOTE: For WordNet Lemmatizerm, Parts of speech (POS) tag is required as argument

* Noun (n)

* Verb (v)

* Adjective (a)

* Adverb (r)

### Comparing Porter stemmer, Snowball stemmer and wordnet lemmetizer

In [34]:
for i in range(len(words)):
    
    print("====================================================")

    word_dict = words[i]

    for k,v in word_dict.items():

        word=str(k)
        pos=str(v)

    print("Input word : ", word)
    print("Pos tag : ", pos)
    print("Porter Stemmer : ", porter_stem.stem(word))
    print("SnowballStemmer : ", snowball_stem.stem(word))
    print("Wordnet Lemmetizer : ", wordnet_lemma.lemmatize(word,pos))


Input word :  going
Pos tag :  v
Porter Stemmer :  go
SnowballStemmer :  go
Wordnet Lemmetizer :  go
Input word :  goes
Pos tag :  v
Porter Stemmer :  goe
SnowballStemmer :  goe
Wordnet Lemmetizer :  go
Input word :  history
Pos tag :  n
Porter Stemmer :  histori
SnowballStemmer :  histori
Wordnet Lemmetizer :  history
Input word :  congratulations
Pos tag :  n
Porter Stemmer :  congratul
SnowballStemmer :  congratul
Wordnet Lemmetizer :  congratulation
Input word :  fairly
Pos tag :  r
Porter Stemmer :  fairli
SnowballStemmer :  fair
Wordnet Lemmetizer :  fairly
Input word :  eating
Pos tag :  v
Porter Stemmer :  eat
SnowballStemmer :  eat
Wordnet Lemmetizer :  eat
Input word :  eats
Pos tag :  v
Porter Stemmer :  eat
SnowballStemmer :  eat
Wordnet Lemmetizer :  eat
Input word :  eaten
Pos tag :  v
Porter Stemmer :  eaten
SnowballStemmer :  eaten
Wordnet Lemmetizer :  eat
Input word :  finalized
Pos tag :  v
Porter Stemmer :  final
SnowballStemmer :  final
Wordnet Lemmetizer :  finali