# Stemming

Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for "boat" might also return "boats" and "boating". Here, "boat" would be the stem for [boat, boater, boating, boats].
Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required.


Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.

In another word, there is one root word, but there are many variations of the same words. 

For example, the root word is "run" and it's variations are "runs, running, ran" and like so. 

In the same way, with the help of Stemming, we can find the root word of any variations.


Stemming is a data-preprocessing module. The English language has many variations of a single word. These variations create ambiguity in machine learning training and prediction. To create a successful model, it's vital to filter such words and convert to the same type of sequenced data using stemming. Also, this is an important technique to get row data from a set of sentence and removal of redundant data also known as normalization.



In [0]:
# Install NLTK if not already installed...uncomment the next cell and run it.
#! pip install nltk

In [0]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Porter Stemmer

One of the most common - and effective - stemming tools is Porter's Algorithm developed by Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules.



In [3]:
from nltk.stem import PorterStemmer

list_of_words= ["wait", "waiting", "waited", "waits", 'run','runner','running','ran','runs','easily','fairly']
ps =PorterStemmer()
for w in list_of_words:
    rootWord=ps.stem(w)
    print(rootWord)

wait
wait
wait
wait
run
runner
run
ran
run
easili
fairli


## Snowball Stemmer

This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by Martin Porter. The algorithm used here is more acurately called the "English Stemmer" or "Porter2 Stemmer". It offers a slight improvement over the original Porter stemmer, both in logic and speed. Since nltk uses the name SnowballStemmer, we'll use it here.



In [1]:
from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')
words = ['run','runner','running','ran','runs','easily','fairly']

for word in words:
    print(word+' --> '+s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


## Lemmatization

Lemmatization is the process of finding the lemma of a word depending on their meaning. 

Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. 
It helps in returning the base or dictionary form of a word, which is known as the lemma. 

The NLTK Lemmatization method is based on WorldNet's built-in morph function. 
Text preprocessing includes both stemming as well as lemmatization. 
Many people find the two terms confusing. Some treat these as same, but there is a difference between these both. 

Lemmatization is preferred over Stemming because of the below reason.

Why is Lemmatization better than Stemming?

Stemming algorithm works by cutting the suffix from the word. In a broader sense cuts either the beginning or end of the word.

On the contrary, Lemmatization is a more powerful operation, and it takes into consideration morphological analysis of the words. It returns the lemma which is the base form of all its inflectional forms. 
In-depth linguistic knowledge is required to create dictionaries and look for the proper form of the word. 

Stemming is a general operation while lemmatization is an intelligent operation where the proper form will be looked in the dictionary. Hence, lemmatization helps in forming better machine learning features.


In [0]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

text = "buying buys studies studying cries cry"
tokenization = nltk.word_tokenize(text)

for w in tokenization:
 print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))  

Lemma for buying is buying
Lemma for buys is buy
Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry


## Difference between Stemming and Lemmatization

In [0]:

porter_stemmer  = PorterStemmer()

wordnet_lemmatizer = WordNetLemmatizer()

text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
 print("\n Stemming for {} is {}".format(w,porter_stemmer.stem(w))) 
 print(" Lemma for {} is {} \n".format(w, wordnet_lemmatizer.lemmatize(w)))  


 Stemming for studies is studi
 Lemma for studies is study 


 Stemming for studying is studi
 Lemma for studying is studying 


 Stemming for cries is cri
 Lemma for cries is cry 


 Stemming for cry is cri
 Lemma for cry is cry 

