# Stemming

It is a rule-based technique that just chops off the suffix of a word to get its root form, which is called the ‘stem’. 

For example, if we use a stemmer to stem the words of the string - "The driver is racing in his boss’ car", the words ‘driver’ and ‘racing’ will be converted to their root form by just chopping of the suffixes ‘er’ and ‘ing’. So, ‘driver’ will be converted to ‘driv’ and ‘racing’ will be converted to ‘rac’.

There are two popular stemmers:

<b>Porter stemmer:</b> This was developed in 1980 and works only on English words. You can find all the detailed rules of this stemmer <a href="http://snowball.tartarus.org/algorithms/porter/stemmer.html">here</a>.

<b>Snowball stemmer:</b> This is a more versatile stemmer that not only works on English words but also on words of other languages such as French, German, Italian, Finnish, Russian, and many more languages. You can learn more about this stemmer <a href="http://snowball.tartarus.org/">here</a>.

In [1]:
# import required librarries
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import PorterStemmer

#### Stemming using a sample text corpus

In [2]:
text = "Very orderly and methodical he looked, with a hand on each knee, and a loud watch ticking a sonorous sermon under his flapped newly bought waist-coat, as though it pitted its gravity and longevity against the levity and evanescence of the brisk fire."
print(text)

Very orderly and methodical he looked, with a hand on each knee, and a loud watch ticking a sonorous sermon under his flapped newly bought waist-coat, as though it pitted its gravity and longevity against the levity and evanescence of the brisk fire.


In [4]:
# tokenize the word
tokens = word_tokenize(text.lower())
print(tokens)

['very', 'orderly', 'and', 'methodical', 'he', 'looked', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'ticking', 'a', 'sonorous', 'sermon', 'under', 'his', 'flapped', 'newly', 'bought', 'waist-coat', ',', 'as', 'though', 'it', 'pitted', 'its', 'gravity', 'and', 'longevity', 'against', 'the', 'levity', 'and', 'evanescence', 'of', 'the', 'brisk', 'fire', '.']


In [6]:
# Porter Stemmer
stemmer = PorterStemmer()
porter_stemmed = [stemmer.stem(token) for token in tokens]
print('Porter Stemmed words - \n',porter_stemmed)
print('\nLength of Porter Stemmed - ',len(porter_stemmed))

Porter Stemmed words - 
 ['veri', 'orderli', 'and', 'method', 'he', 'look', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'tick', 'a', 'sonor', 'sermon', 'under', 'hi', 'flap', 'newli', 'bought', 'waist-coat', ',', 'as', 'though', 'it', 'pit', 'it', 'graviti', 'and', 'longev', 'against', 'the', 'leviti', 'and', 'evanesc', 'of', 'the', 'brisk', 'fire', '.']

Length of Porter Stemmed -  47


In [10]:
# Snowball Stemmer
print('Languages supported by Snowball - \n',SnowballStemmer.languages)
stemmer = SnowballStemmer('english')
snowball_stemmed = [stemmer.stem(token) for token in tokens]
print('\nSnowball Stemmed words - \n',snowball_stemmed)
print('\nLength of Snowball Stemmed - ',len(snowball_stemmed))

Languages supported by Snowball - 
 ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')

Snowball Stemmed words - 
 ['veri', 'order', 'and', 'method', 'he', 'look', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'tick', 'a', 'sonor', 'sermon', 'under', 'his', 'flap', 'newli', 'bought', 'waist-coat', ',', 'as', 'though', 'it', 'pit', 'it', 'graviti', 'and', 'longev', 'against', 'the', 'leviti', 'and', 'evanesc', 'of', 'the', 'brisk', 'fire', '.']

Length of Snowball Stemmed -  47


#### Lets create a dataframe to compare all tokens and its stem created by Snowball stemmer and Porter stemmer

In [17]:
token_df = pd.DataFrame({'token': tokens, 'porter_stemmed': porter_stemmed, 'snowball_stemmed':snowball_stemmed})

In [18]:
token_df.head()

Unnamed: 0,token,porter_stemmed,snowball_stemmed
0,very,veri,veri
1,orderly,orderli,order
2,and,and,and
3,methodical,method,method
4,he,he,he


In [19]:
token_df.shape

(47, 3)

In [21]:
# Lets print the tokens that have been stemmed
df_diff_stem = token_df[(token_df.token != token_df.porter_stemmed) | (token_df.token != token_df.snowball_stemmed)]

In [22]:
df_diff_stem.head()

Unnamed: 0,token,porter_stemmed,snowball_stemmed
0,very,veri,veri
1,orderly,orderli,order
3,methodical,method,method
5,looked,look,look
18,ticking,tick,tick


In [23]:
df_diff_stem.shape

(15, 3)

In [25]:
# Lets compare the stems created by both
df_diff_stem.head(15)

Unnamed: 0,token,porter_stemmed,snowball_stemmed
0,very,veri,veri
1,orderly,orderli,order
3,methodical,method,method
5,looked,look,look
18,ticking,tick,tick
20,sonorous,sonor,sonor
23,his,hi,his
24,flapped,flap,flap
25,newly,newli,newli
32,pitted,pit,pit


#### Snowball stemmer works a little better, but usually, we won’t see much of a difference as both of them are rule based. Snowball has some updated rules and that’s why we saw it stems some words differently.