# Stemming and Lemmatization in Python

Stemming and Lemmatization are text normalization techniques within the field of Natural language Processing that are used to prepare text, words, and documents for further processing. In this blog, you may study stemming and lemmatization in an exceedingly practical approach covering the background, applications of stemming and lemmatization, and the way to stem and lemmatize words, sentences and documents using the Python nltk package which is the natural language package provided by Python

In natural language processing, you may want your program to acknowledge that the words “kick” and “kicked” are just different tenses of the same verb. this can be the concept of reducing different kinds of a word to a core root.

## Stemming

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers.

Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for “boat” might also return “boats” and “boating”. Here, “boat” would be the stem for (boat, boater, boating, boats).

Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. In fact, spaCy doesn’t include a stemmer, opting instead to rely entirely on lemmatization.

![image.png](attachment:image.png)

## Lemmatization with NLTK

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word. 

Text preprocessing includes both Stemming as well as Lemmatization. Many times people find these two terms confusing. Some treat these two as the same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.

* Used in comprehensive retrieval systems like search engines.
* Used in compact indexing

Examples of lemmatization:

-> rocks : rock

-> corpora : corpus

-> better : good

One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not supplied, the default is “noun.”


In [1]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\divak\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# import these modules
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))


rocks : rock
corpora : corpus
better : good


In [3]:
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer
#is based on The Porter Stemming Algorithm

In [4]:
wordnet_lem = WordNetLemmatizer()

In [5]:
word_to_lemm = ["drive", "drives", "driver", "drivers", "driven", "driving"]

In [6]:
for words in word_to_lemm:
    print(words + ":" + wordnet_lem.lemmatize(words))

drive:drive
drives:drive
driver:driver
drivers:driver
driven:driven
driving:driving


## Stemming words with NLTK
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

Some more example of stemming for root word "like" include:

* -> "likes"
* -> "liked"
* -> "likely"
* -> "liking"

Errors in Stemming: 
There are mainly two errors in stemming – Overstemming and Understemming. Overstemming occurs when two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are stemmed to same root that are not of different stems.

Applications of stemming are:  

* Stemming is used in information retrieval systems like search engines.
* It is used to determine domain vocabularies in domain analysis.

Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same.


### Code #1:  

In [7]:
# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

# choose some words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers"]

for w in words:
	print(w, " : ", ps.stem(w))


program  :  program
programs  :  program
programmer  :  programm
programming  :  program
programmers  :  programm


### Code #2: Stemming words from sentences

In [8]:
# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
  
ps = PorterStemmer()
  
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)
  
for w in words:
    print(w, " : ", ps.stem(w))

Programmers  :  programm
program  :  program
with  :  with
programming  :  program
languages  :  languag


--------------------------------------

In [10]:
import nltk

### Porter Stemmer

In [11]:
from nltk.stem.porter import PorterStemmer
porstem = PorterStemmer()
# Try some stems
print('drive: {}'.format(porstem.stem('drive')))
print('drives: {}'.format(porstem.stem('drives')))
print('driver: {}'.format(porstem.stem('driver')))
print('drivers: {}'.format(porstem.stem('drivers')))
print('driven: {}'.format(porstem.stem('driven')))
print('driving: {}'.format(porstem.stem('driving')))

drive: drive
drives: drive
driver: driver
drivers: driver
driven: driven
driving: drive


In [12]:
word_to_stem = ["drive", "drives", "driver", "drivers", "driven", "driving"]
for words in word_to_stem:
    print(words + ":" + porstem.stem(words))

drive:drive
drives:drive
driver:driver
drivers:driver
driven:driven
driving:drive


### LancasterStemmer

In [13]:
from nltk.stem.lancaster import LancasterStemmer
lanstem = LancasterStemmer()
# Try some stems
print('drive: {}'.format(lanstem.stem('drive')))
print('drives: {}'.format(lanstem.stem('drives')))
print('driver: {}'.format(lanstem.stem('driver')))
print('drivers: {}'.format(lanstem.stem('drivers')))
print('driven: {}'.format(lanstem.stem('driven')))
print('driving: {}'.format(lanstem.stem('driving')))

drive: driv
drives: driv
driver: driv
drivers: driv
driven: driv
driving: driv


In [14]:
word_to_stem = ["drive", "drives", "driver", "drivers", "driven", "driving"]
for words in word_to_stem:
    print(words + ":" + lanstem.stem(words))

drive:driv
drives:driv
driver:driv
drivers:driv
driven:driv
driving:driv


### Snowball Stemmer

In [15]:
from nltk.stem.snowball import SnowballStemmer
snostem = SnowballStemmer('english')
# Try some stems
print('drive: {}'.format(snostem.stem('drive')))
print('drives: {}'.format(snostem.stem('drives')))
print('driver: {}'.format(snostem.stem('driver')))
print('drivers: {}'.format(snostem.stem('drivers')))
print('driven: {}'.format(snostem.stem('driven')))
print('driving: {}'.format(snostem.stem('driving')))

drive: drive
drives: drive
driver: driver
drivers: driver
driven: driven
driving: drive


### Compare Porter Stemmer, LancasterStemmer and Snowball Stemmer

In [16]:
def stemm(word):
    print("Porter Stemmer" + ":" + porstem.stem(word))
    print("LancasterStemmer" + ":" + lanstem.stem(word))
    print("Snowball Stemmer" + ":" + snostem.stem(word))
    
    return

stemm('data')
print("-----------------------")
stemm('driving')

Porter Stemmer:data
LancasterStemmer:dat
Snowball Stemmer:data
-----------------------
Porter Stemmer:drive
LancasterStemmer:driv
Snowball Stemmer:drive
