# Lemmatisation

This is a more sophisticated technique (and perhaps more 'intelligent') in the sense that it doesn’t just chop off the suffix of a word. Instead, it takes an input word and searches for its base word by going recursively through all the variations of dictionary words. The base word in this case is called the lemma. Words such as ‘feet’, ‘drove’, ‘arose’, ‘bought’, etc. can’t be reduced to their correct base form using a stemmer. But a lemmatizer can reduce them to their correct base form. The most popular lemmatizer is the WordNet lemmatizer created by a team od researchers at the Princeton university. We can read more about it <a href="https://wordnet.princeton.edu/">here</a>.

In [1]:
# import the required librarires

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

#### Let us do lemmatisation of sample corpus text

In [2]:
text = "Very orderly and methodical he looked, with a hand on each knee, and a loud watch ticking a sonorous sermon under his flapped newly bought waist-coat, as though it pitted its gravity and longevity against the levity and evanescence of the brisk fire."
print(text)

Very orderly and methodical he looked, with a hand on each knee, and a loud watch ticking a sonorous sermon under his flapped newly bought waist-coat, as though it pitted its gravity and longevity against the levity and evanescence of the brisk fire.


In [3]:
tokens =  word_tokenize(text.lower())

In [4]:
print(tokens)

['very', 'orderly', 'and', 'methodical', 'he', 'looked', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'ticking', 'a', 'sonorous', 'sermon', 'under', 'his', 'flapped', 'newly', 'bought', 'waist-coat', ',', 'as', 'though', 'it', 'pitted', 'its', 'gravity', 'and', 'longevity', 'against', 'the', 'levity', 'and', 'evanescence', 'of', 'the', 'brisk', 'fire', '.']


In [5]:
# download the lemmas
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/nehaverma/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
# lemmatize the tokens
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized = [wordnet_lemmatizer.lemmatize(token) for token in tokens]
print(lemmatized)

['very', 'orderly', 'and', 'methodical', 'he', 'looked', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'ticking', 'a', 'sonorous', 'sermon', 'under', 'his', 'flapped', 'newly', 'bought', 'waist-coat', ',', 'a', 'though', 'it', 'pitted', 'it', 'gravity', 'and', 'longevity', 'against', 'the', 'levity', 'and', 'evanescence', 'of', 'the', 'brisk', 'fire', '.']


### Lemmatisation v/s Stemming

- A stemmer is a rule based technique, and hence, it is much faster than the lemmatizer (which searches the dictionary to look for the lemma of a word). On the other hand, a stemmer typically gives less accurate results than a lemmatizer.

- A lemmatizer is slower because of the dictionary lookup but gives better results than a stemmer. It is important to know that for a lemmatizer to perform accurately, we need to provide the part-of-speech tag of the input word (noun, verb, adjective etc.)But there are often cases when the POS tagger itself is quite inaccurate on our text, and that will worsen the performance of the lemmatiser as well. In short, we may want to consider a stemmer rather than a lemmatiser if we notice that POS tagging is inaccurate.

- Also, Lemmatization only works on correctly spelt words. While stemming is rule based

In [9]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
stemmed = [stemmer.stem(token) for token in tokens]
print(stemmed)

['veri', 'order', 'and', 'method', 'he', 'look', ',', 'with', 'a', 'hand', 'on', 'each', 'knee', ',', 'and', 'a', 'loud', 'watch', 'tick', 'a', 'sonor', 'sermon', 'under', 'his', 'flap', 'newli', 'bought', 'waist-coat', ',', 'as', 'though', 'it', 'pit', 'it', 'graviti', 'and', 'longev', 'against', 'the', 'leviti', 'and', 'evanesc', 'of', 'the', 'brisk', 'fire', '.']


In [11]:
import pandas as pd
token_df = pd.DataFrame({'token': tokens, 'lemmatized': lemmatized, 'stemmed':stemmed})

In [12]:
token_df.head()

Unnamed: 0,token,lemmatized,stemmed
0,very,very,veri
1,orderly,orderly,order
2,and,and,and
3,methodical,methodical,method
4,he,he,he


In [13]:
diff_token_df = token_df[(token_df.lemmatized != token_df.token) | (token_df.stemmed != token_df.token)]

In [14]:
diff_token_df.shape

(15, 3)

In [15]:
diff_token_df.head(15)

Unnamed: 0,token,lemmatized,stemmed
0,very,very,veri
1,orderly,orderly,order
3,methodical,methodical,method
5,looked,looked,look
18,ticking,ticking,tick
20,sonorous,sonorous,sonor
24,flapped,flapped,flap
25,newly,newly,newli
29,as,a,as
32,pitted,pitted,pit


#### As we can see there are some words lemmatized incorrectly. That's because we also need to pass the POS (part of speech) for correcting this

In [19]:
# For example, let us lemmatize the word working without passing POS
wordnet_lemmatizer.lemmatize('working')

'working'

In [20]:
# Now when we pass POD as verb (v), it lemmatizes it correctly
wordnet_lemmatizer.lemmatize('working', pos='v')

'work'