# Lemmatization

As we saw in the previous chapter, we can explain to the machine which words are similar but also how different there are.

However some "different" words are only variations of the same word and should not be considered as different entries. 

Let's take an example:

Imagine that you are asked to build a model to classify books in two categories: _cooking_ and _cars_. You will use the most frequent words of the book to build your algorithm.

In that case you don't really want to make a distinction between `apple` and `apples` or between `wheel` and `wheels`. You prefer to consider `apple` and `apples` as being variations of `apple`.

To fix that, we will apply **lemmatization**. This approach aims to reduce each word to its simplest variation (named **lemma**). This lemma corresponds to the heading word in a language dictionary:


**apple** (noun) : `a round fruit (usually with a green or red skin) which can be eaten (plural: apples)`

 


## Still confused?
Let's see how it works in a practical case.

First, read [this article](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/).

Then, try to apply what you have learned by using SpaCy or NLTK.

**Pro tips:** Most lemmatizers only work with a single word and not on sentences. Think about tokenizing your sentence first.

**Pro tips:** If you experience SSL issues during `nltk` import [check this](https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed).

## Spacy

In [None]:
# Can you lemmatize this sentence with Spacy and / or NLTK?

my_sentence = "Those children are playing. this game, those games, I play he plays"


In [1]:
#Step 1 - Import Spacy
import spacy

#Step 2 - Initialize the Spacy en model.
load_model = spacy.load('en_core_web_sm')

#Step 3 - Take a simple text for sample
my_sentence = "Those children are playing. this game, those games, I play he plays"

#Step 4 - Parse the text
doc = load_model(my_sentence)

#Step 5 - Extract the lemma for each token
lemmatized_output = " ".join([token.lemma_ for token in doc])

print(lemmatized_output)

those child be play . this game , those game , I play he play


## NLTK

In [1]:
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\yurit\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\yurit\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yurit\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
from nltk.stem import WordNetLemmatizer

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize Single Word
print(lemmatizer.lemmatize("children"))
print(lemmatizer.lemmatize("plays"))
print(lemmatizer.lemmatize("playing"))



child
play
playing


In [6]:
import nltk
from nltk.stem import WordNetLemmatizer

# Define the sentence to be lemmatized
sentence = "Those children are playing. this game, those games, I play he plays"

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)

# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)

['Those', 'children', 'are', 'playing', '.', 'this', 'game', ',', 'those', 'games', ',', 'I', 'play', 'he', 'plays']
Those child are playing . this game , those game , I play he play


In [9]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize


my_sentence = "Those children are playing. this game, those games, I play he plays"

#Tokenize: Split the sentence into words
#word_list = nltk.sent_tokenize(my_sentence) # Tokenize sentences
#print(word_list)

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(my_sentence)
print(word_list)

# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)

['Those children are playing.', 'this game, those games, I play he plays']
['Those', 'children', 'are', 'playing', '.', 'this', 'game', ',', 'those', 'games', ',', 'I', 'play', 'he', 'plays']
Those child are playing . this game , those game , I play he play


In [2]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from pprint import pprint
 
lemmatizer = WordNetLemmatizer()

def lemmetize_print(words):
     a = []
     tokens = word_tokenize(words)
     for token in tokens:
          lemmetized_word = lemmatizer.lemmatize(token)
          a.append(lemmetized_word)
     pprint({a[i] : tokens[i] for i in range(len(a))}, indent = 1, depth=5)

lemmetize_print("Those children are playing. this game, those games, I play he plays")


{',': ',',
 '.': '.',
 'I': 'I',
 'Those': 'Those',
 'are': 'are',
 'child': 'children',
 'game': 'games',
 'he': 'he',
 'play': 'plays',
 'playing': 'playing',
 'this': 'this',
 'those': 'those'}


What are the differences between both tools ?
yuri: spacy is an object-oriented whereas nltk is not
## Conclusion
There are multiple libraries that allow you to do lemmatization. Each of them have their particularities.
There are also other techniques to "simplify" words like [Stemming](https://medium.com/swlh/introduction-to-stemming-vs-lemmatization-nlp-8c69eb43ecfe). Feel free explore those that seems relevant to your use-case.

![stemming vs lemmatization](https://miro.medium.com/max/2050/1*ES5bt7IoInIq2YioQp2zcQ.png)


In [12]:
# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)
    
# 1. Init Lemmatizer
lemmatizer = WordNetLemmatizer()

# 2. Lemmatize Single Word with the appropriate POS tag
word = 'feet'
print(lemmatizer.lemmatize(word, get_wordnet_pos(word)))

# 3. Lemmatize a Sentence with the appropriate POS tag
sentence = "Those children are playing. this game, those games, I play he plays"
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])

foot
['Those', 'child', 'be', 'play', '.', 'this', 'game', ',', 'those', 'game', ',', 'I', 'play', 'he', 'play']
