# Wordnet Lemmatizer with NLTK

Wordnet is an large, freely and publicly available lexical database for the English language aiming to establish structured semantic relationships between words. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers. NLTK offers an interface to it, but you have to download it first in order to use it. Follow the below instructions to install nltk and download wordnet

In [None]:
# How to install and import NLTK
# In terminal or prompt:
# pip install nltk

# # Download Wordnet through NLTK in python console:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

history , historical - history  - lemmization
histori - stemming


In [None]:
import nltk
from nltk.stem import WordNetLemmatizer 

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize Single Word
print(lemmatizer.lemmatize("bats"))


print(lemmatizer.lemmatize("are"))


print(lemmatizer.lemmatize("feet"))


bat
are
foot


In [None]:
nltk.download('punkt')
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)


# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
The striped bat are hanging on their foot for best


# Part of speech

In [None]:
print(lemmatizer.lemmatize("stripes", 'v'))  


print(lemmatizer.lemmatize("stripes", 'n'))  


strip
stripe


# Wordnet Lemmatizer with appropriate POS tag

In [None]:
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
print(nltk.pos_tag(['feet']))
#> [('feet', 'NNS')]

print(nltk.pos_tag(nltk.word_tokenize(sentence)))
#> [('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('for', 'IN'), ('best', 'JJS')]

[('feet', 'NNS')]
[('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('for', 'IN'), ('best', 'JJS')]


In [None]:
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)


# 1. Init Lemmatizer
lemmatizer = WordNetLemmatizer()

# 2. Lemmatize Single Word with the appropriate POS tag
word = 'feet'
print(lemmatizer.lemmatize(word, get_wordnet_pos(word)))

# 3. Lemmatize a Sentence with the appropriate POS tag
sentence = "The striped bats are hanging on their feet for best"
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])

 

foot
['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


# Using lemminflect

LemmInflect uses a dictionary approach to lemmatize English words and inflect them into forms specified by a user supplied Universal Dependencies or Penn Treebank tag. The library works with out-of-vocabulary (OOV) words by applying neural network techniques to classify word forms and choose the appropriate morphing rules.

In [None]:
pip install lemminflect

Collecting lemminflect
  Downloading lemminflect-0.2.2-py3-none-any.whl (769 kB)
[?25l[K     |▍                               | 10 kB 18.0 MB/s eta 0:00:01[K     |▉                               | 20 kB 12.7 MB/s eta 0:00:01[K     |█▎                              | 30 kB 6.6 MB/s eta 0:00:01[K     |█▊                              | 40 kB 5.9 MB/s eta 0:00:01[K     |██▏                             | 51 kB 5.0 MB/s eta 0:00:01[K     |██▋                             | 61 kB 5.2 MB/s eta 0:00:01[K     |███                             | 71 kB 4.9 MB/s eta 0:00:01[K     |███▍                            | 81 kB 5.4 MB/s eta 0:00:01[K     |███▉                            | 92 kB 5.5 MB/s eta 0:00:01[K     |████▎                           | 102 kB 5.4 MB/s eta 0:00:01[K     |████▊                           | 112 kB 5.4 MB/s eta 0:00:01[K     |█████▏                          | 122 kB 5.4 MB/s eta 0:00:01[K     |█████▌                          | 133 kB 5.4 MB/s eta 0:00

In [None]:
from lemminflect import getLemma
getLemma('watches', upos='VERB')


('watch',)

In [None]:
getLemma('watched', upos='VERB')

('watch',)

# Spacy

spaCy is a relatively new in the space and is billed as an industrial strength NLP engine. It comes with pre-built models that can parse text and compute various NLP related features through one single function call. Ofcourse, it provides the lemma of the word too. Before we begin, let’s install spaCy and download the ‘en’ model.

In [None]:
import spacy

# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)

# Extract the lemma for each token and join
" ".join([token.lemma_ for token in doc])


'the stripe bat be hang on -PRON- foot for good'

It did all the lemmatizations the Wordnet Lemmatizer supplied with the correct POS tag did. Plus it also lemmatized ‘best’ to ‘good’. Nice! You’d see the -PRON- character coming up whenever spacy detects a pronoun.

# TextBlob Lemmatizer

TexxtBlob is a powerful, fast and convenient NLP package as well. Using the Word and TextBlob objects, its quite straighforward to parse and lemmatize words and sentences respectively.

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
pip install textblob



In [None]:
# pip install textblob
from textblob import TextBlob, Word

# Lemmatize a sentence
sentence = "The striped bats are hanging on their feet for best"
sent = TextBlob(sentence)
" ". join([w.lemmatize() for w in sent.words])
#> 'The striped bat are hanging on their foot for best'

'The striped bat are hanging on their foot for best'

TextBlob Lemmatizer with appropriate POS tag

In [None]:
# Define function to lemmatize each word with its POS tag
def lemmatize_with_postag(sentence):
    sent = TextBlob(sentence)
    tag_dict = {"J": 'a', 
                "N": 'n', 
                "V": 'v', 
                "R": 'r'}
    words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]    
    lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
    return " ".join(lemmatized_list)

# Lemmatize
sentence = "The striped bats are hanging on their feet for best"
lemmatize_with_postag(sentence)


'The striped bat be hang on their foot for best'

Thanks 