In [51]:
import nltk
from nltk.stem import PorterStemmer,SnowballStemmer,WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [34]:
stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

**OverStemming**

Over-stemming is when two words with different stems are stemmed to the same root. This is also known as a false positive.

universal

university

universe

All the above 3 words are stemmed to univers which is wrong behavior.
Though these three words are etymologically related, their modern meanings are in widely different domains, so treating them as synonyms in NLP/NLU will likely reduce the relevance of the search results

**UnderStemming**

Under-stemming is when two words that should be stemmed to the same root are not. This is also known as a false negative. Below is the example for the same.

alumnus

alumni

alumnae

**1.  Wordnet Lemmatizer with NLTK**

In [0]:
paragraph = """Thank you all so very much. Thank you to the Academy. 
               Thank you to all of you in this room. I have to congratulate 
               the other incredible nominees this year. The Revenant was 
               the product of the tireless efforts of an unbelievable cast
               and crew. First off, to my brother in this endeavor, Mr. Tom 
               Hardy. Tom, your talent on screen can only be surpassed by 
               your friendship off screen … thank you for creating a t
               ranscendent cinematic experience. Thank you to everybody at 
               Fox and New Regency … my entire team. I have to thank 
               everyone from the very onset of my career … To my parents; 
               none of this would be possible without you. And to my 
               friends, I love you dearly; you know who you are. And lastly,
               I just want to say this: Making The Revenant was about
               man's relationship to the natural world. A world that we
               collectively felt in 2015 as the hottest year in recorded
               history. Our production needed to move to the southern
               tip of this planet just to be able to find snow. Climate
               change is real, it is happening right now. It is the most
               urgent threat facing our entire species, and we need to work
               collectively together and stop procrastinating. We need to
               support leaders around the world who do not speak for the 
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and 
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and 
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this 
               amazing award tonight. Let us not take this planet for 
               granted. I do not take tonight for granted. Thank you so very much."""

In [0]:
sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()

# Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)  

In [40]:
sentences

['Thank much .',
 'Thank Academy .',
 'Thank room .',
 'I congratulate incredible nominee year .',
 'The Revenant product tireless effort unbelievable cast crew .',
 'First , brother endeavor , Mr. Tom Hardy .',
 'Tom , talent screen surpassed friendship screen … thank creating ranscendent cinematic experience .',
 'Thank everybody Fox New Regency … entire team .',
 'I thank everyone onset career … To parent ; none would possible without .',
 'And friend , I love dearly ; know .',
 "And lastly , I want say : Making The Revenant man 's relationship natural world .",
 'A world collectively felt 2015 hottest year recorded history .',
 'Our production needed move southern tip planet able find snow .',
 'Climate change real , happening right .',
 'It urgent threat facing entire specie , need work collectively together stop procrastinating .',
 'We need support leader around world speak big polluter , speak humanity , indigenous people world , billion billion underprivileged people would aff

In [0]:
s = '"The striped bats are hanging on their feet for best"'

In [46]:
# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(s)
print(word_list)

['``', 'The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best', "''"]


In [0]:
# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])

In [48]:
print(lemmatized_output)

`` The striped bat are hanging on their foot for best ''


it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize()


**2. Wordnet Lemmatizer with appropriate POS tag**

#https://www.nltk.org/book/ch05.html
part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word 

In [52]:
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective

In [0]:
word='feet'

In [60]:
tag = nltk.pos_tag([word])
tag

[('feet', 'NNS')]

In [65]:
print("rocks :", lemmatizer.lemmatize("rocks")) 
print("corpora :", lemmatizer.lemmatize("corpora"))  
# a denotes adjective in "pos" 
print("better :", lemmatizer.lemmatize("better", pos ="a")) 

rocks : rock
corpora : corpus
better : good


In [0]:
# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [0]:
# 1. Init Lemmatizer
lemmatizer = WordNetLemmatizer()

In [63]:
# 2. Lemmatize Single Word with the appropriate POS tag
word = 'feet'
print(lemmatizer.lemmatize(word, get_wordnet_pos(word)))

foot


In [64]:
# 3. Lemmatize a Sentence with the appropriate POS tag
sentence = "The striped bats are hanging on their feet for best"
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])

['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


**3. spaCy Lemmatization**

In [0]:
import spacy

# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])

In [67]:
sentence = "The striped bats are hanging on their feet for best"
# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)
doc

The striped bats are hanging on their feet for best

In [68]:
# Extract the lemma for each token and join
" ".join([token.lemma_ for token in doc])

'the stripe bat be hang on -PRON- foot for good'

It did all the lemmatizations the Wordnet Lemmatizer supplied with the correct POS tag did. Plus it also lemmatized ‘best’ to ‘good’. Nice!

**4.TextBlob Lemmatizer**

In [0]:
from textblob import TextBlob, Word

In [71]:
# Lemmatize a word
word = 'stripes'
w = Word(word)
w

'stripes'

In [72]:
w.lemmatize()

'stripe'

In [73]:
# Lemmatize a sentence
sentence = "The striped bats are hanging on their feet for best"
sent = TextBlob(sentence)
" ". join([w.lemmatize() for w in sent.words])

'The striped bat are hanging on their foot for best'

**5. TextBlob Lemmatizer with appropriate POS tag**

In [74]:
# Define function to lemmatize each word with its POS tag
def lemmatize_with_postag(sentence):
    sent = TextBlob(sentence)
    tag_dict = {"J": 'a', 
                "N": 'n', 
                "V": 'v', 
                "R": 'r'}
    words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]    
    lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
    return " ".join(lemmatized_list)

# Lemmatize
sentence = "The striped bats are hanging on their feet for best"
lemmatize_with_postag(sentence)

'The striped bat be hang on their foot for best'

**Comparing NLTK, TextBlob, spaCy, Pattern and Stanford CoreNLP**

In [0]:
sentence = """Following mice attacks, caring farmers were marching to Delhi for better living conditions. 
Delhi police on Tuesday fired water cannons and teargas shells at protesting farmers as they tried to 
break barricades with their cars, automobiles and tractors."""

In [87]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [86]:
# NLTK
from nltk.stem import WordNetLemmatizer
import string
lemmatizer = WordNetLemmatizer()
print(" ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence) if w not in string.punctuation]))

Following mouse attack care farmer be march to Delhi for well living condition Delhi police on Tuesday fire water cannon and teargas shell at protest farmer a they try to break barricade with their car automobile and tractor


In [89]:
# Spacy
import spacy
nlp = spacy.load('en', disable=['parser', 'ner'])
doc = nlp(sentence)
print(" ".join([token.lemma_ for token in doc]))

follow mice attack , care farmer be march to Delhi for well living condition . 
 Delhi police on Tuesday fire water cannon and teargas shell at protest farmer as -PRON- try to 
 break barricade with -PRON- car , automobile and tractor .


In [90]:
# TextBlob
print(lemmatize_with_postag(sentence))

Following mouse attack care farmer be march to Delhi for good living condition Delhi police on Tuesday fire water cannon and teargas shell at protest farmer a they try to break barricade with their car automobile and tractor
