Stemming and Lemmatization
--

A> Stemming
--
In this code example, we will discuss stemming. Stemming is a process of
extracting a root word. For example, “fish,” “fishes,” and “fishing” are
stemmed into fish.

Problem
--
You want to do stemming.

Solution
--
The simplest way to do this by using NLTK or a TextBlob library.

In [1]:
# Read the text data
text=['I like fishing','I eat fish','There are lots of fishes in pound']

#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                               tweet
0                     I like fishing
1                         I eat fish
2  There are lots of fishes in pound


In [2]:
# Stemming the text

#Import library
from nltk.stem import PorterStemmer

## use PorterStemmer to find stem of the words 
st = PorterStemmer()

df['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
## type your code here






# If you observe this, you will notice that fish, fishing, and fishes 
# have been stemmed to fish.

0                       I like fish
1                        I eat fish
2    there are lot of fish in pound
Name: tweet, dtype: object

B> Lemmatizing
--
In this section, we will discuss lemmatization. 

Lemmatization is a process of extracting a root word by considering the vocabulary. For example, “good,” “better,” or “best” is lemmatized into good.
The part of speech of a word is determined in lemmatization. It will
return the dictionary form of a word, which must be a valid word while
stemming just extracts the root word.

> Lemmatization handles matching “car” to “cars” along with matching “car” to “automobile.”

> Stemming handles matching “car” to “cars.”

Lemmatization can get better results.

> The stemmed form of leafs is leaf.

> The stemmed form of leaves is leav.

> The lemmatized form of leafs is leaf.

> The lemmatized form of leaves is leaf.

Problem
--
You want to perform lemmatization.

Solution
--
The simplest way to do this is by using NLTK or the TextBlob library.

In [5]:
text=['I like fishing','I eat fish','There are many fishes in pound', 'leaves and leaf']

#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                            tweet
0                  I like fishing
1                      I eat fish
2  There are many fishes in pound
3                 leaves and leaf


In [7]:
# Lemmatizing the data

#Import library
from textblob import Word

#Code for lemmatize
## type your code here

df['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]) )





0                  I like fishing
1                      I eat fish
2    There are many fish in pound
3                   leaf and leaf
Name: tweet, dtype: object

You can observe that fish and fishes are lemmatized to fish and, as
explained, leaves and leaf are lemmatized to leaf.

In [8]:
import nltk 
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Zuhrah\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [9]:
import nltk
from nltk.stem import WordNetLemmatizer 

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize Single Word
print(lemmatizer.lemmatize("bats"))
#> bat

print(lemmatizer.lemmatize("are"))
#> are

print(lemmatizer.lemmatize("feet"))
#> foot

bat
are
foot


"""
Let’s lemmatize a simple sentence. We first tokenize the sentence into words using nltk.word_tokenize and then we will call lemmatizer.lemmatize() on each word. 
"""


In [13]:
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']

# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best


['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
The striped bat are hanging on their foot for best


"""
The above code is a simple example of how to use the wordnet lemmatizer on words and sentences.

Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().

Sometimes, the same word can have a multiple lemmas based on the meaning / context.
"""

In [11]:
print(lemmatizer.lemmatize("stripes", 'v'))  
#> strip

print(lemmatizer.lemmatize("stripes", 'n'))  
#> stripe


strip
stripe



Wordnet Lemmatizer with appropriate POS tag
-------------------------------------------

It may not be possible manually provide the corrent POS tag for every word for large texts. So, instead, we will find out the correct POS tag for each word, map it to the right input character that the WordnetLemmatizer accepts and pass it as the second argument to lemmatize().

So how to get the POS tag for a given word?

In nltk, it is available through the nltk.pos_tag() method. It accepts only a list (list of words), even if its a single word.


In [14]:
print(nltk.pos_tag(['feet']))
#> [('feet', 'NNS')]

print(nltk.pos_tag(nltk.word_tokenize(sentence)))
#> [('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('for', 'IN'), ('best', 'JJS')]


[('feet', 'NNS')]
[('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('for', 'IN'), ('best', 'JJS')]



nltk.pos_tag() returns a tuple with the POS tag.

The key here is to map NLTK’s POS tags to the format wordnet lemmatizer would accept.

The get_wordnet_pos() function defined below does this mapping job.


In [16]:
# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)
# [('feet', 'NNS'), ('The', 'DT') ]

# [0] -> ('feet', 'NNS')
# [0][1] -> 'NNS'
# [0][1][0] -> 'N'

# if no tag matches , then default is wordnet.NOUN
# for e.g ([word])[0][1][0] extracts 'D' from [('The', 'DT')]
# but 'D' is not their in tag_dict hence return the default as 'Noun'
# 1. Init Lemmatizer
lemmatizer = WordNetLemmatizer()

# 2. Lemmatize Single Word with the appropriate POS tag
word = 'feet'
print(lemmatizer.lemmatize(word, get_wordnet_pos(word)))

# 3. Lemmatize a Sentence with the appropriate POS tag
sentence = "The striped bats are hanging on their feet for best"
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
#> ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']



foot
['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


#------------------------------------------------

# Extra Reading for Data Science lovers :
https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

#-------------------------------------------------------------------