<h1> <center> Stemming and Lemmatization </h1> </center>

In [1]:
"""
Why do we perform all such processing such as tokenization, lemmatization? Because, the entire sentence will be assigned nos 
corresponding to each word. So what is the issue?

Ex: if the sentence has jump, JUMPING, Jump, jUmped: all of them mean the same but if we dont process, model assigns different 
nos to each of the words and that causes dimensionality problems.

So for dimensionality reduction and efficient processing we need to process the raw sentence.



****************************************************STEMMING VS LEMMATIZATION**************************************************

eg: jump,jumping, jumped -----------------> what is the root word? jump right? Stemming and lemmatization both help to break down
a word to its base word and root word.

What is the difference? 

Stemming can produce rootwords which might not be grammatically correct
Lemmatization on the other hand produces meaningful root words

eg:  COMPUTING-----------> STEMMING------------> ******** COMPUT *******
        |
        |
        |
        |
        ------------------> LEMMATIZATION-------> *******COMPUTE******
        

Which one to use?

Lemmatization might sound the preferred choice, but it is slow in computation, since it involves finding meaningful root words.
If the words are critical and involve scenarios such as a chat bot for a website, lemmatization is preferred.

Else, for basic sentimental analysis and stuffs, stemming can be used.

*******************************************************************************************************************************
"""

import nltk 
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\angsh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

<h1> <center> STEMMING </h1> </center>

In [2]:
"""
                                            Types of Stemming Packages in nltk
                                                1. Snowball Stemmer
                                                2. Lancaster Stemmer
                                                3. Porter Stemmer
Each of them are slightly different in their modes of operation.
                                                    
"""
from nltk.stem import SnowballStemmer,LancasterStemmer,PorterStemmer
lancaster=LancasterStemmer()
porter=PorterStemmer()
snowball=SnowballStemmer('english')

In [3]:
#let us create some words
words=["hobbies","computer","running","SUPERB","gangster"]
stem_dict={"Words":words,
          "Snowball":[snowball.stem(x) for x in words], # the stemming function always takes a string and not a list
            "Lancaster":[lancaster.stem(x) for x in words],
           "Porter":[porter.stem(x) for x in words]}
import pandas as pd
pd.DataFrame(stem_dict)

Unnamed: 0,Words,Snowball,Lancaster,Porter
0,hobbies,hobbi,hobby,hobbi
1,computer,comput,comput,comput
2,running,run,run,run
3,SUPERB,superb,superb,superb
4,gangster,gangster,gangst,gangster


In [4]:
#lets take a sentence
sentence="JUmping going cooling heating computer data science shit rubbish what the fucking job is it go to hell"
#lets tokenize the sentence
token=nltk.word_tokenize(sentence)
print([snowball.stem(x) for x in token])

['jump', 'go', 'cool', 'heat', 'comput', 'data', 'scienc', 'shit', 'rubbish', 'what', 'the', 'fuck', 'job', 'is', 'it', 'go', 'to', 'hell']


In [5]:
print([lancaster.stem(x) for x in token])

['jump', 'going', 'cool', 'heat', 'comput', 'dat', 'sci', 'shit', 'rub', 'what', 'the', 'fuck', 'job', 'is', 'it', 'go', 'to', 'hel']


In [6]:
print([porter.stem(x) for x in token])

['jump', 'go', 'cool', 'heat', 'comput', 'data', 'scienc', 'shit', 'rubbish', 'what', 'the', 'fuck', 'job', 'is', 'it', 'go', 'to', 'hell']


<h1> <center> LEMMATIZATION </h1> </center>

In [7]:
from nltk.stem import WordNetLemmatizer
lemma=WordNetLemmatizer()
print([lemma.lemmatize(x) for x in token])

['JUmping', 'going', 'cooling', 'heating', 'computer', 'data', 'science', 'shit', 'rubbish', 'what', 'the', 'fucking', 'job', 'is', 'it', 'go', 'to', 'hell']


But wait! The words havent been changed. That's because lemmatization needs parts of speech as context <code> pos </code> is the parameter and this technique is called <b> Tagging</b>

In [8]:
print(lemma.lemmatize('jumping',pos='v'))#verb
print(lemma.lemmatize('eating',pos='v'))
print(lemma.lemmatize('pizzas',pos='n'))
print(lemma.lemmatize('Jumping',pos='v')) #case sensitive convert to lower

jump
eat
pizza
Jumping


<h1>Can you spot the disadvantage? We have to manually tag each word with its parts of speech which is cumbersome for huge text documents. There are approaches for automatic taggin that will be discussed later </h1>

<h2> <i> Always remember to convert strings to lower case at first before lemmatization. Stemming procedures do that on their own