<a href="https://colab.research.google.com/github/sandipanpaul21/NLP-using-Python/blob/master/03_Stemming_%26_Lemmatizing_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk
nltk.download('punkt')
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
ps = PorterStemmer()
ps

<PorterStemmer>

In [3]:
# Lets choose some similar stem words
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
print('Stemming word Example')
print(example_words)
print('\n')
print('After Stemming, Words are')
# Next, we can easily stem by doing something like:
for w in example_words:
    print(ps.stem(w))

Stemming word Example
['python', 'pythoner', 'pythoning', 'pythoned', 'pythonly']


After Stemming, Words are
python
python
python
python
pythonli


In [4]:
# Now let's try stemming a typical sentence, rather than some words:

new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
print('Another Example')
print(new_text)
print('\n')

print('For each word, stemming done as follows')
# Word Tokenizer 
words = word_tokenize(new_text)
# For each word, stemming done
for w in words:
    print(ps.stem(w))

Another Example
It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once.


For each word, stemming done as follows
it
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.


In [5]:
# Lemmatizing 

# A very similar operation to stemming is called lemmatizing. 
# The major difference between these is, as you saw earlier, 
# stemming can often create non-existent words, whereas lemmas are actual words.
# So, your root stem, meaning the word you end up with, 
# is not something you can just look up in a dictionary, but you can look up a lemma.
# Some times you will wind up with a very similar word, but sometimes,
# you will wind up with a completely different word.

In [6]:
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

# Inference : 
# But lemmatization has limits.
# For example, Porter stems both happiness and happy to happi, 
# while WordNet lemmatizes the two words to themselves. 
# The WordNet lemmatizer also requires specifying the word’s part of speech
# otherwise, it assumes the word is a noun.
# Finally, lemmatization cannot handle unknown words: for example,
# Porter stems both iphone and iphones to iphon, while WordNet lemmatizes both words to themselves.
# In the example of shoe and shoes, we probably want to treat the two forms identically.
# But we wouldn’t want to do the same for the words logistic(Mathematical term) and logistics(Mechanical term),
# which mean very different things despite their apparent similarity.
# Nor would we want to equate the words universe and university, 
# even though both words derive from the same Latin root.

LookupError: 
**********************************************************************
  Resource [93momw-1.4[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('omw-1.4')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/omw-1.4[0m

  Searched in:
    - 'C:\\Users\\USER/nltk_data'
    - 'C:\\Python37\\nltk_data'
    - 'C:\\Python37\\share\\nltk_data'
    - 'C:\\Python37\\lib\\nltk_data'
    - 'C:\\Users\\USER\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [None]:
# Stemming

# The idea of stemming is a sort of normalizing method.
# Many variations of words carry the same meaning, other than when tense is involved.
# The reason why we stem is to shorten the lookup, and normalize sentences.

# Consider:
# I was taking a ride in the car.
# I was riding in the car.
# This sentence means the same thing. in the car is the same. 
# I was is the same. the ing denotes a clear past-tense in both cases
# so is it truly necessary to differentiate between ride and riding, 
# in the case of just trying to figure out the meaning of what this past-tense activity was?