## Stemming

In [42]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import TreebankWordTokenizer

In [67]:
doc1 = "learning learns learned learnt caresses pennies"

In [68]:
tokenizer = TreebankWordTokenizer()
ps = PorterStemmer()

In [69]:
tokens = tokenizer.tokenize(doc1)

In [70]:
stemmed_tokens = []
for token in tokens:
    stemmed_token = ps.stem(token)
    stemmed_tokens.append(stemmed_token)    

In [71]:
stemmed_tokens

['learn', 'learn', 'learn', 'learnt', 'caress', 'penni']

In [72]:
# Let's look at example and see how stemming helps us to reduce the vocabulary

In [49]:
corpus = ['they are watching movies', 'they all watched movies', 'he always watches that movie']

In [50]:
tokenizer2 = TreebankWordTokenizer()
ps2 = PorterStemmer()

In [57]:
vocabulary = []
all_stemmed_tokens = []
for doc in corpus:
    tokens = tokenizer2.tokenize(doc)
    stemmed_tokens = []
    for token in tokens:
        st = ps2.stem(token)
        stemmed_tokens.append(st)
        if st not in vocabulary:
            vocabulary.append(st)
    all_stemmed_tokens.append(stemmed_tokens)

In [58]:
all_stemmed_tokens

[['they', 'are', 'watch', 'movi'],
 ['they', 'all', 'watch', 'movi'],
 ['he', 'alway', 'watch', 'that', 'movi']]

In [59]:
vocabulary

['they', 'are', 'watch', 'movi', 'all', 'he', 'alway', 'that']

In [60]:
# From the above corpus, if we do not use stemming, our vocab size will be 13, whereas after stemming our vocabulary size is 8
# In a large corpus of data, this is very useful.

### Lemmatization

Lemmatization is the process of reducing inflected or derived words to their base form or root form, which is similar to dictionary form (called as lemma), using morphological analysis. The lemma is usually identical to the morphological root of the word, which is used in dictionary.

If we apply lemmatization on a document corpus, this will convert many inflected/derived form of a word into its root word and all those derived words will be treated as a single word when we construct the vocabulary from the given corpus. Hence, it will considerably decrease the size of the vocabulary, especially when we have a huge corpus.

In [78]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\prasa\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [79]:
from nltk.stem import WordNetLemmatizer

In [80]:
doc2 = "learning learns learned learnt caresses pennies"

In [81]:
tokenizer3 = TreebankWordTokenizer()
lm = WordNetLemmatizer()

In [82]:
lemm_tokens = []
for token in tokenizer3.tokenize(doc2):
    l_token = lm.lemmatize(token)
    lemm_tokens.append(l_token)    

In [83]:
print(lemm_tokens)

['learning', 'learns', 'learned', 'learnt', 'caress', 'penny']


### Note:  
#### If we stem the word pennies, it is converted to penni whereas if we lemmatize the same word pennies, it is converted to penny, which is it dictionary form.
#### Again, whether to use stemming or lemmatization depends on the problem.  We need to use both and see which one works well for our problem 