# NLP Text Preprocessing Techniques

---

This module covers: 

- Tokenization.
- Stemming 
- Lemmatization

---

## Import Modules

In [57]:
# Regex
import re

# NLTK
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
# POS for nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag

# spaCy
import spacy

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sharmaa1\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


---

## Data
- Random Text for preprocessing

In [2]:
text1 = """
This is not the best text, but it is better than good. 
It includes some words including good. 
I could include more but let's not. 
Live and let live.  
The cat jumped over the other cats.Not just another cat.
June, the girl, was born in June, the month.
Some of the worst months are worse than some of the bad years.
Happy New Year, Ashish.
Employees who employ another emplopyee are often employed in New York. This is not new.
The legislation of the legislature is a weird set of words.
"""
print(text1)

list_of_text = (text1.strip()).split(" ")

print('-'*30)
print('# of words in text : ', len((text1.strip()).split(" ")))

print('-'*30)
print('# of unique words  : ', len(set(list_of_text)))

print('-'*30)
print('# of tokens        : ', len((text1.strip()).split(" ")))


This is not the best text, but it is better than good. 
It includes some words including good. 
I could include more but let's not. 
Live and let live.  
The cat jumped over the other cats.Not just another cat.
June, the girl, was born in June, the month.
Some of the worst months are worse than some of the bad years.
Happy New Year, Ashish.
Employees who employ another emplopyee are often employed in New York. This is not new.
The legislation of the legislature is a weird set of words.

------------------------------
# of words in text :  87
------------------------------
# of unique words  :  65
------------------------------
# of tokens        :  87


### Get all words from a text using Regex

In [3]:
tmp_txt1 = "Hi New York! What's new and happening and rocking your world?"

In [4]:
regex_list_of_words = re.findall(r'\w+', tmp_txt1)

In [5]:
print(len(regex_list_of_words))

12


In [6]:
regex_list_of_words

['Hi',
 'New',
 'York',
 'What',
 's',
 'new',
 'and',
 'happening',
 'and',
 'rocking',
 'your',
 'world']

**`Insights:`**
- Punctuations (Exclamation, Apostrophe and question mark) missing if we simply only extract words
- What's --> What and s
- Duplicate words included since we are only using a simple Regex command

---

## Tokenization - NLTK

- Get all words using NLTK word tokens

In [7]:
nltk_word_tkn_list_of_words = word_tokenize(tmp_txt1)
print(type(nltk_word_tkn_list_of_words))

<class 'list'>


In [8]:
print(len(nltk_word_tkn_list_of_words))

14


In [9]:
nltk_word_tkn_list_of_words
# print(nltk_word_tkn_list_of_words, end='|')

['Hi',
 'New',
 'York',
 '!',
 'What',
 "'s",
 'new',
 'and',
 'happening',
 'and',
 'rocking',
 'your',
 'world',
 '?']

**`Insights:`**
- Punctuations (Exclamation, Apostrophe and question mark) included
- What's --> What and 's
- Duplicated words included

---

## Stemming - NLTK

1. Porter Stemmer
2. Snowball Stemmer (English Stemmer, Porter2 Stemmer)

In [10]:
# 1. Porter Stemmer

p_stemmer = PorterStemmer()
p_stemmer

<PorterStemmer>

In [11]:
words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness']

In [12]:
print(f'{"WORD":{17}} {"STEMMED WORD (Porter)":{15}}')

for word in words:
    print(f'{word:{12}} ---> {p_stemmer.stem(word):{15}}')

WORD              STEMMED WORD (Porter)
run          ---> run            
runner       ---> runner         
ran          ---> ran            
runs         ---> run            
easily       ---> easili         
fairly       ---> fairli         
fairness     ---> fair           


In [13]:
# 2. Snowball Stemmer

s_stemmer = SnowballStemmer(language='english')
s_stemmer

<nltk.stem.snowball.SnowballStemmer at 0x1ef89f55988>

In [14]:
print(f'{"WORD":{17}} {"STEMMED WORD (Snowball)":{15}}')

for word in words:
    print(f'{word:{12}} ---> {s_stemmer.stem(word):{15}}')

WORD              STEMMED WORD (Snowball)
run          ---> run            
runner       ---> runner         
ran          ---> ran            
runs         ---> run            
easily       ---> easili         
fairly       ---> fair           
fairness     ---> fair           


In [15]:
words2 = ['generous', 'generously', 'generation', 'generate']

print(f'{"WORD":{17}} {"STEMMED WORD (Snowball)":{15}}')

for word2 in words2:
    print(f'{word2:{12}} ---> {s_stemmer.stem(word2):{15}}')

WORD              STEMMED WORD (Snowball)
generous     ---> generous       
generously   ---> generous       
generation   ---> generat        
generate     ---> generat        


In [89]:
words3 = ['employ', 'employs', 'employing', 'employment', 'employee', 'employees', 'employer', 'employers']

print(f'{"WORD":{17}} {"STEMMED WORD (Snowball)":{15}}')

for word3 in words3:
    print(f'{word3:{12}} ---> {s_stemmer.stem(word3):{15}}')

WORD              STEMMED WORD (Snowball)
employ       ---> employ         
employs      ---> employ         
employing    ---> employ         
employment   ---> employ         
employee     ---> employe        
employees    ---> employe        
employer     ---> employ         
employers    ---> employ         


---

## Lemmatization - NLTK


- General process for using this is to first tokenize your text (usually token will be word)
- Subsequently, use an instance of the WordNetLemmatizer and call the lemmatize method on each word

In [16]:
lem = WordNetLemmatizer()

In [17]:
lem

<WordNetLemmatizer>

In [20]:
print(f'{"WORD":{17}} {"LEMMATIZED WORD (WordNetLemmatizer)":{15}}')

for word in words:
    print(f'{word:{12}} ---> {lem.lemmatize(word):{15}}')

WORD              LEMMATIZED WORD (WordNetLemmatizer)
run          ---> run            
runner       ---> runner         
ran          ---> ran            
runs         ---> run            
easily       ---> easily         
fairly       ---> fairly         
fairness     ---> fairness       


In [22]:
print(f'{"WORD":{17}} {"LEMMATIZED WORD (WordNetLemmatizer)":{15}}')

for word2 in words2:
    print(f'{word2:{12}} ---> {lem.lemmatize(word2):{15}}')

WORD              LEMMATIZED WORD (WordNetLemmatizer)
generous     ---> generous       
generously   ---> generously     
generation   ---> generation     
generate     ---> generate       


In [90]:
print(f'{"WORD":{17}} {"LEMMATIZED WORD (WordNetLemmatizer)":{15}}')

for word3 in words3:
    print(f'{word3:{12}} ---> {lem.lemmatize(word3):{15}}')

WORD              LEMMATIZED WORD (WordNetLemmatizer)
employ       ---> employ         
employs      ---> employ         
employing    ---> employing      
employment   ---> employment     
employee     ---> employee       
employees    ---> employee       
employer     ---> employer       
employers    ---> employer       


**Lemmatization dependent on POS of word**
- Below shows an example of why it is important to provide a POS tag of the word that will be lemmatized

In [44]:
wrd = 'content'
wrd = 'strips'
wrd = 'lead'
wrd = 'leaves'

In [45]:
lem.lemmatize(wrd)

'leaf'

In [46]:
lem.lemmatize(wrd, pos='n')

'leaf'

In [47]:
lem.lemmatize(wrd, pos='v')

'leave'

In [58]:
dir(nltk)
pos_tag(['leaves'])

[('leaves', 'NNS')]

In [84]:
leaves1 = 'No one leaves him like this'
leaves2 = 'There are leaves all over the lawn'
leaves3 = 'Dried leaves on trees'
emps1 = 'Employers can employ employees or employee for employment or employing'

print(pos_tag(word_tokenize(leaves1)))
print('-'*30)
print(pos_tag(word_tokenize(leaves2)))
print('-'*30)
print(pos_tag(word_tokenize(leaves3)))
print('-'*30)
print(pos_tag(word_tokenize(emps1)))

[('No', 'DT'), ('one', 'NN'), ('leaves', 'VBZ'), ('him', 'PRP'), ('like', 'IN'), ('this', 'DT')]
------------------------------
[('There', 'EX'), ('are', 'VBP'), ('leaves', 'VBZ'), ('all', 'DT'), ('over', 'IN'), ('the', 'DT'), ('lawn', 'NN')]
------------------------------
[('Dried', 'NNP'), ('leaves', 'VBZ'), ('on', 'IN'), ('trees', 'NNS')]
------------------------------
[('Employers', 'NNS'), ('can', 'MD'), ('employ', 'VB'), ('employees', 'NNS'), ('or', 'CC'), ('employee', 'NN'), ('for', 'IN'), ('employment', 'NN'), ('or', 'CC'), ('employing', 'VBG')]


---

### Lemmatization - spaCy

In [76]:
nlp = spacy.load('en_core_web_lg')
nlp

<spacy.lang.en.English at 0x1efd2e07188>

In [77]:
doc1 = nlp(leaves1)
doc1

No one leaves him like this

In [78]:
doc2 = nlp(leaves2)
doc2

There are leaves all over the lawn

In [79]:
doc3 = nlp(leaves3)
doc3

Dried leaves on trees

In [85]:
doc4 = nlp(emps1)
doc4

Employers can employ employees or employee for employment or employing

In [80]:
# Creating a function to shows lemmas for other documents

def show_lemmas(text):
    """
    Helper function to print lemmas of words(tokens) of a provided text
    I/P:
        - text(NLP document object): NLP doc object
    """

    print(f'{"Token Text":{15}} {"Token Lemma_":{15}} {"Token POS":{12}} {"Token Lemma Hash Ref":{23}} ')
    print('-'*65)
    
    for token in text:
        print(f'{token.text:{15}} {token.lemma_:{15}} {token.pos_:{12}} {token.lemma:<{23}}')

In [81]:
show_lemmas(doc1)

Token Text      Token Lemma_    Token POS    Token Lemma Hash Ref    
-----------------------------------------------------------------
No              no              DET          13055779130471031426   
one             one             NOUN         17454115351911680600   
leaves          leave           VERB         9707179535890930240    
him             he              PRON         1655312771067108281    
like            like            ADP          18194338103975822726   
this            this            DET          1995909169258310477    


In [82]:
show_lemmas(doc2)

Token Text      Token Lemma_    Token POS    Token Lemma Hash Ref    
-----------------------------------------------------------------
There           there           PRON         2112642640949226496    
are             be              AUX          10382539506755952630   
leaves          leave           NOUN         9707179535890930240    
all             all             ADV          13409319323822384369   
over            over            ADP          5456543204961066030    
the             the             DET          7425985699627899538    
lawn            lawn            NOUN         8580092763855978974    


In [83]:
show_lemmas(doc3)

Token Text      Token Lemma_    Token POS    Token Lemma Hash Ref    
-----------------------------------------------------------------
Dried           dry             VERB         4116088610979501248    
leaves          leave           VERB         9707179535890930240    
on              on              ADP          5640369432778651323    
trees           tree            NOUN         5236966400857015965    


In [86]:
show_lemmas(doc4)

Token Text      Token Lemma_    Token POS    Token Lemma Hash Ref    
-----------------------------------------------------------------
Employers       employer        NOUN         10831503532707336449   
can             can             AUX          6635067063807956629    
employ          employ          VERB         12763792191920418315   
employees       employee        NOUN         8285577505045524338    
or              or              CCONJ        3740602843040177340    
employee        employee        NOUN         8285577505045524338    
for             for             ADP          16037325823156266367   
employment      employment      NOUN         10954873364127974648   
or              or              CCONJ        3740602843040177340    
employing       employ          VERB         12763792191920418315   


In [88]:
nltk_op = [('Employers', 'NNS'), ('can', 'MD'), ('employ', 'VB'), ('employees', 'NNS'), ('or', 'CC'), ('employee', 'NN'), ('for', 'IN'), ('employment', 'NN'), ('or', 'CC'), ('employing', 'VBG')]
nltk_op

[('Employers', 'NNS'),
 ('can', 'MD'),
 ('employ', 'VB'),
 ('employees', 'NNS'),
 ('or', 'CC'),
 ('employee', 'NN'),
 ('for', 'IN'),
 ('employment', 'NN'),
 ('or', 'CC'),
 ('employing', 'VBG')]

---