# NLP Text Preprocessing Techniques

---

This pipeline covers: 

- Tokenization.
- Removing the stop words (including additional stop words).
- Lemmatization 
- Generate TF-IDF numerical vectors (including bi-gram)

---

## Import Modules

In [150]:
# Regex
import re

# NLTK
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

# spaCy
import spacy

# spaCy visualization
from spacy import displacy

---

## Data
- Random Text for preprocessing

In [1]:
text1 = """
This is not the best text, but it is better than good. 
It includes some words including good. 
I could include more but let's not. 
Live and let live.  
The cat jumped over the other cats.Not just another cat.
June, the girl, was born in June, the month.
Some of the worst months are worse than some of the bad years.
Happy New Year, Ashish.
Employees who employ another emplopyee are often employed in New York. This is not new.
The legislation of the legislature is a weird set of words.
"""
print(text1)

list_of_text = (text1.strip()).split(" ")

print('-'*30)
print('# of words in text : ', len((text1.strip()).split(" ")))

print('-'*30)
print('# of unique words  : ', len(set(list_of_text)))

print('-'*30)
print('# of tokens        : ', len((text1.strip()).split(" ")))


This is not the best text, but it is better than good. 
It includes some words including good. 
I could include more but let's not. 
Live and let live.  
The cat jumped over the other cats.Not just another cat.
June, the girl, was born in June, the month.
Some of the worst months are worse than some of the bad years.
Happy New Year, Ashish.
Employees who employ another emplopyee are often employed in New York. This is not new.
The legislation of the legislature is a weird set of words.

------------------------------
# of words in text :  87
------------------------------
# of unique words  :  65
------------------------------
# of tokens        :  87


### Get all words from a text using Regex

In [7]:
tmp_txt1 = "Hi New York! What's new and happening and rocking your world?"

In [8]:
regex_list_of_words = re.findall(r'\w+', tmp_txt1)

In [13]:
print(len(regex_list_of_words))

12


In [9]:
regex_list_of_words

['Hi',
 'New',
 'York',
 'What',
 's',
 'new',
 'and',
 'happening',
 'and',
 'rocking',
 'your',
 'world']

**`Insights:`**
- Punctuations (Exclamation, Apostrophe and question mark) missing if we simply only extract words
- What's --> What and s
- Duplicate words included since we are only using a simple Regex command

### Get all words using NLTK word tokens

In [11]:
nltk_word_tkn_list_of_words = word_tokenize(tmp_txt1)
print(type(nltk_word_tkn_list_of_words))

<class 'list'>


In [14]:
print(len(nltk_word_tkn_list_of_words))

14


In [215]:
nltk_word_tkn_list_of_words
# print(nltk_word_tkn_list_of_words, end='|')

['Hi',
 'New',
 'York',
 '!',
 'What',
 "'s",
 'new',
 'and',
 'happening',
 'and',
 'rocking',
 'your',
 'world',
 '?']

**`Insights:`**
- Punctuations (Exclamation, Apostrophe and question mark) included
- What's --> What and 's
- Duplicated words included

## Stemming - NLTK

1. Porter Stemmer
2. Snowball Stemmer (English Stemmer, Porter2 Stemmer)

In [153]:
# 1. Porter Stemmer

p_stemmer = PorterStemmer()
p_stemmer

<PorterStemmer>

In [154]:
words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness']

In [184]:
print(f'{"WORD":{17}} {"STEMMED WORD (Porter)":{15}}')

for word in words:
    print(f'{word:{12}} ---> {p_stemmer.stem(word):{15}}')

WORD              STEMMED WORD (Porter)
run          ---> run            
runner       ---> runner         
ran          ---> ran            
runs         ---> run            
easily       ---> easili         
fairly       ---> fairli         
fairness     ---> fair           


In [156]:
# 2. Snowball Stemmer

s_stemmer = SnowballStemmer(language='english')
s_stemmer

<nltk.stem.snowball.SnowballStemmer at 0x15ae04e6c48>

In [187]:
print(f'{"WORD":{17}} {"STEMMED WORD (Snowball)":{15}}')

for word in words:
    print(f'{word:{12}} ---> {s_stemmer.stem(word):{15}}')

WORD              STEMMED WORD (Snowball)
run          ---> run            
runner       ---> runner         
ran          ---> ran            
runs         ---> run            
easily       ---> easili         
fairly       ---> fair           
fairness     ---> fair           


In [186]:
words2 = ['generous', 'generously', 'generation', 'generate']

print(f'{"WORD":{17}} {"STEMMED WORD (Snowball)":{15}}')

for word2 in words2:
    print(f'{word2:{12}} ---> {s_stemmer.stem(word2):{15}}')

WORD              STEMMED WORD (Snowball)
generous     ---> generous       
generously   ---> generous       
generation   ---> generat        
generate     ---> generat        


---

### Get all words using spaCy

## TO DO @@@@@@@@@@@@@@@@@@@@@
1. a text with emails, url etc, and see what tokens come from nltk or spacy (spacy should keep emails, urls, U.S., etc. intact. 
    - Another text "A 5km cab ride in NYC costs $10.43"
    - "Let's visit St. Louis in the U.S. next summer"
2. Language library small and large have limited words
