# Natural Language Text Pre-preprocessing

In [1]:
"""
cd .\00text-preprocess\
jupyter nbconvert --to markdown pre-process.ipynb --output README.md
"""
import nltk
# nltk.download()

## Introduction

Data preprocessing is an essential step in building a Machine Learning model and depending on how well the data has been preprocessed; the results are seen.

In NLP, text preprocessing is the first step in the process of building a model.
The various text preprocessing steps are:

- Tokenization
- Lower casing
- Stop words removal
- Stemming
- Lemmatization

## Bags of Word Pipeline

- Get the Data/Corpus 
- Tokenization,Stop words removal
- Stemming,Lemmatization
- Building a Vocab
- Vectorization
- Classification

## Load Data/Corpus 

In [3]:
from nltk.corpus import brown

In [6]:
print(brown.categories())
print(len(brown.categories()))


['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
15


Most corpora consist of a set of files, each containing a document (or other pieces of text). A list of identifiers for these files is accessed via the `fileids()` method of the corpus reader:

In [11]:
print(brown.fileids()[:20])


['ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10', 'ca11', 'ca12', 'ca13', 'ca14', 'ca15', 'ca16', 'ca17', 'ca18', 'ca19', 'ca20']


In [14]:
data = brown.sents(categories="fiction")
data


[['Thirty-three'], ['Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.'], ...]

In [16]:
" ".join(data[1])

'Scotty did not go back to school .'

## Tokenization

Tokenization: Splitting the sentence into words.
Strings can be tokenized into tokens via `nltk.word_tokenize`.


In [19]:
from nltk.tokenize import sent_tokenize,word_tokenize
# prerequisite:nltk.download('punkt')

In [15]:
sample_text = "Does this thing really work? Lets see."

In [14]:
sent_tokenize(sample_text)

['Does this thing really work?', 'Lets see.']

In [26]:
words = word_tokenize(sample_text)
words


['Does', 'this', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']

## Stopwords

Stop words removal: Stop words are very commonly used words (**a, an, the, etc.**) in the documents. These words do not really signify any importance as they do not help in distinguishing two documents. We can use `nltk.corpus.stopwords.words(‘english’)` to fetch a list of `stopwords` in the English dictionary. Then, we remove the tokens that are `stopwords`.


In [17]:
from nltk.corpus import stopwords

In [25]:
stop = stopwords.words('english')

print(stop[:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [29]:
clean_words = [w for w in words if w not in stop ]
print(words)
print(clean_words)


['Does', 'this', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']
['Does', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']


> !! **Watch out for Uppercase**: for example `this` in the above got removed as it is a stopword. But if we would have used `This`, it will not be removed

In [30]:
sample_text = "Does This thing really work? Lets see."
words = word_tokenize(sample_text)
words

['Does', 'This', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']

In [31]:
clean_words = [w for w in words if w not in stop]
print(words)
print(clean_words)


['Does', 'This', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']
['Does', 'This', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']


> Solution:	

In [32]:
sample_text = "Does This thing really work? Lets see."
sample_text = sample_text.lower()
words = word_tokenize(sample_text)
words

['does', 'this', 'thing', 'really', 'work', '?', 'lets', 'see', '.']

In [33]:
clean_words = [w for w in words if w not in stop]
print(words)
print(clean_words)


['does', 'this', 'thing', 'really', 'work', '?', 'lets', 'see', '.']
['thing', 'really', 'work', '?', 'lets', 'see', '.']


### including punctuations

In [35]:
import string
punctuations = list(string.punctuation)
stop = stop + punctuations

In [36]:
clean_words = [w for w in words if w not in stop]
print(words)
print(clean_words)

['does', 'this', 'thing', 'really', 'work', '?', 'lets', 'see', '.']
['thing', 'really', 'work', 'lets', 'see']


### Complete 

In [17]:
from nltk.corpus import stopwords
import string
punctuations = list(string.punctuation)
stops = stopwords.words('english')
stops = stops + punctuations


def remove_stopwords(text):
	useful_words = [w for w in text if w not in stops]
	return useful_words

In [24]:
sample_text = "Does This thing really work? Lets see."
sample_text = sample_text.lower()
words = word_tokenize(sample_text)
usefull_text = remove_stopwords(words)
usefull_text

['thing', 'really', 'work', 'lets', 'see']

In [25]:
words = "Does this thing really work? Lets see.".split()
usefull_text = remove_stopwords(words)
usefull_text

['Does', 'thing', 'really', 'work?', 'Lets', 'see.']

### Tokenization using Regular Expression

In [26]:
from nltk.tokenize import RegexpTokenizer

In [29]:
sentence = "Send all the 50 documents related to chapters 1,2,3 at prateek@cb.com"
# include all words(caputare @ also), exclude numbers 
tokenizer = RegexpTokenizer('[a-zA-Z@]+') 
useful_text = tokenizer.tokenize(sentence)
" ".join(useful_text)


'Send all the documents related to chapters at prateek@cb com'

## Stemming



Stemming: It is a process of transforming a word to its root form.

- Process that transforms particular words(verbs,plurals)into their radical form
- Preserve the semantics of the sentence without increasing the number of unique tokens
- • Example - `jumps, jumping, jumped, jump` ==> `jump`

We stem the tokens using `nltk.stem.porter.PorterStemmer`  to get the stemmed tokens.

In [10]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()
porter.stem("having")

'have'

In [8]:
words = ["play", "playing", "player", "played"]

stemmed_words = [porter.stem(w) for w in words]
stemmed_words


['play', 'play', 'player', 'play']

In [13]:
words = ["machine","happying"]

stemmed_words = [ps.stem(w) for w in words]
stemmed_words

['machin', 'happi']

Explanation: The word `'machine'` has its suffix `'e'` chopped off. The stem does not make sense as it is not a word in English. This is a disadvantage of stemming.


`SnowballStemmer` and `LancasterStemmer`: 

In [2]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer


In [3]:
porter = PorterStemmer()
lancaster = LancasterStemmer()
word_list = ["friend", "friendship", "friends", "friendships", "stabil",
             "destabilize", "misunderstanding", "railroad", "moonlight", "football"]
print("{0:20}{1:20}{2:20}".format(
    "Word", "Porter Stemmer", "lancaster Stemmer"))
for word in word_list:
    print("{0:20}{1:20}{2:20}".format(
        word, porter.stem(word), lancaster.stem(word)))


Word                Porter Stemmer      lancaster Stemmer   
friend              friend              friend              
friendship          friendship          friend              
friends             friend              friend              
friendships         friendship          friend              
stabil              stabil              stabl               
destabilize         destabil            dest                
misunderstanding    misunderstand       misunderstand       
railroad            railroad            railroad            
moonlight           moonlight           moonlight           
football            footbal             footbal             


`nltk` introduced `SnowballStemmers` that are used to create `non-English`

In [11]:
from nltk.stem.snowball import SnowballStemmer

englishStemmer=SnowballStemmer("english")
englishStemmer.stem("generously")

'generous'

The ‘english’ stemmer is better than the original ‘porter’ stemmer.



In [12]:
porter.stem("generously")


'gener'

## POS(part of speech) tagger

We can use `nltk.pos_tag` to retrieve the `part of speech` of each token in a list.

pos_tag **abbreviations**:

- NNP:		proper noun, singular (sarah)
- NNS: noun, common, plural
- NNPS:		proper noun, plural (indians or americans)
- PDT:		predeterminer (all, both, half)
- POS:		possessive ending (parent\ ‘s)
- DT: determiner
- .....

[https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk](https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk)

In [61]:
from nltk import pos_tag
pos_tag(['any'])

[('any', 'DT')]

> Watch Out: post_tag takes `list` not  `string`

load text:

In [14]:
from nltk.corpus import state_union
# Prerequisite: 
# nltk.download('state_union')

In [22]:
documents = nltk.corpus.state_union.fileids()
print(documents[:20])

['1945-Truman.txt', '1946-Truman.txt', '1947-Truman.txt', '1948-Truman.txt', '1949-Truman.txt', '1950-Truman.txt', '1951-Truman.txt', '1953-Eisenhower.txt', '1954-Eisenhower.txt', '1955-Eisenhower.txt', '1956-Eisenhower.txt', '1957-Eisenhower.txt', '1958-Eisenhower.txt', '1959-Eisenhower.txt', '1960-Eisenhower.txt', '1961-Kennedy.txt', '1962-Kennedy.txt', '1963-Johnson.txt', '1963-Kennedy.txt', '1964-Johnson.txt']


In [27]:
speech  = state_union.raw('2006-GWBush.txt')
speech[:500]

"PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all. Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream. Tonight we are comforted by the hope of a glad reunion with the hus"

In [38]:
speech_in_words = word_tokenize(speech)
pos = pos_tag(speech_in_words)
pos[:10]

[('PRESIDENT', 'NNP'),
 ('GEORGE', 'NNP'),
 ('W.', 'NNP'),
 ('BUSH', 'NNP'),
 ("'S", 'POS'),
 ('ADDRESS', 'NNP'),
 ('BEFORE', 'IN'),
 ('A', 'NNP'),
 ('JOINT', 'NNP'),
 ('SESSION', 'NNP')]

## Lemmatization

**Lemmatization**: Unlike stemming, lemmatization reduces the words to a word existing in the language.

For lemmatization to resolve a word to its `lemma`, **`part of speech` of the word is required**. This helps in transforming the word into a proper root form. However, for doing so, it requires extra computational linguistics power such as a **part of speech tagger**.

[what-is-the-difference-between-stemming-and-lemmatization/](https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/)

Lemmatization is preferred over Stemming because lemmatization does a morphological analysis of the words.


In [43]:
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()

In [47]:
lemmatizer.lemmatize("bats")

'bat'

In [48]:
sentence = "The striped bats are hanging on their feet for best"
# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)

['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
The striped bat are hanging on their foot for best


Notice it didn’t do a good job. Because, `‘are’` is not converted to` ‘be’` and `‘hanging’` is not converted to `‘hang’` as expected. This can be corrected if we provide the correct **‘part-of-speech’ tag (`POS` tag)** as the second argument to `lemmatize()`. Sometimes, the same word can have a multiple lemmas based on the meaning / context.

In [49]:
lemmatizer.lemmatize("painting", pos='n')

'painting'

In [50]:
lemmatizer.lemmatize("painting", pos='v')

'paint'

In [74]:
lemmatizer.lemmatize("hanging", pos='v')


'hang'

In [80]:
lemmatizer.lemmatize("are", pos='v')

'be'

In [81]:
lemmatizer.lemmatize("is", pos='v')

'be'

In [87]:
w = "hanging"
postag = pos_tag([w])
postag


[('hanging', 'VBG')]

Simple Function to convert `pos_tag` abbreviations to simple form that the `lemmatize()` function takes. For example `NN`,`NNS` etc to (`n`), `VBG` etc to `v`

In [53]:
from nltk.corpus import wordnet
def get_simple_pos(tag):
	
	if tag.startswith("J"):
		return wordnet.ADJ
	elif tag.startswith("V"):
		return wordnet.VERB
	elif tag.startswith("N"):
		return wordnet.NOUN
	elif tag.startswith("R"):
		return wordnet.ADV
	else:
		return wordnet.NOUN


In [85]:
w="hanging"
postag = pos_tag([w])
print(postag)
print(postag[0][1]+" --> ",end="" )
pos = get_simple_pos(postag[0][1])
print(pos)


[('hanging', 'VBG')]
VBG --> v


In [88]:
sentence = "The striped bats are hanging on their feet for best"
word_list = nltk.word_tokenize(sentence)
o = []
for w in word_list:
	postag = pos_tag([w])
	pos = get_simple_pos(postag[0][1])
	clean_word = lemmatizer.lemmatize(w, pos=pos)
	o.append(clean_word)

print(o)



['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


[https://www.machinelearningplus.com/nlp/lemmatization-examples-python/](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/)

## Constructing Vocab & Vectorization

### Count Vectorization (AKA One-Hot Encoding)


One of the most basic ways we can numerically represent words is through the one-hot encoding method (also sometimes called count vectorizing).

The idea is super simple. 
- **Create a vector that has as many dimensions as your corpora has unique words**. 
- Each unique word has a unique dimension and will be represented by a `1` in that dimension with `0s` everywhere else.

The result of this? Really huge and **sparse vectors** that capture absolutely no relational information. It could be useful if you have no other option. But we do have other options, if we need that semantic relationship information.

<div align="center">
<img src="img/count-v.jpg" alt="count-v.jpg" width="800px">
</div>

In [3]:
corpus = [
	"This is good",
	"This is bad",
	"awesome This is awesome"
]


In [1]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [4]:
vectorizer.fit(corpus)


CountVectorizer()

In [19]:
# Now, we can inspect how our vectorizer vectorized the text
# This will print out a list of words used, and their index in the vectors
print('Vocabulary: ')
print(vectorizer.vocabulary_)
print(vectorizer.vocabulary_.keys())

Vocabulary: 
{'this': 4, 'is': 3, 'good': 2, 'bad': 1, 'awesome': 0}
dict_keys(['this', 'is', 'good', 'bad', 'awesome'])


In [7]:
# If we would like to actually create a vector, we can do so by passing the
# text into the vectorizer to get back counts
vector = vectorizer.transform(corpus)


In [8]:
# Our final vector:
print('Full vector: ')
print(vector.toarray())

Full vector: 
[[0 0 1 1 1]
 [0 1 0 1 1]
 [2 0 0 1 1]]


> Fit + Transform:

In [15]:
vectorized_corpus = vectorizer.fit_transform(corpus).toarray()
vectorized_corpus


array([[0, 0, 1, 1, 1],
       [0, 1, 0, 1, 1],
       [2, 0, 0, 1, 1]], dtype=int64)

> Reverse Mapping

In [18]:
vectorizer.inverse_transform(vectorized_corpus)

[array(['good', 'is', 'this'], dtype='<U7'),
 array(['bad', 'is', 'this'], dtype='<U7'),
 array(['awesome', 'is', 'this'], dtype='<U7')]

### N-Gram

`N-gram `can be defined as the **contiguous sequence of n items** from a given sample of text or speech. The items can be letters, words, or base pairs according to the application. The N-grams typically are collected from a text or speech corpus (A long text dataset). An `N-gram` of size `1` is referred to as a `“unigram“`, size `2` is a `“bigram”`, size `3` is a `“trigram”`.

<div align="center">
<img src="img/n_gram_ex.png" alt="n_gram_ex.png" width="500px">
</div>

> Why N-gram?


An N-gram plays important role in text analysis in Machine Learning. Sometimes a **single word alone** isn’t sufficient to observe the *context* of a text. Let’s see how N-gram will useful for text analysis using an example.

For example, we need to predict the sentiment of the text such as positive or negative.

`text = “The Margherita pizza is not bad taste”`

If we consider `unigram` or a single word for text analysis, the negative word `“bad”` lead to the wrong prediction of the text. But if we use `bigram`,  the bigram word `“not bad”` helps to predict the text as a positive sentiment.


<div align="center">
<img src="img/n-gram_ex1.png" alt="n_gram_ex1.png" width="700px">
</div>


In [None]:
corpus = [
	"This is good movie",
	"This is good movie but actor is not present",
	"this is not good movie"
]