# Natural Language Text Pre-preprocessing

In [1]:
"""
cd .\00text-preprocess\
jupyter nbconvert --to markdown pre-process.ipynb --output README.md
"""
import nltk
# nltk.download()

## Introduction

NLP Workflow: `Text -> Numbers -> Classification`

Data preprocessing is an essential step in building a Machine Learning model and depending on how well the data has been preprocessed; the results are seen.

In NLP, text preprocessing is the first step in the process of building a model.
The various text preprocessing steps are:

- Tokenization
- Lower casing
- Stop words removal
- Stemming
- Lemmatization




## Bags of Word Pipeline

- Get the Data/Corpus 
- Tokenization,Stop words removal
- Stemming,Lemmatization
- Building a Vocab
- Vectorization
- Classification

## Load Data/Corpus 

In [3]:
from nltk.corpus import brown

In [6]:
print(brown.categories())
print(len(brown.categories()))


['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
15


Most corpora consist of a set of files, each containing a document (or other pieces of text). A list of identifiers for these files is accessed via the `fileids()` method of the corpus reader:

In [11]:
print(brown.fileids()[:20])


['ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10', 'ca11', 'ca12', 'ca13', 'ca14', 'ca15', 'ca16', 'ca17', 'ca18', 'ca19', 'ca20']


In [14]:
data = brown.sents(categories="fiction")
data


[['Thirty-three'], ['Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.'], ...]

In [16]:
" ".join(data[1])

'Scotty did not go back to school .'

## Tokenization

Tokenization: Splitting the sentence into words.
Strings can be tokenized into tokens via `nltk.word_tokenize`.


In [29]:
from nltk.tokenize import sent_tokenize,word_tokenize
# prerequisite:nltk.download('punkt')

In [15]:
sample_text = "Does this thing really work? Lets see."

In [14]:
sent_tokenize(sample_text)

['Does this thing really work?', 'Lets see.']

In [26]:
words = word_tokenize(sample_text)
words


['Does', 'this', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']

## Stopwords

Stop words removal: Stop words are very commonly used words (**a, an, the, etc.**) in the documents. These words do not really signify any importance as they do not help in distinguishing two documents. We can use `nltk.corpus.stopwords.words(‘english’)` to fetch a list of `stopwords` in the English dictionary. Then, we remove the tokens that are `stopwords`.


In [17]:
from nltk.corpus import stopwords

In [25]:
stop = stopwords.words('english')

print(stop[:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [29]:
clean_words = [w for w in words if w not in stop ]
print(words)
print(clean_words)


['Does', 'this', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']
['Does', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']


> !! **Watch out for Uppercase**: for example `this` in the above got removed as it is a stopword. But if we would have used `This`, it will not be removed

In [30]:
sample_text = "Does This thing really work? Lets see."
words = word_tokenize(sample_text)
words

['Does', 'This', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']

In [31]:
clean_words = [w for w in words if w not in stop]
print(words)
print(clean_words)


['Does', 'This', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']
['Does', 'This', 'thing', 'really', 'work', '?', 'Lets', 'see', '.']


> Solution:	

In [32]:
sample_text = "Does This thing really work? Lets see."
sample_text = sample_text.lower()
words = word_tokenize(sample_text)
words

['does', 'this', 'thing', 'really', 'work', '?', 'lets', 'see', '.']

In [33]:
clean_words = [w for w in words if w not in stop]
print(words)
print(clean_words)


['does', 'this', 'thing', 'really', 'work', '?', 'lets', 'see', '.']
['thing', 'really', 'work', '?', 'lets', 'see', '.']


### including punctuations

In [35]:
import string
punctuations = list(string.punctuation)
stop = stop + punctuations

In [36]:
clean_words = [w for w in words if w not in stop]
print(words)
print(clean_words)

['does', 'this', 'thing', 'really', 'work', '?', 'lets', 'see', '.']
['thing', 'really', 'work', 'lets', 'see']


## Stemming



Stemming: It is a process of transforming a word to its root form.
We stem the tokens using `nltk.stem.porter.PorterStemmer` to get the stemmed tokens.

In [2]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

words = ["play","playing","player","played"]

stemmed_words = [ ps.stem(w) for w in words]
stemmed_words

['play', 'play', 'player', 'play']

In [13]:
words = ["machine","happying"]

stemmed_words = [ps.stem(w) for w in words]
stemmed_words

['machin', 'happi']

Explanation: The word `'machine'` has its suffix `'e'` chopped off. The stem does not make sense as it is not a word in English. This is a disadvantage of stemming.


## POS(part of speech) tagger

We can use `nltk.pos_tag` to retrieve the `part of speech` of each token in a list.

pos_tag **abbreviations**:

- NNP:		proper noun, singular (sarah)
- NNS: noun, common, plural
- NNPS:		proper noun, plural (indians or americans)
- PDT:		predeterminer (all, both, half)
- POS:		possessive ending (parent\ ‘s)
- DT: determiner
- .....

[https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk](https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk)

In [61]:
from nltk import pos_tag
pos_tag(['any'])

[('any', 'DT')]

> Watch Out: post_tag takes `list` not  `string`

load text:

In [14]:
from nltk.corpus import state_union
# Prerequisite: 
# nltk.download('state_union')

In [22]:
documents = nltk.corpus.state_union.fileids()
print(documents[:20])

['1945-Truman.txt', '1946-Truman.txt', '1947-Truman.txt', '1948-Truman.txt', '1949-Truman.txt', '1950-Truman.txt', '1951-Truman.txt', '1953-Eisenhower.txt', '1954-Eisenhower.txt', '1955-Eisenhower.txt', '1956-Eisenhower.txt', '1957-Eisenhower.txt', '1958-Eisenhower.txt', '1959-Eisenhower.txt', '1960-Eisenhower.txt', '1961-Kennedy.txt', '1962-Kennedy.txt', '1963-Johnson.txt', '1963-Kennedy.txt', '1964-Johnson.txt']


In [27]:
speech  = state_union.raw('2006-GWBush.txt')
speech[:500]

"PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all. Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream. Tonight we are comforted by the hope of a glad reunion with the hus"

In [38]:
speech_in_words = word_tokenize(speech)
pos = pos_tag(speech_in_words)
pos[:10]

[('PRESIDENT', 'NNP'),
 ('GEORGE', 'NNP'),
 ('W.', 'NNP'),
 ('BUSH', 'NNP'),
 ("'S", 'POS'),
 ('ADDRESS', 'NNP'),
 ('BEFORE', 'IN'),
 ('A', 'NNP'),
 ('JOINT', 'NNP'),
 ('SESSION', 'NNP')]

## Lemmatization

**Lemmatization**: Unlike stemming, lemmatization reduces the words to a word existing in the language.

For lemmatization to resolve a word to its `lemma`, **`part of speech` of the word is required**. This helps in transforming the word into a proper root form. However, for doing so, it requires extra computational linguistics power such as a **part of speech tagger**.

[what-is-the-difference-between-stemming-and-lemmatization/](https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/)

Lemmatization is preferred over Stemming because lemmatization does a morphological analysis of the words.


In [43]:
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()

In [47]:
lemmatizer.lemmatize("bats")

'bat'

In [48]:
sentence = "The striped bats are hanging on their feet for best"
# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)

['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
The striped bat are hanging on their foot for best


Notice it didn’t do a good job. Because, `‘are’` is not converted to` ‘be’` and `‘hanging’` is not converted to `‘hang’` as expected. This can be corrected if we provide the correct **‘part-of-speech’ tag (`POS` tag)** as the second argument to `lemmatize()`. Sometimes, the same word can have a multiple lemmas based on the meaning / context.

In [49]:
lemmatizer.lemmatize("painting", pos='n')

'painting'

In [50]:
lemmatizer.lemmatize("painting", pos='v')

'paint'

In [74]:
lemmatizer.lemmatize("hanging", pos='v')


'hang'

In [80]:
lemmatizer.lemmatize("are", pos='v')

'be'

In [81]:
lemmatizer.lemmatize("is", pos='v')

'be'

In [87]:
w = "hanging"
postag = pos_tag([w])
postag


[('hanging', 'VBG')]

Simple Function to convert `pos_tag` abbreviations to simple form that the `lemmatize()` function takes. For example `NN`,`NNS` etc to (`n`), `VBG` etc to `v`

In [53]:
from nltk.corpus import wordnet
def get_simple_pos(tag):
	
	if tag.startswith("J"):
		return wordnet.ADJ
	elif tag.startswith("V"):
		return wordnet.VERB
	elif tag.startswith("N"):
		return wordnet.NOUN
	elif tag.startswith("R"):
		return wordnet.ADV
	else:
		return wordnet.NOUN


In [85]:
w="hanging"
postag = pos_tag([w])
print(postag)
print(postag[0][1]+" --> ",end="" )
pos = get_simple_pos(postag[0][1])
print(pos)


[('hanging', 'VBG')]
VBG --> v


In [88]:
sentence = "The striped bats are hanging on their feet for best"
word_list = nltk.word_tokenize(sentence)
o = []
for w in word_list:
	postag = pos_tag([w])
	pos = get_simple_pos(postag[0][1])
	clean_word = lemmatizer.lemmatize(w, pos=pos)
	o.append(clean_word)

print(o)



['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


[https://www.machinelearningplus.com/nlp/lemmatization-examples-python/](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/)