## Text Preprocessing Using NLTK.

Text Preprocessing is cleaning our data to obtain meaning information by removing all the unnecessary words, acronyms, smiles etc. 

We have demonstrated important things to consider while performing text preprocessing with the code.
    
    1. Converting Text to same Case
    2. Tokenization
    3. Stemming and Lemmatization
    4. Stopwords
    5. Normalization
    6. Noise Word Removal 

## Importing Libraries

In [1]:
#!pip install nltk
#nltk.download()

In [None]:
import nltk

## Converting to Same Case

When dealing with text, we usually tend to have mix of both upper and lower case words which needs to be verified. Assume you have a sentence with "India", "india", "INDIA" in a text. In such the number of features will be 3 which should be 1. This increases unnecessary computation and not a good dataset. 
Thus, we make the text of same case-type.

Normally, we convert the text in NLP in lowercase. But, the result obtain for the case is same.

In [None]:
text = "His name is Xyz. XYZ is from ABC. xyz studied in abc and has job in Abc."
text.lower()

When not reducing to lower case, with only two words we see 6 new words or 6 new features. The number can vary very highly if this case sensitivity is not checked.

## Tokenization

Tokenization is the process of splitting the text into words and sentences based on the requirement.
The process is the 

In [None]:
text = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
"""               

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
sentences = sent_tokenize(text)
sentences[:5]

In [None]:
words = word_tokenize(text)
print('|'.join(words))

### Few points about tokenization 

We can see that sentence ends with "." so we could have also used regular expression. 

But, for the case when "." is used in between sentences like "Mr." or "Dr." the sentence will spilt based on this. 

Thus we need to put mulitplt conditions. Thus we prefer nltk sent_tokenization which already have this inbuilt pattern checking.

In [None]:
import re
msg = "Mr. XYS is the chairman of the company. He has two other companies also."

In [None]:
msg.split('.')

In [None]:
sent_tokenize("Mr. XYS is the chairman of the company. He has two other companies also.")

Although we can use split function for the words, with word_tokenize we obtain faster result computationally

## Stemming and Lemmatization

Stemming is to reduce word into stem words which may or may not have meaning.

history --> histori | 
historical --> histor

going --> go | 
goes --> goe | 
gone --> gone

In [None]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
print(stemmer.stem("history"))
print(stemmer.stem("historical"))
print()
print(stemmer.stem("going"))
print(stemmer.stem("goes"))
print(stemmer.stem("gone"))

Lemmatization is to reduce into root word which definately contain meaning. For Lemmatization to work properly it require Part of Speech as an input. If the input is not provided, it assumes the word to be noun. Because this the computation speed decreases but better result are obtained.

history --> histori |
historical --> historical

going --> go |
went --> go |
gone --> go

In [None]:
from nltk import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("history"))
print(lemmatizer.lemmatize("historical"))
print()
print(lemmatizer.lemmatize("going", pos='v'))
print(lemmatizer.lemmatize("went"))
print(lemmatizer.lemmatize("went", pos='v'))

stemming is faster than lemmatization

most of the words after stemming does not generate meaningful words

stemming may reduce two words with different root to same stem words 

stemming may reduce two words with same root word to different stem words

## Stopwords

Stopwords are the set of words that does not convey meaningful information. 

    Eg: the, is, am, are.

So we can remove the stopwords and still able to generate a meaningful sentence. While nltk contains a set for stopwords, sometimes we have to add new words or remove few words from the set. 

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print("|".join(stop_words))

We can see that the list contain negation words which are very important when trying to understand negative words or sarcastic words.

    Eg. I am not hungry --> I am hungry.
    He is not dead --> He is dead
    
The meaning is very different here. So, sometimes we have to create our own stopwords set.

In [None]:
text1 = "He worked very hard but was not able to succeed"
text2 = [word for word in text1.split(' ') if word not in stop_words] 
text2 = ' '.join(text2)
text2

Looking at the structure we infer that "Hard provided positive result". But, this is clearly not the case. So, we have to be careful while using stopwords.

## Normalization

When acronym of words are used it generally highlites importance of the word following it like Dr. , Mr. etc. 
But when we are having data from informal conversation often words are used like:
    
    2morrow - > tomorrow
    2moro - > tomorrow
    
    :) - > smile
    :-) - > smile
    
So, this conatins the same meaning and express the emotion of a person. So, the word smile and tomorrow and smile may be important but this different acronyms or symbols although maybe different, convey the same meaning. So, depending on the task it may be of hugh significance like in customer review. 

This words neither comes under the category of stopwords nor a part of english language. Thus, we convert into english language. This can be done collecting all the words that does not have only letters and than deal with them.

In [None]:
import re
text = "We will meet 2morrow. Evr1 must come b4 8. :)"
words = [word for word in text.split(" ") if word.isalpha()==False]
words

This is generally a higly overlooked but very important part of text preprocessing while working on sentiment analysis. Although it is time consuming but significant improvement are obtained when normalization is done. 

## Noise Word Removal

When dealing with informal text like tweets, we come across Scenario likes

    ..trouble..
    trouble<
    1.trouble
    
Performing stemming on this words does not change such words to stem word nor lemmatization works. So, in such case we have to remove unnecessary "." or performing cleaning. 

In [None]:
import re
text = "..trouble. trouble< 1.trouble"
message = re.sub("[^a-zA-Z]", " ", text)
message

Here, we replaced every character except english alphabet with space removing space