# Natural Language Processing 

Can we make computers understand words and sentences? As mentioned in the previous chapter, one of the goals is to match or surpass important human capabilities. One of those capabilities is language (communication, knowing the meaning of something, arriving at conclusions based on the words and sentences). 

This is where Natural Language Processing or NLP comes in. It’s a branch of artificial intelligence wherein the focus is on understanding and interpreting human language. It can cover the understanding and interpretation of both text and speech. 

Have you ever done a voice search in Google? Are you familiar with chatbots (they automatically respond based on your inquiries and words)? What about Google Translate?

It’s Natural Language Processing (NLP) at work. In fact, within a few or several years the NLP market might become a multi-billion dollar industry. That’s because it could be widely used in customer service, creation of virtual assistants (similar to Iron Man’s JARVIS), healthcare documentation, and other fields. 

Natural Language Processing is even used in understanding the content and gauging sentiments found in social media posts, blog comments, product reviews, news, and other online sources. NLP is very useful in these areas due to the massive availability of data from online activities. Remember that we can vastly improve our data analysis and machine learning model if we have sufficient amounts of quality data to work on. 

## Analyzing Words & Sentiments 


One of the most common uses of NLP is in understanding the sentiment in a piece of text (e.g. Is it a positive or negative product review?What does the tweet say overall?). If we only have a dozen comments and reviews to read, we don’t need any technology to do the task. But what if we have to deal with hundreds or thousands of sentences to read? 

Technology is very useful in this large-scale task. Implementing NLP can make our lives a bit easier and even make the results a bit more consistent and reproducible. 
To get started, let’s study Restaurant_Reviews.tsv (let’s take a peek): 

Wow... Loved this place.1
Crust is not good.0
Not tasty and the texture was just nasty.0
Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.              1
The selection on the menu was great and so were the prices.1
Now I am getting angry and I want my damn pho.0
Honeslty it didn't taste THAT fresh.)0
The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.              0
The fries were great too.1 

The first part is the statement wherein a person shares his/her impression or experience about the restaurant. The second part is whether that statement is negative or not (0 if negative, 1 if positive or Liked). Notice that this is very similar with Supervised Learning wherein there are labels early on. 

However, NLP is different because we’re dealing mainly with text and language instead of numerical data. Also, understanding text (e.g. finding patterns and inferring rules) can be a huge challenge. That’s because language is often inconsistent with no explicit rules. For instance, the meaning of the sentence can change dramatically by rearranging, omitting, or adding a few words in it. There’s also the thing about context wherein how the words are used greatly affect the meaning. We also have to deal with “filler” words that are only there to complete the sentence but not important when it comes to meaning. 

Understanding statements, getting the meaning and determining the emotional state of the writer could be a huge challenge. That’s why it’s really difficult even for experienced programmers to come up with a solution on how to deal with words and language. 

## Using NLTK 


One of the most popular suites is the Natural Language Toolkit (NLTK). 
With NLTK (developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.), text processing becomes a bit more straightforward because you’ll be implementing pre-built code instead of writing everything from scratch. In fact, many countries and universities actually incorporate NLTK in their courses. 


### Step 1: Import dataset with setting delimiter as ‘\t’ as columns are separated as tab space. 
Reviews and their category(0 or 1) are not separated by any other symbol but with tab space as most of the other symbols are is the review (like $ for price, ….!, etc) and the algorithm might use them as delimiter, which will lead to strange behavior (like errors, weird output) in output.

In [2]:
# Importing Libraries 
import numpy as np   
import pandas as pd  
  
# Import dataset 
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t')

### Step 2: Text Cleaning or Preprocessing
Remove Punctuations, Numbers: Punctuations, Numbers doesn’t help much in processong the given text, if included, they will just increase the size of bag of words that we will create as last step and decrase the efficency of algorithm.


Stemming: Take roots of the word
 
Convert each word into its lower case: For example, it useless to have same words in different cases (eg ‘good’ and ‘GOOD’).

In [3]:
# library to clean data 
import re    
# Natural Language Tool Kit 
import nltk    
nltk.download('stopwords')  
# to remove stopword 
from nltk.corpus import stopwords  
# for Stemming propose  
from nltk.stem.porter import PorterStemmer   
# Initialize empty array 
# to append clean text  
corpus = []    
# 1000 (reviews) rows to clean 
for i in range(0, 1000):        
    # column : "Review", row ith 
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])        
    # convert all cases to lower cases 
    review = review.lower()        
    # split to array(default delimiter is " ") 
    review = review.split()        
    # creating PorterStemmer object to 
    # take main stem of each word 
    ps = PorterStemmer()        
    # loop for stemming each word 
    # in string array at ith row     
    review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]                    
    # rejoin all string array elements 
    # to create back into a string 
    review = ' '.join(review)         
    # append each string to create 
    # array of clean text  
    corpus.append(review)  

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bradl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Step 3: Tokenization
involves splitting sentences and words from the body of the text.

### Step 4: Making the bag of words via sparse matrix
Take all the different words of reviews in the dataset without repeating of words.


One column for each word, therefore there are going to be many columns.


Rows are reviews


If word is there in row of dataset of reviews, then the count of word will be there in row of bag of words under the column of the word.

For this purpose we need CountVectorizer class from sklearn.feature_extraction.text.
We can also set max number of features (max no. features which help the most via attribute “max_features”). Do the training on corpus and then apply the same transformation to the corpus “.fit_transform(corpus)” and then convert it into array. If review is positive or negative that answer is in second column of : dataset[:, 1] : all rows ans 1st column (indexing from zero).

In [5]:
# Creating the Bag of Words model 
from sklearn.feature_extraction.text import CountVectorizer 
  
# To extract max 1500 feature. 
# "max_features" is attribute to 
# experiment with to get better results 
cv = CountVectorizer(max_features = 1500)  
  
# X contains corpus (dependent variable) 
X = cv.fit_transform(corpus).toarray()  
  
# y contains answers if review 
# is positive or negative 
y = dataset.iloc[:, 1].values  

#### Description of the dataset to be used:
*Columns seperated by \t (tab space)*


*First column is about reviews of people*


*In second column, 0 is for negative review and 1 is for positive review*

### Step 5 : Splitting Corpus into Training and Test set. 
For this we need class train_test_split from sklearn.cross_validation. Split can be made 70/30 or 80/20 or 85/15 or 75/25, here I choose 75/25 via “test_size”.
X is the bag of words, y is 0 or 1 (positive or negative).

In [7]:
# Splitting the dataset into 
# the Training set and Test set 
from sklearn.model_selection import train_test_split   
# experiment with "test_size" 
# to get better results 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 

### Step 6: Fitting a Predictive Model (here random forest)
Since Random fored is ensemble model (made of many trees) from sklearn.ensemble, import RandomForestClassifier class


With 501 tree or “n_estimators” and criterion as ‘entropy’


Fit the model via .fit() method with attributes X_train and y_train

In [8]:
# Fitting Random Forest Classification 
# to the Training set 
from sklearn.ensemble import RandomForestClassifier 
  
# n_estimators can be said as number of 
# trees, experiment with n_estimators 
# to get better results  
model = RandomForestClassifier(n_estimators = 501, criterion = 'entropy')                               
model.fit(X_train, y_train)  

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=501, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### Step 7: Pridicting Final Results 
via using .predict() method with attribute X_test


In [9]:
# Predicting the Test set results 
y_pred = model.predict(X_test) 
  
y_pred 

array([1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1,
       0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 0, 1, 1, 1], dtype=int64)

### Step 8: To know the accuracy, confusion matrix is needed.

In [11]:
# Making the Confusion Matrix 
from sklearn.metrics import confusion_matrix   
cm = confusion_matrix(y_test, y_pred)   
print(cm) 

[[114  21]
 [ 41  74]]
