# Natural Language Processing

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### to display full column's content

In [2]:
pd.set_option('display.max_colwidth', -1)

#### Importing the dataset

In [3]:
## since we are analyzing text it may contain commas so python read it as column, to avoid that we will read the tsv file 
## tsv : tab separated values
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

# quoting is a parameter for quotes, and 3 is the code's value to ignore quotes ""

In [4]:
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.,1
4,The selection on the menu was great and so were the prices.,1


In [5]:
len(dataset)

1000

### Cleaning the dataset

###### we need to get rid of words that won't help the machine learning algo to predict if the review is good or bad (or, and, ponctuation...), we will also apply stemming, which consist of taking the roots of the differents versions of same word (loved --> love)

###### we will also do some tokenization, it will split a review to words, whic, thx to the text-preprocessing, will only be relevent words

### first clean the first review, then apply cleaning techniques using a loop

In [6]:
import re ## regular expressions

In [7]:
dataset['Review'][0]

'Wow... Loved this place.'

### 1st step: only keeping letters in the review

In [8]:
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][0] ) ## ' ' te replace the removed character by space
review

'Wow    Loved this place '

### 2nd step: putting all letters in lowercase

In [9]:
review = review.lower()
review

'wow    loved this place '

### 3rd step: Removing non-significant words (which won't help the model to predict nature of review + -)

In [10]:
import nltk
from nltk.corpus import stopwords

the idea is to make a loop to go over the words in the review and check if the wors is in the stopwords list

In [11]:
## convert review to a list
review = review.split()
review

['wow', 'loved', 'this', 'place']

In [12]:
## review is now a list
review = [word for word in review if not word in set(stopwords.words('english')) ] ## set just in case we have largest review
review

['wow', 'loved', 'place']

### 4th step: Stemming: keep the root of the words   loved = love

In [13]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [14]:
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english')) ]
review

['wow', 'love', 'place']

loved becomes love

### 5th step: join back the words of the list review, so it becomes one string

In [15]:
review = ' '.join(review)
review

'wow love place'

### now apply the same step to all reviews

In [16]:
corpus_list =[]  ### a corpus is a collection of text
for i in range (0, len(dataset)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i] )
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english')) ]
    review = ' '.join(review)
    corpus_list.append(review)

In [17]:
corpus_list[0:5]

['wow love place',
 'crust good',
 'tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price']

In [18]:
dataset['Review'][0:5]

0    Wow... Loved this place.                                                               
1    Crust is not good.                                                                     
2    Not tasty and the texture was just nasty.                                              
3    Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.
4    The selection on the menu was great and so were the prices.                            
Name: Review, dtype: object

# Creating the Bag of Words model

###### What's We're going to do to create this backwards  model is just to take all the different words of the 1000 reviews.
###### So we will take a while. "Love place crust good tested text nasty" and all the other words down to the 1000s review without taking twice or three times the duplicates or triplicates we are just taking all the different but unique words of these 1000 reviews and basically then what we'll do is to create one column for each word. So of course there are a lot of different words here so we will have a lot of columns and then we will put all these columns in a table where the rows are nothing else than the 1000 reviews. So basically what we'll get is a table containing 1000 rows where the rows correspond to the reviews and a lot of columns where the columns correspond to each of the different words we can find here in all the reviews in this corpus. So each cell of this table will correspond to one specific review and one specific word of this corpus. And in this cell we're going to have a number and this number is going to be the number of times the word corresponding to the column appears in the review.So for example let's say that you know the first column corresponds to the word: Wow. Well for this particular first column and for the first line that corresponds to the first review that is this one well since Wow appears once in the first review. Well we'll get to one in this particular cell because this cell belongs to the column that corresponds to Wow and well appears once in the review. But then if we stay in this first column and move onto the second row. Well since Wow doesn't appear anywhere in the second review we will get a zero for this particular cell belonging to the first column and the second row. So as you can imagine for most of the cells will have a zero because we can simply see that with this wow word  which will be one of the column and we can see that this word appears only in the first review.

#### what it is why do we need to create such a model such a representation.

###### Well it's because simply what we'll do in the end is to predict if a review is positive or negative. And for our machine learning model to be able to predict that well it needs to be trained on all these reviews because for all these reviews we have the real results that is we know for each one if it is positive or negative so we will train our algorithm on all these reviews because we have the results and the machine learning model will understand how to make the correlations between the hints that tell if the review is positive or negative and it's true result whether it is positive or negative. So it will make some correlations between the words there and the reviews and the real result. But in order for machine learning model to be trained to predict if a review is positive or negative, it needs to have some independent variables and one dependent variable because simply what we're doing here is classification because the outcome the dependent variable is a categorical variable a binary outcome one is a review positive or zero if the reviews negative. So we are doing nothing else than classification.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
##### X matrix of features (independent variables)

In [20]:
cv = CountVectorizer()
X = cv.fit_transform(corpus_list).toarray()

In [21]:
X.shape

(1000, 1565)

X has 1000 lines and 1565 words (columns) (all words used in the 1000 rows). Imagine we have 1 million reviews to analyze and train our mode, well then we will get more than 1 million words (columns) which will lead to huge spasity. So, we add the max_features parameter in CountVectorizer object to keep the most frequent words (relevant ones), lets choose 1500.

In [22]:
cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus_list).toarray()

In [23]:
X.shape

(1000, 1500)

#### the dependent variable (liked clumn + - review)

In [24]:
y = dataset.iloc[:, 1].values

## Splitting the dataset into the Training set and Test set

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [26]:
print('lenght of train set :' + str(len(X_train))+ '\nlenght of train set :' + str(len(X_test)))

lenght of train set :800
lenght of train set :200


## Fitting Naive Bayes to the Training set

In [27]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

## Predicting the Test set results

In [28]:
y_pred = classifier.predict(X_test)

## Making the Confusion Matrix

In [29]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
## cm= [TP FP]
##     [FN TN]
TP = cm[0][0]
FP = cm[0][1]
FN = cm[1][0]
TN = cm[1][1]

### metrics
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1_Score = 2 * Precision * Recall / (Precision + Recall)

In [30]:
print(cm)
print( 
      "model's ACCURACY is : " + str("%.3f" %  Accuracy) +"%" +   ## "%.3f" % to keep 3 numbers after the comma
      "\nmodel's PRECISION is : " + str("%.3f" % Precision) + "%"+ ## we could also use format(x, '.2f')
      "\nmodel's RECALL is : " + str("%.3f" % Recall) + "%"+
      "\nmodel's F1 SCORE is : " + str("%.3f" % F1_Score) + "%"
)

[[55 42]
 [12 91]]
model's ACCURACY is : 0.730%
model's PRECISION is : 0.567%
model's RECALL is : 0.821%
model's F1 SCORE is : 0.671%


###### over 200 review, the model predicted 55 correct negative reviews and 9 correct positive reviews
###### if we had a 1 million reviews it would be more accurate, but for 800 review it's not bad

## Random Forest to the Training set

In [31]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10, criterion= "entropy", random_state=0)

classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm
## cm= [TP FP]
##     [FN TN]

array([[87, 10],
       [46, 57]], dtype=int64)

In [32]:
TP = cm[0][0]
FP = cm[0][1]
FN = cm[1][0]
TN = cm[1][1]

### metrics
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1_Score = 2 * Precision * Recall / (Precision + Recall)

In [33]:
print(cm)
print( 
      "model's ACCURACY is : " + str("%.3f" %  Accuracy) +"%" +   ## "%.3f" % to keep 3 numbers after the comma
      "\nmodel's PRECISION is : " + str("%.3f" % Precision) + "%"+ ## we could also use format(x, '.2f')
      "\nmodel's RECALL is : " + str("%.3f" % Recall) + "%"+
      "\nmodel's F1 SCORE is : " + str("%.3f" % F1_Score) + "%"
)

[[87 10]
 [46 57]]
model's ACCURACY is : 0.720%
model's PRECISION is : 0.897%
model's RECALL is : 0.654%
model's F1 SCORE is : 0.757%


less accurate than Naive Bayes

## SVM to the Training set

In [34]:
from sklearn.svm import SVC
classifier =SVC(kernel='linear', random_state=0)
classifier.fit(X_train, y_train)
## predict
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

TP = cm[0][0]  ## true positive
FP = cm[0][1]  ## false positive
FN = cm[1][0]  ## false negative
TN = cm[1][1]  ## true negative

### metrics
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1_Score = 2 * Precision * Recall / (Precision + Recall)

## print
print(cm)
print( 
      "model's ACCURACY is : " + str("%.3f" %  Accuracy) +"%" +   ## "%.3f" % to keep 3 numbers after the comma
      "\nmodel's PRECISION is : " + str("%.3f" % Precision) + "%"+ ## we could also use format(x, '.2f')
      "\nmodel's RECALL is : " + str("%.3f" % Recall) + "%"+
      "\nmodel's F1 SCORE is : " + str("%.3f" % F1_Score) + "%"
)

[[74 23]
 [33 70]]
model's ACCURACY is : 0.720%
model's PRECISION is : 0.763%
model's RECALL is : 0.692%
model's F1 SCORE is : 0.725%


if we use a gaussian kernel here the result won't be good