<a href="https://colab.research.google.com/github/anavchug/nlp-restaurant-reviews/blob/main/NLP%20On%20Restaurant%20Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

## Cleaning the texts

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = [] #creating an empty list which will contain all our different reviews from our dataset but all cleaned
for i in range(0, 1000):
  #step 1 cleaning- removing punctuations
  #the sub function replaces anything in a string by anything else you want. so basically we will replace all the
  # punctuations by a space. The hat ^ means not in CS and Math. so we are saying that we want to replace anything that is not a letter
  # from a-z or A-Z. The ' ' means a space. dataset['Review'][i] means we want to make these changes in the first column whose name is Review
  review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])

  #step 2 cleaning- making all letters to lowercase
  review = review.lower() #this will update the review variable with lowercase letters

  #prep for step 3 i.e stemming. We need to split the elements of the review in its different words so that we can apply stemming to each
  # of these words by simplifying them by their root
  review = review.split()

  #step 3 cleaning- stemming , eg loved will be replaced by love
  ps = PorterStemmer()
  all_stopwords= stopwords.words('english')
  all_stopwords.remove('not')
  # now if the word of the review we are dealing with right now in this for loop is not in the set of all the english stopwords (like a, the),
  # then we will consider it and apply stemming to it. However if the word is in the stopwords, then we won't include it in this for loop, and
  # hence won't apply stemming to it. and so it won't be in the future sparse matrix
  review = [ps.stem(word) for word in review if not word in set( all_stopwords)]
  # now we can join the words back together to get the original format that was a string
  review = ' '.join(review)  #' '.join means each word would have a space after it
  corpus.append(review) # appending the cleaned review in our corpus list


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
print(corpus)

## Creating the Bag of Words model

In [None]:
#We will get the cleaned words in our bag of words model by creating a sparse matrix. The rows of the matrix will contain all the different
#reviews and the columns will contain all the different words taken from all the different reviews. Each cell will get a 0 or a 1. If the
#word of the column is not in the review of the row, then it will get a 0. If the word of the column is in the review of the row,
#then it will get a 1.
#The process of creating all these columns corresponding to each of the words taken from all the reviews is called Tokenisation.
from sklearn.feature_extraction.text import CountVectorizer
# parameter- maximum size of the sparse matrix i.e the max no. of columns therefore the max number of words you can
# include in the columns of the sparse matrix. This is important because we still have some words in the review that are still not relevant
# to predict a review as positive or negative like place, crust, textur etc even if they are not stopwords and we can get rid of them with
# this parameter. So the trick is to take the most frequent words such that we won't include words like place or steve in the sparse matrix
cv = CountVectorizer(max_features = 1500) # we are taking the 1500 most frequent words
X = cv.fit_transform(corpus).toarray() #the fit will take all the words and the transform will fit the words in different columns of the matrix
# of features X. also toarray means X must be a 2d array
y = dataset.iloc[:, -1].values

In [None]:
len(X[0])
# this will give us the no. of elements in the first row , therefore the number of columns of X

1500

## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Naive Bayes model on the Training set

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB()

##Training the Logistic Regression model on the Training set

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

LogisticRegression(random_state=0)

##Training the K-NN model on the Training set

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

KNeighborsClassifier()

##Training the SVM model on the Training set

In [None]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

SVC(kernel='linear', random_state=0)

##Training the Kernel SVM model on the Training set

In [None]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

SVC(random_state=0)

##Training the Decision Tree Classification model on the Training set

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=0)

##Training the Random Forest Classification model on the Training set

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0)

## Predicting the Test set results

In [None]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
# this displays the vector of predictions and the the vector of real results

## Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))
# this is for Naive Bayes
# 55 correct predictions of negative reviews,
# 91 correct predictions of positive reviews,
# 42 incorrect predictions positive reviews,
# 12 incorrect predictions of negative reviews,


##Predicting if a single review is positive or negative

###Positive review
Use our model to predict if the following review:

"I love this restaurant so much"

is positive or negative.

Solution: We just repeat the same text preprocessing process we did before, but this time with a single review.

In [None]:
new_review = 'I love this restaurant so much'
new_review = re.sub('[^a-zA-Z]', ' ', new_review)
new_review = new_review.lower()
new_review = new_review.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
new_review = [ps.stem(word) for word in new_review if not word in set(all_stopwords)]
new_review = ' '.join(new_review)
new_corpus = [new_review]
new_X_test = cv.transform(new_corpus).toarray()
new_y_pred = classifier.predict(new_X_test)
print(new_y_pred)

The review was correctly predicted as positive by our model.

###Negative review
Use our model to predict if the following review:

"I hate this restaurant so much"

is positive or negative.

Solution: We just repeat the same text preprocessing process we did before, but this time with a single review.

In [None]:
new_review = 'I hate this restaurant so much'
new_review = re.sub('[^a-zA-Z]', ' ', new_review)
new_review = new_review.lower()
new_review = new_review.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
new_review = [ps.stem(word) for word in new_review if not word in set(all_stopwords)]
new_review = ' '.join(new_review)
new_corpus = [new_review]
new_X_test = cv.transform(new_corpus).toarray()
new_y_pred = classifier.predict(new_X_test)
print(new_y_pred)

[0]
