# Restaurant Reviews Analyser

<h2> Natural Language Processing</h2>

<h3>Problem Statement and Data Description</h3>
<p>Have fun with this database and try to predict whether review is positive or negative. Fields in the dataset include:</p>
<ol>
<li>Review</li>
<li>Liked</li>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [5]:
# importing the dataset
Data=pd.read_csv('Restaurant_Reviews.tsv',delimiter='\t', quoting=3)

In [70]:
Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
Review    1000 non-null object
Liked     1000 non-null int64
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


In [55]:
Data.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [69]:
# Now let's see how many positive and negative reviews are there
# Remember 1 and 0 accounts for positve and negative reviews respectively
Data['Liked'].value_counts()

1    500
0    500
Name: Liked, dtype: int64

<h3>Now let's process the textual data</h3>

In [None]:
# We use RegularExpression(re) and nltk library for preprocessing the textual data
import re
import nltk
# download stopwords from nltk 
nltk.download('stopwords') 

In [10]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer #for getting root value of a word

In [33]:
corpus=[]
#looping through each review
for i in range(0,1000):
    review=re.sub('[^a-zA-Z]', ' ', Data['Review'][i]) #removing everything except a-z and A-Z
    review=review.lower() #converting to lowercase
    review=review.split()
    ps=PorterStemmer()
    review=[ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review=' '.join(review) #list to string
    corpus.append(review)

Creating the Bag of Words model

In [38]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=1500)
X=cv.fit_transform(corpus).toarray()
Y=Data.iloc[:,1].values

Splitting the dataset into the Training set and Test set

In [43]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X, Y, test_size=0.20, random_state=0)

Building a model

In [47]:
#Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
gb=GaussianNB()
gb.fit(x_train,y_train)

# Predicting the Test set results
y_pred=gb.predict(x_test)

Evaluating the model

In [53]:

from sklearn.metrics import confusion_matrix ,accuracy_score
#Making the Confusion Matrix
cm=confusion_matrix(y_test,y_pred)

#getting accuracy score
acc=accuracy_score(y_test,y_pred)

In [49]:
cm #confusion matrix

array([[55, 42],
       [12, 91]])

In [54]:
acc #accuracy score

0.72999999999999998

storing trained object

In [72]:
import pickle

In [73]:
pickle.dump(gb,open('review_cls.pkl','wb'))