

    The purpose of this analysis is to build a prediction model to predict whether a review on the restaurant is positive or negative. To do so, we will work on Restaurant Review dataset. !

    Dataset: Restaurant_Reviews.tsv is a dataset from Kaggle datasets which consists of 1000 reviews on a restaurant. !
    
    To build a model to predict if review is positive or negative, following steps are performed. !

    Importing Dataset !
 
    Preprocessing Dataset !

    Vectorization !

    Training and Classification !

    Analysis Conclusion (train and test score) !



In [1]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


In [2]:
# Reading the data from tsv file !

dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t')

In [3]:
# Let's do some investigation !
dataset.shape

(1000, 2)

In [4]:
dataset.columns.tolist()

['Review', 'Liked']

In [5]:
# Let's check first 10 reciews !

dataset['Review'].head(10)

0                             Wow... Loved this place.
1                                   Crust is not good.
2            Not tasty and the texture was just nasty.
3    Stopped by during the late May bank holiday of...
4    The selection on the menu was great and so wer...
5       Now I am getting angry and I want my damn pho.
6                Honeslty it didn't taste THAT fresh.)
7    The potatoes were like rubber and you could te...
8                            The fries were great too.
9                                       A great touch.
Name: Review, dtype: object

In [6]:
# Checks missing value !

dataset.isnull().any(axis = 0)

Review    False
Liked     False
dtype: bool

In [7]:
corpus = []

In [8]:
for i in range(0 , 1000):
    
    # Column : "Review" , row ith !
    review = re.sub('[a-zA-z]' , ' ' , dataset['Review'][i])
    
    # Convert all case to lower case !
    review = review.lower()
    
    # Split to array(default delimeter is " ") !
    review = review.split()
    
    # Creating a PorterStemmer object to take main stem of each word !
    ps = PorterStemmer()
    
    # Loop for stemming each word in string array at ith raw !
    review = [ps.stem(word) for word in review
                            if not word in set(stopwords.words('english'))]
    
    # Rejoin all string array elements to create back into a string !
    review = ' '.join(review)
    
    # Append all string array elements !
    # Array of clean text !
    corpus.append(review)
    
    

In [9]:
# Creting the Bag of Words Model !

from sklearn.feature_extraction.text import CountVectorizer

In [10]:
# Create object !

vect = CountVectorizer(max_features = 1500)

In [11]:
# features contain corpus(Dependent variable) !

features = vect.fit_transform(corpus).toarray()

In [12]:
# labels contains answer if review is positive or negative !

labels = dataset.iloc[: , 1].values

In [13]:
# Splitting the dataset into the training set and test set !

from sklearn.model_selection import train_test_split

In [14]:
# Experiment to "test_size" to get better result ! 
features_train , features_test , labels_train , labels_test = train_test_split(features , labels , test_size = 0.4 , 
                                                                               random_state = 0)

In [15]:
# Creating the Model !

from sklearn.linear_model import LogisticRegression

In [16]:
# Create object !

model = LogisticRegression()

In [17]:
# Fitting the model !

model.fit(features_train , labels_train)

LogisticRegression()

In [18]:
# Predicting the Test set results !

prediction = model.predict(features_test)

In [19]:
# Making the Confusion Matrix !

from sklearn.metrics import confusion_matrix

In [20]:
cm = confusion_matrix(labels_test , prediction)

In [21]:
#Calculate the Accuracy Score of the Model !

from sklearn.metrics import accuracy_score

In [22]:
accuracy_score = accuracy_score(labels_test , prediction)

In [23]:
print('Prediction is: {}'.format(prediction))
print('Confusion Matrix is: {}'.format(cm))
print('Accuracy Score is: {}%'.format(accuracy_score * 100))

Prediction is: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Confusion Matrix is: [[  4 196]
 [  3 197]]
Accuracy Score is: 50.24999999999999%
