The purpose of this analysis is to build a prediction model to predict whether a review on the restaurant is positive or negative. To do so, we will work on Restaurant Review dataset. !

    Dataset: Restaurant_Reviews.tsv is a dataset from Kaggle datasets which consists of 1000 reviews on a restaurant. !

    To build a model to predict if review is positive or negative, following steps are performed. !

    Importing Dataset !

    Preprocessing Dataset !

    Vectorization !

    Training and Classification !

    Analysis Conclusion (train and test score) !



In [1]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


In [2]:
# Reading the data from tsc file !
dataset = pd.read_csv('Restaurant_Reviews.tsv' , delimiter = '\t')

In [3]:
# Let's do some investigation !

dataset.shape

(1000, 2)

In [4]:
dataset.columns.tolist()

['Review', 'Liked']

In [5]:
# Checks for missing values !

dataset.isnull().any(axis = 0)

Review    False
Liked     False
dtype: bool

In [6]:
corpus = [] # Empty list !

In [7]:
for i in range(0 , 1000) :
    
    # Column : "Review" , row ith !
    review = re.sub('[a-zA-Z]' , ' ' , dataset['Review'][i])
    
    # Convert all case ro lower case !
    review = review.lower()
    
    # Split to array(default delimiter is " ") !
    review = review.split()
    
    # Creating a PorterStemmer object to take main stem of each word. !
    ps = PorterStemmer()
    
    # Loop for dtemming each word in string array at the ith row. !
    review = [ps.stem(word) for word in review
                            if not word in set(stopwords.words('english'))]
    
    # Rejoin all string array elements to create back into a string !
    review = ''.join(review)
    
    # Append all string array elements to array of clear text. !
    corpus.append(review)

In [8]:
# Creating the Bag of Words Model !

from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
# Create object !

vect = TfidfVectorizer()

In [10]:
# features contain coepus(Dependent variable) !

features = vect.fit_transform(corpus).toarray()

In [11]:
# labels contains answer if review is positive or negative !

labels = dataset.iloc[: , 1].values

In [12]:
# Splitting the dataset into the training set and test set. !
from sklearn.model_selection import train_test_split

In [13]:
# Experiment to "test_size" to get better result !

features_train , features_test , labels_train , labels_test = train_test_split(features , labels , test_size = 0.4 ,
                                                                               random_state = 0)

In [14]:
# Creating the Model !
from sklearn.ensemble import RandomForestClassifier

In [15]:
# Creat object !

model = RandomForestClassifier()

In [16]:
# Fitting the Model !

model.fit(features_train , labels_train)

RandomForestClassifier()

In [17]:
# Predicting the Test set results !

prediction = model.predict(features_test)

In [18]:
# Making the Confusion matrix !
from sklearn.metrics import confusion_matrix

In [19]:
cm = confusion_matrix(labels_test , prediction)

In [20]:
# Calculate the Accuracy Score of the model. !

from sklearn.metrics import accuracy_score

In [21]:
accuracy_score = accuracy_score(labels_test , prediction)

In [22]:
print('The Prediction of our Model is: {}'.format(prediction))
print('Confusion Matrix of our Model is: {}'.format(cm))
print('The Accuracy Score of our Model is: {}%'.format(accuracy_score * 100))

The Prediction of our Model is: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
Confusion Matrix of our Model is: [[  4 196]
 [  3 197]]
The Accuracy Score of our Model is: 50.24999999999999%
