# Customer Sentiment Analysis for restaurant customer reviews

### Importing Data

#### First, we import the dataset, which is in the form of a TSV file. We can use the pandas library to load the dataset into a DataFrame.

In [4]:
import pandas as pd

df = pd.read_csv('/Users/anaghabhole/Documents/Projects/Sentiment Analysis/Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

df


Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


### Preprocessing Dataset:
#### Next, we need to preprocess the text data by removing any irrelevant information such as punctuations, numbers, and stopwords. We can use the NLTK library for this task.



In [7]:
import nltk
import re
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []
for i in range(0, len(df)):
    review = re.sub('[^a-zA-Z]', ' ', df['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anaghabhole/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Vectorization:

#### To convert the preprocessed text data into numerical data, we use vectorization. We can use the CountVectorizer class from the scikit-learn library for this task.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, 1].values

### Training and Classification:

#### We can now train our predictive models on the preprocessed and vectorized dataset. We will use the Multinomial Naive Bayes, Bernoulli Naive Bayes, and Logistic Regression algorithms to build our models.

In [9]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Multinomial Naive Bayes
classifier_mnb = MultinomialNB()
classifier_mnb.fit(X_train, y_train)

# Bernoulli Naive Bayes
classifier_bnb = BernoulliNB()
classifier_bnb.fit(X_train, y_train)

# Logistic Regression
classifier_lr = LogisticRegression(random_state = 0)
classifier_lr.fit(X_train, y_train)

# Predicting Test Set Results
y_pred_mnb = classifier_mnb.predict(X_test)
y_pred_bnb = classifier_bnb.predict(X_test)
y_pred_lr = classifier_lr.predict(X_test)

# Creating Confusion Matrix
cm_mnb = confusion_matrix(y_test, y_pred_mnb)
cm_bnb = confusion_matrix(y_test, y_pred_bnb)
cm_lr = confusion_matrix(y_test, y_pred_lr)

### Analysis Conclusion:
#### We can compare the performance of the three models by comparing their confusion matrices. The model with the highest accuracy is the best model for predicting the sentiment of the restaurant reviews.

In [10]:
# Print Multinomial Naive Bayes Confusion Matrix and Accuracy
print("Multinomial Naive Bayes Confusion Matrix:")
print(cm_mnb) # print confusion matrix
accuracy_mnb = (cm_mnb[0][0]+cm_mnb[1][1])/len(y_pred_mnb) # calculate accuracy
print("Accuracy:", accuracy_mnb) # print accuracy

# Print Bernoulli Naive Bayes Confusion Matrix and Accuracy
print("Bernoulli Naive Bayes Confusion Matrix:")
print(cm_bnb) # print confusion matrix
accuracy_bnb = (cm_bnb[0][0]+cm_bnb[1][1])/len(y_pred_bnb) # calculate accuracy
print("Accuracy:", accuracy_bnb) # print accuracy

# Print Logistic Regression Confusion Matrix and Accuracy
print("Logistic Regression Confusion Matrix:")
print(cm_lr) # print confusion matrix
accuracy_lr = (cm_lr[0][0]+cm_lr[1][1])/len(y_pred_lr) # calculate accuracy
print("Accuracy:", accuracy_lr) # print accuracy

Multinomial Naive Bayes Confusion Matrix:
[[72 25]
 [22 81]]
Accuracy: 0.765
Bernoulli Naive Bayes Confusion Matrix:
[[73 24]
 [22 81]]
Accuracy: 0.77
Logistic Regression Confusion Matrix:
[[76 21]
 [37 66]]
Accuracy: 0.71
