### Demonstration Sentiment Analysis

This notebook is for training purposes and as such has been simplified to highlight the core steps in training an ML model.

### Step 1: Load, Clean, Prepare the data

In [1]:
import pandas as pd

df = pd.read_csv('http://neueda.conygre.com/pydata/ml_fc/Restaurant_Reviews.tsv', sep='\t')

print(df.shape)
df.head()

(1000, 2)


Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


Note - for simplicity this preparation function is greatly simplified. We could include further steps such as:
* stemming / lemmatization
* removing stop words
* identification of the most symantically valuable words

In [2]:
import re

def prepare_review(review):
    
    review = re.sub('[^a-zA-Z]', ' ', review)    
    return review.lower()


In [3]:
prepare_review('Not tasty and the texture was just nasty.')

'not tasty and the texture was just nasty '

In [4]:
corpus = df['Review'].apply(prepare_review).tolist()

corpus[:5]

['wow    loved this place ',
 'crust is not good ',
 'not tasty and the texture was just nasty ',
 'stopped by during the late may bank holiday off rick steve recommendation and loved it ',
 'the selection on the menu was great and so were the prices ']

### This step is encoding the "Bag of Words" - the CountVectorizer utility does this for us.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1500)

### Step 2: Split into independant and dependant variables

In [6]:
X = cv.fit_transform(corpus).toarray()  # Independent variables

y = df['Liked'].values

### Step 3: Split our data into training and test sets 

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Step 4: Choose and train the model

In [8]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)

LogisticRegression()

#### Here we are demonstrating inference / prediction

In [9]:
#new_review = "Food was Great!"
#new_review = "Service was awful, tasted revolting!"
#new_review = "Could be better"
new_review = "Food was not great"

prepared_review = prepare_review(new_review)

feature_set = cv.transform( [prepared_review] )

result = model.predict(feature_set.toarray())

print(result)
print()

if result[0] == 1:
    print('Glad you enjoyed your visit!')
else:
    print('Apologies')

[0]

Apologies


### Step 5: Validate / Measure the model

In this case as this is a classification problem, we will use a confusion matrix. There are a number of further calculations we might do to extract more metrics from our model.


In [10]:
from sklearn.metrics import confusion_matrix

cmatrix = confusion_matrix(y_test, model.predict(X_test))

print(cmatrix)

print('Accuracy: ', (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())

[[82 15]
 [20 83]]
Accuracy:  0.825


### For reference, how do some other sklearn classification models perform?

In [11]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

In [12]:
MNB_classifier = MultinomialNB()
MNB_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, MNB_classifier.predict(X_test))
print("MNB_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


MNB_classifier accuracy: 0.815


In [13]:
BNB_classifier = BernoulliNB()
BNB_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, BNB_classifier.predict(X_test))
print("BNB_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


BNB_classifier accuracy: 0.8


In [14]:
LR_classifier = LogisticRegression()
LR_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, LR_classifier.predict(X_test))
print("LR_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


LR_classifier accuracy: 0.825


In [15]:
SGD_classifier = SGDClassifier()
SGD_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, SGD_classifier.predict(X_test))
print("SGD_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


SGD_classifier accuracy: 0.81


In [16]:
SVC_classifier = LinearSVC()
SVC_classifier.fit(X_train, y_train)
cmatrix = confusion_matrix(y_test, SVC_classifier.predict(X_test))
print("SVC_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())


SVC_classifier accuracy: 0.81


In [17]:
from sklearn.neural_network import MLPClassifier

MLP_classifier = MLPClassifier(hidden_layer_sizes=(20), max_iter=1000)
MLP_classifier.fit(X_train, y_train)

cmatrix = confusion_matrix(y_test, MLP_classifier.predict(X_test))
print("MLP_classifier accuracy:", (cmatrix[0][0]  + cmatrix[1][1]) / cmatrix.sum())

MLP_classifier accuracy: 0.8
