
* Load the Yelp 10,000 reviews data  download
    * Prepare a binary outcome variable: recode ratings below 4 as 0 and >= 4 as 1
* Use sklearn’s TF-IDF vectorizer function to convert the corpus of reviews into text features
    * Pay attention to key choices about stop words, minimum document frequency, and maximum document frequency
    * You may also want to stem data prior to using the vectorizer function
    * Develop a ML model to predict the binary ratings outcome and assess the model’s performance
* Repeat this process but use word embeddings values
    * After removing stop words, calculate the average vector for each word in the review
    * Use these average vector scores as the features in your ML model
    * Compare the performance with the previous model

In [133]:
# bring in some texts
import pandas as pd
import numpy as np

df = pd.read_csv('yelp_10k.csv')
print(df.shape, df.columns.to_list(),'\n')
#df.head()

(10000, 5) ['REST_ID', 'UserId', 'Rating', 'Date', 'Description'] 



In [134]:
df.fillna(0, inplace=True)

In [135]:
#df.describe()

In [136]:
#Prepare a binary outcome variable: recode ratings below 4 as 0 and >= 4 as 1

df.loc[(df.Rating < 4),'BinaryRating',]= 0
df.loc[(df.Rating >= 4),'BinaryRating']= 1
df['BinaryRating'] = df.BinaryRating.astype(int)

In [137]:
df.head()

Unnamed: 0,REST_ID,UserId,Rating,Date,Description,BinaryRating
0,37185,m07sy7eLtOjVdZ8oN9JKag,5.0,2010-02-18,"Hey, wait a minute...didn't Henry VIII always ...",1
1,18248,ev_mrEIDJauxugj1r8Z3qg,4.0,2011-01-08,I've only eaten at a few rotating sushi bars i...,1
2,2625,ZY3GMhyJTAS4ml39whHUQg,2.0,2016-08-20,Overall disappointed and still hungry for some...,0
3,38,q7hkNbl7DEW8tKu7F0ATiw,5.0,2017-09-19,Spent 4 days in DC and this was my favorite pl...,1
4,17895,focs8rivLPE-b8W6dsQe6Q,4.0,2014-06-10,I was stopping by the DC area on a leisure wee...,1


In [138]:
# use regex to remove any special characters.
df['Description'] = df['Description'].astype(str).str.replace("(\\d|\\W)+", " ",regex=True).str.strip()

In [139]:
df.head()

Unnamed: 0,REST_ID,UserId,Rating,Date,Description,BinaryRating
0,37185,m07sy7eLtOjVdZ8oN9JKag,5.0,2010-02-18,Hey wait a minute didn t Henry VIII always eat...,1
1,18248,ev_mrEIDJauxugj1r8Z3qg,4.0,2011-01-08,I ve only eaten at a few rotating sushi bars i...,1
2,2625,ZY3GMhyJTAS4ml39whHUQg,2.0,2016-08-20,Overall disappointed and still hungry for some...,0
3,38,q7hkNbl7DEW8tKu7F0ATiw,5.0,2017-09-19,Spent days in DC and this was my favorite plac...,1
4,17895,focs8rivLPE-b8W6dsQe6Q,4.0,2014-06-10,I was stopping by the DC area on a leisure wee...,1


### Tokenize , lementize and stem the data in description column.

In [140]:
import spacy 
import nltk 

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

df['Lemmatised_D'] = df.Description.apply(lemmatize_text)
df['Stemmed_D'] = df['Lemmatised_D'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.


In [141]:
df.head()

Unnamed: 0,REST_ID,UserId,Rating,Date,Description,BinaryRating,Lemmatised_D,Stemmed_D
0,37185,m07sy7eLtOjVdZ8oN9JKag,5.0,2010-02-18,Hey wait a minute didn t Henry VIII always eat...,1,"[Hey, wait, a, minute, didn, t, Henry, VIII, a...","[hey, wait, a, minut, didn, t, henri, viii, al..."
1,18248,ev_mrEIDJauxugj1r8Z3qg,4.0,2011-01-08,I ve only eaten at a few rotating sushi bars i...,1,"[I, ve, only, eaten, at, a, few, rotating, sus...","[I, ve, onli, eaten, at, a, few, rotat, sushi,..."
2,2625,ZY3GMhyJTAS4ml39whHUQg,2.0,2016-08-20,Overall disappointed and still hungry for some...,0,"[Overall, disappointed, and, still, hungry, fo...","[overal, disappoint, and, still, hungri, for, ..."
3,38,q7hkNbl7DEW8tKu7F0ATiw,5.0,2017-09-19,Spent days in DC and this was my favorite plac...,1,"[Spent, day, in, DC, and, this, wa, my, favori...","[spent, day, in, DC, and, thi, wa, my, favorit..."
4,17895,focs8rivLPE-b8W6dsQe6Q,4.0,2014-06-10,I was stopping by the DC area on a leisure wee...,1,"[I, wa, stopping, by, the, DC, area, on, a, le...","[I, wa, stop, by, the, DC, area, on, a, leisur..."


### Remove Stop words

In [142]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

df['Normalized'] = df['Stemmed_D'].apply(lambda x: ' '.join([word for word in x if word not in (stop_words)]))

In [143]:
df.head()

Unnamed: 0,REST_ID,UserId,Rating,Date,Description,BinaryRating,Lemmatised_D,Stemmed_D,Normalized
0,37185,m07sy7eLtOjVdZ8oN9JKag,5.0,2010-02-18,Hey wait a minute didn t Henry VIII always eat...,1,"[Hey, wait, a, minute, didn, t, Henry, VIII, a...","[hey, wait, a, minut, didn, t, henri, viii, al...",hey wait minut henri viii alway eat turkey leg...
1,18248,ev_mrEIDJauxugj1r8Z3qg,4.0,2011-01-08,I ve only eaten at a few rotating sushi bars i...,1,"[I, ve, only, eaten, at, a, few, rotating, sus...","[I, ve, onli, eaten, at, a, few, rotat, sushi,...",I onli eaten rotat sushi bar time fantast eat ...
2,2625,ZY3GMhyJTAS4ml39whHUQg,2.0,2016-08-20,Overall disappointed and still hungry for some...,0,"[Overall, disappointed, and, still, hungry, fo...","[overal, disappoint, and, still, hungri, for, ...",overal disappoint still hungri someth better b...
3,38,q7hkNbl7DEW8tKu7F0ATiw,5.0,2017-09-19,Spent days in DC and this was my favorite plac...,1,"[Spent, day, in, DC, and, this, wa, my, favori...","[spent, day, in, DC, and, thi, wa, my, favorit...",spent day DC thi wa favorit place great happi ...
4,17895,focs8rivLPE-b8W6dsQe6Q,4.0,2014-06-10,I was stopping by the DC area on a leisure wee...,1,"[I, wa, stopping, by, the, DC, area, on, a, le...","[I, wa, stop, by, the, DC, area, on, a, leisur...",I wa stop DC area leisur weekend redston wa re...


### Keep just the Normalised data

In [144]:
df2 = df.drop(columns=['Lemmatised_D','Stemmed_D'])
df2

Unnamed: 0,REST_ID,UserId,Rating,Date,Description,BinaryRating,Normalized
0,37185,m07sy7eLtOjVdZ8oN9JKag,5.0,2010-02-18,Hey wait a minute didn t Henry VIII always eat...,1,hey wait minut henri viii alway eat turkey leg...
1,18248,ev_mrEIDJauxugj1r8Z3qg,4.0,2011-01-08,I ve only eaten at a few rotating sushi bars i...,1,I onli eaten rotat sushi bar time fantast eat ...
2,2625,ZY3GMhyJTAS4ml39whHUQg,2.0,2016-08-20,Overall disappointed and still hungry for some...,0,overal disappoint still hungri someth better b...
3,38,q7hkNbl7DEW8tKu7F0ATiw,5.0,2017-09-19,Spent days in DC and this was my favorite plac...,1,spent day DC thi wa favorit place great happi ...
4,17895,focs8rivLPE-b8W6dsQe6Q,4.0,2014-06-10,I was stopping by the DC area on a leisure wee...,1,I wa stop DC area leisur weekend redston wa re...
...,...,...,...,...,...,...,...
9995,22070,SBFIcLkzx3bBY2Q-Hsbvsw,3.0,2014-07-13,I would definitely try this place again br br ...,0,I would definit tri thi place br br F W ha ver...
9996,31006,5EQnJI0p5Jnu0Bjhj4aaQw,5.0,2012-06-10,The Wooly Pig is definitely one of the best sa...,1,wooli pig definit one best sandwich I everyth ...
9997,5104,DEL1SlWIT3SwSbX67Vmf4w,5.0,2017-08-02,Nothing bUt wonderful here The buffalo chicken...,1,noth wonder buffalo chicken wrap wa great sinc...
9998,23634,tG2IxZc52u8Dd_lk0wE_ow,5.0,2017-10-09,It s an adorable restaurant Went on a random d...,1,It ador restaur went random date wife plan ret...


### Converting the corpus of reviews into text features using sklearn vectorizer

In [145]:
#Converting the corpus of reviews into text features using sklearn vectorizer

# sklearn vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# create vectorizer object with attributes to pre-process and filter words
vectorizer = CountVectorizer(stop_words='english', min_df=2,
                            ngram_range=(1,1))

X = vectorizer.fit_transform(df2.Normalized.to_list())

print('shape of word vector array:', X.shape, '\n')
print(len(vectorizer.get_feature_names()), 'words:')
print(vectorizer.get_feature_names())

shape of word vector array: (10000, 9303) 

9303 words:
['__', 'aah', 'ab', 'aback', 'abalon', 'abandon', 'aberdeen', 'abhor', 'abil', 'abita', 'abl', 'abnorm', 'abomin', 'abound', 'abov', 'abraham', 'abroad', 'abrupt', 'abruptli', 'absenc', 'absent', 'absinth', 'absolut', 'absorb', 'absurd', 'absurdli', 'abt', 'abund', 'abus', 'abysm', 'ac', 'academi', 'acai', 'accent', 'accentu', 'accept', 'access', 'accessori', 'accid', 'accident', 'acclaim', 'accomad', 'accommod', 'accomod', 'accompani', 'accomplish', 'accord', 'accordingli', 'accordion', 'account', 'accouter', 'accoutr', 'accumul', 'accur', 'accuraci', 'accus', 'accustom', 'ace', 'ach', 'achiev', 'acid', 'acknowledg', 'acm', 'acoust', 'acquaint', 'acquiesc', 'acquir', 'act', 'action', 'activ', 'actual', 'acut', 'ad', 'adam', 'adana', 'adapt', 'add', 'addict', 'addit', 'address', 'adequ', 'adjac', 'adjoin', 'adjust', 'admir', 'admiss', 'admit', 'admittedli', 'admo', 'adobo', 'adopt', 'ador', 'adorn', 'adult', 'advanc', 'advantag', 

### Converting the corpus of reviews into text features using sklearn’s TF-IDF

In [158]:
#Converting the corpus of reviews into text features using sklearn’s TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

# same process w/ tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=3)

X2 = tfidf_vectorizer.fit_transform(df2.Normalized.to_list())

#Looking at keywords
tmp = pd.DataFrame(X2[1].T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
tmp.sort_values(by=["tfidf"],ascending=False).head(15)

Unnamed: 0,tfidf
sushi,0.332497
roll,0.305148
crab,0.21425
mental,0.192687
milkshak,0.175553
pace,0.171679
rotat,0.16895
regard,0.167076
bar,0.16488
discov,0.157181


In [159]:
#Adding back the dataset

tmp = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
df2 = pd.concat([df, tmp], axis=1)
df2.head()

Unnamed: 0,REST_ID,UserId,Rating,Date,Description,BinaryRating,Lemmatised_D,Stemmed_D,Normalized,__,...,zipcar,ziti,zoe,zombi,zone,zoo,zorba,zucchini,zuni,zushi
0,37185,m07sy7eLtOjVdZ8oN9JKag,5.0,2010-02-18,Hey wait a minute didn t Henry VIII always eat...,1,"[Hey, wait, a, minute, didn, t, Henry, VIII, a...","[hey, wait, a, minut, didn, t, henri, viii, al...",hey wait minut henri viii alway eat turkey leg...,0,...,0,0,0,0,0,0,0,0,0,0
1,18248,ev_mrEIDJauxugj1r8Z3qg,4.0,2011-01-08,I ve only eaten at a few rotating sushi bars i...,1,"[I, ve, only, eaten, at, a, few, rotating, sus...","[I, ve, onli, eaten, at, a, few, rotat, sushi,...",I onli eaten rotat sushi bar time fantast eat ...,0,...,0,0,0,0,0,0,0,0,0,0
2,2625,ZY3GMhyJTAS4ml39whHUQg,2.0,2016-08-20,Overall disappointed and still hungry for some...,0,"[Overall, disappointed, and, still, hungry, fo...","[overal, disappoint, and, still, hungri, for, ...",overal disappoint still hungri someth better b...,0,...,0,0,0,0,0,0,0,0,0,0
3,38,q7hkNbl7DEW8tKu7F0ATiw,5.0,2017-09-19,Spent days in DC and this was my favorite plac...,1,"[Spent, day, in, DC, and, this, wa, my, favori...","[spent, day, in, DC, and, thi, wa, my, favorit...",spent day DC thi wa favorit place great happi ...,0,...,0,0,0,0,0,0,0,0,0,0
4,17895,focs8rivLPE-b8W6dsQe6Q,4.0,2014-06-10,I was stopping by the DC area on a leisure wee...,1,"[I, wa, stopping, by, the, DC, area, on, a, le...","[I, wa, stop, by, the, DC, area, on, a, leisur...",I wa stop DC area leisur weekend redston wa re...,0,...,0,0,0,0,0,0,0,0,0,0


In [160]:
#Rearranging columns for modelling purposes

cols = df2.columns.tolist()
a, b = cols.index('Normalized'), cols.index('BinaryRating')
cols[b], cols[a] = cols[a], cols[b]
df2 = df2[cols]

df2

Unnamed: 0,REST_ID,UserId,Rating,Date,Description,Normalized,Lemmatised_D,Stemmed_D,BinaryRating,__,...,zipcar,ziti,zoe,zombi,zone,zoo,zorba,zucchini,zuni,zushi
0,37185,m07sy7eLtOjVdZ8oN9JKag,5.0,2010-02-18,Hey wait a minute didn t Henry VIII always eat...,hey wait minut henri viii alway eat turkey leg...,"[Hey, wait, a, minute, didn, t, Henry, VIII, a...","[hey, wait, a, minut, didn, t, henri, viii, al...",1,0,...,0,0,0,0,0,0,0,0,0,0
1,18248,ev_mrEIDJauxugj1r8Z3qg,4.0,2011-01-08,I ve only eaten at a few rotating sushi bars i...,I onli eaten rotat sushi bar time fantast eat ...,"[I, ve, only, eaten, at, a, few, rotating, sus...","[I, ve, onli, eaten, at, a, few, rotat, sushi,...",1,0,...,0,0,0,0,0,0,0,0,0,0
2,2625,ZY3GMhyJTAS4ml39whHUQg,2.0,2016-08-20,Overall disappointed and still hungry for some...,overal disappoint still hungri someth better b...,"[Overall, disappointed, and, still, hungry, fo...","[overal, disappoint, and, still, hungri, for, ...",0,0,...,0,0,0,0,0,0,0,0,0,0
3,38,q7hkNbl7DEW8tKu7F0ATiw,5.0,2017-09-19,Spent days in DC and this was my favorite plac...,spent day DC thi wa favorit place great happi ...,"[Spent, day, in, DC, and, this, wa, my, favori...","[spent, day, in, DC, and, thi, wa, my, favorit...",1,0,...,0,0,0,0,0,0,0,0,0,0
4,17895,focs8rivLPE-b8W6dsQe6Q,4.0,2014-06-10,I was stopping by the DC area on a leisure wee...,I wa stop DC area leisur weekend redston wa re...,"[I, wa, stopping, by, the, DC, area, on, a, le...","[I, wa, stop, by, the, DC, area, on, a, leisur...",1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,22070,SBFIcLkzx3bBY2Q-Hsbvsw,3.0,2014-07-13,I would definitely try this place again br br ...,I would definit tri thi place br br F W ha ver...,"[I, would, definitely, try, this, place, again...","[I, would, definit, tri, thi, place, again, br...",0,0,...,0,0,0,0,0,0,0,0,0,0
9996,31006,5EQnJI0p5Jnu0Bjhj4aaQw,5.0,2012-06-10,The Wooly Pig is definitely one of the best sa...,wooli pig definit one best sandwich I everyth ...,"[The, Wooly, Pig, is, definitely, one, of, the...","[the, wooli, pig, is, definit, one, of, the, b...",1,0,...,0,0,0,0,0,0,0,0,0,0
9997,5104,DEL1SlWIT3SwSbX67Vmf4w,5.0,2017-08-02,Nothing bUt wonderful here The buffalo chicken...,noth wonder buffalo chicken wrap wa great sinc...,"[Nothing, bUt, wonderful, here, The, buffalo, ...","[noth, but, wonder, here, the, buffalo, chicke...",1,0,...,0,0,0,0,0,0,0,0,0,0
9998,23634,tG2IxZc52u8Dd_lk0wE_ow,5.0,2017-10-09,It s an adorable restaurant Went on a random d...,It ador restaur went random date wife plan ret...,"[It, s, an, adorable, restaurant, Went, on, a,...","[It, s, an, ador, restaur, went, on, a, random...",1,0,...,0,0,0,0,0,0,0,0,0,0


In [161]:
#Develop a ML model to predict the binary ratings outcome and assess the model’s performance

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

### ML model to predict the binary ratings outcome and assess the model’s performance  - Logistic Regression Model

In [166]:
#Logistic regression

xcols = df2.columns[9:].to_list()

X_train, X_test, y_train, y_test = train_test_split(df2[xcols], df2['BinaryRating'], 
                                                    train_size=0.8, random_state=1)
print('training data:', X_train.shape)
print('test data:', X_test.shape)

training data: (8000, 9303)
test data: (2000, 9303)


In [167]:
#Logistic regression

log_reg = LogisticRegression(solver='lbfgs', max_iter=3000)

# fit the model to the training data
clf = log_reg.fit(X_train, y_train)

# Examine the features with the largest positive and negative coefficients (6 in total)

coef = pd.concat([pd.DataFrame(xcols),pd.DataFrame(np.transpose(clf.coef_))], axis = 1)
coef.columns = ['feature','coefficient']
coef.sort_values(by=['coefficient'], ascending=False, inplace=True)

print('Three largest positive features:\n', coef.head(3), '\n')
print('Three largest negative features:\n', coef.tail(3))

print('\ntraining accuracy: {}'.format(clf.score(X_train, y_train).round(3)))
print('test accuracy: {}'.format(clf.score(X_test, y_test).round(3)))

Three largest positive features:
       feature  coefficient
2910  fantast     1.989076
9146   wonder     1.732525
510    awesom     1.725501 

Three largest negative features:
       feature  coefficient
3920  horribl    -2.114509
9182    worst    -2.402082
812     bland    -2.430427

training accuracy: 0.98
test accuracy: 0.842


In [168]:
#Logistic regression

# Regularization w/ logistic regression
# Regularization w/ the C parameter
# C parameter with logistic regression

cset = [.001, .01, .1, 1, 10]
for i in cset:
    print('C =', i)
    log_reg = LogisticRegression(solver='lbfgs', max_iter=1000, C=i)
    clf = log_reg.fit(X_train, y_train)
    print('training accuracy: {}'.format(clf.score(X_train, y_train).round(3)))
    print('test accuracy: {}'.format(clf.score(X_test, y_test).round(3)), '\n')

C = 0.001
training accuracy: 0.756
test accuracy: 0.749 

C = 0.01
training accuracy: 0.868
test accuracy: 0.832 

C = 0.1
training accuracy: 0.93
test accuracy: 0.856 

C = 1
training accuracy: 0.98
test accuracy: 0.842 

C = 10
training accuracy: 0.998
test accuracy: 0.828 



In [169]:
scores = cross_val_score(LogisticRegression(solver='lbfgs', max_iter=3000), X_train, y_train, cv=5) 
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.85


We obtain a mean cross-validation score of 85%, which indicates reasonable performance for a balanced binary classification task. 

In [170]:
from sklearn.model_selection import GridSearchCV 
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]} 
grid = GridSearchCV(LogisticRegression(solver='lbfgs', max_iter=3000), param_grid, cv=5) 
grid.fit(X_train, y_train) 
print("Best cross-validation score: {:.2f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)

Best cross-validation score: 0.86
Best parameters:  {'C': 0.1}


We obtain a cross-validation score of 86% using C=0.1. 
We can now assess the generalization performance of this parameter setting on the test set:

In [171]:
print("Test score: {:.2f}".format(grid.score(X_test, y_test)))

Test score: 0.86


In [172]:
#Logistic regression
#F1 score and confusion matrix

from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

y_pred_test = clf.predict(X_test)
print('F1 score: {:.3f}'.format(f1_score(y_test, y_pred_test)), '\n')
cm = confusion_matrix(y_test, y_pred_test)
sample = np.array([['TN', 'FP'], ['FN', 'TP']])
print('CM key:\n', sample, '\n')
print('CM for test:\n', cm)

F1 score: 0.870 

CM key:
 [['TN' 'FP']
 ['FN' 'TP']] 

CM for test:
 [[ 507  175]
 [ 169 1149]]


## Repeating this process but use word embeddings values

In [175]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec, KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [184]:
import gzip
import shutil
with gzip.open('GoogleNews-vectors-negative300-SLIM.bin.gz', 'rb') as f_in:
    with open('GoogleNews-vectors-negative300-SLIM.bin', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [190]:
# Import the os module
import os
os.getcwd()

'/Users/akshathabyranaumesh/Documents/MSBA_FOX/Spring2021/STAT 5603/Week10'

In [191]:
#Beginning of using word embeddings model

#Load model

model = KeyedVectors.load_word2vec_format("/Users/akshathabyranaumesh/Documents/MSBA_FOX/Spring2021/STAT 5603/Week10/GoogleNews-vectors-negative300-SLIM.bin", binary=True)
print(len(model), 'words in model')

299567 words in model


In [192]:
# process text
from nltk import word_tokenize as tokenize
from nltk.corpus import stopwords
stops = stopwords.words('english')
import string

texts = df2.Normalized.to_list()
texts2 = []
for text in texts:
    text = str(text).lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = tokenize(text)
    tokens = [i for i in tokens if i.isalpha()]
    tokens = [i for i in tokens if i not in stops]
    texts2.append(tokens)

print('original text 0:', texts[0])
print('processed text 0:', texts2[0])

original text 0: hey wait minut henri viii alway eat turkey leg would san francisco eat red meat It would tmz wa ye old internet back mediev age king could behead hi wife bear heir Ah good ole day well day back moment enter hous prime rib br br OK quit joust contest knight shine armor ladi wait janean garofalo ask lord cabl guy place throwback least middl last centuri slab meat bake potato consid money meal well still even though thi wa first visit sinc dad took famili I wa six ye six flash forward year ahem I jami W readi gnaw whatev land dinner plate br br anthoni bourdain visit hi No reserv show must smack subconsci becaus everyth wa familiar unctuou waiter spin salad bowl big silver cart prime rib readi carv We seat one darker dine room least semi privat booth saw marri coupl eat prime rib birthday wife wa petit thing I well imagin boister husband must behind restaur select br br menu simpl five cut prime rib fish vegan need appli natur jami I went king henri viii cut sinc wa tailo

In [193]:
# Get average vectors for each text

avg_vectors = []; drops = []
for i, text in enumerate(texts2):
    vectors = []
    
    for token in text:
        if token in model:
            v = model.get_vector(token)
            vectors.append(v)
    
    # keep good records, record bad ones to be dropped
    if vectors:
        avg_vector = np.mean(vectors, axis=0)
        avg_vectors.append(avg_vector)
    else:
        drops.append(i)
    
# print out example avg_vector
print('avg_vector:', avg_vectors[0])

avg_vector: [ 1.22524425e-03  2.39473358e-02  5.94263105e-03  3.83460484e-02
 -7.94497877e-03  7.23298406e-03  1.89751554e-02 -3.89018543e-02
  1.47806844e-02  4.44152504e-02 -1.25671914e-02 -4.25503291e-02
 -1.43128047e-02  4.99436166e-03 -4.38332148e-02  3.54755111e-02
  6.13987725e-03  3.45766693e-02 -7.65680242e-03 -1.81281045e-02
 -9.23362933e-03  9.45788622e-03  3.13412659e-02 -6.18531660e-04
 -1.26057805e-03 -1.16026299e-02 -3.65019254e-02  2.70060357e-02
  8.78584664e-03 -1.96496956e-03 -1.40894596e-02  1.65069290e-02
  1.77377486e-03 -1.12468386e-02 -2.08163578e-02 -4.94827749e-03
  5.69078606e-03  6.11359579e-03  1.96793061e-02  2.36510653e-02
  1.94793176e-02 -4.55833077e-02  4.85706553e-02  1.06921922e-02
  2.63318443e-03 -1.48866400e-02 -1.99303795e-02 -4.19220654e-03
  5.50156133e-03  1.35639431e-02 -2.07287688e-02  2.59547830e-02
  3.42011754e-03 -3.60190915e-03 -6.38564583e-03 -2.81686103e-03
 -1.19062066e-02 -1.34483082e-02  6.16570003e-03 -3.54163647e-02
 -2.25372314e

In [194]:
# Put average vectors into a dataframe
tmp = pd.DataFrame(list(map(np.ravel, avg_vectors)))
print(tmp.shape)
tmp.head()

(9999, 300)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,0.001225,0.023947,0.005943,0.038346,-0.007945,0.007233,0.018975,-0.038902,0.014781,0.044415,...,-0.00895,0.009827,-0.031803,0.012137,-0.001471,-0.023986,-0.010995,-0.028859,-0.001836,0.005996
1,-0.007487,0.018119,-0.005769,0.063815,-0.026951,-0.001522,0.022371,-0.029931,0.018833,0.041283,...,0.002374,0.001885,-0.034577,0.018469,0.001508,-0.020043,-0.006698,-0.013732,0.015434,-0.010177
2,-4.5e-05,0.040102,0.004401,0.032413,-0.018996,0.006179,0.021792,-0.033027,0.014204,0.045303,...,-0.027731,0.015088,-0.02081,0.031761,-0.003702,-0.021437,-0.008673,-0.018782,0.004128,-0.000453
3,-0.027629,0.033283,0.004934,0.052229,-0.016362,0.00403,0.013848,-0.03357,0.014172,0.033833,...,0.008199,0.011628,-0.047325,0.015193,-0.008598,-0.029359,0.001043,-0.008568,0.013322,0.003974
4,-0.014171,0.019145,0.005588,0.029606,-0.009208,-0.000747,0.012348,-0.038308,0.017312,0.028928,...,-0.009373,0.012766,-0.02657,0.026305,0.003702,-0.020486,-0.010701,-0.027191,0.007774,-0.004004


In [195]:
# Remove dropped rows from df2
print(len(drops), ':', drops)
print(df2.iloc[drops])
dfs = df2.drop(drops)
print('\ndata shapes:', dfs.shape, dfs.shape)

# synchronize indices
dfs.reset_index(drop=True, inplace=True)
tmp.reset_index(drop=True, inplace=True)

# Merge horizontally into the dataset
dfm = pd.concat([dfs, tmp], axis=1)
#dfm.dropna(inplace=True)
print(dfm.shape)
dfm.head()

1 : [1823]
      REST_ID                  UserId  Rating        Date  \
1823     8934  lWRva8Rz5LxgJZ1dk0dkRQ     4.0  2013-02-02   

                            Description                    Normalized  \
1823  Friendly service delicious coffee  friendli servic delici coffe   

                                Lemmatised_D  \
1823  [Friendly, service, delicious, coffee]   

                              Stemmed_D  BinaryRating  __  ...  zipcar  ziti  \
1823  [friendli, servic, delici, coffe]             1   0  ...       0     0   

      zoe  zombi  zone  zoo  zorba  zucchini  zuni  zushi  
1823    0      0     0    0      0         0     0      0  

[1 rows x 9312 columns]

data shapes: (9999, 9312) (9999, 9312)
(9999, 9612)


Unnamed: 0,REST_ID,UserId,Rating,Date,Description,Normalized,Lemmatised_D,Stemmed_D,BinaryRating,__,...,290,291,292,293,294,295,296,297,298,299
0,37185,m07sy7eLtOjVdZ8oN9JKag,5.0,2010-02-18,Hey wait a minute didn t Henry VIII always eat...,hey wait minut henri viii alway eat turkey leg...,"[Hey, wait, a, minute, didn, t, Henry, VIII, a...","[hey, wait, a, minut, didn, t, henri, viii, al...",1,0,...,-0.00895,0.009827,-0.031803,0.012137,-0.001471,-0.023986,-0.010995,-0.028859,-0.001836,0.005996
1,18248,ev_mrEIDJauxugj1r8Z3qg,4.0,2011-01-08,I ve only eaten at a few rotating sushi bars i...,I onli eaten rotat sushi bar time fantast eat ...,"[I, ve, only, eaten, at, a, few, rotating, sus...","[I, ve, onli, eaten, at, a, few, rotat, sushi,...",1,0,...,0.002374,0.001885,-0.034577,0.018469,0.001508,-0.020043,-0.006698,-0.013732,0.015434,-0.010177
2,2625,ZY3GMhyJTAS4ml39whHUQg,2.0,2016-08-20,Overall disappointed and still hungry for some...,overal disappoint still hungri someth better b...,"[Overall, disappointed, and, still, hungry, fo...","[overal, disappoint, and, still, hungri, for, ...",0,0,...,-0.027731,0.015088,-0.02081,0.031761,-0.003702,-0.021437,-0.008673,-0.018782,0.004128,-0.000453
3,38,q7hkNbl7DEW8tKu7F0ATiw,5.0,2017-09-19,Spent days in DC and this was my favorite plac...,spent day DC thi wa favorit place great happi ...,"[Spent, day, in, DC, and, this, wa, my, favori...","[spent, day, in, DC, and, thi, wa, my, favorit...",1,0,...,0.008199,0.011628,-0.047325,0.015193,-0.008598,-0.029359,0.001043,-0.008568,0.013322,0.003974
4,17895,focs8rivLPE-b8W6dsQe6Q,4.0,2014-06-10,I was stopping by the DC area on a leisure wee...,I wa stop DC area leisur weekend redston wa re...,"[I, wa, stopping, by, the, DC, area, on, a, le...","[I, wa, stop, by, the, DC, area, on, a, leisur...",1,0,...,-0.009373,0.012766,-0.02657,0.026305,0.003702,-0.020486,-0.010701,-0.027191,0.007774,-0.004004


In [199]:
# Send to ML algorithm: Logistic Regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

print('distribution of outcome:')
print(dfm['BinaryRating'].value_counts(), '\n')

xcols = dfm.columns[9:].to_list()
X_train, X_test, y_train, y_test = train_test_split(dfm[xcols], dfm['BinaryRating'], 
                                                    random_state=0)
print('train shape:', X_train.shape)
print('test shape:', X_test.shape, '\n')

log_reg = LogisticRegression(solver='lbfgs', max_iter=10000, class_weight='balanced')
clf = log_reg.fit(X_train, y_train)

print('training accuracy: {}'.format(clf.score(X_train, y_train).round(3)))
print('test accuracy: {}'.format(clf.score(X_test, y_test).round(3)))
y_pred_test = clf.predict(X_test)
print('F1 score: {:.3f}'.format(f1_score(y_test, y_pred_test)), '\n')
cm = confusion_matrix(y_test, y_pred_test)
sample = np.array([['TN', 'FP'], ['FN', 'TP']])
print('CM key:\n', sample, '\n')
print('CM for test:\n', cm)

distribution of outcome:
1    6531
0    3468
Name: BinaryRating, dtype: int64 

train shape: (7499, 9603)
test shape: (2500, 9603) 

training accuracy: 0.982
test accuracy: 0.848
F1 score: 0.882 

CM key:
 [['TN' 'FP']
 ['FN' 'TP']] 

CM for test:
 [[ 690  177]
 [ 204 1429]]


In [200]:
# Regularization w/ logistic regression
# Regularization w/ the C parameter
# C parameter with logistic regression

cset = [.001, .01, .1, 1, 10]
for i in cset:
    print('C =', i)
    log_reg = LogisticRegression(solver='lbfgs', max_iter=1000, C=i)
    clf = log_reg.fit(X_train, y_train)
    print('training accuracy: {}'.format(clf.score(X_train, y_train).round(3)))
    print('test accuracy: {}'.format(clf.score(X_test, y_test).round(3)), '\n')

C = 0.001
training accuracy: 0.754
test accuracy: 0.744 

C = 0.01
training accuracy: 0.864
test accuracy: 0.84 

C = 0.1
training accuracy: 0.929
test accuracy: 0.86 

C = 1
training accuracy: 0.979
test accuracy: 0.851 

C = 10
training accuracy: 0.997
test accuracy: 0.834 



In [203]:
from sklearn.model_selection import GridSearchCV 
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]} 
grid = GridSearchCV(LogisticRegression(solver='lbfgs', max_iter=3000), param_grid, cv=5) 
grid.fit(X_train, y_train) 
print("Best cross-validation score: {:.3f}".format(grid.best_score_)) 
print("Best parameters: ", grid.best_params_)

Best cross-validation score: 0.851
Best parameters:  {'C': 0.1}


In [202]:
#Logistic regression
#F1 score and confusion matrix

from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

y_pred_test = clf.predict(X_test)
print('F1 score: {:.3f}'.format(f1_score(y_test, y_pred_test)), '\n')
cm = confusion_matrix(y_test, y_pred_test)
sample = np.array([['TN', 'FP'], ['FN', 'TP']])
print('CM key:\n', sample, '\n')
print('CM for test:\n', cm)

F1 score: 0.874 

CM key:
 [['TN' 'FP']
 ['FN' 'TP']] 

CM for test:
 [[ 645  222]
 [ 194 1439]]


## Compare the performance of the both models:

Comparing both models the second model has a better F1 score of 0.874. 