# Predicting User Satisfaction with Amazon Alexa
## Random Forest + TF-IDF vs. BERT encoding for the Star Rating Prediction 
### By Elena Korshakova and Diedre Brown

This notebook compares the performance of the TF-IDF encoding method with the BERT encoding method in combination with the ensemble model such as random forest. We aim to compare the performance of different encoding methods based on the accuracy and F-1 score because ratings were unbalanced (skewed towards 5-star score).

# Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import GridSearchCV
from bert_embedding import BertEmbedding

## Load preprocessed data

In [2]:
train = pd.read_pickle("data/df_train.pickle")
test = pd.read_pickle("data/df_test.pickle")

In [3]:
train

Unnamed: 0,review,rating
0,great speaker,3
1,great little,4
2,awesome,5
3,love,5
4,great device,5
...,...,...
6850,fun love,5
6851,lot fun,5
6852,buy gift husband problem set want return past ...,3
6853,set control light home thermostat love able se...,5


# 1. Random Forest + TF-IDF features

## Transform reviews into features (TF-IDF encoding)

We started to transform the reviews into features using the TF-IDF as term-weighting method intended to reflect how important a word is in the review. In our case the TF–IDF value increases proportionally to the number of times a word appears in the review and is offset by the number of reviews in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

In [4]:
vectoriser = TfidfVectorizer()

In [5]:
# Transfrom training data
X = vectoriser.fit_transform(train['review'])
y = train['rating']

In [6]:
X.shape

(6765, 3625)

In [7]:
# Transform test data
X_test = vectoriser.transform(test['review'])
y_test = test['rating']

In [8]:
X_test.shape

(3039, 3625)

## Hyperparameter tuning (Random Forest)

To get the best paremetrs for our data we started with the hyperparametr tuning using GridSearchCV to evaluate all the possible combinations of parameter values and retain the best combination.

In [9]:
# Make validation split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.15, random_state = 100)

In [12]:
# Create a parameter grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 15, None],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 5],
    'class_weight': ['balanced'],
    'max_features': ['auto', 'sqrt', 'log2']
}

In [13]:
# Create grid search object
clf = GridSearchCV(RandomForestClassifier(), param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)

# Fit on data
best_clf = clf.fit(X_train, y_train)

Fitting 5 folds for each of 162 candidates, totalling 810 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   39.0s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   49.2s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 810 out of 810 | elapsed:  3.7min finished


In [14]:
best_clf.best_params_

{'class_weight': 'balanced',
 'max_depth': None,
 'max_features': 'log2',
 'min_samples_leaf': 1,
 'min_samples_split': 6,
 'n_estimators': 100}

In [15]:
model = best_clf.best_estimator_

In [16]:
preds_val = model.predict(X_val)
preds_train = model.predict(X_train)

In [17]:
print("Training accuracy score: ", np.round(accuracy_score(y_train, preds_train), 4))
print("Validation accuracy score: ", np.round(accuracy_score(y_val, preds_val), 4))

Training accuracy score:  0.9642
Validation accuracy score:  0.7232


In [18]:
print("Training F1 score: ", np.round(f1_score(y_train, preds_train, average='weighted'), 4))
print("Validation F1 score: ", np.round(f1_score(y_val, preds_val, average='weighted'), 4))

Training F1 score:  0.9646
Validation F1 score:  0.6912


As a result we got 96% traning accuracy (F1=96%) and 72% validation accuracy (F1=69%).

## Refit the best model and predict on test dataset

Based on the hyperparametr tuning results we refit the model on the 2017 dataset and predict on 2018 dataset.

In [19]:
model = best_clf.best_estimator_

In [20]:
# Refit the model on the full training set
model.fit(X, y)

RandomForestClassifier(class_weight='balanced', max_features='log2',
                       min_samples_split=6)

In [21]:
preds_test = model.predict(X_test)

In [22]:
print("Accuracy score on the test set: ", np.round(accuracy_score(y_test, preds_test), 4))

Accuracy score on the test set:  0.7206


In [23]:
print("F1 score on the test set: ", np.round(f1_score(y_test, preds_test, average='weighted'), 4))

F1 score on the test set:  0.6827


We` got 72% accuracy (F1=68) using TF-IDF encoding and random forest model

# 2. Random Forest + BERT embeddings

The next step aims to increase the accuracy using BERT encoding method. BERT considers all the words of the input reviews simultaneously and then uses an attention mechanism to develop a contextual meaning of the words within each review.

## Load data transformed to embeddings

In [46]:
train_emb = pd.read_pickle("data/df_train_emb.pickle")
test_emb = pd.read_pickle('data/df_test_emb.pickle')

In [47]:
train_emb

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,759,760,761,762,763,764,765,766,767,y
0,-0.196070,-0.166101,0.088162,-0.387476,-0.075338,0.145149,-0.100961,0.332107,-0.399594,-0.577685,...,-0.069991,-0.139178,0.020440,-0.012614,0.167302,-0.074398,-0.048064,0.139034,-0.761326,3.0
1,0.375143,0.252748,-0.009002,-0.047845,0.280493,0.355130,-0.615671,0.173091,-0.417215,-0.464209,...,0.238013,-0.144601,-0.116475,0.188066,-0.773876,-0.500712,0.129086,0.544737,-0.042861,4.0
2,0.501979,-0.266838,-0.096103,-0.082397,0.593921,-0.378008,-0.344594,0.807677,-0.599734,-0.235689,...,-0.477558,0.225232,-0.362823,-0.148275,-0.017346,0.071473,0.342333,0.486430,-0.301303,5.0
3,0.386490,0.361879,0.234233,-0.395798,0.935691,-0.320418,0.204268,0.338452,-0.052004,-0.810699,...,-0.915966,1.166752,-0.439389,0.048832,-0.294889,0.536690,-0.957577,-0.063262,-0.469560,5.0
4,-0.209383,0.256669,0.263394,0.202036,0.686160,-0.000283,-0.446947,0.150865,-0.399861,-0.683067,...,-0.089460,-0.091926,-0.183906,0.479365,-0.073143,0.375636,-0.353099,0.043606,-0.477852,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6760,0.365414,-0.410754,0.485132,0.269432,0.485766,-0.270238,0.140317,0.333541,-0.844339,-0.108403,...,-0.300737,0.121451,-0.393203,0.184932,-0.273029,0.031431,-0.113239,0.127719,-0.287914,5.0
6761,0.707097,-0.112192,0.695348,0.372509,-0.000643,-0.696345,0.262589,0.827819,-0.913947,-0.246767,...,0.006230,-0.335931,-0.042181,-0.164755,-0.105138,0.244592,-0.068610,-0.113216,0.114406,4.0
6762,0.008660,-0.460067,0.715946,-0.247391,0.500307,-0.152738,0.155582,0.342058,0.291900,-0.137198,...,-0.652499,0.539859,0.035297,-0.174948,0.150594,0.155599,-0.643417,-0.163756,-0.368232,3.0
6763,0.348846,0.031631,0.840641,0.022863,0.702807,-0.170580,0.127256,-0.044578,0.135992,-0.475031,...,-0.768864,0.175243,0.024505,0.225648,0.357531,-0.172759,-0.444898,-0.230627,-0.244909,


In [49]:
train_emb = train_emb.dropna(subset = ['y'])
X_emb = train_emb.drop(columns = ['y'])
y_emb = train_emb['y']

test_emb = test_emb.dropna(subset = ['y'])
X_emb_test = test_emb.drop(columns = ['y'])
y_emb_test = test_emb['y']

## Hyperparameter tuning (Random Forest)

We are using exactly the same schema for the logistic regression to compare encoding performance.

In [53]:
# Make validation split
X_train_emb, X_val_emb, y_train_emb, y_val_emb = train_test_split(X_emb, y_emb, test_size = 0.15, random_state = 100)

In [54]:
# Create grid search object
clf = GridSearchCV(RandomForestClassifier(), param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)

# Fit on data
best_clf = clf.fit(X_train_emb, y_train_emb)

Fitting 5 folds for each of 162 candidates, totalling 810 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   28.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  7.0min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 14.7min
[Parallel(n_jobs=-1)]: Done 810 out of 810 | elapsed: 14.9min finished


In [55]:
best_clf.best_params_

{'class_weight': 'balanced',
 'max_depth': None,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 50}

In [56]:
model = best_clf.best_estimator_

In [57]:
preds_val = model.predict(X_val_emb)
preds_train = model.predict(X_train_emb)

In [58]:
print("Training accuracy score: ", np.round(accuracy_score(y_train_emb, preds_train), 4))
print("Validation accuracy score: ", np.round(accuracy_score(y_val_emb, preds_val), 4))

Training accuracy score:  0.8369
Validation accuracy score:  0.5399


In [59]:
print("Training F1 score: ", np.round(f1_score(y_train_emb, preds_train, average='weighted'), 4))
print("Validation F1 score: ", np.round(f1_score(y_val_emb, preds_val, average='weighted'), 4))

Training F1 score:  0.8492
Validation F1 score:  0.4818


## Refit the best model and predict on test dataset

In [60]:
model = best_clf.best_estimator_

In [61]:
# Refit the model on the full training set
model.fit(X_emb, y_emb)

RandomForestClassifier(class_weight='balanced', max_features='sqrt',
                       n_estimators=50)

In [62]:
preds_test = model.predict(X_emb_test)

In [63]:
print("Accuracy score on the test set: ", np.round(accuracy_score(y_emb_test, preds_test), 4))

Accuracy score on the test set:  0.6472


In [64]:
print("F1 score on the test set: ", np.round(f1_score(y_emb_test, preds_test, average='weighted'), 4))

F1 score on the test set:  0.5867


As a result BERT embeddings didn't give a boost in performance and we got only 65% accuracy (F-1=59%).