# Predicting User Satisfaction with Amazon Alexa
## Logistic Regression + TF-IDF and BERT encoding for the Star Rating Prediction 
### By Elena Korshakova and Diedre Brown

This notebook compares the performance of the TF-IDF encoding method with the BERT encoding method in combination with the generalized linear model such as logistic regression. We aim to compare the performance of different encoding methods based on the accuracy and F-1 score because ratings were unbalanced (skewed towards 5-star score). 

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import GridSearchCV
from bert_embedding import BertEmbedding

In [2]:
import warnings
warnings.filterwarnings('ignore')

## Load preprocessed data

In [3]:
train = pd.read_pickle("data/df_train.pickle")
test = pd.read_pickle("data/df_test.pickle")

In [4]:
train

Unnamed: 0,review,rating
0,great speaker,3
1,great little,4
2,awesome,5
3,love,5
4,great device,5
...,...,...
6850,fun love,5
6851,lot fun,5
6852,buy gift husband problem set want return past ...,3
6853,set control light home thermostat love able se...,5


# 1. Logistic Regression + TF-IDF features

## Transform reviews into features (TF-IDF encoding)

We started to transform the reviews into features using the TF-IDF as term-weighting method intended to reflect how important a word is in the review. In our case the TF–IDF value increases proportionally to the number of times a word appears in the review and is offset by the number of reviews in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. 

In [5]:
vectoriser = TfidfVectorizer()

In [6]:
# Transfrom training data
X = vectoriser.fit_transform(train['review'])
y = train['rating']

In [7]:
X.shape

(6765, 3625)

In [8]:
# Transform test data
X_test = vectoriser.transform(test['review'])
y_test = test['rating']

In [9]:
X_test.shape

(3039, 3625)

## Hyperparameter tuning (Logistic Regression)

To get the best paremetrs for our data we started with the hyperparametr tuning using GridSearchCV to evaluate all the possible combinations of parameter values and retain the best combination.

In [10]:
# Make validation split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.15, random_state = 100)

In [11]:
# Create a parameter grid
param_grid = {
    'penalty': ['l1', 'l2', 'elasticnet'],
    'C': np.arange(0.1, 5, 0.2),
    'solver': ['lbfgs', 'liblinear'],
    'class_weight': ['balanced']   
}

In [12]:
# Create grid search object
clf = GridSearchCV(LogisticRegression(max_iter = 500), param_grid = param_grid, cv = 5, n_jobs=-1)

# Fit on data
best_clf = clf.fit(X_train, y_train)

In [13]:
best_clf.best_params_

{'C': 0.30000000000000004,
 'class_weight': 'balanced',
 'penalty': 'l2',
 'solver': 'liblinear'}

In [14]:
model = best_clf.best_estimator_

In [15]:
preds_val = model.predict(X_val)
preds_train = model.predict(X_train)

In [16]:
print("Training accuracy score: ", np.round(accuracy_score(y_train, preds_train), 4))
print("Validation accuracy score: ", np.round(accuracy_score(y_val, preds_val), 4))

Training accuracy score:  0.7689
Validation accuracy score:  0.7133


In [17]:
print("Training F1 score: ", np.round(f1_score(y_train, preds_train, average='weighted'), 4))
print("Validation F1 score: ", np.round(f1_score(y_val, preds_val, average='weighted'), 4))

Training F1 score:  0.7566
Validation F1 score:  0.7001


As a result we got 77% traning accuracy (F1=76%) and 71% validation accuracy (F1=70%).

## Refit the best model and predict on test dataset

Based on the hyperparametr tuning results we refit the model on the 2017 dataset and predict on 2018 dataset

In [18]:
model = best_clf.best_estimator_

In [19]:
# Refit the model on the full training set
model.fit(X, y)

LogisticRegression(C=0.30000000000000004, class_weight='balanced', max_iter=500,
                   solver='liblinear')

In [20]:
preds_test = model.predict(X_test)

In [21]:
print("Accuracy score on the test set: ", np.round(accuracy_score(y_test, preds_test), 4))

Accuracy score on the test set:  0.6779


In [22]:
print("F1 score on the test set: ", np.round(f1_score(y_test, preds_test, average='weighted'), 4))

F1 score on the test set:  0.6834


We` got 68% accuracy (F1=68) using TF-IDF encoding and logistic regression model

# 2. Logistic Regression + BERT embeddings

The next step aims to increase the accuracy using BERT encoding method. BERT considers all the words of the input reviews simultaneously and then uses an attention mechanism to develop a contextual meaning of the words within each review. 

## Transform reviews into features (embeddings)

In [23]:
train

Unnamed: 0,review,rating
0,great speaker,3
1,great little,4
2,awesome,5
3,love,5
4,great device,5
...,...,...
6850,fun love,5
6851,lot fun,5
6852,buy gift husband problem set want return past ...,3
6853,set control light home thermostat love able se...,5


In [24]:
bert_embedding = BertEmbedding()

In [25]:
def get_embedding(review):
    """Return mean of word embeddings for a reivew"""
    row_embeddings = bert_embedding(review.split('/n'))[0][1:]
    avg_embedding = np.mean(row_embeddings, axis=1)[0].tolist()
    return avg_embedding

In [26]:
%%time
# Get embeddings for training set
X_emb = np.array(list(train['review'].apply(get_embedding)))
y_emb = train['rating']

CPU times: user 1h 18min 39s, sys: 4min 16s, total: 1h 22min 56s
Wall time: 19min 21s


In [27]:
X_emb.shape

(6765, 768)

In [28]:
# Save train features to pickle file
train_emb = pd.DataFrame(X_emb)
train_emb['y'] = y_emb
train_emb.to_pickle("data/df_train_emb.pickle")

In [29]:
%%time
# Get embeddings for test set
X_emb_test = np.array(list(test['review'].apply(get_embedding)))
y_emb_test = test['rating']

CPU times: user 32min 11s, sys: 1min 39s, total: 33min 50s
Wall time: 7min 54s


In [30]:
X_emb_test.shape

(3039, 768)

In [31]:
# Save test features to pickle file
test_emb = pd.DataFrame(X_emb_test)
test_emb['y'] = y_emb_test
test_emb.to_pickle("data/df_test_emb.pickle")

## Hyperparameter tuning (Logistic Regression)

We are using exactly the same schema for the logistic regression to compare encoding performance. 

In [32]:
# Make validation split
X_train_emb, X_val_emb, y_train_emb, y_val_emb = train_test_split(X_emb, y_emb, test_size = 0.15, random_state = 100)

In [33]:
X_train_emb.shape

(5750, 768)

In [34]:
# Create a parameter grid
param_grid = {
    'penalty': ['l1', 'l2', 'elasticnet'],
    'C': np.arange(0.1, 5, 0.5),
    'solver': ['saga'],
    'class_weight': ['balanced']   
}

In [35]:
# Create grid search object
clf = GridSearchCV(LogisticRegression(max_iter = 100), param_grid = param_grid, cv = 5, n_jobs=-1)

# Fit on data
best_clf = clf.fit(X_train_emb, y_train_emb)

In [36]:
best_clf.best_params_

{'C': 1.1, 'class_weight': 'balanced', 'penalty': 'l2', 'solver': 'saga'}

In [37]:
model = best_clf.best_estimator_

In [38]:
preds_val = model.predict(X_val_emb)
preds_train = model.predict(X_train_emb)

In [39]:
print("Training accuracy score: ", np.round(accuracy_score(y_train_emb, preds_train), 4))
print("Validation accuracy score: ", np.round(accuracy_score(y_val_emb, preds_val), 4))

Training accuracy score:  0.7303
Validation accuracy score:  0.5675


In [40]:
print("Training F1 score: ", np.round(f1_score(y_train_emb, preds_train, average='weighted'), 4))
print("Validation F1 score: ", np.round(f1_score(y_val_emb, preds_val, average='weighted'), 4))

Training F1 score:  0.744
Validation F1 score:  0.6094


## Refit the best model and predict on test dataset

In [41]:
model = best_clf.best_estimator_

In [42]:
# Refit the model on the full training set
model.fit(X_emb, y_emb)

LogisticRegression(C=1.1, class_weight='balanced', solver='saga')

In [43]:
preds_test = model.predict(X_emb_test)

In [44]:
print("Accuracy score on the test set: ", np.round(accuracy_score(y_emb_test, preds_test), 4))

Accuracy score on the test set:  0.4962


In [45]:
print("F1 score on the test set: ", np.round(f1_score(y_emb_test, preds_test, average='weighted'), 4))

F1 score on the test set:  0.5635


As a result BERT embeddings didn't give a boost in performance and we got only 49% accuracy (F-1=56%). 