# Score Prediction
This script trials a couple of different modelling approaches using Random Forest and Naive Bayes classifiers to try and predict review scores from the review text. 

This script is almost identical to the main script, only with the addition of some extra code to allow us to run this notebook on GCP's Workbench (due to long local run times). In particular, the additional material is for the importing of raw data and saving of processed data to a GCP Bucket, and some code select only a subset of the data for demonstration purposes due to run times (even on GCP).

In [1]:
import numpy as np
import os
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.naive_bayes import MultinomialNB
from sklearn.utils.class_weight import compute_class_weight
from google.cloud import storage
import io

First we import data from our GCS Bucket.

In [2]:
bucket_name = "bf-review-nlp"
blob_name = "processed_data/processed_df.pickle"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
pickle_data = blob.download_as_bytes()
pickle_file = io.BytesIO(pickle_data)
df_raw = pickle.load(pickle_file)

Even when running on GCP's Workbench, the run times were extremely long using the full \~1.5m-record dataset. For demonstration purposes, this code takes a stratified sample from the original dataset comprising 5% of the total rows (~79,000 reviews).

We use a stratified split because the classes (review scores 1*-5*) are very unbalanced, as shown in our EDA script.

In [3]:
# Trim dataset for runtime

X_raw = df_raw["stemmedText"]
y_raw = df_raw["overall"]
stratified_split = StratifiedShuffleSplit(n_splits=1, test_size=0.05, random_state=42)
_, idx = next(stratified_split.split(X_raw, y_raw))

df = df_raw.iloc[idx]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 79122 entries, 256187 to 892717
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   overall      79122 non-null  category
 1   verified     79122 non-null  bool    
 2   style        65261 non-null  object  
 3   reviewText   79122 non-null  object  
 4   stemmedText  79122 non-null  object  
dtypes: bool(1), category(1), object(3)
memory usage: 2.6+ MB


This function will be used to save a number of model metrics to .csv files for evaluation after each model has been run so that we can examine them later.

In [1]:
# Output classification_report() as .csv
def csv_report(report, name):
    lines = report.split("\n")
    data = [line.split()[1:] for line in lines[2:-5]]
    columns = ["precision", "recall", "f1-score", "support"]
    df = pd.DataFrame(data, columns=columns)

    output_csv_path = name+".csv"
    df.to_csv(output_csv_path)

Here we transform our review text into an n-gram count matrix (considering only unigrams and bigrams). We also limit the matrix to contain the 1000 most common n-grams just for simplicity and run time. A more advanced model could consider a much larger dictionary.

Once again we use a stratified train-test split due to the severely unbalanced classes we saw in our EDA.

In [5]:
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(df["stemmedText"])
y = df["overall"]

stratified_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(stratified_split.split(X, y))
X_train = X[train_idx]
X_test = X[test_idx]
y_train = y.iloc[train_idx]
y_test = y.iloc[test_idx]

Now we train a number of classification models, and output some evaluation metrics using the function we defined above. We will try using the dataset as-is, as well as using class weights to improve performance on highly unbalanced classes.

In all cases we will use f1-score as our primary metric, again because we want to assess performance on unbalanced classes.

The first model we trial is a Random Forest classifier, using a grid search to tune hyperparameters. The GridSearchCV funtion builds in k-fold cross validation for more reliable evaluation.

In [6]:
# We could do a much more comprehensive parameter search, but due to computation time, we'll keep it small.
rf = RandomForestClassifier(random_state=42)
rf_params = {"max_depth": [None, 10],
             "min_samples_split": [10, 20],
             "min_samples_leaf": [4, 8]}

rf_grid = GridSearchCV(rf, rf_params, cv=5, scoring="f1_macro")
rf_grid.fit(X_train, y_train)
rf_best_params = rf_grid.best_params_

best_rf = RandomForestClassifier(**rf_best_params, random_state=42)
best_rf.fit(X_train, y_train)
rf_pred = best_rf.predict(X_test)
rf_report = classification_report(y_test, rf_pred)
csv_report(rf_report, "rf")

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


We will re-use the hyperparameters chosen above to train another Random Forest classifier using class weights for comparison. Ideally we would re-tune hyperparameters, but this is simpler for demonstration purposes.

An alternative approach could be oversampling, but class weighting is less computationally expensive and run time is a major concern here.

In [7]:
# Random Forest classifier with class weights.
class_weights = compute_class_weight("balanced", classes=y_train.unique(), y=y_train)
class_weight_dict = dict(enumerate(class_weights))

rf_weighted = RandomForestClassifier(**rf_best_params, class_weight=class_weight_dict, random_state=42)
rf_weighted.fit(X_train, y_train)
rf_wighted_pred = rf_weighted.predict(X_test)
rf_weighted_report = classification_report(y_test, rf_wighted_pred)
csv_report(rf_weighted_report, "rf_weighted")

We repeat the above steps using a Multinomial Naive Bayes classifier. First we train a model using a grid search for hyperparameter tuning, then we re-use the best performing parameters to train a second NB classifier using class weights.

In [8]:
nb = MultinomialNB()
nb_params = {"alpha": [0.1, 1, 10],
             "fit_prior": [True, False]}

nb_grid = GridSearchCV(nb, nb_params, cv=5, scoring="f1_macro")
nb_grid.fit(X_train, y_train)
nb_best_params = nb_grid.best_params_

best_nb = MultinomialNB(**nb_best_params)
best_nb.fit(X_train, y_train)
nb_pred = best_nb.predict(X_test)
nb_report = classification_report(y_test, nb_pred)
csv_report(nb_report, "nb")

In [9]:
# Manually set class prior probabilities as the class weights
nb_weighted = MultinomialNB(**nb_best_params, class_prior=class_weights)
nb_weighted.fit(X_train, y_train)
nb_weighted_pred = nb_weighted.predict(X_test)
nb_weighted_report = classification_report(y_test, nb_weighted_pred)
csv_report(nb_weighted_report, "nb_weighted")

Now we can compare model performance from the output .csvs.

Let's look at macro and weighted-average f1-scores.

In [None]:
rf_report = pd.read_csv("rf.csv")
rf_weighted_report = pd.read_csv("rf_weighted.csv")
nb_report = pd.read_csv("nb.csv")
nb_weighted_report = pd.read_csv("nb_weighted.csv")

In [None]:
def weighted_f1(df, name):
    df["weighted_f1"] = df["f1-score"] * df['support']
    weighted_average_f1 = df["weighted_f1"].sum() / df["support"].sum()
    print(name, ": ", weighted_average_f1)
    
def macro_f1(df, name):
    macro_f1 = df["f1-score"].mean()
    print(name, ": ", macro_f1)    

In [None]:
# Weighted averages
weighted_f1(rf_report, "rf")
weighted_f1(rf_weighted_report, "rf_weighted")
weighted_f1(nb_report, "nb")
weighted_f1(nb_weighted_report, "nb_weighted")

# Macro scores
macro_f1(rf_report, "rf")
macro_f1(rf_weighted_report, "rf_weighted")
macro_f1(nb_report, "nb")
macro_f1(nb_weighted_report, "nb_weighted")

With the best weighted- and macro-f1 averages, we could say our "overall" best performer is the unweighted Naive Bayes classifier.

However, this depends on our use for the model. For example, if we wanted to specifically focus on identifying the differences between 4* and 5* reviews (see "Additional Insights.py" script!), we might want to look at the weighted Random Forest model which performs better in those specific two classes.