# Sentiment Analysis Model

change current directory and import packages

In [1]:
%load_ext lab_black
import os
import numpy as np

In [2]:
if not os.path.exists("/sentiment_analysis"):
    os.chdir("..")

import packages

In [83]:
from sentiment_analysis.utils.train_test_split import TrainTestSplit
from sentiment_analysis.models.model import StreamlinedModel
from sentiment_analysis.features.word_frequencies import WordFrequencyVectorizer
from sentiment_analysis.data.review_processor import ReviewProcessor
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import shap

Perform train test split on the reviews data

In [4]:
X_train, y_train, X_test, y_test = TrainTestSplit().get_split_data()

Build Word Frequency Vectors and lightGBM model and train using the training data

In [5]:
lightgbm = StreamlinedModel(
    transformer_description="word frequency vector",
    transformer=WordFrequencyVectorizer,
    model_description="LightGBM model",
    model=lgb.LGBMClassifier,
    model_params={
        "application": "binary",
        "objective": "binary",
        "metric": "auc",
        "is_unbalance": "false",
        "boosting": "gbdt",
        "num_leaves": 31,
        "feature_fraction": 0.06,
        "bagging_fraction": 0.67,
        "bagging_freq": 1,
        "learning_rate": 0.05,
        "verbose_eval": 0,
        "n_estimators": 2000,
        "n_jobs": 6,
    },
)

train the streamlined model

In [6]:
lightgbm.train(X_train, y_train)

get scores

In [7]:
print("Train accuracy:", lightgbm.score(X_train, y_train))
print("Test accuracy:", lightgbm.score(X_test, y_test))

Train accuracy: 0.9993055555555556
Test accuracy: 0.81125


get predictions and predicted probabilities

In [8]:
y_pred = lightgbm.predict(X_test)
y_prob = lightgbm.predict_proba(X_test)

In [13]:
print("Test ROC:", roc_auc_score(y_test, y_prob[:, 1]))

Test ROC: 0.8921375


We were able to achieve some quite amazing results with LightGBM model. The next two steps we can have a look are

1. Check the most wrong positive and negative reviews
2. use SHAP to understand which word are more likely to lead to lightGBM 

**Most wrong positive/negative reviews**

get the raw review texts and rank by smallest distance of 0/1

In [72]:
wrong_positive_inds = np.where((y_test == 1) & (y_pred != y_test))[0]
wrong_negative_inds = np.where((y_test == 0) & (y_pred != y_test))[0]

In [53]:
most_wrong_positive_index = y_prob[:, 1][wrong_positive_inds].argmin()
most_wrong_negative_index = y_prob[:, 1][wrong_negative_inds].argmax()

In [81]:
print("Most wrong positive review: \n")

print(np.array(X_test)[most_wrong_positive_index])

Most wrong positive review: 


Before I saw this sequel, I had heard and read that it was a terrible film. However, Be Cool is still an enjoyable comedy even if it's not as good as the original. It's full of so much self-deprecating humor that you can't help but cut it a break. I laughed quite a few times, and there are a lot of fun moments.

However, the film is a bit uneven. The movie tries hard to find the humor in hit men and quirky characters, but it often seems to be trying a bit too hard. John Travolta reprises his role as Chili Palmer, Hollywood gangster, who now turns his eye to the music business. Using his "negotiation skills," he tries to run an independent record label with the wife of a murdered friend, played by Uma Thurman, and try to get his young singer (Christina Milian) a hit record.

An impressive cast is what saves the movie from sinking. The best performance is given by The Rock and it's nice to see him in a different type of role. Vince Vaughn also gives a funny

In [82]:
print("Most wrong negative review: \n")

print(np.array(X_test)[most_wrong_negative_index])

Most wrong negative review: 


Theres not alot I can say about measuring spoons really. They are made to a high standard, durable, easy to clean and dishwasher safe. At this price I dont see why you would need to pay more for a set of measuring spoons unless it was for decorative reasons



As we can see, the most wrong positive/negative reviews would be pretty tough to get right. Even with human