# Sentiment Analysis with XGBoost on Algorithmia

With this notebook, we will be training an XGBoost model on Amazon's Musical Instrument Reviews dataset and be able to use this model to predict the sentiment of the given texts. If you would like to see the final product first, you can check out this algorithm in action at https://algorithmia.com/algorithms/asli/xgboost_basic_sentiment_analysis

## Overview

In this notebook, we will step by step: 

1. Load the training data

2. Preprocess the data

3. Setup an XGBoost model and do a mini hyperparameter search

4. Fit the data on our model

5. Get the predictions

6. Check the accuracy

7. Pickle the final model

## Building the XGBoost model

In [None]:
!pip3 install -r notebook_requirements.txt

In [6]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline

from string import punctuation
from nltk.corpus import stopwords

from xgboost import XGBClassifier
import pandas as pd
import numpy as np
import joblib

### Load the training data
Let's load our training data, take a look at a few rows and one of the review texts in detail.

In [7]:
data = pd.read_csv("./data/amazon_musical_reviews/Musical_instruments_reviews.csv")
data.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2IBPI20UZIR0U,1384719342,"cassandra tu ""Yeah, well, that's just like, u...","[0, 0]","Not much to write about here, but it does exac...",5.0,good,1393545600,"02 28, 2014"
1,A14VAT5EAX3D9S,1384719342,Jake,"[13, 14]",The product does exactly as it should and is q...,5.0,Jake,1363392000,"03 16, 2013"
2,A195EZSQDW3E21,1384719342,"Rick Bennette ""Rick Bennette""","[1, 1]",The primary job of this device is to block the...,5.0,It Does The Job Well,1377648000,"08 28, 2013"
3,A2C00NNG1ZQQG2,1384719342,"RustyBill ""Sunday Rocker""","[0, 0]",Nice windscreen protects my MXL mic and preven...,5.0,GOOD WINDSCREEN FOR THE MONEY,1392336000,"02 14, 2014"
4,A94QU4C90B1AX,1384719342,SEAN MASLANKA,"[0, 0]",This pop filter is great. It looks and perform...,5.0,No more pops when I record my vocals.,1392940800,"02 21, 2014"


In [8]:
data["reviewText"].iloc[1]

"The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording. :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]"

### Preprocessing
Time to process our texts! Basically, we'll:
- Remove the English stopwords
- Remove punctuations
- Drop unused columns

In [9]:
import nltk
nltk.download('stopwords')

def threshold_ratings(data):
    def threshold_overall_rating(rating):
        return 0 if int(rating)<=3 else 1
    data["overall"] = data["overall"].apply(threshold_overall_rating)

def remove_stopwords_punctuation(data):
    data["review"] = data["reviewText"] + data["summary"]

    puncs = list(punctuation)
    stops = stopwords.words("english")

    def remove_stopwords_in_str(input_str):
        filtered = [char for char in str(input_str).split() if char not in stops]
        return ' '.join(filtered)

    def remove_punc_in_str(input_str):
        filtered = [char for char in input_str if char not in puncs]
        return ''.join(filtered)

    def remove_stopwords_in_series(input_series):
        text_clean = []
        for i in range(len(input_series)):
            text_clean.append(remove_stopwords_in_str(input_series[i]))
        return text_clean

    def remove_punc_in_series(input_series):
        text_clean = []
        for i in range(len(input_series)):
            text_clean.append(remove_punc_in_str(input_series[i]))
        return text_clean

    data["review"] = remove_stopwords_in_series(data["review"].str.lower())
    data["review"] = remove_punc_in_series(data["review"].str.lower())

def drop_unused_colums(data):
    data.drop(['reviewerID', 'asin', 'reviewerName', 'helpful', 'unixReviewTime', 'reviewTime', "reviewText", "summary"], axis=1, inplace=True)

def preprocess_reviews(data):
    remove_stopwords_punctuation(data)
    threshold_ratings(data)
    drop_unused_colums(data)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aslisabanci/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
preprocess_reviews(data)
data.head()

Unnamed: 0,overall,review
0,1,much write here exactly supposed to filters po...
1,1,product exactly quite affordablei realized dou...
2,1,primary job device block breath would otherwis...
3,1,nice windscreen protects mxl mic prevents pops...
4,1,pop filter great looks performs like studio fi...


### Split our training and test sets

In [11]:
rand_seed = 42
X = data["review"]
y = data["overall"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rand_seed)

### Mini randomized search
Let's set up a very basic cross-validated randomized search over parameter settings.

In [12]:
params = {"max_depth": range(9,12), "min_child_weight": range(5,8)}
rand_search_cv = RandomizedSearchCV(XGBClassifier(), param_distributions=params, n_iter=1)

### Pipeline to vectorize, transform and fit
Time to vectorize our data, transform it and then fit our model to it.
To be able to feed the text data as numeric values to our model, we will first convert our texts into a matrix of token counts using a CountVectorizer. Then we will convert the count matrix to a normalized tf-idf (term-frequency times inverse document-frequency) representation. Using this transformer, we will be scaling down the impact of tokens that occur very frequently, because they convey less information to us. On the contrary, we will be scaling up the impact of the tokens that occur in a small fraction of the training data because they are more informative to us.

In [13]:
model  = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', rand_search_cv)
])
model.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('model',
                 RandomizedSearchCV(estimator=XGBClassifier(base_score=None,
                                                            booster=None,
                                                            colsample_bylevel=None,
                                                            colsample_bynode=None,
                                                            colsample_bytree=None,
                                                            gamma=None,
                                                            gpu_id=None,
                                                            importance_type='gain',
                                                            interaction_constraints=None,
                                                            learning_rate=None,
                                                            max_delta_step=None,
                 

### Predict and calculate accuracy

In [14]:
predictions = model.predict(X_test)
acc = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {round(acc * 100, 2)}")

Model Accuracy: 89.14


### Automated Deployment via Github Actions
TODO: Update the instructions
- Import the Algorithmia Github Action config utility script
- Get the model file path to be saved from the YAML file - no need to manually enter it here.
- Get the algorithm name from the YAML file too.
- If desired, overwrite the algorithm script and the dependency file from within your notebook too.

In [18]:
import sys
"""
When using a locally embedded Github action, this should point to the src path of the Github action itself. The suggested best practice is to place the Github action folder under a .github/actions directory under the root level of your model development repository
An example usage would be:
sys.path.append(".github/actions/algorithmia_ci_modeldeployment/src")

If you're using a Github action on another repository, it should point to /src, as this is where our Dockerized Github action copied its source files. So our path append snippet should be:
sys.path.append("/src")
"""
# Our example is using a Github action from another repo, so we append the /src dir to our path.
sys.path.append("/src")

from action_config_utils import ActionConfigUtils
config_utils = ActionConfigUtils()


In [16]:
# Geting the Algorithmia algo name and the to-be-deployed model file name from the Github action config file, in order to reduce duplication. This notebook will:
# 1. Save the created model object to a local path. So we're extracting the configured file path from our Github action yaml. The Github action will then take the model object from this path and upload it to Algorithmia.
modelfile_relativepath = config_utils.get_model_relativepath(default_path="./autodeployed_model.pkl")
print(modelfile_relativepath)

# 2. Update its algorithm and dependency files on Algorithmia, with the custom code written on this notebook. So we're extracting the configured algorithm name from our Github action yaml. The Github action will do a Git add + commit + push on these files and trigger a new build of the algorithm on Algorithmia.
algo_name = config_utils.get_algoname(default_name="xgboost_automated")
print(algo_name)

./autodeployed_model.pkl
xgboost_automated


In [19]:
joblib.dump(model, modelfile_relativepath, compress=True)

['./autodeployed_model.pkl']

In [20]:
algo_script_path, algo_requirements_path = config_utils.get_algorithmia_filepaths(algo_name)

In [5]:
%%writefile $algo_script_path
import Algorithmia
import json
import os.path
import joblib
import xgboost
import pandas as pd
import hashlib

#I'm generated via a notebook and pushed via Github Actions!

client = Algorithmia.client()

def load_model_config(config_rel_path="../model_config.json"):
    """Loads the model manifest file as a dict. 
    A manifest file has the following structure:
    {
      "model_filepath": Uploaded model path on Algorithmia data collection
      "model_md5_hash": MD5 hash of the uploaded model file
      "model_origin_repo": Model development repository having the Github CI workflow
      "model_origin_commit_SHA": Commit SHA related to the trigger of the CI workflow
      "model_origin_commit_msg": Commit message related to the trigger of the CI workflow
      "model_uploaded_utc": UTC timestamp of the automated model upload
    }
    """
    config = []
    config_path = "{}/{}".format(os.path.dirname(__file__), (config_rel_path))
    if os.path.exists(config_path):
        with open(config_path) as json_file:
            config = json.load(json_file)
    return config


def load_model(config):
    """Loads the model object from the file at model_filepath key in config dict"""
    model_path = config["model_filepath"]
    model_file = client.file(model_path).getFile().name
    model_obj = joblib.load(model_file)
    return model_file, model_obj

def assert_model_md5(model_file):
    """
    Calculates the loaded model file's MD5 and compares the actual file hash with the hash on the model manifest
    """
    md5_hash = None
    DIGEST_BLOCK_SIZE = 128 * 64
    with open(model_file, "rb") as f:
        md5_hasher = hashlib.md5()
        buf = f.read(DIGEST_BLOCK_SIZE)
        while len(buf) > 0:
            md5_hasher.update(buf)
            buf = afile.read(DIGEST_BLOCK_SIZE)
        md5_hash = file_hash.hexdigest()
    assert config["model_md5_hash"] == md5_hash
    
def assert_model_pipeline_steps(model_obj):
    """For demonstration purposes, asserts that the XGBoost model has the expected pipeline steps.
    """
    assert xgb.steps[0][0] == "vect"
    assert xgb.steps[1][0] == "tfidf"
    assert xgb.steps[2][0] == "model"
    print("All assertions are okay, we have a perfectly uploaded model!")


config = load_model_config()
xgb_path, xgb_obj = load_model(config)
assert_model_md5(xgb_path)
assert_model_pipeline_steps(xgb_obj)

# API calls will begin at the apply() method, with the request body passed as 'input'
# For more details, see algorithmia.com/developers/algorithm-development/languages
def apply(input):
    series_input = pd.Series([input])
    result = xgb.predict(series_input)
    return {
        "sentiment": result.tolist()[0], 
        "predicting_model_metadata": {
            "model_file": config["model_filepath"],
            "origin_repo": config["model_origin_repo"], 
            "origin_commit_SHA": config["model_origin_commit_SHA"], 
            "origin_commit_msg": config["model_origin_commit_msg"]
        }
    }

Writing $algo_script_path


In [None]:
%%writefile $algo_requirements_path
# I'm generated via a notebook and pushed via Github Actions!
algorithmia>=1.0.0,<2.0
scikit-learn
pandas
numpy
joblib
xgboost