# Sentiment Analysis with XGBoost on Algorithmia

With this notebook, we will be training an XGBoost model on Amazon's Musical Instrument Reviews dataset and be able to use this model to predict the sentiment of the given texts. If you would like to see the final product first, you can check out this algorithm in action at https://algorithmia.com/algorithms/asli/xgboost_basic_sentiment_analysis

## Overview
Let's first go over the steps we will cover in this notebook. We will start with the end in mind and then slowly build up to that point. At the end of this demo, you will have an up and running on Algorithmia, ready to serve its predictions upon your requests!

Step by step, we will: 

1. Create an algorithm on Algorithmia 
2. Clone the algorithm's repository on our local machine, so that we develop it locally 
3. Create the basic algorithm script and the dependencies file. We will code our script in advance, assuming that our model will be sitting on a remote path on Algorithmia and our script will load the model from there. We will then make these assumptions true!
4. Commit and push these files to Algorithmia and get our Algorithm's container built
5. Load the training data
6. Preprocess the data
7. Setup an XGBoost model and do a mini hyperparameter search
8. Fit the data on our model
9. Get the predictions
10. Check the accuracy
11. Repeat the steps through 6 and 10 until we are happy with our model :)
11. Once we are happy, upload the to Algorithmia and have it up and ready to serve our upcoming prediction requests!
12. Test our published algorithm with sample requests

## Getting up and ready on Algorithmia
Let's first create an algorithm on Algorithmia and then build on it slowly.
After importing the necessary packages, we'll define the variables to use across many of our calls to Algorithmia, through the Python API client.

In [1]:
import Algorithmia
from Algorithmia.errors import AlgorithmException
import urllib.parse
from git import Git, Repo, remote

In [2]:
api_key = "YOUR_API_KEY"
algo_client = Algorithmia.client(api_key)
username = "YOUR_USERNAME"
algo_name = "xgboost_basic_sentiment_analysis"
algo_namespace = f"{username}/{algo_name}"

local_dir = "../algorithmia_repo"
algo_script_path = "{}/{}/src/{}.py".format(local_dir, algo_name, algo_name)
dependency_file_path = "{}/{}/{}".format(local_dir, algo_name, "requirements.txt")

### Creating the algorithm and cloning its repo
You would only need to do this step once, because you only need one algorithm and cloning it once on your local environment is enough.

Let's first define our functions to do these two things and then call them on the next step. Let's also have a utility class named Progress to see our Github framework's progress when operating on our repository.

In [3]:
class Progress(remote.RemoteProgress):
    def line_dropped(self, line):
        print(line)
    def update(self, *args):
        print(self._cur_line)

p = Progress()

def create_algorithm(algo_name):    
    details = {
        "summary": algo_name,
        "label": algo_name,
        "tagline": algo_name
    }
    settings = {
        "source_visibility": "closed",
        "package_set": "python37",
        "license": "apl",
        "network_access": "full",
        "pipeline_enabled": True
    }
    algo_client.algo(algo_namespace).create(details, settings)
    
def clone_algorithm_repo():
    # Encode API key, so we can use it in the git URL
    encoded_api_key= urllib.parse.quote_plus(api_key)

    algo_repo = "https://{}:{}@git.algorithmia.com/git/{}/{}.git".format(username, encoded_api_key, username, algo_name)
    _ = Repo.clone_from(algo_repo, "{}/{}".format(local_dir, algo_name), progress=p)

    cloned_repo = Repo("{}/{}".format(local_dir, algo_name))
    return cloned_repo

In [12]:
create_algorithm(algo_name)
cloned_repo = clone_algorithm_repo()

Cloning into '../algorithmia_repo/xgboost_basic_sentiment_analysis'...
POST git-upload-pack (157 bytes)
remote: Counting objects: 1
remote: Counting objects: 15, done
remote: Finding sources:   6% (1/15)
remote: Finding sources:  13% (2/15)
remote: Finding sources:  20% (3/15)
remote: Finding sources:  26% (4/15)
remote: Finding sources:  33% (5/15)
remote: Finding sources:  40% (6/15)
remote: Finding sources:  46% (7/15)
remote: Finding sources:  53% (8/15)
remote: Finding sources:  60% (9/15)
remote: Finding sources:  66% (10/15)
remote: Finding sources:  73% (11/15)
remote: Finding sources:  80% (12/15)
remote: Finding sources:  86% (13/15)
remote: Finding sources:  93% (14/15)
remote: Finding sources: 100% (15/15)
remote: Finding sources: 100% (15/15)
remote: Getting sizes:   9% (1/11)
remote: Getting sizes:  18% (2/11)
remote: Getting sizes:  27% (3/11)
remote: Getting sizes:  36% (4/11)
remote: Getting sizes:  45% (5/11)
remote: Getting sizes:  54% (6/11)
remote: Getting sizes:  

### Adding the algorithm script and the dependencies
Let's create the algorithm script that will run when we make our requests and the dependency file that will be used when building the container for our algorithm on the Algorithmia environment.

We will be creating these two files programmatically with the %%writefile macro, but you can always use another editor to edit and save them later when you need.

In [13]:
%%writefile $algo_script_path
import Algorithmia
import joblib
import numpy as np
import pandas as pd
import xgboost

model_path = "data://asli/xgboost_demo/musicalreviews_xgb_model.pkl"
client = Algorithmia.client()
model_file = client.file(model_path).getFile().name
loaded_xgb = joblib.load(model_file)

# API calls will begin at the apply() method, with the request body passed as 'input'
# For more details, see algorithmia.com/developers/algorithm-development/languages
def apply(input):
    series_input = pd.Series([input])
    result = loaded_xgb.predict(series_input)
    # Returning the first element of the list, as we'll be taking a single input for our demo purposes
    # As you'll see while building the model: 0->negative, 1->positive
    return {"sentiment": result.tolist()[0]}

Overwriting ../algorithmia_repo/xgboost_basic_sentiment_analysis/src/xgboost_basic_sentiment_analysis.py


In [14]:
%%writefile $dependency_file_path
algorithmia>=1.0.0,<2.0
scikit-learn
pandas
numpy
joblib
xgboost

Overwriting ../algorithmia_repo/xgboost_basic_sentiment_analysis/requirements.txt


### Adding these files to git, commiting and pushing
Now we're ready to upload our changes to our remote repo on Algorithmia and our algorithm will be built on the Algorithmia servers and get ready to accept our requests.

In [15]:
files = ["src/{}.py".format(algo_name), "requirements.txt"]
cloned_repo.index.add(files)
cloned_repo.index.commit("Add algorithm files")

origin = cloned_repo.remote(name='origin')
_ = origin.push(progress=p)

Enumerating objects: 9, done.
Counting objects:  11% (1/9)
Counting objects:  22% (2/9)
Counting objects:  33% (3/9)
Counting objects:  44% (4/9)
Counting objects:  55% (5/9)
Counting objects:  66% (6/9)
Counting objects:  77% (7/9)
Counting objects:  88% (8/9)
Counting objects: 100% (9/9)
Counting objects: 100% (9/9), done.
Delta compression using up to 8 threads
Compressing objects:  20% (1/5)
Compressing objects:  40% (2/5)
Compressing objects:  60% (3/5)
Compressing objects:  80% (4/5)
Compressing objects: 100% (5/5)
Compressing objects: 100% (5/5), done.
Writing objects:  20% (1/5)
Writing objects:  40% (2/5)
Writing objects:  60% (3/5)
Writing objects:  80% (4/5)
Writing objects: 100% (5/5)
Writing objects: 100% (5/5), 466 bytes | 466.00 KiB/s, done.
Total 5 (delta 3), reused 0 (delta 0)
remote: Resolving deltas:  33% (1/3)
remote: Resolving deltas:  66% (2/3)
remote: Resolving deltas: 100% (3/3)
remote: Resolving deltas: 100% (3/3)
remote: Updating references: 100% (1/1)
remote:

### Uploading the model to Algorithmia
Let's also write the function take our saved model from its local path and put it on a data container on Algorithmia. As you'll remember, our algorithm script will be looking for the model to load at this data path.

We will call this function once we're happy with our model, that we'll develop soon.

In [16]:
def upload_model_to_algorithmia(local_path, model_name):
    algorithmia_data_path = "data://asli/xgboost_demo"
    if not algo_client.dir(algorithmia_data_path).exists():
        algo_client.dir(algorithmia_data_path).create()
    algorithmia_path = "{}/{}".format(algorithmia_data_path, model_name)
    result = algo_client.file(algorithmia_path).putFile(local_path)


### Calling the algorithm
Finally, let's write the function to call our algorithm. Again, we will use this function once our model is uploaded and we're ready to make the requests. 

In [17]:
from retry import retry
# Call algorithm until the algo hash endpoint becomes available, up to 10 seconds
@retry(AlgorithmException, tries=10, delay=1)
def get_review_sentiment(input):
    latest_algo_hash = algo_client.algo(algo_namespace).info().version_info.git_hash
    algo = algo_client.algo("{}/{}".format(algo_namespace, latest_algo_hash))
    algo.set_options(timeout=60, stdout=False)
    algo_pipe_result = algo.pipe(input)
    print(algo_pipe_result.metadata)
    return algo_pipe_result.result["sentiment"]

## Building the XGBoost model
Now it's time to build our model!

In [20]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler  # for scaling
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline

from string import punctuation
from nltk.corpus import stopwords

from scipy.stats import uniform

from xgboost import XGBClassifier
import pandas as pd
import numpy as np
import joblib

### Load the training data
Let's load our training data, take a look at a few rows and one of the review texts in detail.

In [28]:
data = pd.read_csv("./data/amazon_musical_reviews/Musical_instruments_reviews.csv")
data.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2IBPI20UZIR0U,1384719342,"cassandra tu ""Yeah, well, that's just like, u...","[0, 0]","Not much to write about here, but it does exac...",5.0,good,1393545600,"02 28, 2014"
1,A14VAT5EAX3D9S,1384719342,Jake,"[13, 14]",The product does exactly as it should and is q...,5.0,Jake,1363392000,"03 16, 2013"
2,A195EZSQDW3E21,1384719342,"Rick Bennette ""Rick Bennette""","[1, 1]",The primary job of this device is to block the...,5.0,It Does The Job Well,1377648000,"08 28, 2013"
3,A2C00NNG1ZQQG2,1384719342,"RustyBill ""Sunday Rocker""","[0, 0]",Nice windscreen protects my MXL mic and preven...,5.0,GOOD WINDSCREEN FOR THE MONEY,1392336000,"02 14, 2014"
4,A94QU4C90B1AX,1384719342,SEAN MASLANKA,"[0, 0]",This pop filter is great. It looks and perform...,5.0,No more pops when I record my vocals.,1392940800,"02 21, 2014"


In [29]:
data["reviewText"].iloc[1]

"The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording. :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]"

### Preprocessing
Time to process our texts! Basically, we'll:
- Remove the English stopwords
- Remove punctuations
- Drop unused columns

In [31]:
def threshold_ratings(data):
    def threshold_overall_rating(rating):
        return 0 if int(rating)<=3 else 1
    data["overall"] = data["overall"].apply(threshold_overall_rating)

def remove_stopwords_punctuation(data):
    data["review"] = data["reviewText"] + data["summary"]

    puncs = list(punctuation)
    stops = stopwords.words("english")

    def remove_stopwords_in_str(input_str):
        filtered = [char for char in str(input_str).split() if char not in stops]
        return ' '.join(filtered)

    def remove_punc_in_str(input_str):
        filtered = [char for char in input_str if char not in puncs]
        return ''.join(filtered)

    def remove_stopwords_in_series(input_series):
        text_clean = []
        for i in range(len(input_series)):
            text_clean.append(remove_stopwords_in_str(input_series[i]))
        return text_clean

    def remove_punc_in_series(input_series):
        text_clean = []
        for i in range(len(input_series)):
            text_clean.append(remove_punc_in_str(input_series[i]))
        return text_clean

    data["review"] = remove_stopwords_in_series(data["review"].str.lower())
    data["review"] = remove_punc_in_series(data["review"].str.lower())

def drop_unused_colums(data):
    data.drop(['reviewerID', 'asin', 'reviewerName', 'helpful', 'unixReviewTime', 'reviewTime', "reviewText", "summary"], axis=1, inplace=True)

def preprocess_reviews(data):
    remove_stopwords_punctuation(data)
    threshold_ratings(data)
    drop_unused_colums(data)

In [32]:
preprocess_reviews(data)
data.head()

Unnamed: 0,overall,review
0,1,much write here exactly supposed to filters po...
1,1,product exactly quite affordablei realized dou...
2,1,primary job device block breath would otherwis...
3,1,nice windscreen protects mxl mic prevents pops...
4,1,pop filter great looks performs like studio fi...


### Split our training and test sets

In [None]:
rand_seed = 42
X = data["review"]
y = data["overall"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rand_seed)

### Mini randomized search
Let's set up a very basic cross-validated randomized search over parameter settings.

In [13]:
params = {"max_depth": range(9,12), "min_child_weight": range(5,8)}
rand_search_cv = RandomizedSearchCV(XGBClassifier(), param_distributions=params, n_iter=5)

### Pipeline to vectorize, transform and fit
Time to vectorize our data, transform it and then fit our model to it.
To be able to feed the text data as numeric values to our model, we will first convert our texts into a matrix of token counts using a CountVectorizer. Then we will convert the count matrix to a normalized tf-idf (term-frequency times inverse document-frequency) representation. Using this transformer, we will be scaling down the impact of tokens that occur very frequently, because they convey less information to us. On the contrary, we will be scaling up the impact of the tokens that occur in a small fraction of the training data because they are more informative to us.

In [14]:
model  = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('model', rand_search_cv)
])
model.fit(X_train, y_train)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('model',
                 RandomizedSearchCV(estimator=XGBClassifier(base_score=None,
                                                            booster=None,
                                                            colsample_bylevel=None,
                                                            colsample_bynode=None,
                                                            colsample_bytree=None,
                                                            gamma=None,
                                                            gpu_id=None,
                                                            importance_type='gain',
                                                            interaction_constraints=None,
                                                            learning_rate=None,
                                                            max_delta_step=None,
                 

### Predict and calculate accuracy

In [15]:
predictions = model.predict(X_test)
acc = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {round(acc * 100, 2)}")

Model Accuracy: 89.14


### Save the model
Once we're happy with our model's accuracy, let's save it locally first and then take it from there and upload to Algorithmia.
For the Algorithmia upload, we will use our previously defined function.

In [21]:
model_name = "musicalreviews_xgb_model.pkl"
local_path = f"model/{model_name}"

In [None]:
joblib.dump(model, local_path, compress=True)

In [22]:
algorithmia_data_path = "data://asli/xgboost_demo"
upload_model_to_algorithmia(local_path, remote_data_source,model_name)

### Time to test end to end!
Now we are up and ready and we have a perfectly scalable algorithm on Algorithmia, waiting for its visitors! Let's test it with one positive and one negative text and see how well it does. 
To send the request to our algorithm, we will use our previously defined function and give it a string input.

In [18]:
pos_test_input = "It doesn't work quite as expected. Not worth your money!"
sentiment = get_review_sentiment(pos_test_input)
print("Sentiment for the given text is: {}".format(sentiment))

Metadata(content_type='json',duration=0.020603229,stdout=None)
Sentiment for the given text is: 0


In [19]:
neg_test_input = "I am glad that I bought this. It works great!"
sentiment = get_review_sentiment(neg_test_input)
print("Sentiment for the given text is: {}".format(sentiment))

Metadata(content_type='json',duration=0.018659969,stdout=None)
Sentiment for the given text is: 1
