# HuggingFace Transformer Pipeline : Sentiment Analysis

In this notebook, we will use the pre-trained [HuggingFace's Transformer Pipeline : Sentiment Analysis](https://huggingface.co/docs/transformers/task_summary#sequence-classification) to predict the sentiment of the tweets.

We will compare this pre-trained local model to the baseline model from [main.ipynb](main.ipynb).

## Load project modules and data

We will use basic python packages, and the [HuggingFace](https://huggingface.co/docs/transformers/quicktour) package to predict text sentiment.


In [6]:
# Import custom helper libraries
import os
import sys

# Maths modules
import pandas as pd


src_path = os.path.abspath(os.path.join("../src"))
if src_path not in sys.path:
    sys.path.append(src_path)

import data.helpers as data_helpers
import visualization.helpers as viz_helpers



In [7]:

# Download and unzip CSV files
!cd .. && make dataset && cd notebooks


>>> Downloading and extracting data files...
Data files already downloaded.
>>> OK.



In [8]:
# Load data from CSV
df = pd.read_csv(
    os.path.join(
        "..", "data", "raw", "training.1600000.processed.noemoticon.csv"
    ),
    names=["target", "id", "date", "flag", "user", "text"],
)

# Reduce memory usage
df = data_helpers.reduce_dataframe_memory_usage(df)

# Drop useless columns
df.drop(columns=["id", "date", "flag", "user"], inplace=True)

# Replace target values with labels
df.target.replace(
    {
        0: "NEGATIVE",
        2: "NEUTRAL",
        4: "POSITIVE",
    },
    inplace=True,
)

df.describe()


Unnamed: 0,target,text
count,1600000,1600000
unique,2,1581466
top,NEGATIVE,isPlayer Has Died! Sorry
freq,800000,210


## Classification Model

Now we can measure the performance of our model defined in [custom_huggingface_sentiment_analysis_classifier.py](../src/models/custom_huggingface_sentiment_analysis_classifier.py). We are going to use the same metrics as our baseline model defined in [main.ipynb](main.ipynb).

### HuggingFace's Transformer Pipeline : Sentiment Analysis model

In this model, we will use the pre-trained HuggingFace's Transformer Pipeline Sentiment Analysis to predict the sentiment of the tweets.

In [9]:
from models.custom_huggingface_sentiment_analysis_classifier import (
    CustomHuggingfaceSentimentAnalysisClassifier,
)


# Initialize Azure Text Analytics classifier
cls = CustomHuggingfaceSentimentAnalysisClassifier()

cache_json_path = os.path.join("..", "results", "huggingface_cache.json")
if os.path.exists(cache_json_path):
    # Load cached results
    cls.load_cache_json(filename=cache_json_path)
else:
    # Compute sentiment scores
    cls.fit(X=df.text.values, y=df.target.values)
    # Save results to cache
    cls.save_cache_json(filename=cache_json_path)

# Plot classification performances
viz_helpers.plot_classifier_results(
    cls,
    df.text.values,
    df.target.values,
    title="Classification results",
)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are 

The performances on the dataset are slightly better than our baseline model : 
- Average Precision = 0.75 (baseline = 0.73 , +2.7%)
- ROC AUC = 0.78 (baseline = 0.74 , +5.4%)



This model is also biased towards the positive class, but less than our baseline model : it predicted 26% (baseline = 35% , -26%) more _POSITIVE_ (1115) messages than _NEGATIVE_ (885).


In [None]:
# Compute predictions
y_pred = model.predict(X)
df["prediction"] = y_pred


In [None]:
import shap

shap.initjs()

explainer = shap.Explainer(cls.classifier)


In [None]:
# False positive example
fp_index = df[(df.target == "NEGATIVE") & (df.prediction == "POSITIVE")].index[0]
fp_text = df.text.values[fp_index]

shap_values = explainer([fp_text])

shap.plots.text(shap_values[0, :, "POSITIVE"])

In [None]:
# False negative example
fn_index = df[(df.target == "POSITIVE") & (df.prediction == "NEGATIVE")].index[0]
fn_text = df.text.values[fn_index]

shap_values = explainer([fn_text])

shap.plots.text(shap_values[0, :, "POSITIVE"])