## Amazon Sentiment Analysis

This project aims to experiment with machine learning text classification models for [Amazon reviews](https://huggingface.co/datasets/amazon_polarity). The goal of this sentiment analysis is to identify whether a review is positive or negative based on the text alone.

References:
- [Hugging Face: "Getting Started with Sentiment Analysis using Python"](https://huggingface.co/blog/sentiment-analysis-python)

In [5]:
from numpy import argsort

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

from datasets import load_dataset


In [6]:
# Download dataset from Hugging Face
dataset = load_dataset("amazon_polarity")


Found cached dataset amazon_polarity (/Users/user/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/a27b32b7e7b88eb274a8fa8ba0f654f1fe998a87c22547557317793b5d2772dc)


  0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
train = dataset['train'].to_pandas()
test = dataset['test'].to_pandas()


In [8]:
print(f'Train set: {train.shape}')
print(f'Test set: {test.shape}')


Train set: (3600000, 3)
Test set: (400000, 3)


In [9]:
train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3600000 entries, 0 to 3599999
Data columns (total 3 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   label    int64 
 1   title    object
 2   content  object
dtypes: int64(1), object(2)
memory usage: 82.4+ MB


In [10]:
train.head()


Unnamed: 0,label,title,content
0,1,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,1,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,1,Amazing!,This soundtrack is my favorite music of all ti...
3,1,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,1,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."


### Bag of Words Models

Let's start with a simple Naive Bayes classifier as the baseline. Because of the size of the dataset, we'll also create smaller sets for faster training and testing.

In [11]:
# Preprocess the data
X_train = train['content']
y_train = train['label']

X_test = test['content']
y_test = test['label']

# Smaller datasets for faster training and testing initially
X_train_small = X_train[:len(X_train)//20]
y_train_small = y_train[:len(X_train)//20]


We'll use the `TfidfVectorizer` to vectorize the individual words and re-weight the counts based on the inverse-document frequency (penalizing common words that appear frequently such as "the", "a", "is" etc).

In [9]:
nb_pipe = Pipeline([('vectorizer', TfidfVectorizer()),
                    ('classifier', MultinomialNB())])

nb_pipe.fit(X_train_small, y_train_small)
print(classification_report(y_test, nb_pipe.predict(X_test)))


              precision    recall  f1-score   support

           0       0.80      0.85      0.83    200000
           1       0.84      0.78      0.81    200000

    accuracy                           0.82    400000
   macro avg       0.82      0.82      0.82    400000
weighted avg       0.82      0.82      0.82    400000



Next we'll add a layer of complexity by including bigrams and stopwords in the model.

In [46]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'textcat'])

STOP_WORDS = STOP_WORDS.union({'ll', 've'})

nb_pipe = Pipeline([('vectorizer', TfidfVectorizer(ngram_range=(1, 2),
                                                   stop_words=list(STOP_WORDS))),
                    ('classifier', MultinomialNB())])

nb_pipe.fit(X_train_small, y_train_small)
print(classification_report(y_test, nb_pipe.predict(X_test)))


              precision    recall  f1-score   support

           0       0.84      0.86      0.85    200000
           1       0.85      0.84      0.84    200000

    accuracy                           0.85    400000
   macro avg       0.85      0.85      0.85    400000
weighted avg       0.85      0.85      0.85    400000



Bigrams and stopwords have indeed improved the model performance. Let's try a more complicated machine learning algorithm next.

In [47]:
sgd_pipe = Pipeline([('vectorizer', TfidfVectorizer(ngram_range=(1, 2),
                                                    stop_words=list(STOP_WORDS))),
                     ('classifier', SGDClassifier(max_iter=50))])

sgd_pipe.fit(X_train_small, y_train_small)
print(classification_report(y_test, sgd_pipe.predict(X_test)))


              precision    recall  f1-score   support

           0       0.84      0.85      0.84    200000
           1       0.85      0.84      0.84    200000

    accuracy                           0.84    400000
   macro avg       0.84      0.84      0.84    400000
weighted avg       0.84      0.84      0.84    400000



`SGDClassifier` seems to perform worse than the `MultinomialNB` model. Let's try GridSearch to find more optimal hyperparameters. 

In [48]:
params = {'classifier__alpha': (0.001, 0.0001, 0.00001),
          # TODO 'classifier__loss': ('log_loss', 'hinge'), # log_loss = Logistic Regression, hinge = Linear SVM
          }

sgd_grid = GridSearchCV(sgd_pipe, params, cv=3, verbose=True)
sgd_grid.fit(X_train_small, y_train_small)


Fitting 3 folds for each of 3 candidates, totalling 9 fits
              precision    recall  f1-score   support

           0       0.88      0.86      0.87    200000
           1       0.87      0.88      0.87    200000

    accuracy                           0.87    400000
   macro avg       0.87      0.87      0.87    400000
weighted avg       0.87      0.87      0.87    400000



In [49]:
print(classification_report(y_test, sgd_grid.predict(X_test)))


              precision    recall  f1-score   support

           0       0.88      0.86      0.87    200000
           1       0.87      0.88      0.87    200000

    accuracy                           0.87    400000
   macro avg       0.87      0.87      0.87    400000
weighted avg       0.87      0.87      0.87    400000



In [51]:
sgd_grid.best_params_


{'classifier__alpha': 1e-05}

This quick GridSearch exercise has further improved the model as expected. Future work will be to increase the range of alphas and compare Logistic Regression versus SVM within SGD, but for now we'll continue with fast exploration iterations.

### Pre-trained Models

Let's see how pre-trained models from Hugging Face work. We'll try `distilbert-base-uncased-finetuned-sst-2-english` which is one of the most popular text classification models on Hugging Face.

In [3]:
# Example from https://huggingface.co/blog/sentiment-analysis-python
from transformers import pipeline

sentiment_pipeline = pipeline(
    model="distilbert-base-uncased-finetuned-sst-2-english")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)


Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

In [49]:
# Run on test data subset for speed
X_test_small = X_test[:1000]
y_test_small = y_test[:len(X_test_small)]

bert = sentiment_pipeline(X_test_small.to_list())


In [50]:
bert[:5]


[{'label': 'NEGATIVE', 'score': 0.5490386486053467},
 {'label': 'POSITIVE', 'score': 0.9994763731956482},
 {'label': 'NEGATIVE', 'score': 0.9965258240699768},
 {'label': 'NEGATIVE', 'score': 0.9797807931900024},
 {'label': 'POSITIVE', 'score': 0.9715625643730164}]

In [51]:
pred = [1 if key['label'] == 'POSITIVE' else 0 for key in bert]
pred[:5]


[0, 1, 0, 0, 1]

In [52]:
print(classification_report(y_test_small, pred))


              precision    recall  f1-score   support

           0       0.87      0.89      0.88       498
           1       0.89      0.86      0.87       502

    accuracy                           0.88      1000
   macro avg       0.88      0.88      0.88      1000
weighted avg       0.88      0.88      0.88      1000



The `distilbert-base-uncased-finetuned-sst-2-english` model appears to perform better than the `SGDClassifier` model, with the tradeoff of it being slower to run. 

An interesting next step would be to take this pre-trained model and do transfer learning by further training it on the particular dataset here.

### Model Analysis

The models developed here have a high degree of interpretability. Let's investigate this by looking at the words identified as signaling the most positive and negative sentiments from the Naive Bayes model.

In [24]:
# dictionary of (word, value) pairs
vocab = nb_pipe.get_params()['vectorizer'].vocabulary_


In [29]:
# log probability of features given a class
pos = nb_pipe.get_params()['classifier'].feature_log_prob_[1]
neg = nb_pipe.get_params()['classifier'].feature_log_prob_[0]


During training the Naive Bayes model calculates probabilities such as $P(\textrm{heartwarming}\ |\ \textrm{positive}),$ the probability that the word "heartwarming" appears in the review text, given that the review is positive. From this we can calculate a **polarity score** for each word:

$$\textrm{polarity}(word) = \log\left(\frac{P(word\ |\ \textrm{positive})}{P(word\ |\ \textrm{negative})}\right).$$


In [31]:
# use logarithm quotient rule
polarity = pos - neg

# indices of the polarity list sorted from least to greatest
indices = argsort(polarity)

print("Most Positive Words:")
for word in vocab:
    if vocab[word] in indices[-10:]:
        print(word)

print("\nMost Negative Words:")
for word in vocab:
    if vocab[word] in indices[:10]:
        print(word)


Most Positive Words:
timeless
pleasantly
invaluable
downside
refreshing
heartwarming
captures
unforgettable
frankl
wiesel

Most Negative Words:
worst
waste
refund
poorly
redeeming
worthless
unwatchable
uninspiring
unreadable
disgrace
