## Amazon Sentiment Analysis

This project aims to experiment with machine learning text classification models for [Amazon reviews](https://huggingface.co/datasets/amazon_polarity). The goal of this sentiment analysis is to identify whether a review is positive or negative based on the text alone.

References:
- [Hugging Face: "Getting Started with Sentiment Analysis using Python"](https://huggingface.co/blog/sentiment-analysis-python)

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

from datasets import load_dataset


In [1]:
# Download dataset from Hugging Face
dataset = load_dataset("amazon_polarity")


Found cached dataset amazon_polarity (/Users/user/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/a27b32b7e7b88eb274a8fa8ba0f654f1fe998a87c22547557317793b5d2772dc)


  0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
train = dataset['train'].to_pandas()
test = dataset['test'].to_pandas()


In [10]:
print(f'Train set: {train.shape}')
print(f'Test set: {test.shape}')


Train set: (3600000, 3)
Test set: (400000, 3)


In [4]:
train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3600000 entries, 0 to 3599999
Data columns (total 3 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   label    int64 
 1   title    object
 2   content  object
dtypes: int64(1), object(2)
memory usage: 82.4+ MB


In [5]:
train.head()


Unnamed: 0,label,title,content
0,1,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,1,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,1,Amazing!,This soundtrack is my favorite music of all ti...
3,1,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,1,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."


Let's start with a simple Naive Bayes classifier as the baseline. Because of the size of the dataset, we'll also create smaller sets for faster training and testing.

In [44]:
# Preprocess the data
X_train = train['content']
y_train = train['label']

X_test = test['content']
y_test = test['label']

# Smaller datasets for faster training and testing initially
X_train_small = X_train[:len(X_train)//20]
y_train_small = y_train[:len(X_train)//20]


We'll use the `TfidfVectorizer` to vectorize the individual words and re-weight the counts based on the inverse-document frequency (penalizing common words that appear frequently such as "the", "a", "is" etc).

In [45]:
nb_pipe = Pipeline([('vectorizer', TfidfVectorizer()),
                    ('classifier', MultinomialNB())])

nb_pipe.fit(X_train_small, y_train_small)
print(classification_report(y_test, nb_pipe.predict(X_test)))


              precision    recall  f1-score   support

           0       0.80      0.85      0.83    200000
           1       0.84      0.78      0.81    200000

    accuracy                           0.82    400000
   macro avg       0.82      0.82      0.82    400000
weighted avg       0.82      0.82      0.82    400000



Next we'll add a layer of complexity by including bigrams and stopwords in the model.

In [46]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'textcat'])

STOP_WORDS = STOP_WORDS.union({'ll', 've'})

nb_pipe = Pipeline([('vectorizer', TfidfVectorizer(ngram_range=(1, 2),
                                                   stop_words=list(STOP_WORDS))),
                    ('classifier', MultinomialNB())])

nb_pipe.fit(X_train_small, y_train_small)
print(classification_report(y_test, nb_pipe.predict(X_test)))


              precision    recall  f1-score   support

           0       0.84      0.86      0.85    200000
           1       0.85      0.84      0.84    200000

    accuracy                           0.85    400000
   macro avg       0.85      0.85      0.85    400000
weighted avg       0.85      0.85      0.85    400000



Bigrams and stopwords have indeed improved the model performance. Let's try a more complicated machine learning algorithm next.

In [47]:
sgd_pipe = Pipeline([('vectorizer', TfidfVectorizer(ngram_range=(1, 2),
                                                    stop_words=list(STOP_WORDS))),
                     ('classifier', SGDClassifier(max_iter=50))])

sgd_pipe.fit(X_train_small, y_train_small)
print(classification_report(y_test, sgd_pipe.predict(X_test)))


              precision    recall  f1-score   support

           0       0.84      0.85      0.84    200000
           1       0.85      0.84      0.84    200000

    accuracy                           0.84    400000
   macro avg       0.84      0.84      0.84    400000
weighted avg       0.84      0.84      0.84    400000



`SGDClassifier` seems to perform worse than the `MultinomialNB` model. Let's try GridSearch to find more optimal hyperparameters. 

In [48]:
params = {'classifier__alpha': (0.001, 0.0001, 0.00001),
          # TODO 'classifier__loss': ('log_loss', 'hinge'), # log_loss = Logistic Regression, hinge = Linear SVM
          }

sgd_grid = GridSearchCV(sgd_pipe, params, cv=3, verbose=True)
sgd_grid.fit(X_train_small, y_train_small)


Fitting 3 folds for each of 3 candidates, totalling 9 fits
              precision    recall  f1-score   support

           0       0.88      0.86      0.87    200000
           1       0.87      0.88      0.87    200000

    accuracy                           0.87    400000
   macro avg       0.87      0.87      0.87    400000
weighted avg       0.87      0.87      0.87    400000



In [49]:
print(classification_report(y_test, sgd_grid.predict(X_test)))


              precision    recall  f1-score   support

           0       0.88      0.86      0.87    200000
           1       0.87      0.88      0.87    200000

    accuracy                           0.87    400000
   macro avg       0.87      0.87      0.87    400000
weighted avg       0.87      0.87      0.87    400000



In [51]:
sgd_grid.best_params_


{'classifier__alpha': 1e-05}

This quick GridSearch exercise has further improved the model as expected. Future work will be to increase the range of alphas and compare Logistic Regression versus SVM within SGD, but for now we'll continue with fast exploration iterations.