# Amazon Reviews Classification Modelling using Logistic Regression and XGBoost
```
@inproceedings{marc_reviews,
    title={The Multilingual Amazon Reviews Corpus},
    author={Keung, Phillip and Lu, Yichao and Szarvas, György and Smith, Noah A.},
    booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
    year={2020}
}
```

In [1]:
import sys
from typing import Callable

In [96]:
sys.path.append("/Users/dqmis/github/nlp-classification/")

from src.review.data.dataset import load_dataset
from src.review.models import baseline, ml_classifier
from src.review.data.dataset import split_dataset
from src.review import utils

In [14]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [51]:
dataset_df = load_dataset(return_pandas=True, languages=["en"], use_stars=False)

Found cached dataset amazon_reviews_multi (/Users/dqmis/.cache/huggingface/datasets/amazon_reviews_multi/default-18df3f9c3df27db5/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)


In [52]:
dataset_df

Unnamed: 0,review_body,language,label
0,Arrived broken. Manufacturer defect. Two of th...,en,0.0
1,the cabinet dot were all detached from backing...,en,0.0
2,I received my first order of this product and ...,en,0.0
3,This product is a piece of shit. Do not buy. D...,en,0.0
4,went through 3 in one day doesn't fit correct ...,en,0.0
...,...,...,...
199995,"Cute slippers, my MIL loved them.",en,1.0
199996,My 6 year old likes this and keeps him engaged...,en,1.0
199997,Replaced my battery with it. Works like new.,en,1.0
199998,"I like them, holding up well.",en,1.0


In [53]:
# Split into train and test
train_df, val_df, test_df = split_dataset(dataset_df)

In [55]:
train_df

Unnamed: 0,review_body,language,label
17047,I can’t go a song without my phone saying the ...,en,0.0
159950,"I love that the box sorts 4 cash values, sorts...",en,1.0
23294,Had to return for refund minus shipping cost. ...,en,0.0
135704,I love the product material is strong and dura...,en,1.0
29929,Don't buy this product!!! After using these ca...,en,0.0
...,...,...,...
174028,I purchased this from Atoz merchant. At first ...,en,1.0
1885,Extremely cheap- worth $1-2 max. I will not pi...,en,0.0
60746,I love this blanket. Only problem is I was shi...,en,0.0
44421,These lights are really great in terms of outp...,en,0.0


Let's think about metric and the baseline. For metric we will be using F1 score to evaluate our model. For baseline we will be classifying text using top 10 most frequent words in each category.

In [56]:
# Get predictions of a baseline model

true_values = test_df["label"].values
predictions = test_df["review_body"].apply(baseline.classify).values

In [61]:
test_df.groupby("label").count()

Unnamed: 0_level_0,review_body,language
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,15874,15874
1.0,16126,16126


In [60]:
# Calculate metrics

utils.evaluate_model(true_values, predictions)
for metric, value in utils.evaluate_model(true_values, predictions).items():
    print(f"{metric}: {value}")

f1: 0.6114161360403259
accuracy: 0.63315625
classification_report:               precision    recall  f1-score   support

         0.0       0.59      0.88      0.70     15874
         1.0       0.76      0.39      0.52     16126

    accuracy                           0.63     32000
   macro avg       0.68      0.64      0.61     32000
weighted avg       0.68      0.63      0.61     32000



Let's train and evaluate logistic regression model.

In [97]:
# Logreg model
model = ml_classifier.MlClassifier()
model.fit(train_df["review_body"].values, train_df["label"].values)

# Calculate metrics
true_values = test_df["label"].values
predictions = model.predict(test_df["review_body"].values)

utils.evaluate_model(true_values, predictions)
for metric, value in utils.evaluate_model(true_values, predictions).items():
    print(f"{metric}: {value}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


f1: 0.8575891938277138
accuracy: 0.85759375
classification_report:               precision    recall  f1-score   support

         0.0       0.85      0.86      0.86     15874
         1.0       0.86      0.86      0.86     16126

    accuracy                           0.86     32000
   macro avg       0.86      0.86      0.86     32000
weighted avg       0.86      0.86      0.86     32000



In [98]:
# XGBoost model
model = ml_classifier.MlClassifier(classifier="xgboost")
model.fit(train_df["review_body"].values, train_df["label"].values)

# Calculate metrics
true_values = test_df["label"].values
predictions = model.predict(test_df["review_body"].values)

utils.evaluate_model(true_values, predictions)
for metric, value in utils.evaluate_model(true_values, predictions).items():
    print(f"{metric}: {value}")

f1: 0.8281298637526301
accuracy: 0.8281875
classification_report:               precision    recall  f1-score   support

         0.0       0.81      0.85      0.83     15874
         1.0       0.85      0.80      0.82     16126

    accuracy                           0.83     32000
   macro avg       0.83      0.83      0.83     32000
weighted avg       0.83      0.83      0.83     32000



In [99]:
# NB model
model = ml_classifier.MlClassifier(classifier="nb")
model.fit(train_df["review_body"].values, train_df["label"].values)

# Calculate metrics
true_values = test_df["label"].values
predictions = model.predict(test_df["review_body"].values)

utils.evaluate_model(true_values, predictions)
for metric, value in utils.evaluate_model(true_values, predictions).items():
    print(f"{metric}: {value}")

f1: 0.8407200747630316
accuracy: 0.840875
classification_report:               precision    recall  f1-score   support

         0.0       0.86      0.82      0.84     15874
         1.0       0.83      0.87      0.85     16126

    accuracy                           0.84     32000
   macro avg       0.84      0.84      0.84     32000
weighted avg       0.84      0.84      0.84     32000

