# Amazon Reviews Classification Modelling using Logistic Regression and XGBoost
```
@inproceedings{marc_reviews,
    title={The Multilingual Amazon Reviews Corpus},
    author={Keung, Phillip and Lu, Yichao and Szarvas, György and Smith, Noah A.},
    booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
    year={2020}
}
```

In [1]:
import sys
from scipy.stats import uniform, randint

In [2]:
sys.path.append("../../")

from src.data.dataset import load_amazon_dataset
from src.models import review_baseline, ml_classifier
from src.data.dataset import split_dataset
from src import utils

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
dataset_df = load_amazon_dataset(return_pandas=True, languages=["en"], use_stars=False, n_sample=5000)

Found cached dataset amazon_reviews_multi (/home/dqmis/.cache/huggingface/datasets/amazon_reviews_multi/default-18df3f9c3df27db5/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)


In [5]:
dataset_df

Unnamed: 0,review_body,language,label
160476,i like it a lot because it's very soft and com...,en,1
32693,The socks are cute but not usable for barre cl...,en,0
79958,I will pay much more attention to calendars or...,en,0
76366,"Mounts seem to be a good decent quality, but m...",en,0
122343,I found some old floppy disks with photos on t...,en,1
...,...,...,...
44148,Cute door mat but sheds like crazy.,en,0
142321,"Very pleased, great taste!!",en,1
148247,The trimmer work well with a bit noise but it'...,en,1
173244,My little granddaughter loved it.,en,1


In [6]:
# Split into train and test
train_df, val_df, test_df = split_dataset(dataset_df)

In [7]:
train_df

Unnamed: 0,review_body,language,label
16314,will not charge battery!,en,0
47745,The grounding strap that came with this item w...,en,0
5354,I followed the size chart and it was way too s...,en,0
29141,I wanted a DVD of this,en,0
77788,"Does not come with a unifying receiver, and wi...",en,0
...,...,...,...
3899,The top was cracked and leaking and the oil ha...,en,0
199631,"We Love Our Itzy Ritzy Mini, I purchased the m...",en,1
156551,A great idea to put a handle on a pumice stick...,en,1
35910,"My fault. Apparently, I didn't read it correct...",en,0


Let's think about metric and the review_baseline. For metric we will be using F1 score to evaluate our model. For review_baseline we will be classifying text using top 10 most frequent words in each category.

In [8]:
# Get predictions of a baseline model

true_values = test_df["label"].values
predictions = test_df["review_body"].apply(review_baseline.classify).values

In [9]:
test_df.groupby("label").count()

Unnamed: 0_level_0,review_body,language
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,501,501
1,499,499


In [10]:
x_train, y_train = train_df["review_body"].values, train_df["label"].values
x_val, y_val = val_df["review_body"].values, val_df["label"].values
x_test, y_test = test_df["review_body"].values, test_df["label"].values

In [11]:
# Calculate metrics

utils.evaluate_model(true_values, predictions)
for metric, value in utils.evaluate_model(true_values, predictions).items():
    print(f"{metric}: {value}")

f1: 0.6008869179600888
accuracy: 0.622
classification_report:               precision    recall  f1-score   support

           0       0.58      0.85      0.69       501
           1       0.72      0.39      0.51       499

    accuracy                           0.62      1000
   macro avg       0.65      0.62      0.60      1000
weighted avg       0.65      0.62      0.60      1000



Let's train and evaluate logistic regression model.

In [12]:
# Logreg model
model = ml_classifier.MlClassifier(classifier_name="logreg")
model.fit(x_train, y_train)

# Calculate metrics
y_pred = model.predict(x_test)

for metric, value in utils.evaluate_model(y_test, y_pred).items():
    print(f"{metric}: {value}")

model.get_feature_importance()

f1: 0.820999820999821
accuracy: 0.821
classification_report:               precision    recall  f1-score   support

           0       0.82      0.82      0.82       501
           1       0.82      0.82      0.82       499

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000



Weight?,Feature
+2.324,easy
+2.200,perfect
+1.952,love
+1.866,great
+1.762,loves
+1.633,excellent
+1.523,pleased
+1.513,highly
+1.394,loved
+1.375,satisfied


In [13]:
# XGBoost model
model = ml_classifier.MlClassifier(classifier_name="xgboost")
search_params = {
    "n_estimators": range(8, 20),
    "max_depth": range(3, 15),
    "learning_rate": [.4, .45, .5, .55, .6],
    "colsample_bytree": [.6, .7, .8, .9, 1]
}

# Hyperparameter search
model.hyperparam_search(x_train, y_train, search_params, n_iter=10)
model.fit(x_train, y_train)

# Calculate metrics
y_pred = model.predict(x_test)

for metric, value in utils.evaluate_model(y_test, y_pred).items():
    print(f"{metric}: {value}")

model.get_feature_importance()

f1: 0.8309511448808706
accuracy: 0.831
classification_report:               precision    recall  f1-score   support

           0       0.82      0.85      0.83       501
           1       0.84      0.82      0.83       499

    accuracy                           0.83      1000
   macro avg       0.83      0.83      0.83      1000
weighted avg       0.83      0.83      0.83      1000



Weight,Feature
0.0309,easy
0.0213,love
0.0197,received
0.0190,return
0.0159,money
0.0144,loves
0.0137,perfect
0.0130,nice
0.0124,exactly
0.0116,did


In [14]:
# NB model
model = ml_classifier.MlClassifier(classifier_name="nb")
model.fit(x_train, y_train)

# Calculate metrics
y_pred = model.predict(x_test)

for metric, value in utils.evaluate_model(y_test, y_pred).items():
    print(f"{metric}: {value}")

model.get_feature_importance()

f1: 0.8158556308145586
accuracy: 0.816
classification_report:               precision    recall  f1-score   support

           0       0.84      0.79      0.81       501
           1       0.80      0.85      0.82       499

    accuracy                           0.82      1000
   macro avg       0.82      0.82      0.82      1000
weighted avg       0.82      0.82      0.82      1000



## Test on multiple languages

In [15]:
dataset_df = load_amazon_dataset(return_pandas=True, languages=["en", "de", "es"], use_stars=False, n_sample=5000)

Downloading and preparing dataset amazon_reviews_multi/default to /home/dqmis/.cache/huggingface/datasets/amazon_reviews_multi/default-900fce4a1c2f2d48/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609...


Downloading data: 100%|██████████| 90.3M/90.3M [00:03<00:00, 23.4MB/s]
Downloading data: 100%|██████████| 77.5M/77.5M [00:02<00:00, 26.0MB/s]
Downloading data files: 100%|██████████| 3/3 [00:07<00:00,  2.51s/it]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 96.05it/s]
Downloading data: 100%|██████████| 2.25M/2.25M [00:00<00:00, 24.9MB/s]
Downloading data: 100%|██████████| 1.93M/1.93M [00:00<00:00, 19.6MB/s]
Downloading data files: 100%|██████████| 3/3 [00:00<00:00,  4.11it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 49.84it/s]
Downloading data: 100%|██████████| 2.26M/2.26M [00:00<00:00, 6.31MB/s]
Downloading data: 100%|██████████| 1.94M/1.94M [00:00<00:00, 4.77MB/s]
Downloading data files: 100%|██████████| 3/3 [00:01<00:00,  2.34it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 152.35it/s]
                                                                        

Dataset amazon_reviews_multi downloaded and prepared to /home/dqmis/.cache/huggingface/datasets/amazon_reviews_multi/default-900fce4a1c2f2d48/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609. Subsequent calls will reuse this data.


In [16]:
# Split into train and test
train_df, val_df, test_df = split_dataset(dataset_df)

In [17]:
x_train, y_train = train_df["review_body"].values, train_df["label"].values
x_val, y_val = val_df["review_body"].values, val_df["label"].values
x_test, y_test = test_df["review_body"].values, test_df["label"].values