# NLP Amazon Review Classifier

This project utilized 3 NLP techniques to build a Support Vector Machine classifier to categorize Amazon reviews to their corresponding merchant type (e.g. toys and beauty product). This classifier is a personal project to learn and practice using NLP techniques, machine learning, and large language model.

1. Bag of Words
2. Word Vectors
3. DistilBERT

In [2]:
import os
import numpy as np
import random
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from textblob import TextBlob
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import Trainer, TrainingArguments
from transformers import EvalPrediction

from datasets import load_metric

import accelerate
import json

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jiali\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\jiali\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Load In Data

In [3]:
def load_reviews(directory, suffix):
    reviews = []
    categories = []

    for file in os.listdir(directory):
        if not file.endswith(suffix):
            continue
        
        category = file[:-len(suffix)]
        categories.append(category)
        with open(f'{directory}/{file}') as f:
            for line in f:
                review_json = json.loads(line)
                review = {
                    "review_text": review_json.get('reviewText', ''),
                    "category": category                    
                }

                reviews.append(review)
        
    return reviews, categories

In [4]:
train_reviews, categories = load_reviews('./data/train_data', '_train.json')
test_reviews, _ = load_reviews('./data/test_data', '_test.json')

In [5]:
train_reviews[0]

{'review_text': 'It is beautiful', 'category': 'all_beauty'}

#### Train Model (Bag of words)
Converts each word into vector representations. Specifaclly, I used the CountVectorizer the count the occurence of words within each document, thus each document gets a vector representation. Basically, it captures the frequency of words and occurence of words, but it doesn't capture the semantic of words and neither the relationship with surronding words. It would serve as baseline for the following more advanced classifier.

In [7]:
# Define custom function for stemming words
def stemmed_words(doc):
    stemmer = PorterStemmer()
    return [stemmer.stem(w) for w in word_tokenize(doc)]

# Define Part of Speech (POS) tag function
def pos_tag(text):
    blob = TextBlob(text)
    return ' '.join([word + '_' + tag for word, tag in blob.tags])

In [8]:
# Train corpus
train_corpus = [review.get('review_text') for review in train_reviews]
train_label = [review.get('category') for review in train_reviews]

# Test corpus
test_corpus = [review.get('review_text') for review in test_reviews]
test_label = [review.get('category') for review in test_reviews]

In [53]:
from sklearn import svm

vectorizer = CountVectorizer(binary=True, ngram_range=(1,2), tokenizer=stemmed_words)
train_x = vectorizer.fit_transform(train_corpus) # training text converted to vector

clf_svm_bow = svm.SVC(kernel='linear')
clf_svm_bow.fit(train_x, train_label)

#### Evaluate Performance (Bag of words)

In [54]:
# make sure to convert test text to vector form
test_x = vectorizer.transform(test_corpus)

In [55]:
print("Overall Accuracy:", clf_svm_bow.score(test_x, test_label))

y_pred = clf_svm_bow.predict(test_x)

print("F1 Scores by category")
print(test_categories)
print(f1_score(test_label, y_pred, average=None, labels=test_categories))

Overall Accuracy: 0.5441
f1 scores by category
['all_beauty', 'automotive', 'cell_phones_and_accessories', 'office_products', 'pet_supplies', 'sports_and_outdoors', 'tools_and_home_improvement', 'toys_and_games']
[0.8260789  0.41585253 0.55591665 0.47572634 0.67700987 0.37486218
 0.36693548 0.62478485]


### Training Model 2 (Word Vector)
In this 2nd model, I will try using word vector to embed amazon reviews to a pre-trained vector space from spacy en_core_web_md, a medium sized model. Unlike bag of words that treats each word as an individual entity, Word vectors captures relationships between words. Words with similar meaning and are closed to each other in the trained vector space.

In [2]:
import spacy

nlp = spacy.load("en_core_web_md")

In [14]:
train_corpus[:3]

['It is beautiful',
 "Wonderful product and quick delivery!  I couldn't be happier",
 "Smells great with out the allergy issues.  I wish I didn't have to have required number of words.  What more can I say about soap?"]

In [36]:
# POS tagging both training and testing corpus, trying having POS tagging to include more sentiment information, NOT Needed for this classifying task
# train_corpus_POS = [pos_tag(review.get('review_text')) for review in train_reviews]
# test_corpus_POS = [pos_tag(review.get('review_text')) for review in test_reviews]

In [37]:
test_corpus_POS[0]

'Great_NNP body_NN wash_NN for_IN sensitive_JJ skin_NN Definitely_RB works_VBZ for_IN me_PRP Leaves_NNS skin_VBP moisturized_VBN and_CC clean_JJ Will_MD repurchase_VB for_IN sure_JJ PS_NN the_DT candy_NN scent_NN is_VBZ very_RB pleasant_JJ as_IN well_RB'

In [15]:
docs = [nlp(text) for text in train_corpus]
train_x_word_vectors = [x.vector for x in docs]

In [18]:
# Support Vector Classifier to classify amazon reviews by merchant categories
clf_svm_wv = svm.SVC(kernel='linear')
clf_svm_wv.fit(train_x_word_vectors, train_label)

In [19]:
# Converting test corpus to vector in the 
test_docs_wv = [nlp(text) for text in test_corpus]
test_x_wv = [x.vector for x in test_docs_wv]

In [22]:
print("Overall Accuracy:", clf_svm_wv.score(test_x_wv, test_label))

y_pred = clf_svm_wv.predict(test_x_wv)

print("F1 Scores by category")
print(categories)
print(f1_score(test_label, y_pred, average=None, labels=categories))

Overall Accuracy: 0.61255
f1 scores by category
['all_beauty', 'automotive', 'cell_phones_and_accessories', 'office_products', 'pet_supplies', 'sports_and_outdoors', 'tools_and_home_improvement', 'toys_and_games']
[0.79251701 0.50392749 0.63906498 0.55746078 0.75036557 0.48649763
 0.49846782 0.66923077]


### Training Model 3 DistilBERT Transformer Model

Better Contextual Understanding: With models like BERT or DistilBERT, the embeddings of these llm are contextually informed, meaning the vector representation of a word can change based on the surrounding words. This is a key advantage over earlier models like Word2Vec, where a word always has the same vector regardless of context. Both BERT and DistilBERT is used constructing my classifier. DistilBERT runs faster on my local machine and consume less resources while retaining 97% of the classification performance of BERT model.

In [50]:
def load_smaller_reviews(directory, suffix):
    # Retrieve less reviews from each amazong merchant category to reduce computation time. 
    # We can always adjust the sample size to run expand the model training set.
    reviews = []
    categories = []

    for file in os.listdir(directory):
        if not file.endswith(suffix):
            continue
        
        category = file[:-len(suffix)]
        categories.append(category)

        temp_reviews = []
        with open(f'{directory}/{file}') as f:
            for line in f:
                review_json = json.loads(line)
                review = {
                    "review_text": review_json.get('reviewText', ''),
                    "category": category                    
                }

                temp_reviews.append(review)
        
        temp_reviews = random.sample(temp_reviews, 250)
        reviews.extend(temp_reviews)
        
    return reviews, categories

In [52]:
train_x_dbert, categories = load_smaller_reviews('./data/train_data', '_train.json')
test_x_debert, _ = load_smaller_reviews('./data/test_data', '_test.json')
len(test_x_debert)

2000

In [53]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=8)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [76]:
def tokenize(data):
    # Initialize lists to hold tokenization outputs
    input_ids = []
    attention_masks = []

    for item in data:
        # Tokenize the review text
        tokenized_output = tokenizer(item['review_text'], padding='max_length', truncation=True, max_length=256)

        # Append the tokenized outputs to the lists
        input_ids.append(tokenized_output['input_ids'])
        attention_masks.append(tokenized_output['attention_mask'])

    # Return a dictionary with the concatenated results
    return {
        'input_ids': input_ids,
        'attention_mask': attention_masks
    }

tokenized_review = tokenize(train_x_dbert)
tokenized_test_review = tokenize(test_x_debert)

In [77]:
class AmazonReviewsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)  
        return item

    def __len__(self):
        return len(self.labels)


# Convert labels to numeric values
train_labels = [review['category'] for review in train_x_dbert]
label_encoder = LabelEncoder()
train_numeric_labels = label_encoder.fit_transform(train_labels)

test_labels = [review['category'] for review in test_x_debert]
test_numeric_labels = label_encoder.transform(test_labels)


# Create train and test dataset
train_x_dbert_final = AmazonReviewsDataset(tokenized_review, train_numeric_labels)
test_x_dbert_final = AmazonReviewsDataset(tokenized_test_review, test_numeric_labels)

# DataLoader
loader = DataLoader(train_x_dbert_final, batch_size=8, shuffle=True)

In [58]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_x_dbert_final
)

trainer.train()

  0%|          | 0/500 [00:00<?, ?it/s]

{'loss': 1.5263, 'grad_norm': 10.600296974182129, 'learning_rate': 0.0, 'epoch': 2.0}
{'train_runtime': 2184.0831, 'train_samples_per_second': 1.831, 'train_steps_per_second': 0.229, 'train_loss': 1.5263165283203124, 'epoch': 2.0}


TrainOutput(global_step=500, training_loss=1.5263165283203124, metrics={'train_runtime': 2184.0831, 'train_samples_per_second': 1.831, 'train_steps_per_second': 0.229, 'train_loss': 1.5263165283203124, 'epoch': 2.0})

In [70]:
model.save_pretrained('./amazon_review_classifier_300')
label_encoder.classes_ = np.save('label_classes.npy', label_encoder.classes_)

In [78]:
trainer.evaluate(eval_dataset=test_x_dbert_final)

  0%|          | 0/250 [00:00<?, ?it/s]

{'eval_loss': 1.118065595626831,
 'eval_runtime': 457.4179,
 'eval_samples_per_second': 4.372,
 'eval_steps_per_second': 0.547,
 'epoch': 2.0}

In [79]:
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    accuracy = load_metric("accuracy")
    f1 = load_metric("f1")

    return {
        "accuracy": accuracy.compute(predictions=preds, references=p.label_ids)["accuracy"],
        "f1": f1.compute(predictions=preds, references=p.label_ids, average="weighted")["f1"],
    }


training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_x_dbert_final,  # Assumes train_x_dbert_final is already prepared
    compute_metrics=compute_metrics
)


evaluation_results = trainer.evaluate(eval_dataset=test_x_dbert_final)
print(evaluation_results)

  0%|          | 0/250 [00:00<?, ?it/s]

  accuracy = load_metric("accuracy")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

{'eval_loss': 1.118065595626831, 'eval_accuracy': 0.6235, 'eval_f1': 0.6316970832337284, 'eval_runtime': 434.6111, 'eval_samples_per_second': 4.602, 'eval_steps_per_second': 0.575}


### Final Remarks
DistilBERT model yieled 62.35 accuracy rate, outperformed both the Bag of Words and Word Vectors approaches. It is not surprising to see DistilBERT model has the highest accuracy rate in our classifying task. Even though 62.35% isn't all that impressive, however it is a result of shrinking the training sample size 10 times from 2500 per category of reviews to 250 due to computation and time limitation. The result is due the advantage of these transformer model is that the transformer models don't stick to a fixed vector space and not limited to only pre-trained words. Transformer model can adjust the vector representations of words based on its surronding contexts, and capture more accurate relationships with longer texts. At this state, there is still lots of room for improvement, such as fine tuning the hyperparameters, employing other state-of-the-art model, and of course upgrading computation hardware. Overall, it is exciting to see the capability of LLM! Will revisit this project when equipped with more in-depth knowledge of NLP and LLM. 