# BC3415 Individual Assignment: Real-World Applications of Text and Image Classification

Requirements:

A major online retail company receives thousands of product reviews daily. These reviews often contain both text comments and customer-uploaded product images. The company wants to:
- Classify text reviews into sentiment categories (positive, negative, neutral)
- Detect product defects or misdeliveries from customer-uploaded images
- Combine both types of information to automatically flag problematic orders (More of prediction)

## 1 Imports & Configuration

In [1]:
!pip install -U transformers datasets accelerate evaluate scikit-learn scikit-learn pandas numpy gradio --quiet

In [None]:
import pandas as pd
import numpy as np
import gradio as gr

# SciPy
from scipy.sparse import hstack, csr_matrix

# Scikit-learn
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Hugging Face Transformers & Datasets
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    pipeline
)
from datasets import Dataset
import evaluate

## 2 Data Loading

In [6]:
# Ingest all the Amazon Beauty Product Reviews as a dataframe
amazon_reviews_path = "./amazon2023/All_Beauty.jsonl.gz"
amazon_df = pd.read_json(amazon_reviews_path, lines=True)

# Print to see
amazon_df.head(n=10)

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,5,Such a lovely scent but not overpowering.,This spray is really nice. It smells really go...,[],B00YQ6X8EO,B00YQ6X8EO,AGKHLEW2SOWHNMFQIJGBECAF7INQ,2020-05-05 14:08:48.923,0,True
1,4,Works great but smells a little weird.,"This product does what I need it to do, I just...",[],B081TJ8YS3,B081TJ8YS3,AGKHLEW2SOWHNMFQIJGBECAF7INQ,2020-05-04 18:10:55.070,1,True
2,5,Yes!,"Smells good, feels great!",[],B07PNNCSP9,B097R46CSY,AE74DYR3QUGVPZJ3P7RFWBGIX7XQ,2020-05-16 21:41:06.052,2,True
3,1,Synthetic feeling,Felt synthetic,[],B09JS339BZ,B09JS339BZ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,2022-01-28 18:13:50.220,0,True
4,5,A+,Love it,[],B08BZ63GMJ,B08BZ63GMJ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,2020-12-30 10:02:43.534,0,True
5,4,Pretty Color,The polish was quiet thick and did not apply s...,[{'small_image_url': 'https://images-na.ssl-im...,B00R8DXL44,B00R8DXL44,AGMJ3EMDVL6OWBJF7CA5RGJLXN5A,2020-08-27 22:30:08.138,0,True
6,5,Handy,Great for many tasks. I purchased these for m...,[],B099DRHW5V,B099DRHW5V,AHREXOGQPZDA6354MHH4ETSF3MCQ,2021-09-17 13:31:59.443,0,True
7,3,Meh,These were lightweight and soft but much too s...,[{'small_image_url': 'https://m.media-amazon.c...,B088SZDGXG,B08BBQ29N5,AEYORY2AVPMCPDV57CE337YU5LXA,2021-10-15 05:20:59.292,0,True
8,5,Great for at home use and so easy to use!,This is perfect for my between salon visits. I...,[],B08P2DZB4X,B08P2DZB4X,AFSKPY37N3C43SOI5IEXEK5JSIYA,2021-07-27 13:04:04.559,0,False
9,5,Nice shampoo for the money,I get Keratin treatments at the salon at least...,[],B086QY6T7N,B086QY6T7N,AFSKPY37N3C43SOI5IEXEK5JSIYA,2021-07-18 13:21:51.145,0,False


## 3 Data Cleaning & Pre-Processing

### 3.1 Static Analysis

In [7]:
# Create an additional column where review title and body are concatenated together
amazon_df["concatenated_review"] = (
    amazon_df["title"].fillna("").astype(str).str.strip() + "\n" +
    amazon_df["text"].fillna("").astype(str).str.strip()
).str.strip()

In [8]:
# Check for duplicates in the dataframe columns (except for images as not hashable)
cols_to_exclude = ["images"]
no_images_df = amazon_df.drop(columns=[c for c in cols_to_exclude if c in amazon_df.columns]).copy()

# View the shape of the Dataset
num_rows, num_cols = amazon_df.shape
num_cells = amazon_df.size
amazon_df_cols = list(amazon_df.columns)
print("######## Raw Dataset Results ########")
print(f"Number of Rows: {num_rows}")
print(f"Number of Columns: {num_cols}")
print(f"Number of Cells: {num_cells}")
print("Dataset Columns: ", end="")
for col in amazon_df_cols:
    print(col, end=" ")
print()
print()

print("######## Duplicate Check for No Image Dataset Results ########")
row_dup_count = no_images_df.duplicated(keep="first").sum()
print("Exact duplicate rows:", row_dup_count)

######## Raw Dataset Results ########
Number of Rows: 701528
Number of Columns: 11
Number of Cells: 7716808
Dataset Columns: rating title text images asin parent_asin user_id timestamp helpful_vote verified_purchase concatenated_review 

######## Duplicate Check for No Image Dataset Results ########
Exact duplicate rows: 7275


### 3.2 Data Cleaning

In [9]:
# Helper function to check for null and missing values
def null_summary(df: pd.DataFrame, sort_by="Missing", ascending=False, pct_round=2):
    counts = df.isnull().sum()
    pct = (counts / len(df) * 100).round(pct_round)
    out = pd.DataFrame({
        "Column": counts.index,
        "Missing": counts.values,
        "Missing_%": pct.values,
        "Dtype": df.dtypes.astype(str).values
    }).sort_values(sort_by, ascending=ascending, kind="mergesort").reset_index(drop=True)
    return out

# Print to see the current state of the dataset
summary = null_summary(amazon_df)
summary


Unnamed: 0,Column,Missing,Missing_%,Dtype
0,rating,0,0.0,int64
1,title,0,0.0,object
2,text,0,0.0,object
3,images,0,0.0,object
4,asin,0,0.0,object
5,parent_asin,0,0.0,object
6,user_id,0,0.0,object
7,timestamp,0,0.0,datetime64[ns]
8,helpful_vote,0,0.0,int64
9,verified_purchase,0,0.0,bool


From the data types above, we note that the following fields needs additional formatting & cleaning:
- title, text, asin, parent_asin, user_id should be casted to strings
- verified_purchase should be converted to numeric values using one-hot encoding to be useful
- images appears to be a list, so it can be left as an object for now

Additionally, we notice that there are a lot of duplicate reviews but we do not remove them as it is possible for the same product to provide a similar if not identical experience for customers

In [10]:
# Helper function to help encode the boolean verified purchase into integers
def encode_column(df, target_col):
    df = df.copy()
    if target_col not in df.columns:
        return df
    
    # Normalize to boolean/nullable boolean first
    df[target_col] = df[target_col].astype("boolean")
    # single 0/1 column where True->1, False/NA->0
    df[target_col + "_int"] = df[target_col].fillna(False).astype(int)

    return df

# Encode Verified Purchase from Bool to Int
amazon_df = encode_column(amazon_df, "verified_purchase")


In [11]:
# Type cast the columns to the correct dtype
string_cols = ['title', 'text', 'asin', 'parent_asin', 'user_id', 'concatenated_review']
for field in string_cols:
    amazon_df[field] = amazon_df[field].astype("string")

# Verify results
summary = null_summary(amazon_df)
summary

Unnamed: 0,Column,Missing,Missing_%,Dtype
0,rating,0,0.0,int64
1,title,0,0.0,string
2,text,0,0.0,string
3,images,0,0.0,object
4,asin,0,0.0,string
5,parent_asin,0,0.0,string
6,user_id,0,0.0,string
7,timestamp,0,0.0,datetime64[ns]
8,helpful_vote,0,0.0,int64
9,verified_purchase,0,0.0,boolean


In [12]:
# Static Analysis of Numeric Cols
amazon_df.describe()

Unnamed: 0,rating,timestamp,helpful_vote,verified_purchase_int
count,701528.0,701528,701528.0,701528.0
mean,3.960245,2019-04-09 03:31:48.115045888,0.923588,0.905123
min,1.0,2000-11-01 04:24:18,0.0,0.0
25%,3.0,2017-08-01 19:39:25.777499904,0.0,1.0
50%,5.0,2019-10-20 18:11:28.616499968,0.0,1.0
75%,5.0,2021-03-02 01:05:05.557999872,1.0,1.0
max,5.0,2023-09-09 00:39:36.666000,646.0,1.0
std,1.494452,,5.471391,0.293045


### 3.2 Creating Labels for Sentiment Analysis

In [13]:
# Helper function to use ratings to rate the sentiment of reviews as our true values
def create_review_sentiment(r):
    # Treat non-numeric or missing as None
    r_num = pd.to_numeric(r, errors="coerce")
    if pd.isna(r_num):
        return None
    if r_num <= 2:
        return "negative"
    if r_num == 3:
        return "neutral"
    return "positive"

# Create the labels for sentiments based on the ratings 
amazon_df["sentiment"] = amazon_df["rating"].apply(create_review_sentiment)
amazon_df["sentiment"] = amazon_df["sentiment"].astype("string")

# One Hot Encode the values into 3 int cols and add ot originak
sentiment_one_hot = pd.get_dummies(amazon_df["sentiment"], prefix="sentiment", dtype=int)
amazon_df = pd.concat([amazon_df, sentiment_one_hot], axis=1)
amazon_df.head(n=5)

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,concatenated_review,verified_purchase_int,sentiment,sentiment_negative,sentiment_neutral,sentiment_positive
0,5,Such a lovely scent but not overpowering.,This spray is really nice. It smells really go...,[],B00YQ6X8EO,B00YQ6X8EO,AGKHLEW2SOWHNMFQIJGBECAF7INQ,2020-05-05 14:08:48.923,0,True,Such a lovely scent but not overpowering.\nThi...,1,positive,0,0,1
1,4,Works great but smells a little weird.,"This product does what I need it to do, I just...",[],B081TJ8YS3,B081TJ8YS3,AGKHLEW2SOWHNMFQIJGBECAF7INQ,2020-05-04 18:10:55.070,1,True,Works great but smells a little weird.\nThis p...,1,positive,0,0,1
2,5,Yes!,"Smells good, feels great!",[],B07PNNCSP9,B097R46CSY,AE74DYR3QUGVPZJ3P7RFWBGIX7XQ,2020-05-16 21:41:06.052,2,True,"Yes!\nSmells good, feels great!",1,positive,0,0,1
3,1,Synthetic feeling,Felt synthetic,[],B09JS339BZ,B09JS339BZ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,2022-01-28 18:13:50.220,0,True,Synthetic feeling\nFelt synthetic,1,negative,1,0,0
4,5,A+,Love it,[],B08BZ63GMJ,B08BZ63GMJ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,2020-12-30 10:02:43.534,0,True,A+\nLove it,1,positive,0,0,1


## 4 Text Sentiment Analysis

Classify the reviews into 3 sentiment categories:
- positive
- negative
- neutral

### 4.1 Naives Bayes
#### 4.1.1  Naives Bayes (Concatenate review title and body)

In [14]:
# Version 1 (Concatenation) - Declare the dependent and independent variables
x = amazon_df["concatenated_review"]
y = amazon_df["sentiment"]

# 1. Build a TD-IDF vectorizer with specific vocab pruning and phrase length settings
tfidf_vec = TfidfVectorizer(ngram_range=(1, 2), max_df=0.95, min_df=3)

# 2. Fit vectorizers on training data and transform to sparse matrices
x_tfid = tfidf_vec.fit_transform(x)

# 3. Create labels for the target var (0, 1, 2) e.g. for the 3 classes
y = amazon_df["sentiment"].to_numpy()

# 4. Split the dataset to train a Naive Bayes Model
x_train, x_test, y_train, y_test = train_test_split(x_tfid, y, test_size=0.2, random_state=100, stratify=y)
nb_model_1 = MultinomialNB(alpha=0.5)
nb_model_1.fit(x_train, y_train)

y_pred = nb_model_1.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("Confusion:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.878266075577666
              precision    recall  f1-score   support

    negative       0.83      0.82      0.82     29023
     neutral       0.86      0.07      0.14     11261
    positive       0.89      0.99      0.94    100022

    accuracy                           0.88    140306
   macro avg       0.86      0.63      0.63    140306
weighted avg       0.88      0.88      0.85    140306

Confusion:
 [[23803    65  5155]
 [ 3520   839  6902]
 [ 1361    77 98584]]


#### 4.1.2  Naives Bayes (Without concatenate review title and body)

In [15]:
# Version 2 (Without Concatenation)
x = amazon_df[["title", "text"]]
y = amazon_df["sentiment"]

# 1. Prepare raw text inputs from the DataFrame
titles = amazon_df["title"].fillna("").astype(str).tolist()   # list of title strings
texts  = amazon_df["text"].fillna("").astype(str).tolist()    # list of body strings

# 2. Build a TD-IDF vectorizer for the title and body, based on their characteristics
title_tfidf_vec = TfidfVectorizer(ngram_range=(1, 2), max_df=0.90, min_df=2)
text_tfidf_vec = TfidfVectorizer(ngram_range=(1, 2), max_df=0.95, min_df=3)

# 3. Fit vectorizers on training data and transform to sparse matrices
x_title_tfid = title_tfidf_vec.fit_transform(titles) # shape: (n_samples, n_title_features)
x_text_tfid = text_tfidf_vec.fit_transform(texts)    # shape: (n_samples, n_text_features)

# Concatenate the two feature blocks horizontally to form one design matrix
# hstack([]) stacks two already-vectorized feature matrices side by side, keeping separate vocabularies and IDF statistics per field, then forms one big matrix for the model.
x = hstack([x_title_tfid, x_text_tfid], format="csr")  # shape: (n_samples, n_title + n_text)

# 4. Create labels for the target var (0, 1, 2) e.g. for the 3 classes
y = amazon_df["sentiment"].to_numpy()

In [16]:
# 5. Split the dataset to train a Naive Bayes Model
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=100, stratify=y)
nb_model_2 = MultinomialNB(alpha=0.5)
nb_model_2.fit(x_train, y_train)

y_pred = nb_model_2.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("Confusion:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.89790885635682
              precision    recall  f1-score   support

    negative       0.80      0.90      0.85     29023
     neutral       0.69      0.22      0.34     11261
    positive       0.93      0.97      0.95    100022

    accuracy                           0.90    140306
   macro avg       0.81      0.70      0.71    140306
weighted avg       0.89      0.90      0.88    140306

Confusion:
 [[26158   438  2427]
 [ 4381  2510  4370]
 [ 1994   714 97314]]


Results:

- Naives Bayes is relatively accurate in text sentiment analysis, achieving near 90% of the accuracy
- Not concatenating the reviews helps improve the accuracy

### 4.2 BERT

#### 4.2.1 BERT (Concatenate review and body)

In [17]:
# The dataset is too large at 700k reviews, hence we will only make use of a small subset of it
# Helper function to pick subsets easily
def get_subsets(train_tokens, test_tokens, size="small"):
    if size == "small":
        train = train_tokens.shuffle(seed=100).select(range(5000))
        test = test_tokens.shuffle(seed=100).select(range(500))
    elif size == "medium":
        train = train_tokens.shuffle(seed=100).select(range(20000))
        test = test_tokens.shuffle(seed=100).select(range(2000))
    elif size == "large":
        train = train_tokens.shuffle(seed=100).select(range(50000))
        test = test_tokens.shuffle(seed=100).select(range(5000))
    else:
        raise ValueError("size must be one of: small, medium, large")
    return train, test

In [18]:
# Declare the Dependent Y and Independent X
x = amazon_df["concatenated_review"]
y = amazon_df["sentiment"]

# Since sentiment (y) is a string, we need to cast it to ints to work with BERT
sentiment_classes = sorted(y.unique())
# Create a mapping so we can decode what each int means in both ways
label_to_id_idx = {c: i for i, c in enumerate(sentiment_classes)}
id_idx_to_label = {i: c for c, i in label_to_id_idx.items()}

# Convert strings to an array of ints like [pos neg neutral] as int vals
y = y.map(label_to_id_idx).to_numpy()
x = x.to_numpy()

# Split once
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=100, stratify=y)

In [19]:
# Build train and test datasets for tuning the BERT model
train_df = pd.DataFrame({"text": x_train, "labels": y_train})
test_df = pd.DataFrame({"text": x_test, "labels": y_test})
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
test_dataset = Dataset.from_pandas(test_df.reset_index(drop=True))

In [20]:
# Use the DistilBERT Model which is smaller first
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenise the Independent X
def tokenize(batch):
    # 128 to make them smaller
    return tokenizer(batch["text"], truncation=True, max_length=128)

train_tokens = train_dataset.map(tokenize, batched=True, remove_columns=["text"])
test_tokens = test_dataset.map(tokenize,  batched=True, remove_columns=["text"])

Map: 100%|██████████| 561222/561222 [00:26<00:00, 21382.52 examples/s]
Map: 100%|██████████| 140306/140306 [00:07<00:00, 19164.62 examples/s]


In [21]:
# Train the Distil BERT Model to better predict reviews
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
metric_acc = evaluate.load("accuracy")

# Load a Classification Head
num_review_labels = len(sentiment_classes)
distil_bert_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_review_labels,
    id2label=id_idx_to_label,
    label2id=label_to_id_idx,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": metric_acc.compute(predictions=preds, references=labels)["accuracy"]}

args = TrainingArguments(
    output_dir="bert_cls_runs",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=False,
    bf16=False,
    logging_steps=50,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##### 4.2.1.1 Small Dataset + 3 Epoch for Train & Test 

In [22]:
train_subset, test_subset = get_subsets(train_tokens, test_tokens, size="small")

# Small Token Dataset
trainer_1 = Trainer(
    model=distil_bert_model,
    args=args,
    train_dataset=train_subset,
    eval_dataset=test_subset    ,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer_1.train()

  trainer_1 = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3718,0.321312,0.886
2,0.2413,0.307102,0.886
3,0.1733,0.326431,0.89




TrainOutput(global_step=939, training_loss=0.29260404371478943, metrics={'train_runtime': 425.8999, 'train_samples_per_second': 35.22, 'train_steps_per_second': 2.205, 'total_flos': 321948090688680.0, 'train_loss': 0.29260404371478943, 'epoch': 3.0})

In [23]:
# Predictions on the same subset
pred = trainer_1.predict(test_subset)
y_pred = pred.predictions.argmax(-1)
y_true = pred.label_ids

print("Accuracy:", accuracy_score(y_true, y_pred))
print(classification_report(y_true, y_pred, target_names=[id_idx_to_label[i] for i in sorted(id_idx_to_label)]))
print("Confusion:\n", confusion_matrix(y_true, y_pred))



Accuracy: 0.89
              precision    recall  f1-score   support

    negative       0.83      0.86      0.84       104
     neutral       0.49      0.40      0.44        45
    positive       0.95      0.96      0.96       351

    accuracy                           0.89       500
   macro avg       0.76      0.74      0.75       500
weighted avg       0.88      0.89      0.89       500

Confusion:
 [[ 89  10   5]
 [ 14  18  13]
 [  4   9 338]]


##### 4.2.1.2 Small Dataset + 6 Epoch for Train & Test

In [24]:
args = TrainingArguments(
    output_dir="bert_cls_runs",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=6,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=False,
    bf16=False,
    logging_steps=50,
)

train_subset, test_subset = get_subsets(train_tokens, test_tokens, size="small")

# Small Token Dataset
trainer_2 = Trainer(
    model=distil_bert_model,
    args=args,
    train_dataset=train_subset,
    eval_dataset=test_subset    ,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer_2.train()

  trainer_2 = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1922,0.425382,0.884
2,0.1172,0.519413,0.87
3,0.1119,0.533946,0.878
4,0.0624,0.582045,0.888
5,0.0371,0.625233,0.884
6,0.0109,0.632011,0.882




TrainOutput(global_step=1878, training_loss=0.08259289231671058, metrics={'train_runtime': 772.138, 'train_samples_per_second': 38.853, 'train_steps_per_second': 2.432, 'total_flos': 642797096328864.0, 'train_loss': 0.08259289231671058, 'epoch': 6.0})

**Conclusions about the Number of Epochs:**
- For relatively small datasets, we can conclude that 3 epochs is ideal for training our model. This is because the accuracy stops increasing, and even falls afterwards.
- However, we can see that validation loss did not decrease as well as shown above. This is a sign of overfitting which indicates that our training dataset used might be too small.

##### 4.2.1.3 Medium Dataset + 3 Epoch for Train & Test

In [25]:
args = TrainingArguments(
    output_dir="bert_cls_runs",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=4,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=False,
    bf16=False,
    logging_steps=50,
)

# Medium Token Dataset
train_subset, test_subset = get_subsets(train_tokens, test_tokens, size="medium")

trainer_3 = Trainer(
    model=distil_bert_model,
    args=args,
    train_dataset=train_subset,
    eval_dataset=test_subset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer_3.train()

  trainer_3 = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2445,0.261083,0.9015
2,0.1588,0.359224,0.9035
3,0.1454,0.392411,0.9065
4,0.0834,0.457662,0.9025




TrainOutput(global_step=5000, training_loss=0.1611403627872467, metrics={'train_runtime': 2247.3735, 'train_samples_per_second': 35.597, 'train_steps_per_second': 2.225, 'total_flos': 1701684816794136.0, 'train_loss': 0.1611403627872467, 'epoch': 4.0})

In [26]:
# Predictions on the same subset
pred = trainer_3.predict(test_subset)
y_pred = pred.predictions.argmax(-1)
y_true = pred.label_ids

print("Accuracy:", accuracy_score(y_true, y_pred))
print(classification_report(y_true, y_pred, target_names=[id_idx_to_label[i] for i in sorted(id_idx_to_label)]))
print("Confusion:\n", confusion_matrix(y_true, y_pred))



Accuracy: 0.9065
              precision    recall  f1-score   support

    negative       0.84      0.88      0.86       407
     neutral       0.58      0.46      0.51       169
    positive       0.96      0.97      0.96      1424

    accuracy                           0.91      2000
   macro avg       0.79      0.77      0.78      2000
weighted avg       0.90      0.91      0.90      2000

Confusion:
 [[ 358   28   21]
 [  50   77   42]
 [  19   27 1378]]


##### 4.2.1.4 Large Dataset + 3 Epoch for Train & Test

In [27]:
args = TrainingArguments(
    output_dir="bert_cls_runs",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=False,
    bf16=False,
    logging_steps=50,
)

# Large Token Dataset
train_subset, test_subset = get_subsets(train_tokens, test_tokens, size="large")

trainer_4 = Trainer(
    model=distil_bert_model,
    args=args,
    train_dataset=train_subset,
    eval_dataset=test_subset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer_4.train()

  trainer_4 = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2219,0.265905,0.9066
2,0.1606,0.325083,0.9096
3,0.0718,0.394497,0.907




TrainOutput(global_step=9375, training_loss=0.17127845222473145, metrics={'train_runtime': 4709.2914, 'train_samples_per_second': 31.852, 'train_steps_per_second': 1.991, 'total_flos': 3180078397422216.0, 'train_loss': 0.17127845222473145, 'epoch': 3.0})

In [28]:
# Predictions on the same subset
pred = trainer_4.predict(test_subset)
y_pred = pred.predictions.argmax(-1)
y_true = pred.label_ids

print("Accuracy:", accuracy_score(y_true, y_pred))
print(classification_report(y_true, y_pred, target_names=[id_idx_to_label[i] for i in sorted(id_idx_to_label)]))
print("Confusion:\n", confusion_matrix(y_true, y_pred))



Accuracy: 0.9096
              precision    recall  f1-score   support

    negative       0.85      0.89      0.87      1047
     neutral       0.56      0.49      0.52       403
    positive       0.96      0.96      0.96      3550

    accuracy                           0.91      5000
   macro avg       0.79      0.78      0.78      5000
weighted avg       0.91      0.91      0.91      5000

Confusion:
 [[ 929   75   43]
 [ 113  197   93]
 [  50   78 3422]]


Longer Epochs to see if it is better

#### 4.2.2 BERT without Concatenation
##### 4.2.2.1 Small Dataset + 3 Epoch for Train & Test

In [29]:
x = amazon_df[["title", "text"]]
y = amazon_df["sentiment"]

sentiment_classes = sorted(y.unique())
label_to_id_idx = {c: i for i, c in enumerate(sentiment_classes)}
id_idx_to_label = {i: c for c, i in label_to_id_idx.items()}

y = y.map(label_to_id_idx).to_numpy()

# Extract arrays; ensure strings and no NaNs
titles = x["title"].fillna("").astype(str).to_numpy()
texts  = x["text"].fillna("").astype(str).to_numpy()

# Single stratified split
xtr_title, xte_title, xtr_text, xte_text, y_train, y_test = train_test_split(
    titles, texts, y, test_size=0.2, random_state=100, stratify=y
)

In [30]:
# Build HF datasets with two text columns and 'labels'
train_df = pd.DataFrame({"title": xtr_title, "text": xtr_text, "labels": y_train})
test_df  = pd.DataFrame({"title": xte_title, "text": xte_text, "labels": y_test})

train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
test_dataset  = Dataset.from_pandas(test_df.reset_index(drop=True))

In [31]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize(batch):
    return tokenizer(
        batch["title"],
        batch["text"],
        truncation=True,
        max_length=128
    )

train_tokens = train_dataset.map(tokenize, batched=True, remove_columns=["title","text"])
test_tokens  = test_dataset.map(tokenize,  batched=True, remove_columns=["title","text"])

Map: 100%|██████████| 561222/561222 [00:20<00:00, 27377.08 examples/s]
Map: 100%|██████████| 140306/140306 [00:06<00:00, 20701.92 examples/s]


In [32]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
metric_acc = evaluate.load("accuracy")

num_review_labels = len(sentiment_classes)
distil_bert_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_review_labels,
    id2label=id_idx_to_label,
    label2id=label_to_id_idx,
)

def compute_metrics(eval_pred):
    # Works for both tuple and EvalPrediction
    logits = eval_pred[0] if isinstance(eval_pred, tuple) else eval_pred.predictions
    labels = eval_pred[1] if isinstance(eval_pred, tuple) else eval_pred.label_ids
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": metric_acc.compute(predictions=preds, references=labels)["accuracy"]}

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
args = TrainingArguments(
    output_dir="bert_cls_runs",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=False,
    bf16=False,
    logging_steps=50,
)

train_subset, test_subset = get_subsets(train_tokens, test_tokens, size="small")

trainer_5 = Trainer(
    model=distil_bert_model,
    args=args,
    train_dataset=train_subset,
    eval_dataset=test_subset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer_5.train()

  trainer_5 = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.354,0.306957,0.88
2,0.2321,0.293226,0.896
3,0.1575,0.324147,0.898




TrainOutput(global_step=939, training_loss=0.27718083012980016, metrics={'train_runtime': 461.2906, 'train_samples_per_second': 32.517, 'train_steps_per_second': 2.036, 'total_flos': 324881053991352.0, 'train_loss': 0.27718083012980016, 'epoch': 3.0})

**Analysis:**

- We note that when the number of epochs and model being used to train is held constant, not concatenating the review body and title leads to higher model accuracy, validating our results from using a Naives Bayes model.
- Number of epochs used is sufficient since it shows that we minimised validation loss before it started to rise again.
- Accuracy has also plateaued.
- Final solution should use model trained from separate review title and body.


In [34]:
pred = trainer_5.predict(test_subset)
y_pred = pred.predictions.argmax(-1)
y_true = pred.label_ids

print("Accuracy:", accuracy_score(y_true, y_pred))
print(classification_report(y_true, y_pred, target_names=[id_idx_to_label[i] for i in sorted(id_idx_to_label)]))
print("Confusion:\n", confusion_matrix(y_true, y_pred))



Accuracy: 0.898
              precision    recall  f1-score   support

    negative       0.87      0.88      0.87       104
     neutral       0.50      0.44      0.47        45
    positive       0.95      0.96      0.96       351

    accuracy                           0.90       500
   macro avg       0.77      0.76      0.77       500
weighted avg       0.89      0.90      0.90       500

Confusion:
 [[ 91   9   4]
 [ 12  20  13]
 [  2  11 338]]


##### 4.2.2.2 Medium Dataset + 3 Epoch for Train & Test

In [35]:
args = TrainingArguments(
    output_dir="bert_cls_runs",
    eval_strategy="epoch",    # if the installed transformers wants evaluation_strategy, switch to that
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=False,
    bf16=False,
    logging_steps=50,
)

train_subset, test_subset = get_subsets(train_tokens, test_tokens, size="medium")

trainer_6 = Trainer(
    model=distil_bert_model,
    args=args,
    train_dataset=train_subset,
    eval_dataset=test_subset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer_6.train()

  trainer_6 = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2533,0.248621,0.908
2,0.2,0.311493,0.9045
3,0.1642,0.353564,0.9015




TrainOutput(global_step=3750, training_loss=0.2000052101135254, metrics={'train_runtime': 1860.0737, 'train_samples_per_second': 32.257, 'train_steps_per_second': 2.016, 'total_flos': 1287696115193976.0, 'train_loss': 0.2000052101135254, 'epoch': 3.0})

In [36]:
pred = trainer_6.predict(test_subset)
y_pred = pred.predictions.argmax(-1)
y_true = pred.label_ids

print("Accuracy:", accuracy_score(y_true, y_pred))
print(classification_report(y_true, y_pred, target_names=[id_idx_to_label[i] for i in sorted(id_idx_to_label)]))
print("Confusion:\n", confusion_matrix(y_true, y_pred))



Accuracy: 0.908
              precision    recall  f1-score   support

    negative       0.84      0.90      0.87       407
     neutral       0.56      0.49      0.52       169
    positive       0.96      0.96      0.96      1424

    accuracy                           0.91      2000
   macro avg       0.79      0.78      0.79      2000
weighted avg       0.91      0.91      0.91      2000

Confusion:
 [[ 368   22   17]
 [  52   82   35]
 [  16   42 1366]]


##### 4.2.2.3 Large Dataset + 3 Epoch for Train & Test

In [37]:
args = TrainingArguments(
    output_dir="bert_cls_runs",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=False,
    bf16=False,
    logging_steps=50,
)

train_subset, test_subset = get_subsets(train_tokens, test_tokens, size="large")

trainer_7 = Trainer(
    model=distil_bert_model,
    args=args,
    train_dataset=train_subset,
    eval_dataset=test_subset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer_7.train()

  trainer_7 = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2328,0.24246,0.911
2,0.2096,0.264407,0.9124
3,0.1086,0.331888,0.9094




TrainOutput(global_step=9375, training_loss=0.18876554985046387, metrics={'train_runtime': 5178.2739, 'train_samples_per_second': 28.967, 'train_steps_per_second': 1.81, 'total_flos': 3209993795173464.0, 'train_loss': 0.18876554985046387, 'epoch': 3.0})

In [38]:
pred = trainer_7.predict(test_subset)
y_pred = pred.predictions.argmax(-1)
y_true = pred.label_ids

print("Accuracy:", accuracy_score(y_true, y_pred))
print(classification_report(y_true, y_pred, target_names=[id_idx_to_label[i] for i in sorted(id_idx_to_label)]))
print("Confusion:\n", confusion_matrix(y_true, y_pred))



Accuracy: 0.9124
              precision    recall  f1-score   support

    negative       0.84      0.90      0.87      1047
     neutral       0.57      0.44      0.50       403
    positive       0.96      0.97      0.97      3550

    accuracy                           0.91      5000
   macro avg       0.79      0.77      0.78      5000
weighted avg       0.91      0.91      0.91      5000

Confusion:
 [[ 942   72   33]
 [ 131  179   93]
 [  45   64 3441]]


**Final Conclusions:**

1. We note that the BERT model that was trained with the largest dataset has the highest accuracy at 91.2% out of all the models we trained. Hence, we would use it in our final review sentiment analysis solution.
2. We also note that the Naives Bayes model used, being a simple classical machine learning model, could be trained more quickly and with the entire dataset on a laptop GPU, compared to a deep learning model like BERT. They were also able to achieve similar levels of accuracy, peaking at 89% approximately.
3. Hence, in our final solution we can make use of the best model to use text sentiment analysis to determine if a review is negative.

# Storing the Model for Deployment

In [41]:
save_dir = "bert_sentiment_best"

trainer_7.save_model(save_dir)  # model weights + config; use this if you prefer Trainer API [web:207]

In [42]:
tokenizer = AutoTokenizer.from_pretrained(save_dir)                  # reload tokenizer [web:325]
model = AutoModelForSequenceClassification.from_pretrained(save_dir) # reload model [web:325]
clf = pipeline("text-classification", model=model, tokenizer=tokenizer)  # no top_k here

def predict_sentiment(text: str):
    res = clf(text)[0]
    return f"{res['label']} ({res['score']:.3f})"

demo = gr.Interface(
    fn=predict_sentiment,
    inputs=gr.Textbox(lines=3, placeholder="Enter a review..."),
    outputs="text",
    title="Retail Review Sentiment",
    description="Enter a product review to see the predicted sentiment."
)

if __name__ == "__main__":
    demo.launch()


Device set to use mps:0


* Running on local URL:  http://127.0.0.1:7861
* To create a public link, set `share=True` in `launch()`.
