# Assignment 3: Sentiment Analysis with Transformer Models

**Objective:**
This assignment focuses on building and comparing transformer-based models for sentiment analysis on the IMDB movie reviews dataset. The goal is to classify reviews as positive or negative using state-of-the-art NLP models.

**Approach:**
- Load and preprocess the IMDB dataset, mapping sentiments to binary labels.
- Compare multiple transformer models (BERT, RoBERTa, DeBERTa, ELECTRA, DistilBERT) using the HuggingFace Transformers library.
- Train each model on a subset of the data to evaluate and select the best-performing model based on F1 score.
- Retrain the best model on the full training set and evaluate on validation and test sets.
- Report metrics (F1, precision, recall) and show sample predictions.

**Summary:**
This notebook demonstrates a practical workflow for benchmarking transformer models on a real-world sentiment classification task, highlighting model selection, training, and evaluation steps using modern NLP tools.

In [1]:
import os
import random
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from datasets import Dataset, DatasetDict

2025-08-10 06:11:35.069499: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754806295.276172      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754806295.338459      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [2]:
DATA_CSV = "/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv"
OUTPUT_DIR = "/kaggle/working/outputs_imdb"
os.makedirs(OUTPUT_DIR, exist_ok=True)

In [3]:
RANDOM_SEED = 42
torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

# Models to compare
model_names = [
    "bert-base-uncased",
    "roberta-base",
    "microsoft/deberta-base",
    "google/electra-base-discriminator",
    "distilbert-base-uncased",
]

In [4]:
subset_train_size = 10000
subset_val_size = 2000
max_length = 256
batch_size = 16
num_train_epochs_small = 2
batch_size_full = 16
num_train_epochs_full = 3

In [5]:
df = pd.read_csv(DATA_CSV)
df['label'] = df['sentiment'].map({'negative': 0, 'positive': 1})
df = df.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True)

train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['label'], random_state=RANDOM_SEED)
train_df, val_df = train_test_split(train_df, test_size=0.125, stratify=train_df['label'], random_state=RANDOM_SEED)

train_ds = Dataset.from_pandas(train_df[['review', 'label']])
val_ds   = Dataset.from_pandas(val_df[['review', 'label']])
test_ds  = Dataset.from_pandas(test_df[['review', 'label']])
dataset_dict = DatasetDict({"train": train_ds, "validation": val_ds, "test": test_ds})

In [6]:
def get_tokenizer(model_name):
    return AutoTokenizer.from_pretrained(model_name, use_fast=True)

def preprocess_function(examples, tokenizer):
    return tokenizer(examples["review"], padding="max_length", truncation=True, max_length=max_length)


In [7]:
def compute_metrics_binary(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    f1 = f1_score(labels, preds, average="binary")
    precision = precision_score(labels, preds, zero_division=0)
    recall = recall_score(labels, preds, zero_division=0)
    return {"f1": f1, "precision": precision, "recall": recall}

In [8]:
def train_and_eval(model_name, train_dataset, val_dataset, output_dir, small_run=True):
    tokenizer = get_tokenizer(model_name)
    tokenized_train = train_dataset.map(lambda x: preprocess_function(x, tokenizer), batched=True)
    tokenized_val = val_dataset.map(lambda x: preprocess_function(x, tokenizer), batched=True)
    tokenized_train = tokenized_train.remove_columns(["review"])
    tokenized_val = tokenized_val.remove_columns(["review"])
    tokenized_train.set_format("torch")
    tokenized_val.set_format("torch")

    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    per_device = batch_size if small_run else batch_size_full
    epochs = num_train_epochs_small if small_run else num_train_epochs_full

    args = TrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        per_device_train_batch_size=per_device,
        per_device_eval_batch_size=per_device,
        num_train_epochs=epochs,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        fp16=torch.cuda.is_available(),
        save_total_limit=1,
        seed=RANDOM_SEED,
        report_to="none"
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_val,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics_binary,
    )

    trainer.train()
    eval_metrics = trainer.evaluate()
    trainer.save_model(output_dir)
    return eval_metrics, trainer

In [9]:
small_train = dataset_dict["train"].shuffle(seed=RANDOM_SEED).select(range(min(subset_train_size, len(dataset_dict["train"]))))
small_val = dataset_dict["validation"].shuffle(seed=RANDOM_SEED).select(range(min(subset_val_size, len(dataset_dict["validation"]))))

results = {}
for model_name in model_names:
    print(f"\n=== Training on subset: {model_name} ===")
    outdir = os.path.join(OUTPUT_DIR, "subset_compare", model_name.replace("/", "_"))
    os.makedirs(outdir, exist_ok=True)
    metrics, _ = train_and_eval(model_name, small_train, small_val, outdir, small_run=True)
    results[model_name] = metrics
    print(model_name, metrics)


=== Training on subset: bert-base-uncased ===


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,F1,Precision,Recall
1,0.3158,0.253628,0.895055,0.896866,0.893253
2,0.1377,0.329352,0.899901,0.885854,0.914401




bert-base-uncased {'eval_loss': 0.3293517827987671, 'eval_f1': 0.8999008919722497, 'eval_precision': 0.8858536585365854, 'eval_recall': 0.9144008056394763, 'eval_runtime': 17.2368, 'eval_samples_per_second': 116.031, 'eval_steps_per_second': 3.655, 'epoch': 2.0}

=== Training on subset: roberta-base ===


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,F1,Precision,Recall
1,0.2983,0.241242,0.912863,0.941176,0.886203
2,0.1488,0.265899,0.914659,0.911912,0.917422




roberta-base {'eval_loss': 0.265899121761322, 'eval_f1': 0.9146586345381525, 'eval_precision': 0.9119119119119119, 'eval_recall': 0.9174219536757301, 'eval_runtime': 16.6882, 'eval_samples_per_second': 119.846, 'eval_steps_per_second': 3.775, 'epoch': 2.0}

=== Training on subset: microsoft/deberta-base ===


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

pytorch_model.bin:   0%|          | 0.00/559M [00:00<?, ?B/s]

Some weights of DebertaForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


model.safetensors:   0%|          | 0.00/559M [00:00<?, ?B/s]



Epoch,Training Loss,Validation Loss,F1,Precision,Recall
1,0.2903,0.207332,0.918753,0.932573,0.905337
2,0.1317,0.254964,0.925163,0.922846,0.927492




microsoft/deberta-base {'eval_loss': 0.2549639940261841, 'eval_f1': 0.9251632345554998, 'eval_precision': 0.9228456913827655, 'eval_recall': 0.9274924471299094, 'eval_runtime': 25.0467, 'eval_samples_per_second': 79.851, 'eval_steps_per_second': 2.515, 'epoch': 2.0}

=== Training on subset: google/electra-base-discriminator ===


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]



Epoch,Training Loss,Validation Loss,F1,Precision,Recall
1,0.2672,0.211452,0.921964,0.946921,0.898288
2,0.1215,0.283953,0.935208,0.932866,0.937563




google/electra-base-discriminator {'eval_loss': 0.2839530408382416, 'eval_f1': 0.9352084379708689, 'eval_precision': 0.9328657314629258, 'eval_recall': 0.9375629405840886, 'eval_runtime': 18.0925, 'eval_samples_per_second': 110.543, 'eval_steps_per_second': 3.482, 'epoch': 2.0}

=== Training on subset: distilbert-base-uncased ===


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,F1,Precision,Recall
1,0.3354,0.273139,0.884417,0.906878,0.863041
2,0.1579,0.297456,0.894236,0.89022,0.898288




distilbert-base-uncased {'eval_loss': 0.29745587706565857, 'eval_f1': 0.8942355889724312, 'eval_precision': 0.8902195608782435, 'eval_recall': 0.8982880161127895, 'eval_runtime': 8.4092, 'eval_samples_per_second': 237.836, 'eval_steps_per_second': 7.492, 'epoch': 2.0}


In [10]:
best_model_name = sorted(results.items(), key=lambda x: x[1]['eval_f1'], reverse=True)[0][0]
print("Best model from subset:", best_model_name)


Best model from subset: google/electra-base-discriminator


In [11]:
full_outdir = os.path.join(OUTPUT_DIR, "final_best", best_model_name.replace("/", "_"))
full_metrics, full_trainer = train_and_eval(best_model_name, dataset_dict["train"], dataset_dict["validation"], full_outdir, small_run=False)
print("Validation metrics (full training):", full_metrics)


Map:   0%|          | 0/35000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,F1,Precision,Recall
1,0.2175,0.209777,0.930411,0.895633,0.968
2,0.1302,0.348567,0.940455,0.930333,0.9508
3,0.1698,0.880689,0.941247,0.937326,0.9452




Validation metrics (full training): {'eval_loss': 0.8806890845298767, 'eval_f1': 0.9412467635929098, 'eval_precision': 0.9373264577548592, 'eval_recall': 0.9452, 'eval_runtime': 45.684, 'eval_samples_per_second': 109.448, 'eval_steps_per_second': 3.437, 'epoch': 3.0}


In [12]:
tokenizer = get_tokenizer(best_model_name)
tokenized_test = dataset_dict["test"].map(lambda x: preprocess_function(x, tokenizer), batched=True).remove_columns(["review"])
tokenized_test.set_format("torch")
test_metrics = full_trainer.evaluate(eval_dataset=tokenized_test)
print("Test set metrics:", test_metrics)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]



Test set metrics: {'eval_loss': 0.845896303653717, 'eval_f1': 0.9422655783396376, 'eval_precision': 0.937970669837495, 'eval_recall': 0.9466, 'eval_runtime': 90.5239, 'eval_samples_per_second': 110.468, 'eval_steps_per_second': 3.458, 'epoch': 3.0}


In [13]:
rand_idxs = random.sample(range(len(dataset_dict["test"])), 10)
samples = [dataset_dict["test"][i] for i in rand_idxs]
pipe_tokenized = tokenizer([s["review"] for s in samples], truncation=True, padding=True, max_length=max_length, return_tensors="pt")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForSequenceClassification.from_pretrained(full_outdir).to(device)
inputs = {k: v.to(device) for k, v in pipe_tokenized.items()}
with torch.no_grad():
    logits = model(**inputs).logits
    preds = logits.argmax(dim=-1).cpu().numpy()

print("\n--- 10 random test predictions ---\n")
for i, s in enumerate(samples):
    print(f"Review: {s['review'][:200]}...")
    print(f"True label: {s['label']}, Predicted: {int(preds[i])}")
    print("-" * 50)


--- 10 random test predictions ---

Review: This made-for-TV film is a brilliant one. This is probably the best and favourite role by BAFTA winning John Thaw (Kavanagh Q.C. and Inspector Morse). Tom Oakley (Thaw) widowed man has lived in a vill...
True label: 1, Predicted: 1
--------------------------------------------------
Review: !!! Spoiler alert!!!<br /><br />The point is, though, that I didn't think this film had an ending TO spoil... I only started watching it in the middle, after Matt had gotten into Sarah's body, but the...
True label: 0, Predicted: 1
--------------------------------------------------
Review: First off, let me start with a quote a friend of mine said while watching this movie: "This entire movie had to have been a dare. You know, like, 'DUDE, I BET YOU COULDN'T MAKE THE WORST MOVIE EVER'"....
True label: 0, Predicted: 0
--------------------------------------------------
Review: This is a candidate for worst films I've ever seen. It wanted to be as shocking as