##  IMDb Sentiment Classification with Transformers

This notebook demonstrates how to fine-tune **DistilBERT** on the **IMDb movie review dataset** 
for binary sentiment classification (**Positive vs. Negative**).  

###  Overview
- Load and preprocess the IMDb dataset using Hugging Face `datasets`
- Tokenize text with a BERT-based tokenizer
- Fine-tune DistilBERT using the `Trainer` API
- Evaluate performance with accuracy and F1-score
- Save the trained model for later use

In [1]:
# import necessary libraries

import numpy as np
import evaluate
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import gradio as gr

## 1. Load Data

In [2]:
dataset = load_dataset("imdb")

## 2. Tokenization

In [3]:
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [4]:
# Tokenization function
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, max_length=256)

# apply tokenizer
tokenized_datasets = dataset.map(tokenize, batched=True)

# data collector
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Efficient tokenization: set remove_columns=["text"] in .map() and set batched=True, batch_size=1000 for faster preprocessing.

## 3. Load Model

In [5]:
# load model and set number of classes to 2
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 4. Define Evaluation Metric

In [6]:
# load accuracy and f1 score

accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    """ The function computes the accuracy and f1 score for the model predictions"""
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy.compute(predictions=preds, references=labels)
    f1_score = f1.compute(predictions=preds, references=labels, average="weighted")
    return {"accuracy": acc["accuracy"], "f1": f1_score["f1"]}

## 5. Training

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",             # saves model checkpoints
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [8]:
# train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2558,0.251691,0.899,0.898742
2,0.1758,0.283047,0.9122,0.912164
3,0.1092,0.336033,0.91416,0.91416


TrainOutput(global_step=4689, training_loss=0.19062555459141095, metrics={'train_runtime': 1294.0521, 'train_samples_per_second': 57.957, 'train_steps_per_second': 3.624, 'total_flos': 4967510891175168.0, 'train_loss': 0.19062555459141095, 'epoch': 3.0})

## 6. Evaluate model

In [9]:
# evaluation
results = trainer.evaluate()
print(results)

{'eval_loss': 0.33603256940841675, 'eval_accuracy': 0.91416, 'eval_f1': 0.9141598016748058, 'eval_runtime': 103.0502, 'eval_samples_per_second': 242.6, 'eval_steps_per_second': 15.167, 'epoch': 3.0}


In [None]:
# Test on new text
sample = "The movie was absolutely wonderful, loved it!"
# tokenize sample and return pytoch tensors
inputs = tokenizer(sample, return_tensors="pt")
# move inputs to gpu
inputs = {k:v.to(model.device) for k,v in inputs.items()}

# wrap in torch.no_grad to avoid calculating gradients
with torch.no_grad():
    outputs = model(**inputs)
pred = outputs.logits.argmax().item()
print("Prediction:", "Positive" if pred == 1 else "Negative")

Prediction: Positive


## 7. Save Model

In [None]:
# Save model + tokenizer locally
trainer.save_model("./imdb-model")
tokenizer.save_pretrained("./imdb-model")

('./imdb_model\\tokenizer_config.json',
 './imdb_model\\special_tokens_map.json',
 './imdb_model\\vocab.txt',
 './imdb_model\\added_tokens.json',
 './imdb_model\\tokenizer.json')