<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/mthree-c422-dipti/Exercises/day-10/Transformer_From_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Problem Statement: Sentiment Classification on IMDb Movie Reviews Using BERT

## Objective

Build and fine-tune a Transformer-based model (BERT) to classify movie reviews from the IMDb dataset into positive or negative sentiment. This task involves data cleaning, tokenization, model training, evaluation, and analysis, following a similar pipeline demonstrated in the transformer tweet sentiment example.

## Dataset

**IMDb Movie Reviews**

- Publicly accessible via the Hugging Face Datasets library (no manual download or sign-in required).
- Loading code snippet:

```python
from datasets import load_dataset
dataset = load_dataset("imdb")
train = dataset["train"]
test = dataset["test"]
```

- Dataset size: 25,000 training samples and 25,000 testing samples.
- Structure: Each example contains a `text` field (the movie review) and a `label` field (0 = negative, 1 = positive).


## Learning Objectives

- Clean and preprocess natural language movie reviews (remove HTML tags, special characters, unwanted whitespace).
- Tokenize and encode text using `BertTokenizer`.
- Fine-tune `BertForSequenceClassification` for binary sentiment classification.
- Evaluate model with classification metrics (precision, recall, F1-score).
- Analyze model predictions, including inspection of correctly and incorrectly classified samples.


## Tasks

1. **Data Loading \& Exploration**
    - Load the IMDb dataset directly using Hugging Face’s `load_dataset("imdb")` function.
    - Analyze dataset distribution and sample texts to understand the data.
2. **Data Cleaning**
    - Clean the review texts to remove noise such as HTML tags and punctuation.
    - Prepare the cleaned text for tokenization.
3. **Dataset Preparation**
    - Implement a PyTorch Dataset class similar to `TweetDataset`, which performs tokenization, padding, and truncation using `BertTokenizer`.
    - Ensure token sequences have a max length (e.g., 128) for efficient batching.
4. **Model Setup and Training**
    - Load the pretrained BERT base uncased model configured for sequence classification with two output labels.
    - Define training parameters such as batch size, epochs, and logging setup.
    - Use the Hugging Face `Trainer` API to train and validate the model on the IMDb data.
5. **Evaluation and Reporting**
    - Generate a detailed classification report with precision, recall, and F1-score.
    - Create a DataFrame comparing review texts, actual labels, and predicted labels for sample inspection.

## Deliverables

- Python notebook or script containing fully documented code for the entire pipeline.
- Classification report and insights into model performance and errors.
- Examples of correct and incorrect predictions with analysis.


## Getting Started Example

```python
from datasets import load_dataset

# Load IMDb dataset
dataset = load_dataset("imdb")
train = dataset["train"]
test = dataset["test"]

print(f"Number of training samples: {len(train)}")
print(f"Number of test samples: {len(test)}")

# Sample review and label
print("Sample text:", train[0]["text"][:200])
print("Sample label:", train[0]["label"])
```


***

This problem statement ensures you use a reliable, easy-to-access dataset with no external sign-in or manual downloads, perfectly fitting into a Transformer fine-tuning workflow.


In [1]:
import re
import os
import random
from dataclasses import dataclass
from typing import Dict, List

import numpy as np
import pandas as pd
import torch
from datasets import load_dataset
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support
from torch.utils.data import Dataset
from transformers import (
    BertTokenizerFast,
    BertForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    logging,
)

# Reduce transformers logging noise
logging.set_verbosity_error()


In [2]:
# 1. Data Loading & Exploration

def load_and_inspect():
    dataset = load_dataset("imdb")
    train = dataset["train"]
    test = dataset["test"]
    print(f"Train samples: {len(train)} | Test samples: {len(test)}")
    print("Sample text (first 300 chars):\n", train[0]["text"][:300])
    print("Sample label:", train[0]["label"])
    # Distribution
    print("Train label distribution:\n", train.features["label"].names if hasattr(train, 'features') else None)
    return train, test


In [3]:
def clean_text(text: str) -> str:
    """Basic cleaning: remove HTML tags, extra whitespace, weird control chars."""
    if not isinstance(text, str):
        return ""
    # Remove HTML tags
    text = re.sub(r"<[^>]+>", " ", text)
    # Replace newlines / tabs with space
    text = re.sub(r"[\r\n\t]+", " ", text)
    # Remove repeated spaces
    text = re.sub(r" +", " ", text)
    # Strip
    text = text.strip()
    return text


In [4]:
# 3. Dataset Preparation

class IMDBDataset(Dataset):
    def __init__(self, texts: List[str], labels: List[int], tokenizer: BertTokenizerFast, max_length: int = 128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = int(self.labels[idx])
        # Tokenize + return torch tensors for input_ids, attention_mask
        encoding = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding=False,  # padding will be handled by the data collator
            return_tensors=None,
        )
        item = {k: torch.tensor(v) for k, v in encoding.items()}
        item["labels"] = torch.tensor(label, dtype=torch.long)
        return item



In [10]:
# 4. Model Setup and Training

def compute_metrics(pred):
    preds = pred.predictions
    if isinstance(preds, tuple):
        preds = preds[0]
    y_pred = np.argmax(preds, axis=1)
    y_true = pred.label_ids
    acc = accuracy_score(y_true, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary")
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}


def train_and_evaluate(train_dataset, eval_dataset, tokenizer, output_dir="./imdb-bert-output", num_train_epochs=2, batch_size=16):
    model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

    args = TrainingArguments(
    output_dir=output_dir,
    save_steps=500,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.01,
    logging_steps=100,
    save_total_limit=2,
    fp16=torch.cuda.is_available(),
)


    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train()
    # Evaluate
    metrics = trainer.evaluate(eval_dataset=eval_dataset)
    print("Evaluation metrics:\n", metrics)

    # Return trainer and model for further analysis
    return trainer, trainer.model

In [6]:
!pip install --upgrade transformers




In [None]:
# 5. Full pipeline orchestration

def run_pipeline(sample_limit=None, max_length=128, num_train_epochs=2, batch_size=16):
    # Load
    train_raw, test_raw = load_and_inspect()

    # Optionally subsample for quick experiments
    if sample_limit is not None:
        train_raw = train_raw.select(range(min(sample_limit, len(train_raw))))
        test_raw = test_raw.select(range(min(sample_limit, len(test_raw))))

    # Clean texts
    print("Cleaning texts...")
    train_texts = [clean_text(x) for x in train_raw["text"]]
    train_labels = train_raw["label"]
    test_texts = [clean_text(x) for x in test_raw["text"]]
    test_labels = test_raw["label"]

    # Tokenizer
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

    # Create Datasets
    train_dataset = IMDBDataset(train_texts, train_labels, tokenizer, max_length=max_length)
    test_dataset = IMDBDataset(test_texts, test_labels, tokenizer, max_length=max_length)

    # Train
    trainer, model = train_and_evaluate(
        train_dataset,
        test_dataset,
        tokenizer,
        output_dir="./imdb-bert-output",
        num_train_epochs=num_train_epochs,
        batch_size=batch_size,
    )

    # Predictions on test set
    print("Generating predictions on test set...")
    preds_output = trainer.predict(test_dataset)
    logits = preds_output.predictions
    if isinstance(logits, tuple):
        logits = logits[0]
    y_pred = np.argmax(logits, axis=1)
    y_true = preds_output.label_ids

    # Classification report
    report = classification_report(y_true, y_pred, target_names=["negative", "positive"], digits=4)
    print("\nClassification Report:\n", report)

    # Create a DataFrame for a small sample to inspect correct/incorrect predictions
    df = pd.DataFrame({"text": test_texts, "true_label": y_true, "pred_label": y_pred})
    df["correct"] = df["true_label"] == df["pred_label"]

    # Save full predictions to CSV
    os.makedirs("outputs", exist_ok=True)
    df.to_csv("outputs/test_predictions.csv", index=False)
    print("Saved test predictions to outputs/test_predictions.csv")

    # Show some correctly and incorrectly classified samples
    incorrect = df[~df["correct"]].sample(n=min(10, df[~df["correct"]].shape[0]), random_state=42)
    correct = df[df["correct"]].sample(n=min(10, df[df["correct"]].shape[0]), random_state=42)

    print("\nExamples of incorrect predictions:\n")
    for i, row in incorrect.iterrows():
        print(f"True: {row['true_label']} Pred: {row['pred_label']} Text: {row['text'][:300]}\n---\n")

    print("\nExamples of correct predictions:\n")
    for i, row in correct.iterrows():
        print(f"True: {row['true_label']} Pred: {row['pred_label']} Text: {row['text'][:300]}\n---\n")

    return trainer, model, df, report


if __name__ == "__main__":
    # To run quickly for testing, you can set sample_limit=2000
    # For final training set sample_limit=None to use full dataset (25k)
    trainer, model, df, report = run_pipeline(sample_limit=5000, max_length=128, num_train_epochs=2, batch_size=8)
    print("Done.\nCheck outputs/test_predictions.csv for detailed predictions and outputs/ for saved models in ./imdb-bert-output")


Train samples: 25000 | Test samples: 25000
Sample text (first 300 chars):
 I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really h
Sample label: 0
Train label distribution:
 ['neg', 'pos']
Cleaning texts...


  trainer = Trainer(
