<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/mthree-422-salleh/Exercises/day-10/Transformers_From_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Problem Statement: Sentiment Classification on IMDb Movie Reviews Using BERT
Objective

Build and fine-tune a Transformer-based model (BERT) to classify movie reviews from the IMDb dataset into positive or negative sentiment. This task involves data cleaning, tokenization, model training, evaluation, and analysis, following a similar pipeline demonstrated in the transformer tweet sentiment example.
Dataset

IMDb Movie Reviews

    Publicly accessible via the Hugging Face Datasets library (no manual download or sign-in required).
    Loading code snippet:

from datasets import load_dataset
dataset = load_dataset("imdb")
train = dataset["train"]
test = dataset["test"]

    Dataset size: 25,000 training samples and 25,000 testing samples.
    Structure: Each example contains a text field (the movie review) and a label field (0 = negative, 1 = positive).

Learning Objectives

    Clean and preprocess natural language movie reviews (remove HTML tags, special characters, unwanted whitespace).
    Tokenize and encode text using BertTokenizer.
    Fine-tune BertForSequenceClassification for binary sentiment classification.
    Evaluate model with classification metrics (precision, recall, F1-score).
    Analyze model predictions, including inspection of correctly and incorrectly classified samples.

Tasks

    Data Loading & Exploration
        Load the IMDb dataset directly using Hugging Face’s load_dataset("imdb") function.
        Analyze dataset distribution and sample texts to understand the data.
    Data Cleaning
        Clean the review texts to remove noise such as HTML tags and punctuation.
        Prepare the cleaned text for tokenization.
    Dataset Preparation
        Implement a PyTorch Dataset class similar to TweetDataset, which performs tokenization, padding, and truncation using BertTokenizer.
        Ensure token sequences have a max length (e.g., 128) for efficient batching.
    Model Setup and Training
        Load the pretrained BERT base uncased model configured for sequence classification with two output labels.
        Define training parameters such as batch size, epochs, and logging setup.
        Use the Hugging Face Trainer API to train and validate the model on the IMDb data.
    Evaluation and Reporting
        Generate a detailed classification report with precision, recall, and F1-score.
        Create a DataFrame comparing review texts, actual labels, and predicted labels for sample inspection.

Deliverables

    Python notebook or script containing fully documented code for the entire pipeline.
    Classification report and insights into model performance and errors.
    Examples of correct and incorrect predictions with analysis.

Getting Started Example



In [None]:
'''
from datasets import load_dataset

# Load IMDb dataset
dataset = load_dataset("imdb")
train = dataset["train"]
test = dataset["test"]

print(f"Number of training samples: {len(train)}")
print(f"Number of test samples: {len(test)}")

# Sample review and label
print("Sample text:", train[0]["text"][:200])
print("Sample label:", train[0]["label"])
'''



This problem statement ensures you use a reliable, easy-to-access dataset with no external sign-in or manual downloads, perfectly fitting into a Transformer fine-tuning workflow.


In [None]:
# Import the necessary Libraries
import re
import numpy as np
import string
import pandas as pd
import torch

from sklearn.metrics import classification_report
from torch.utils.data import Dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding, logging
import warnings

warnings.filterwarnings("ignore", category=UserWarning)
logging.set_verbosity_error()

In [None]:
# Load train and test data
from datasets import load_dataset

dataset = load_dataset("imdb")
print(dataset)

# Access train and test splits
train = dataset["train"]
test = dataset["test"]

# Check train data
print(train[0])           # First training example
print(train.features)     # Schema (text + label)
print(train.num_rows)     # Number of examples


In [None]:
# Check for GPU availability
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Using device: {device}")

In [None]:
# Clean text
def clean_text(text):
    # Remove HTML tags
    text = re.sub(r"<.*?>", " ", text)

    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))

    # Remove extra whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text

# Clean dataset
def preprocess_dataset(dataset):
    return dataset.map(lambda example: {"clean_text": clean_text(example["text"])})

# Cleaned datasets
train_clean = preprocess_dataset(train)
test_clean = preprocess_dataset(test)

# Check cleaned dataset
print(train_clean[0]["clean_text"])

In [None]:
# Tokenization
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize_fn(my_dataset):
    return tokenizer(my_dataset["clean_text"], truncation=True, padding="max_length", max_length=128)

train_tokenized = train_clean.map(tokenize_fn, batched=True)
test_tokenized = test_clean.map(tokenize_fn, batched=True)

In [None]:
# Load model
num_labels = len(set(train["label"]) | set(test["label"]))
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)


In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=16,    # Increase batch size if memory allows
    per_device_eval_batch_size=64,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none"
)

In [None]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
)

In [None]:
# Train
trainer.train()

In [None]:
# Evaluate
predictions = trainer.predict(test_tokenized)
preds = torch.argmax(torch.tensor(predictions.predictions), axis=1)

In [None]:
# Report
print("\nClassification Report:\n")
print(classification_report(test_tokenized["label"], preds))

In [None]:
# Convert predictions and actual labels to lists
predicted_labels = preds.tolist()
actual_labels = test_tokenized["label"]

In [None]:
# Compare in a DataFrame
comparison_df = pd.DataFrame({
    "text": test_tokenized["clean_text"],
    "actual": actual_labels,
    "predicted": predicted_labels
})

# Print a sample comparison
comparison_df.head(20)
