# This notebook fine-tunes BERT on the IMDB positive/negative reviews dataset

### This code will run faster on GPU, you could try Google Colab's free T4 GPU if needed
### You can also freeze the backbone model to simply do transfer learning with the new head, for speed (Step 6)

In [None]:

# ‚úÖ Step 0: Install required libraries (only run once)
!pip install datasets -U transformers
# You may need to add more packages here if module not found


We import the main tools for this project:  

- `load_dataset` from **datasets** to load IMDB reviews.  
- `BertTokenizer` and `BertForSequenceClassification` from **transformers** to preprocess text and load the model.  
- `Trainer` and `TrainingArguments` to manage the training loop.  
- **torch** for tensor operations.  
- **numpy** for numerical utilities.  
- **accuracy_score** from scikit-learn for evaluation.  


In [None]:

# ‚úÖ Step 1: Imports
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
import numpy as np
from sklearn.metrics import accuracy_score


We load the **IMDB dataset**, which is already split into train and test sets.  

- Training: we take a random subset of 2,000 reviews.  
- Testing: we take a random subset of 1,000 reviews.  

Why subsets?  
- The full IMDB dataset has 25,000 reviews per split.  
- Subsets make training **much faster** while still showing the fine-tuning process.  

We shuffle before selecting so that our subset is a fair mix of positive and negative reviews.  


In [None]:

# ‚úÖ Step 2: Load the IMDB dataset and take a small subset for quick training
dataset = load_dataset("imdb")
small_train = dataset["train"].shuffle(seed=42).select(range(2000))  # Random 2000 samples
small_test = dataset["test"].shuffle(seed=42).select(range(1000))  # Random 1000 samples
# Note that the data comes already split to test train
# The Train and Test were ordered, so we had to randomly select samples


We load the **pre-trained BERT tokenizer** (`bert-base-uncased`).  

- Converts raw text into token IDs that BERT understands.  
- Uses a subword vocabulary learned during BERT‚Äôs pre-training.  
- Handles casing (lowercased text in this model).  

The tokenizer is essential because BERT cannot process raw strings directly.  


In [None]:

# ‚úÖ Step 3: Load tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Loads the pre-trained BERT tokenizer; will convert raw text to token IDs that BERT understands


We define a tokenisation function that:  
- **Pads** sequences to the maximum length (so all inputs are the same size).  
- **Truncates** longer reviews (BERT can only handle up to 512 tokens).  
- Returns token IDs and attention masks.  

We then apply this function to both the training and test sets using `map(batched=True)`, which processes multiple examples at once for speed.  


In [None]:

# ‚úÖ Step 4: Tokenise text
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=512)
# Defines a function to tokenize a batch of text: pads to max length, truncates if too long

train_enc = small_train.map(tokenize, batched=True)
test_enc = small_test.map(tokenize, batched=True)
# Apply the tokenizer to the train/test datasets; batched=True for speed


We set the dataset format to `"torch"`, keeping only the columns that BERT needs:  
- `input_ids` (the token IDs for each word/subword).  
- `attention_mask` (marks which tokens are real vs. padding).  
- `label` (the sentiment: 0 = negative, 1 = positive).  

This ensures that the dataset can be fed directly into the Hugging Face **Trainer API**.  


In [None]:

# ‚úÖ Step 5: Set PyTorch format
train_enc.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_enc.set_format("torch", columns=["input_ids", "attention_mask", "label"])
# Converts datasets to PyTorch tensors for use with Trainer


# Step 6 is where we call in BERT as a pretrained model, with our additional 2-label classification head on top ‚úÖ‚úÖ‚úÖ

We load the **BERT model for sequence classification** (`bert-base-uncased`):  

- It starts with the pretrained BERT backbone (which has learned language representations from large text corpora).  
- We add a **classification head** (a small fully connected layer) for binary sentiment classification.  
- `num_labels=2` because IMDB reviews are labelled as positive or negative.  

This setup is now ready for training.  


In [None]:

# ‚úÖ Step 6: Load model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Loads pre-trained BERT **with a classification head**
# num_labels=2 means binary sentiment classification


In [None]:

### Optional - if you want to freeze the backbone model and just train the appended head:
# for param in model.bert.parameters():
#     param.requires_grad = False
### This will make the model train much faster


We define a function to calculate evaluation metrics during training.  

- `eval_pred` contains model **logits** (raw outputs before softmax) and true labels.  
- We take the **argmax** of the logits to get predicted class indices.  
- Compute **accuracy** by comparing predictions to true labels.  

This function is passed to the Hugging Face `Trainer` so it automatically calculates metrics at the end of each evaluation.  


In [None]:

# ‚úÖ Step 7: Define metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, predictions)}
# Defines a metric function for Trainer to compute accuracy


We set up **TrainingArguments**, which control how the Trainer fine-tunes BERT:  

- `output_dir`: where model checkpoints and logs are stored.  
- `per_device_train_batch_size` / `per_device_eval_batch_size`: batch size per device (GPU/CPU).  
- `num_train_epochs`: how many times the model will see the training dataset.  
- `eval_strategy="epoch"`: evaluate after each epoch.  
- `save_strategy="no"`: we skip saving intermediate checkpoints (saves time and space).  
- `logging_steps`: print training metrics every 10 steps.  
- `logging_dir`: directory for TensorBoard logs if needed.  
- `report_to="none"`: disables reporting to external tools like WandB.  

These settings allow fast, clear, and reproducible fine-tuning on a small dataset.  


In [None]:

# ‚úÖ Step 8: Define training configuration
training_args = TrainingArguments(
    output_dir="./bert-imdb",            # Where to save model checkpoints and logs
    per_device_train_batch_size=8,       # Number of samples per GPU/CPU batch during training
    per_device_eval_batch_size=8,        # Number of samples per GPU/CPU batch during evaluation
    num_train_epochs=2,                  # Total number of training epochs over the dataset
    eval_strategy="epoch",               # Evaluate model at the end of each epoch
    save_strategy="no",                  # Do not save intermediate model checkpoints
    logging_steps=10,                    # Log training metrics every 10 steps
    logging_dir="./logs",                # Directory to store logs for TensorBoard if needed
    load_best_model_at_end=False,        # Do not automatically reload the best model after training
    report_to="none",                    # Disable reporting to external tools like WandB
)


### If the following code runs very slowly, try a GPU. Google Colab has a free T4 GPU
### Note that even with Colab's GPU, it may still take around 5 minutes to run
### If you want it to run faster, go back and freeze the backbone model (Step 6)

We initialise the **Hugging Face Trainer** with:  

- `model`: the BERT model with classification head.  
- `args`: the training arguments defined above.  
- `train_dataset` / `eval_dataset`: tokenised train and test subsets.  
- `compute_metrics`: the function to compute accuracy.  

Then we call `trainer.train()`, which:  
1. Loops over the dataset for the specified number of epochs.  
2. Performs forward and backward passes.  
3. Updates the model parameters (by default only the classification head if backbone is frozen).  

After training, we move the model to **GPU (if available)** for faster inference.  


In [None]:

# ‚úÖ Step 9: Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_enc,
    eval_dataset=test_enc,
    compute_metrics=compute_metrics,
)
# Activate the Hugging Face Trainer with model, datasets, metrics, and args

trainer.train()
# Train the model (fine-tuning the head by default, backbone model can also be updated if not frozen)

import torch

# after trainer.train()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Moves the model to GPU if available for inference


We evaluate the fine-tuned model on the test set:  

- `trainer.evaluate()` computes the metrics (accuracy in this case) on the evaluation dataset.  
- We print the test accuracy to get a quick sense of model performance.  

This step ensures our model generalises well to unseen data.  


In [None]:

# ‚úÖ Step 10: Evaluate
metrics = trainer.evaluate()
print("‚úÖ Test Accuracy:", metrics["eval_accuracy"])
# Compute evaluation metrics on the validation/test set


We define a convenient function `predict_sentiment(text)` to classify new reviews:  

1. **Tokenize the input** text (padding/truncating to 512 tokens).  
2. Move the inputs to the **same device** as the model.  
3. Set the model to **evaluation mode** (`model.eval()`) to disable dropout.  
4. **Forward pass** through BERT, get logits, apply softmax for probabilities.  
5. Select the **class with highest probability** and report its confidence.  
6. Convert numeric prediction (0/1) to human-readable sentiment:  
   - üëç Positive  
   - üëé Negative  

We then test the function on example sentences to see the predictions in action.  


In [None]:

# ‚úÖ Step 11: Make a predictive function
def predict_sentiment(text):
    # Tokenise the input text into IDs BERT understands, add padding/truncation
    # Convert to PyTorch tensors and move to the same device as the model (CPU or GPU)
    inputs = tokenizer(
        text,
        return_tensors="pt",   # return PyTorch tensors
        truncation=True,       # cut off if longer than max_length
        padding=True,          # pad shorter sequences
        max_length=512         # maximum token length
    ).to(device)               # move tensors to GPU if available

    # Put model in evaluation mode (disables dropout, etc.)
    model.eval()

    # Disable gradient calculation for faster inference and lower memory usage
    with torch.no_grad():
        # Forward pass through the model
        outputs = model(**inputs)
        # Get the raw logits (pre-softmax scores) from model output
        logits = outputs.logits
        # Convert logits to probabilities using softmax
        probs = torch.nn.functional.softmax(logits, dim=-1)
        # Pick the index with the highest probability as the predicted class
        pred = torch.argmax(probs, dim=1).item()
        # Get the confidence of the predicted class
        confidence = probs[0, pred].item()

    # Convert numeric prediction to human-readable sentiment
    sentiment = "üëç Positive" if pred == 1 else "üëé Negative"
    # Print the sentiment with its confidence percentage
    print(f"Sentiment: {sentiment} ({confidence:.2%} confidence)")


Let's now take a look at our model in action:

In [None]:

# Test predictions ‚úÖ‚úÖ‚úÖ
predict_sentiment("I loved the movie.")
predict_sentiment("It was boring, slow, and way too long. I wouldn't recommend it.")


If we use longer movie reviews as test data, we'll see that our model outperforms other off-the-shelf binary sentiment classification models  

In fact, let's compare our results with the original BERT model. BERT is great at understanding, but hasn't been trained for binary classification at all -->

In [None]:
# Load BERT with fresh classifier head
from transformers import BertForSequenceClassification
untrained_model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
untrained_model.to(device)

# Quick predict function
def predict_raw(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)
    untrained_model.eval()
    with torch.no_grad():
        logits = untrained_model(**inputs).logits
        probs = torch.nn.functional.softmax(logits, dim=-1)
        pred = logits.argmax(dim=1).item()
        confidence = probs[0, pred].item()
    sentiment = "üëç Positive" if pred == 1 else "üëé Negative"
    print(f"Sentiment: {sentiment} ({confidence:.2%} confidence)")

# Test
predict_raw("I loved the movie.")
predict_raw("It was boring, slow, and way too long. I wouldn't recommend it.")


Notice how the default BERT model is not good at this task, because it has not been trained on it yet.