# Text classification using Transformers.

<p align="center">
  <a href="https://colab.research.google.com/github/auduvignac/llm-finetuning/blob/main/notebooks/project/draft.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Ouvrir dans Google Colab"/>
  </a>
</p>

This lab will focus on text classification on the Imdb dataset.
In this lab session, we will focus on encoder-based transformer architecture, through the lens of the most famous model: **BERT**.

---

# Introduction

## HuggingFace

We have already experimented with some components provided by the HuggingFace library:
- the `datasets` library,
- the `tokenizer`.

Actually, HuggingFace library provides convenient API to deal with transformer models, like BERT, GPT, etc.  To quote their website: *Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. Transformers support framework interoperability between PyTorch, TensorFlow, and JAX.*

## Goal of the lab session

We will experiment with the HuggingFace library. You'll have to load a model and to run it on your task.

Important things to keep in in minds are:
- Even if each model is a Transformer, they all have their peculiarities.
- What is the exact input format expected by the model?
- What is its exact output?
- Can you use the available model as is or should you make some modifications for your task?

These questions are actually part of the life of a NLP scientist. We will adress some of these questions in this lab and in the next lessons / labs / HW.

## Libraries import

In [23]:
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
import json
import math
import random

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from datasets import (
    DatasetDict,
    load_dataset,
)
from sklearn.metrics import (
    accuracy_score,
    brier_score_loss,
    classification_report,
    confusion_matrix,
    f1_score,
    log_loss,
)
from tabulate import tabulate
from torch.nn.utils import clip_grad_norm_
from torch.utils.data import DataLoader
from tqdm.notebook import tqdm
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DistilBertConfig,
    DistilBertForSequenceClassification,
    DistilBertTokenizer,
    get_linear_schedule_with_warmup,
)

# If the machine you run this on has a GPU available with CUDA installed,
# use it. Using a GPU for learning often leads to huge speedups in training.
# See https://developer.nvidia.com/cuda-downloads for installing CUDA
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE

device(type='cuda')

## Device set up & reproducibility

- Ensures results are reproducible across runs (same seed = same shuffling, same weight initialization, etc.).
- Chooses GPU if available, otherwise falls back to CPU.
- Printing confirms where your model will run.

In [2]:
# Reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cuda


In [3]:
# Data collators collect and organise information.
# They clean data and assure the forms of data from a variety of sources,
# including: primary data. survey data.
class DataCollator:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __call__(
        self, batch, max_length=256, padding="longest", return_tensors="pt"
    ):
        return self.tokenizer.pad(
            batch,
            padding=padding,
            max_length=max_length,
            return_tensors=return_tensors,
        )

In [47]:
class LLMFineTuner:
    def __init__(
        self,
        dataset="scikit-learn/imdb",
        model_cls=DistilBertForSequenceClassification,
        num_labels=2,  # "negative" (0) and "positive" (1)
        pretrained_model_name_or_path="distilbert-base-uncased",
        tokenizer_cls=DistilBertTokenizer,
    ):
        self.dataset = dataset
        self.device = torch.device(
            "cuda" if torch.cuda.is_available() else "cpu"
        )
        self.model_cls = model_cls
        self.num_labels = num_labels
        self.pretrained_model_name_or_path = pretrained_model_name_or_path
        self.tokenizer_cls = tokenizer_cls

    def set_dataset(self, verbose=False):
        self.dataset = load_dataset(self.dataset, split="train")
        if verbose:
            print(f"Dataset loaded :\n"
                  f"{self.dataset}\n"
                  f"with {len(self.dataset)} examples."
            )

    def set_tokenizer(
        self,
        do_lower_case=True,
        verbose=False,
    ):
        self.tokenizer = self.tokenizer_cls.from_pretrained(
            self.pretrained_model_name_or_path, do_lower_case=do_lower_case
        )
        if verbose:
            print(
                f"Tokenizer {self.tokenizer_cls.__name__} loaded from "
                f"{self.pretrained_model_name_or_path}"
            )

    def set_data_collator(self):
        self.data_collator = DataCollator(self.tokenizer)

    def split_dataset(
        self, max_length=256, n_samples=2000, seed=42, test_size=0.2
    ):
        """
        Prepares the dataset for training:

        - Shuffles and selects a subset
        - Tokenizes the reviews and generates input_ids and labels
        - Removes unnecessary columns
        - Splits into train/validation sets

        Args:
            n_samples (int): number of examples to select
            test_size (float): proportion of the data to use for validation
            max_length (int): maximum sequence length (truncate if longer)
        """

        def preprocessing_fn(x, tokenizer):
            # Convertit le texte en IDs de tokens
            x["input_ids"] = tokenizer.encode(
                x["review"],
                add_special_tokens=True,
                truncation=True,
                max_length=max_length,
                padding=False,
                return_attention_mask=False,
            )
            # Encode le label
            x["labels"] = 0 if x["sentiment"] == "negative" else 1
            return x

        # Mélanger et sous-échantillonner
        dataset = self.dataset.shuffle(seed).select(range(n_samples))

        # Appliquer le prétraitement
        dataset = dataset.map(
            preprocessing_fn, fn_kwargs={"tokenizer": self.tokenizer}
        )

        # Garder uniquement les colonnes utiles
        dataset = dataset.select_columns(["input_ids", "labels"])

        # Split train / validation
        splitted = dataset.train_test_split(test_size=test_size)

        self.train_set = splitted["train"]
        self.valid_set = splitted["test"]

    def set_loaders(self, train_batch_size=4, eval_batch_size=4):
        self.train_loader = DataLoader(
            batch_size=train_batch_size,
            collate_fn=self.data_collator,
            dataset=self.train_set,
            shuffle=True,
        )
        self.valid_loader = DataLoader(
            batch_size=eval_batch_size,
            collate_fn=self.data_collator,
            dataset=self.valid_set,
            shuffle=False,
        )
        self.n_valid = len(self.valid_set)
        self.n_train = len(self.train_set)

    def set_model(
        self,
        verbose=False,
    ):
        model = self.model_cls.from_pretrained(
            pretrained_model_name_or_path=self.pretrained_model_name_or_path,
            num_labels=self.num_labels,
        )
        self.model = model.to(self.device)
        if verbose:
            print(
                f"Model {self.model_cls.__name__} loaded with "
                f"{self.model.num_labels} labels."
            )

    def set_optimizer(
        self,
        learning_rate=5e-5,
        weight_decay=0.01,
    ):
        # AdamW is a variant of Adam that includes weight decay
        self.optimizer = optim.AdamW(
            self.model.parameters(),
            lr=learning_rate,
            weight_decay=weight_decay,
        )

    def set_scheduler(
        self,
        num_epochs=3,
    ):
        if not self.optimizer:
            raise ValueError("Optimizer must be set before the scheduler.")
        if not self.train_loader:
            raise ValueError("Data loaders must be set before scheduler.")
        # Total training steps = number of batches * number of epochs
        self.num_total_steps = len(self.train_loader) * num_epochs
        self.num_warmup_steps = int(0.1 * self.num_total_steps)
        self.scheduler = get_linear_schedule_with_warmup(
            self.optimizer,
            num_warmup_steps=self.num_warmup_steps,
            num_training_steps=self.num_total_steps,
        )

    def set_optimizer_and_scheduler(
        self,
        learning_rate=5e-5,
        num_epochs=3,
        weight_decay=0.01,
        verbose=False,
    ):
        self.set_optimizer(learning_rate, weight_decay)
        self.set_scheduler(num_epochs)
        if verbose:
            print(
                f"Optimizer and scheduler set with {self.num_total_steps} "
                f"training steps and {self.num_warmup_steps} warmup steps."
            )

    def train_and_validate(
        self, epochs=3, max_grad_norm=1.0, save_dir="./distilbert-best"
    ):
        """
        Trains and validates the model for a given number of epochs.

        - Performs forward/backward passes with gradient clipping
        - Updates optimizer and scheduler
        - Evaluates on validation set at the end of each epoch
        - Saves the best model checkpoint based on validation loss

        Args:
            epochs (int): number of epochs to train
            max_grad_norm (float): gradient clipping norm
            save_dir (str): directory to save the best model
        """
        best_val_loss = float("inf")
        for epoch in range(1, epochs + 1):
            print(f"\nEpoch {epoch}/{epochs}")

            # -------- TRAIN --------
            total_train_loss = 0.0
            self.model.train()
            for batch in self.train_loader:
                # move batch tensors to device
                batch = {k: v.to(self.device) for k, v in batch.items()}

                self.optimizer.zero_grad(set_to_none=True)

                # forward pass (returns loss when 'labels' is provided)
                outputs = self.model(**batch)
                loss = outputs.loss
                total_train_loss += loss.item()

                # backward pass
                loss.backward()

                # gradient clipping
                clip_grad_norm_(
                    self.model.parameters(), max_norm=max_grad_norm
                )

                # optimizer + scheduler step
                self.optimizer.step()
                self.scheduler.step()

            avg_train_loss = total_train_loss / len(self.train_loader)
            print(f"  Training loss: {avg_train_loss:.4f}")

            # -------- VALIDATE --------
            self.model.eval()
            total_val_loss = 0.0
            correct, total = 0, 0

            with torch.no_grad():
                for batch in self.valid_loader:
                    batch = {k: v.to(self.device) for k, v in batch.items()}

                    outputs = self.model(**batch)
                    loss = outputs.loss
                    logits = outputs.logits

                    total_val_loss += loss.item()

                    preds = logits.argmax(dim=-1)
                    correct += (preds == batch["labels"]).sum().item()
                    total += batch["labels"].size(0)

            avg_val_loss = total_val_loss / len(self.valid_loader)
            val_acc = correct / total if total > 0 else 0.0
            print(
                f"  Validation loss: {avg_val_loss:.4f} | Accuracy: {val_acc:.4f}"
            )

            # save best checkpoint
            if avg_val_loss < best_val_loss:
                best_val_loss = avg_val_loss
                self.model.save_pretrained(save_dir)
                self.tokenizer.save_pretrained(save_dir)
                print(f"Saved new best model to {save_dir}")

    def predict_sentiment(self, text):
        # Tokenize the input
        inputs = self.tokenizer(
            text,
            add_special_tokens=True,
            truncation=True,
            max_length=256,
            padding="max_length",  # pad single example
            return_tensors="pt",
        )
        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        # Forward pass
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]

        # Get predicted label
        pred_label = int(logits.argmax(dim=-1).cpu().item())
        label_str = "positive" if pred_label == 1 else "negative"

        return {
            "text": text,
            "pred_label": label_str,
            "probabilities": {
                "negative": float(probs[0]),
                "positive": float(probs[1]),
            },
        }

    def predict_batch(self, texts, max_length=256):
        """
        Predict sentiment for a batch of texts (list of strings).
        Returns a list of dicts with labels and probabilities.
        """
        # Tokenize the whole batch at once
        inputs = self.tokenizer(
            texts,
            add_special_tokens=True,
            truncation=True,
            max_length=max_length,
            padding=True,  # pad to longest in batch
            return_tensors="pt",
        )
        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        # Forward pass
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=-1).cpu().numpy()

        # Decode predictions
        results = []
        for text, prob in zip(texts, probs):
            pred_label = int(prob.argmax())
            label_str = "positive" if pred_label == 1 else "negative"
            results.append(
                {
                    "text": text,
                    "pred_label": label_str,
                    "probabilities": {
                        "negative": float(prob[0]),
                        "positive": float(prob[1]),
                    },
                }
            )
        return results

    def count_parameters(self):
        total = sum(p.numel() for p in self.model.parameters())
        trainable = sum(
            p.numel() for p in self.model.parameters() if p.requires_grad
        )
        return total, trainable

## Initialization

In [48]:
LLMFineTuner_demo = LLMFineTuner()

## Download the training data

In [49]:
LLMFineTuner_demo.set_dataset(verbose=True)

Dataset loaded :
Dataset({
    features: ['review', 'sentiment'],
    num_rows: 50000
})
with 50000 examples.


## Prepare model inputs

The input format to BERT looks like it is  "over-specified", especially if you focus on just one type task: sequence classification, word tagging, paraphrase detection, ...  The format:
- Add special tokens to the start and end of each sentence.
- Pad & truncate all sentences to a single constant length.
- Explicitly differentiate real tokens from padding tokens with the "attention mask".

It looks like that:

<img src="https://drive.google.com/uc?export=view&id=1cb5xeqLu_5vPOgs3eRnail2Y00Fl2pCo" width="600">

If you don't want to recreate this kind of inputs with your own hands, you can use the pre-trained tokenizer associated to BERT. Moreover the function `encode_plus` will:
- Tokenize the sentence.
- Prepend the `[CLS]` token to the start.
- Append the `[SEP]` token to the end.
- Map tokens to their IDs.
- Pad or truncate the sentence to `max_length`
- Create attention masks for `[PAD]` tokens.


> 💡 *Note:* For computational reasons, we will use the [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) model, which is a 40% smaller than the original BERT model but still achieve about 95% of the performances of the original model.

In [7]:
LLMFineTuner_demo.set_tokenizer(verbose=True)

Tokenizer DistilBertTokenizer loaded from distilbert-base-uncased


In [8]:
message = "hello my name is kevin"
tok = LLMFineTuner_demo.tokenizer.tokenize(message)
print("Tokens in the sequence:", tok)
enc = LLMFineTuner_demo.tokenizer.encode(tok)
table = np.array(
    [
        enc,
        [LLMFineTuner_demo.tokenizer.ids_to_tokens[w] for w in enc],
    ]
).T
print("Encoded inputs:")
print(tabulate(table, headers=["Token IDs", "Tokens"], tablefmt="fancy_grid"))

Tokens in the sequence: ['hello', 'my', 'name', 'is', 'kevin']
Encoded inputs:
╒═════════════╤══════════╕
│   Token IDs │ Tokens   │
╞═════════════╪══════════╡
│         101 │ [CLS]    │
├─────────────┼──────────┤
│        7592 │ hello    │
├─────────────┼──────────┤
│        2026 │ my       │
├─────────────┼──────────┤
│        2171 │ name     │
├─────────────┼──────────┤
│        2003 │ is       │
├─────────────┼──────────┤
│        4901 │ kevin    │
├─────────────┼──────────┤
│         102 │ [SEP]    │
╘═════════════╧══════════╛


🚧 **Question** 🚧

You noticed special tokens like `[CLS]` and `[SEP]` in the sequence. Note how they were added automatically by HuggingFace.

- Why are there such special tokens?

**Answer - Edoardo as of 08 2025**

CLS --> model assumes we are working on a classification task. It is a token used to represent the sentence for the next tasks and it can be considered, after several transformer layers, as a summary representation of the sentences.

SEP --> it is the separator token. In this specific case it is not important because we have only 1 sentence. However, it is important when there are several sentences.

The SEP token is very important, for example, in case we have a Q&A tasks.

## Data pre-processing

Usual data-processing for torch.

**Explanation Edoardo 08 2025**

The function below converts raw review text and sentiment lables into a tokenized sequence and numerical labels. Those can be consumed by an Hugging Face Model.
Prepare data set for model training. Training will be done with DistilBert on 2000 examples randomly selected and tokenized.

In [9]:
LLMFineTuner_demo.split_dataset()

**Explanation Edoardo 08 2025**

As in the preprocessing function we did not do the padding, we are now doing it. The above code allows to:

1. It takes the input_ids (which are different lengths right now).
2. Pads them so they all match the longest sequence in that batch (efficient, avoids over-padding).
3. Adds an attention mask automatically (1 for real tokens, 0 for padding).
4. Converts everything into PyTorch tensors (torch.LongTensor) so the model can use them.

The collator will:

- Pad the batch dynamically
- Add attention mask labels (0 & 1)
- Return tensors in the right format.

In [10]:
LLMFineTuner_demo.set_data_collator()

In [11]:
LLMFineTuner_demo.set_loaders()

**What is done above**

| Step                   | Status | Code you already have                                               | Purpose                                                           |
| ---------------------- | ------ | ------------------------------------------------------------------- | ----------------------------------------------------------------- |
| Dataset loaded         | ✅      | `dataset` already exists                                            | Get raw reviews + sentiments                                      |
| Shuffle                | ✅      | `dataset = dataset.shuffle()`                                       | Remove order bias                                                 |
| Subsample              | ✅      | `dataset.select(range(n_samples))`                                  | Work on 2,000 examples only                                       |
| Tokenization           | ✅      | `dataset.map(preprocessing_fn, fn_kwargs={"tokenizer": tokenizer})` | Convert reviews → token IDs, sentiment → labels                   |
| Column pruning         | ✅      | `select_columns(["input_ids", "labels"])`                           | Keep only relevant fields                                         |
| Train/validation split | ✅      | `train_test_split(test_size=0.2)`                                   | Split into train/valid sets                                       |
| Data collator          | ✅      | `DataCollator(tokenizer)`                                           | Handles **dynamic padding**, adds attention mask, returns tensors |
| Dataloaders            | ✅      | `DataLoader(train_set, ... collate_fn=data_collator)`               | Feed data in mini-batches                                         |

**What is missing to run a classification using the above**

| Step                                     | What to do                                                                                                                                                            |
| ---------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Device setup & reproducibility**       | Select GPU if available (`device = torch.device("cuda" if torch.cuda.is_available() else "cpu")`), set random seeds for reproducibility (`random`, `numpy`, `torch`). |
| **Load model**                           | Use `DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)`, then move it to `device`.                                         |
| **Optimizer**                            | Define optimizer, usually `AdamW` with small LR (`2e-5` to `5e-5`) and weight decay (e.g., `0.01`).                                                                   |
| **Scheduler (optional but recommended)** | Learning rate warmup + decay, via `get_linear_schedule_with_warmup`.                                                                                                  |
| **Training loop**                        | Iterate over `train_dataloader`: forward pass, compute loss, backward pass, gradient clipping, optimizer + scheduler step.                                            |
| **Validation loop**                      | After each epoch, run `model.eval()` on `valid_dataloader`: compute val loss, accuracy, F1 score, etc.                                                                |
| **Metrics**                              | Track training loss, validation loss, and at least **accuracy** (better also F1 if dataset is imbalanced).                                                            |
| **Logging / monitoring**                 | Print metrics per epoch, optionally add progress bars (`tqdm`).                                                                                                       |
| **Checkpointing**                        | Save best model & tokenizer: `model.save_pretrained("./distilbert-best")`, `tokenizer.save_pretrained("./distilbert-best")`.                                          |
| **Sanity checks**                        | Verify `[CLS]` (id=101) at start, `[SEP]` (id=102) present, `attention_mask` correct; inspect one batch to confirm.                                                   |

## Load Model

- Loads DistilBERT pretrained on masked language modeling.
- Adds a classification head (a linear layer on top of the [CLS] embedding).
- Sets num_labels=2 → binary classification.
- Moves everything to the device you set (cuda if available). Both the model and the data needs to be on the same devide to interact.

In [12]:
LLMFineTuner_demo.set_model(verbose=True)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model DistilBertForSequenceClassification loaded with 2 labels.


**Interpretation Output**

* DistilBERT itself is pretrained with **masked language modeling** (predict missing words) ;
* `DistilBertForSequenceClassification` adds a **randomly initialized classification head** on top ;
* The encoder starts from pretrained weights, but the classifier must be **fine-tuned on our sentiment dataset**  ;
* During fine-tuning, the head learns to map the `[CLS]` embedding → **positive / negative** ;
* Next step: **Train**, then **Use**.

Workflow:

1. **Fine-tune**
2. **Use for inference**


## Optimizer and Scheduler

**Recap Optimizer**

It is an algo that updates the model weights during training (based on the latest computed loss).

Adam learns different learning rates for different weights. The decay prevents overfitting by keeping weights to grow too large.

**Recap Scheduler**

Learning rate = how big the optimizer steps are.

We use the scheduler as the weights are random at the beginning and we need to gradually change them.

From an **intuitive** perspective:

- Optimizer = *how do we adjust the weights based on the loss?*
- Scheduler = *how big should we make the steps from now to the next training ?*

In [13]:
LLMFineTuner_demo.set_optimizer_and_scheduler(verbose=True)

Optimizer and scheduler set with 1200 training steps and 120 warmup steps.


## Training & Validation

**Explanation of the theory**

*Forward pass*

Input: a batch of reviews → tokenized into input_ids and attention_mask.

The model:

- Looks up embeddings for each token.
- Passes them through the DistilBERT encoder (stack of Transformer layers).
- Uses the [CLS] token’s hidden state as a representation of the whole sentence.
- Feeds that into the classification head (a small linear layer).
- Output: logits → raw, unnormalized scores for each class (positive/negative).

Mathematically:

$$
\text{logits} = W \cdot h_{\text{[CLS]}} + b
$$

*Loss computation*

We compare the logits with the true labels (0 and 1). We calculate the loss function cross entropy:

$$
L = - \Big( y \cdot \log(\hat{y}) + (1-y) \cdot \log(1-\hat{y}) \Big)
$$

where:

$$
\hat{y}_i = \frac{e^{\text{logits}_i}}{\sum_j e^{\text{logits}_j}}
$$


*Backward pass*

Pytorch automatically computes the gradients of the loss with respect to every model parameters. The gradient (or slope) will tell us how to nugde each weight to reduce the loss.

- If the gradient is +, we should decrease the weight. And viceversa.

*Gradient clipping*

If gradients produced risk to be too large, we can clip them to limit their size and ensure stable updates. Mathematically:

$$
g = min(g,clip  value)
$$

*Optimizer*

Then it is the role of the optimizer to apply the gradients to update weights, including weight decay in our case.

Mathematically:

$$
w \leftarrow w - \eta \cdot \nabla_w L
$$

With n being the learning rate.

*Scheduler*

Instead of keeping the learning rate constant with start small, increase it and then decrease it again.

*Validation*

- Turn off gradients (model.eval() + torch.no_grad()).
- Run the model on validation data.
- Compute validation loss + metrics (accuracy, F1).

This checks if the model is learning general patterns, not just memorizing training data.

*Checkpoint*

Save weights when validation loss improves. To prevent forgetting a good model.

In [14]:
LLMFineTuner_demo.train_and_validate()


Epoch 1/3
  Training loss: 0.5504
  Validation loss: 0.7127 | Accuracy: 0.8150
  ✅ Saved new best model to ./distilbert-best

Epoch 2/3
  Training loss: 0.3403
  Validation loss: 0.8591 | Accuracy: 0.8050

Epoch 3/3
  Training loss: 0.1178
  Validation loss: 0.7688 | Accuracy: 0.8450


**Interpretation**

Classic pattern: the model is getting more confident over epochs. Accuracy creeps up, but validation loss rises (overconfidence on wrong cases). This is mild overfitting / miscalibration at epocs 1 and 3.

Saving based on the current validation loss, the current best checkpoint is epoch 2. This is the best choice for generalization.

## Sanity Check #1

In [15]:
# Grab a single batch from the training dataloader
batch = next(iter(LLMFineTuner_demo.train_loader))

# Print shapes
print("Input IDs shape:", batch["input_ids"].shape)
print("Attention mask shape:", batch["attention_mask"].shape)
print("Labels shape:", batch["labels"].shape)

# Print the first example
first_input_ids = batch["input_ids"][0]
first_mask = batch["attention_mask"][0]
first_label = batch["labels"][0]

print("\nFirst input_ids:", first_input_ids.tolist())
print("First attention_mask:", first_mask.tolist())
print("First label:", first_label.item())

# Convert IDs back to tokens to inspect
tokens = LLMFineTuner_demo.tokenizer.convert_ids_to_tokens(first_input_ids)
print("\nDecoded tokens:", tokens)

Input IDs shape: torch.Size([4, 256])
Attention mask shape: torch.Size([4, 256])
Labels shape: torch.Size([4])

First input_ids: [101, 1045, 2113, 8750, 1012, 1045, 1005, 2310, 2042, 2046, 2009, 2146, 2077, 2009, 2150, 1037, 2120, 9575, 1025, 1045, 3866, 2743, 2863, 2077, 2087, 2111, 2354, 2054, 5202, 7384, 1062, 2130, 2001, 1012, 1998, 2074, 2061, 2017, 2113, 1045, 1005, 1049, 2025, 23678, 2075, 2055, 2026, 1010, 2292, 2033, 2360, 2023, 1024, 2041, 1997, 2035, 1996, 8750, 2015, 1045, 1005, 2310, 2464, 1010, 3317, 1999, 1996, 3712, 2003, 2011, 2521, 2028, 1997, 1996, 2190, 1012, 2009, 1005, 1055, 5793, 2111, 2360, 24462, 2185, 2003, 1996, 2190, 1010, 2021, 1045, 2428, 21090, 1012, 2087, 2111, 2069, 2113, 2008, 3185, 2138, 2009, 2028, 2019, 9078, 22117, 2100, 2400, 1025, 2023, 3475, 1005, 1056, 2019, 4654, 27609, 3370, 1011, 1045, 1005, 2310, 3491, 4615, 18847, 3630, 3489, 1998, 3317, 1999, 1996, 3712, 2000, 2111, 2040, 1005, 1040, 2069, 2412, 2464, 24462, 2185, 1010, 1998, 2027, 5993, 

- Each batch has 4 samples as expected.
- Sequences padded/truncated to length 256, as expected
- Attention mask correctly aligns with input ids
- Special tokens are present

## Sanity Check 2

Validate that special tokens are present in a random batch

In [16]:
batch = next(iter(LLMFineTuner_demo.train_loader))
iid = batch["input_ids"][0]
assert (
    iid[0].item() == LLMFineTuner_demo.tokenizer.cls_token_id
), "Missing [CLS]"
assert (
    (iid == LLMFineTuner_demo.tokenizer.sep_token_id).any().item()
), "Missing [SEP]"
print("Special tokens OK ✔️")

Special tokens OK ✔️


## Sanity Check 3

Check truncation and padding rates

In [17]:
def trunc_pad_stats(dataloader, max_len=256):
    n, n_trunc, n_pad = 0, 0, 0
    for b in dataloader:
        input_ids = b["input_ids"]
        attn = b["attention_mask"]
        # padded examples have any 0 in mask
        n_pad += (attn.sum(dim=1) < input_ids.size(1)).sum().item()
        # truncated examples exactly hit max_len AND have no padding
        n_trunc += ((attn.sum(dim=1) == max_len)).sum().item()
        n += input_ids.size(0)
    return {
        "total_examples": n,
        "padded_frac": n_pad / n,
        "truncated_frac": n_trunc / n,
        "exact_len_frac": (
            n_trunc / n
        ),  # same as truncated_frac with this logic
    }

In [24]:
stats = trunc_pad_stats(LLMFineTuner_demo.train_loader, max_len=256)
print(json.dumps(stats, indent=4))

{
    "total_examples": 1600,
    "padded_frac": 0.528125,
    "truncated_frac": 0.4425,
    "exact_len_frac": 0.4425
}


*Conclusion check 3*

The truncated fraction is a bit high. This means that we need to raise the max length from 256, otherwise long form context is not retained.

## Sanity Check 4

Spot check a padded example

In [19]:
def show_padded_example(dataloader, tokenizer):
    for b in dataloader:
        for i in range(b["input_ids"].size(0)):
            attn = b["attention_mask"][i]
            if attn[-1].item() == 0:  # ends with padding
                ids = b["input_ids"][i]
                toks = tokenizer.convert_ids_to_tokens(ids)
                print("...tokens tail:", toks[-30:])
                print("...mask tail:", attn[-30:].tolist())
                return
    print("No padded example found in this pass")

In [20]:
show_padded_example(
    LLMFineTuner_demo.train_loader, LLMFineTuner_demo.tokenizer
)

...tokens tail: ['[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
...mask tail: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


*Conclusion sanity check 4*

Short reviews are padded up to the batch max length and attention mask correctly ignores padding, hence there is no pollution of the context as the model will not read the pad tokens.

## Inference helper

We can, with it, feed raw text and get sentiment predictions from our fine tuned DistilBert.

In [27]:
print(
    json.dumps(
        LLMFineTuner_demo.predict_sentiment(
            "I absolutely loved this movie, it was fantastic!"
        ),
        indent=4,
    )
)
print(
    json.dumps(
        LLMFineTuner_demo.predict_sentiment(
            "This was the worst film I have ever seen."
        ),
        indent=4,
    )
)

{
    "text": "I absolutely loved this movie, it was fantastic!",
    "pred_label": "positive",
    "probabilities": {
        "negative": 0.0015521567547693849,
        "positive": 0.9984477758407593
    }
}
{
    "text": "This was the worst film I have ever seen.",
    "pred_label": "negative",
    "probabilities": {
        "negative": 0.9980460405349731,
        "positive": 0.001953961793333292
    }
}


In [28]:
model_path = "./distilbert-best"
LLMFineTuner_1 = LLMFineTuner(pretrained_model_name_or_path=model_path)
LLMFineTuner_1.set_tokenizer()
LLMFineTuner_1.set_model()
LLMFineTuner_1.model.eval()

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [31]:
print(
    json.dumps(
        LLMFineTuner_1.predict_sentiment(
            "I absolutely loved this movie, it was fantastic!"
        ),
        indent=4,
    )
)
print(
    json.dumps(
        LLMFineTuner_1.predict_sentiment(
            "This was the worst film I have ever seen."
        ),
        indent=4,
    )
)

{
    "text": "I absolutely loved this movie, it was fantastic!",
    "pred_label": "positive",
    "probabilities": {
        "negative": 0.00686316704377532,
        "positive": 0.9931368827819824
    }
}
{
    "text": "This was the worst film I have ever seen.",
    "pred_label": "negative",
    "probabilities": {
        "negative": 0.9953239560127258,
        "positive": 0.004676037933677435
    }
}


**Summary**

We now have:

1. Preprocessing & batching
2. Training and validation
3. Save of the best model
4. Inference on new text (single ones, not batch in the case above)

## Batch Inference Helper - Example of use

In [33]:
reviews = [
    "I absolutely loved this movie, it was fantastic!",
    "This was the worst film I have ever seen.",
    "The acting was decent but the story was too slow.",
    "What a masterpiece - I'd watch it again and again!",
]

batch_results = LLMFineTuner_1.predict_batch(reviews)
for res in batch_results:
    print(json.dumps(res, indent=4))

{
    "text": "I absolutely loved this movie, it was fantastic!",
    "pred_label": "positive",
    "probabilities": {
        "negative": 0.006863163318485022,
        "positive": 0.9931368827819824
    }
}
{
    "text": "This was the worst film I have ever seen.",
    "pred_label": "negative",
    "probabilities": {
        "negative": 0.9953239560127258,
        "positive": 0.0046760402619838715
    }
}
{
    "text": "The acting was decent but the story was too slow.",
    "pred_label": "negative",
    "probabilities": {
        "negative": 0.9909650087356567,
        "positive": 0.009034992195665836
    }
}
{
    "text": "What a masterpiece - I'd watch it again and again!",
    "pred_label": "positive",
    "probabilities": {
        "negative": 0.010801234282553196,
        "positive": 0.9891987442970276
    }
}


## Pick New Model

We now pass to another model. We move to Roberta, which:

- Is larger
- Has a different tokenizer
- Different pretraining

In [34]:
# 1) We pick the comparison model : roberta-base
# other options that could be tested: "bert-base-uncased",
# "microsoft/MiniLM-L6-H384-uncased"
model_name = "roberta-base"
LLMFineTuner_2 = LLMFineTuner(
    model_cls=AutoModelForSequenceClassification,
    pretrained_model_name_or_path=model_name,
    tokenizer_cls=AutoTokenizer,
)
# 2) We load tokenizer + model (for binary classification)
LLMFineTuner_2.set_tokenizer(verbose=True)
LLMFineTuner_2.set_model(verbose=True)

Tokenizer AutoTokenizer loaded from roberta-base


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model AutoModelForSequenceClassification loaded with 2 labels.


In [36]:
total_params, trainable_params = LLMFineTuner_2.count_parameters()
print(
    f"[{model_name}] Total params: {total_params/1e6:.1f}M | "
    f"Trainable: {trainable_params/1e6:.1f}M | "
    f"Device: {next(LLMFineTuner_2.model.parameters()).device}"
)

[roberta-base] Total params: 124.6M | Trainable: 124.6M | Device: cuda:0


| Property                            | **DistilBERT (base-uncased)**                                                     | **RoBERTa-base**                                       |
| ----------------------------------- | --------------------------------------------------------------------------------- | ------------------------------------------------------ |
| Layers (Transformer encoder blocks) | **6**                                                                             | **12**                                                 |
| Hidden size $d_\text{model}$        | **768**                                                                           | **768**                                                |
| Attention heads                     | **12**                                                                            | **12**                                                 |
| Dim per head                        | 64                                                                                | 64                                                     |
| FFN/intermediate size               | **3072**                                                                          | **3072**                                               |
| Parameters (approx.)                | **\~66M**                                                                         | **\~125M**                                             |
| Max sequence length                 | 512                                                                               | 514\* (commonly used as 512)                           |
| Positional embeddings               | Learned absolute                                                                  | Learned absolute                                       |
| Tokenizer & vocab                   | **WordPiece**, 30,522, *uncased*                                                  | **Byte-level BPE**, 50,265, *cased*                    |
| Segment (token type) embeddings     | **Not used** (tokenizer may output, model ignores)                                | **Not used**                                           |
| Special tokens                      | `[CLS] [SEP] [PAD] [MASK]`                                                        | `<s> </s> <pad> <mask>`                                |
| Pretraining objective               | **Masked LM** + **distillation** from BERT-base (adds KL + cosine losses; no NSP) | **Masked LM only**, **dynamic masking**; **no NSP**    |
| Pretraining corpora (high-level)    | Wikipedia + BookCorpus (via BERT teacher)                                         | Larger mix (BookCorpus, CC-News, OpenWebText, Stories) |
| Typical inference speed             | **Faster** (half the depth)                                                       | Slower vs DistilBERT (deeper)                          |
| Typical memory/VRAM                 | **Lower**                                                                         | Higher                                                 |
| Practical trade-off                 | Efficiency with \~95–97% of BERT-base accuracy                                    | Strong baseline accuracy; heavier & costlier           |