# Text classification using Transformers.

<p align="center">
  <a href="https://raw.githubusercontent.com/auduvignac/llm-finetuning/refs/heads/main/notebooks/corrections/distilbert-finetuning.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Ouvrir dans Google Colab"/>
  </a>
</p>

This lab will still focus on text classification on the Imdb dataset.
In this lab session, we will focus on encoder-based transformer architecture, through the lens of the most famous model: **BERT**.

---

# Introduction

## HuggingFace

We have already experimented with some components provided by the HuggingFace library:
- the `datasets` library,
- the `tokenizer`.

Actually, HuggingFace library provides convenient API to deal with transformer models, like BERT, GPT, etc.  To quote their website: *Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. Transformers support framework interoperability between PyTorch, TensorFlow, and JAX.*

## Goal of the lab session

We will experiment with the HuggingFace library. You'll have to load a model and to run it on your task.

Important things to keep in in minds are:
- Even if each model is a Transformer, they all have their peculiarities.
- What is the exact input format expected by the model?
- What is its exact output?
- Can you use the available model as is or should you make some modifications for your task?

These questions are actually part of the life of a NLP scientist. We will adress some of these questions in this lab and in the next lessons / labs / HW.

In [None]:
%%capture
!pip install transformers datasets

In [None]:
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
import math

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from datasets import load_dataset
from tabulate import tabulate
from torch.utils.data import DataLoader

# from tqdm.notebook import tqdm
from tqdm import tqdm
from tqdm.notebook import tqdm
from transformers import DistilBertTokenizer

# If the machine you run this on has a GPU available with CUDA installed,
# use it. Using a GPU for learning often leads to huge speedups in training.
# See https://developer.nvidia.com/cuda-downloads for installing CUDA
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE

## Download the training data

In [None]:
dataset = load_dataset("scikit-learn/imdb", split="train")
print(dataset)

## Prepare model inputs

The input format to BERT looks like it is  "over-specified", especially if you focus on just one type task: sequence classification, word tagging, paraphrase detection, ...  The format:
- Add special tokens to the start and end of each sentence.
- Pad & truncate all sentences to a single constant length.
- Explicitly differentiate real tokens from padding tokens with the "attention mask".

It looks like that:

<img src="https://drive.google.com/uc?export=view&id=1cb5xeqLu_5vPOgs3eRnail2Y00Fl2pCo" width="600">

If you don't want to recreate this kind of inputs with your own hands, you can use the pre-trained tokenizer associated to BERT. Moreover the function `encode_plus` will:
- Tokenize the sentence.
- Prepend the `[CLS]` token to the start.
- Append the `[SEP]` token to the end.
- Map tokens to their IDs.
- Pad or truncate the sentence to `max_length`
- Create attention masks for `[PAD]` tokens.


> 💡 *Note:* For computational reasons, we will use the [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) model, which is a 40% smaller than the original BERT model but still achieve about 95% of the performances of the original model.

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained(
    "distilbert-base-uncased", do_lower_case=True
)

Let's see how the tokenizer actually process the sequence:

In [None]:
# Some useful steps:
message = "hello my name is kevin"
tok = tokenizer.tokenize(message)
print("Tokens in the sequence:", tok)
enc = tokenizer.encode(tok)
table = np.array(
    [
        enc,
        [tokenizer.ids_to_tokens[w] for w in enc],
    ]
).T
print("Encoded inputs:")
print(tabulate(table, headers=["Token IDs", "Tokens"], tablefmt="fancy_grid"))

🚧 **Question** 🚧

You noticed special tokens like `[CLS]` and `[SEP]` in the sequence. Note how they were added automatically by HuggingFace.

- Why are there such special tokens?

**Answer**

Special tokens like `[CLS]` and `[SEP]` are used by BERT and similar transformer models to structure input sequences for different tasks:

- `[CLS]` (classification): Added at the start of every input sequence. Its final hidden state is used as the aggregate representation for classification tasks.
- `[SEP]` (separator): Used to separate different segments (e.g., two sentences in sentence-pair tasks) and to mark the end of a sequence.

These tokens help the model understand the boundaries and roles of different parts of the input, enabling it to perform tasks like classification, question answering, or sentence pair comparison.

## Data pre-processing

Usual data-processing for torch. Same as previous lab.

In [None]:
def preprocessing_fn(x, tokenizer):
    """
    Preprocesses a single example for BERT/DistilBERT input.

    Args:
        x (dict): A dictionary containing the keys "review" (text) and
                  "sentiment" (label).
        tokenizer (transformers.PreTrainedTokenizer): Tokenizer to encode the
                  text.

    Returns:
        dict: Dictionary with "input_ids" (token ids for the review) and
              "labels" (0 for negative, 1 for positive).
    """
    x["input_ids"] = tokenizer.encode(
        x["review"],
        add_special_tokens=False,
        truncation=True,
        max_length=256,
        padding=False,
        return_attention_mask=False,
    )
    x["labels"] = 0 if x["sentiment"] == "negative" else 1
    return x

In [None]:
n_samples = 2000  # the number of training example

# We first shuffle the data !
dataset = dataset.shuffle()

# Select 5000 samples
splitted_dataset = dataset.select(range(n_samples))

# Tokenize the dataset
splitted_dataset = splitted_dataset.map(
    preprocessing_fn, fn_kwargs={"tokenizer": tokenizer}
)


# Remove useless columns
splitted_dataset = splitted_dataset.select_columns(["input_ids", "labels"])

# Split the train and validation
splitted_dataset = splitted_dataset.train_test_split(test_size=0.2)

train_set = splitted_dataset["train"]
valid_set = splitted_dataset["test"]

In [None]:
class DataCollator:
    """
    Data collator for batching and padding inputs for BERT/DistilBERT models.

    This class pads a batch of tokenized inputs to the same length and returns
    tensors suitable for model input. It uses the provided tokenizer's pad
    method.

    Args:
        tokenizer (transformers.PreTrainedTokenizer): Tokenizer used for
        padding and tensor conversion.

    Methods:
        __call__(batch): Pads and converts a batch of tokenized inputs to
                          tensors.
    """

    def __init__(self, tokenizer):
        """
        Initializes the DataCollator with a tokenizer.

        Args:
            tokenizer (transformers.PreTrainedTokenizer): Tokenizer for padding
              and tensor conversion.
        """
        self.tokenizer = tokenizer

    def __call__(self, batch):
        """
        Pads and converts a batch of tokenized inputs to tensors.

        Args:
            batch (list of dict): List of tokenized input dictionaries.

        Returns:
            dict: Dictionary of padded tensors ready for model input.
        """
        return self.tokenizer.pad(
            batch, padding="longest", max_length=256, return_tensors="pt"
        )

In [None]:
data_collator = DataCollator(tokenizer)

In [None]:
batch_size = 4

train_dataloader = DataLoader(
    train_set, batch_size=batch_size, collate_fn=data_collator
)
valid_dataloader = DataLoader(
    valid_set, batch_size=batch_size, collate_fn=data_collator
)
n_valid = len(valid_set)
n_train = len(train_set)

# Model from scratch

For this task, we will start from a randomly initialized model.

## Retrieve the architecture configuration

In HuggingFace, model's parameters are specified through a `config` file. It is a json-like object.

We can retrieve the one from the official model with the following code:

In [None]:
from transformers import DistilBertConfig

model_config = DistilBertConfig.from_pretrained("distilbert-base-uncased")
print(model_config)

🚧 **Question** 🚧

Make sure you understand the parameters of the configuration.
- Which ones are task-agnostic parameters?
- Which ones are not?
- Why are there different parameters for different tasks?

**Answer**

- **Task-agnostic parameters** are those that define the core architecture and behavior of the transformer model itself, regardless of the downstream task. Examples include:
  - `vocab_size`: Size of the vocabulary.
  - `max_position_embeddings`: Maximum sequence length.
  - `n_layers`: Number of transformer layers.
  - `n_heads`: Number of attention heads.
  - `dim`: Hidden size of the model.
  - `dropout`, `attention_dropout`: Dropout rates.
  - `activation`: Activation function.
  - `initializer_range`: Range for weight initialization.

- **Task-specific parameters** are those added or modified for a particular downstream task (e.g., classification, token classification, question answering). Examples include:
  - `num_labels`: Number of output classes (for classification tasks).
  - `id2label`, `label2id`: Mappings for label names.
  - `problem_type`: Specifies the type of classification (single-label, multi-label, regression).
  - Any parameters related to the output head (e.g., classifier layers).

- **Why different parameters for different tasks?**
  - The base transformer architecture is shared across tasks, but each NLP task may require a different output format or head (e.g., a classification layer for sentiment analysis, a span prediction head for question answering). Task-specific parameters allow the model to adapt its outputs and loss functions to the requirements of each task, ensuring optimal performance and compatibility.



Several architectures are available for DistilBert on HuggingFace, designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained DistilBert model, each has different top layers and output types designed to accomodate their specific NLP task.  

Here is the current list of classes provided for fine-tuning:
* BertModel
* BertForMaskedLM
* BertForNextSentencePrediction
* BertForSequenceClassification
* BertForTokenClassification
* BertForQuestionAnswering

The documentation for these can be found under [here](https://huggingface.co/docs/transformers/model_doc/distilbert).




🚧 **TODO** 🚧

For our first experiment, we want to build from a standard stack of transformer layers, without any additional task-specific head.

Which architecture is the corresponding one ?

Choose the right one and initialize the model below, with the config.

We need `DistilBertModel` because it provides only the core transformer stack (encoder layers) of DistilBERT, **without any task-specific head** (such as classification, token classification, or question answering layers). 

This is ideal when we want to use the model as a feature extractor or build your own custom head (e.g., for classification) on top of the transformer outputs. Other classes like `DistilBertForSequenceClassification` include additional layers for specific tasks, but `DistilBertModel` gives us just the backbone.

In [None]:
from transformers import DistilBertModel

bert = DistilBertModel(model_config)

In [None]:
print(bert)

Just for curiosity's sake, we can browse all of the model's parameters by name here.

In the below cell, we printed out the names and dimensions of the weights for:

- The embedding layer
- The first of the twelve transformers
- The output layer.



In [None]:
# Get all of the model's parameters as a list of tuples.
params = list(bert.named_parameters())

In [None]:
print(
    "The BERT model has {:} different named parameters.\n".format(len(params))
)

print("==== Embedding Layer ====\n")

for p in params[:4]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print("\n==== First Transformer Layer ====\n")

for p in params[4:20]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

🚧 **TODO** 🚧

Test your bert.
We can already try the model on the validation set. Before just look at the output of the model on one batch.
- Interpret the output.  
- Do you understand everything ?


In [None]:
batch = next(iter(train_dataloader))

input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]

output = bert(input_ids=input_ids, attention_mask=attention_mask)
print(
    output["last_hidden_state"].shape
)  # 4 texte 256 taille max 768 de dim embedding

## Building a classifier

Our `bert` model is simply a stack of transformer layers. We would like to use it as a backbone for text classification.

🚧 **TODO** 🚧

Wraps the model into a classifier.

> 💡 *Hint*: Use the last hidden [CLS] vector representation to perform classification.

In [None]:
class DistilBertClassifier(nn.Module):
    """
    A simple text classifier built on top of a DistilBERT backbone.

    This classifier uses the output of the [CLS] token from the last hidden
    state of DistilBERT and applies a dropout followed by a linear layer for
    binary classification.

    Args:
        bert (DistilBertModel): The DistilBERT model providing contextualized
          embeddings.

    Methods:
        forward(input_ids, attention_mask, **kwargs): Returns logits for each
          input sequence.
    """

    def __init__(self, bert):
        super().__init__()
        self.bert = bert
        self.drop = nn.Dropout(0.3)
        self.linear = nn.Linear(768, 1)

    def forward(self, input_ids, attention_mask, **kwargs):
        """
        Performs a forward pass through the classifier.

        Args:
            input_ids (torch.Tensor): Token IDs for each input sequence.
            attention_mask (torch.Tensor): Attention mask to differentiate real
              tokens from padding.
            **kwargs: Additional arguments (not used).

        Returns:
            torch.Tensor: Logits for each input sequence
              (shape: [batch_size, 1]).
        """
        output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        vectors = output["last_hidden_state"]  # (b, l, d)
        cls_vector = vectors[:, 0]  # (b, d)
        return self.linear(cls_vector)  # (b, 1)

In [None]:
bert = DistilBertModel(model_config)
model = DistilBertClassifier(bert)
model.to(DEVICE)

🚧 **TODO** 🚧

Test your model on the batch.
Make sure it has the right shape.

In [None]:
out = model(
    input_ids=batch["input_ids"].cuda(),
    attention_mask=batch["attention_mask"].cuda(),
)
print(out.shape)

### Training

🚧 **TODO** 🚧

Train your model.
Make sure you track the following quantities per epoch:
- training loss
- training accuracy
- validation loss
- validation accuracy

In [None]:
def validation(model, valid_dataloader):
    """
    Evaluates the model on the validation set.

    Args:
        model (nn.Module): The classifier model to evaluate.
        valid_dataloader (DataLoader): DataLoader for the validation set.

    Returns:
        tuple: Average validation loss and accuracy.
    """
    total_size = 0
    acc_total = 0
    loss_total = 0
    criterion = nn.BCEWithLogitsLoss()
    model.eval()
    with torch.no_grad():
        for batch in tqdm(valid_dataloader):
            batch = {k: v.cuda() for k, v in batch.items()}
            input_ids = batch["input_ids"]
            labels = batch["labels"]
            attention_mask = batch["attention_mask"]
            preds = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(preds.squeeze(), labels.float().squeeze())
            acc = (preds.squeeze() > 0) == labels
            total_size += acc.shape[0]
            acc_total += acc.sum().item()
            loss_total += loss.item()
    model.train()
    return loss_total / len(valid_dataloader), acc_total / total_size


validation(model, valid_dataloader)

In [None]:
def training(model, n_epochs, train_dataloader, valid_dataloader, lr=5e-5):
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=lr,
        eps=1e-08,
    )
    list_val_acc = []
    list_train_acc = []
    list_train_loss = []
    list_val_loss = []
    criterion = nn.BCEWithLogitsLoss()
    for e in range(n_epochs):
        # ========== Training ==========

        # Set model to training mode
        model.train()
        model.to(DEVICE)

        # Tracking variables
        train_loss = 0
        epoch_train_acc = 0
        for batch in tqdm(train_dataloader):
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            input_ids, attention_mask, labels = (
                batch["input_ids"],
                batch["attention_mask"],
                batch["labels"],
            )
            optimizer.zero_grad()
            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs.squeeze(), labels.float().squeeze())

            # Backward pass
            loss.backward()

            # Optimization
            optimizer.step()

            train_loss += loss.detach().cpu().item()
            acc = (outputs.squeeze() > 0) == labels.squeeze()
            epoch_train_acc += acc.float().mean().item()
        list_train_acc.append(100 * epoch_train_acc / len(train_dataloader))
        list_train_loss.append(train_loss / len(train_dataloader))

        # ========== Validation ==========

        l, a = validation(model, valid_dataloader)
        list_val_loss.append(l)
        list_val_acc.append(a * 100)
        print(
            e,
            "\n\t - Train loss: {:.4f}".format(list_train_loss[-1]),
            "Train acc: {:.4f}".format(list_train_acc[-1]),
            "Val loss: {:.4f}".format(l),
            "Val acc:{:.4f}".format(a * 100),
        )
    return list_train_loss, list_train_acc, list_val_loss, list_val_acc

In [None]:
bert = DistilBertModel(model_config)
model = DistilBertClassifier(bert)
model.to(DEVICE)

In [None]:
batch_size = 16

train_dataloader = DataLoader(
    train_set, batch_size=batch_size, collate_fn=data_collator
)
valid_dataloader = DataLoader(
    valid_set, batch_size=batch_size, collate_fn=data_collator
)
n_valid = len(valid_set)
n_train = len(train_set)

In [None]:
list_train_loss, list_train_acc, list_val_loss, list_val_acc = training(
    model, 3, train_dataloader, valid_dataloader
)

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 2, figsize=(10, 5))
axs[1].plot(list_train_acc, label="Train accuracy")
axs[1].plot(list_val_acc, label="Validation accuracy")
axs[0].plot(list_train_loss, label="Train loss")
axs[0].plot(list_val_loss, label="Validation loss")
axs[0].set_title("Loss")
axs[1].set_title("Accuracy")
axs[0].legend()
axs[1].legend()
plt.legend()
plt.show()

🚧 **Question** 🚧

How does it compare with your convolution model from previous lab?


## Pre-trained model

Now we are going to compare with a pre-trained model.

First, we are going to load the model's weights from the HuggingFace hub.

In [None]:
bert = DistilBertModel.from_pretrained("distilbert-base-uncased")
model = DistilBertClassifier(bert)
model.to(DEVICE)

## Fine-Tuning

With our model loaded and ready,  we need to grab the training hyperparameters from within the stored model.

For the purposes of fine-tuning, the authors recommend choosing from the following values (from Appendix A.3 of the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf)):

- **Batch size:** 16, 32  
- **Learning rate (Adam):** 5e-5, 3e-5, 2e-5  
- **Number of epochs:** 2, 3, 4

We chose:
* Batch size: 16 (set when creating our DataLoaders)
* Learning rate: 5e-5
* Epochs: 3 (we'll see that this is probably too many...)

The epsilon parameter `eps = 1e-8` is "a very small number to prevent any division by zero in the implementation" (from [here](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)).

You can find the creation of the AdamW optimizer in `run_glue.py` [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109).

🚧 **TODO** 🚧

Build the classifier and train it with the pre-trained checkpoint.

In [None]:
list_train_loss, list_train_acc, list_val_loss, list_val_acc = training(
    model, 3, train_dataloader, valid_dataloader, lr=5e-6
)

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
axs[1].plot(list_train_acc, label="Train accuracy")
axs[1].plot(list_val_acc, label="Validation accuracy")
axs[0].plot(list_train_loss, label="Train loss")
axs[0].plot(list_val_loss, label="Validation loss")
axs[0].set_title("Loss")
axs[1].set_title("Accuracy")
axs[0].legend()
axs[1].legend()
plt.legend()
plt.show()

🚧 **Question** 🚧

What do you think of the results?

**Answer**

TODO