# Code Tutorial: Fine-Tuning BERT & XLM-RoBERTa Using datasets

### Big Picture

Fine-tuning a pre-trained Transformer (like BERT or XLM-RoBERTa) allows us to leverage massive language understanding already learned by these models and adapt them to a specific task—in this case, sentiment analysis on movie reviews. Rather than training a huge neural network from scratch (which requires enormous data and compute), we start from a strong foundation and teach it to distinguish positive from negative reviews.

**Why follow this process?**

1. **Reproducibility & Control**: Fixing seeds and explicitly configuring our device ensures results are consistent and that we fully utilize available hardware (CPU/GPU).
2. **Data Handling**: Loading and pre-processing (tokenizing) the raw text into the exact format the model expects is crucial—it converts words into numerical IDs, pads or truncates to a uniform length, and batches them efficiently.
3. **Model & Optimizer Setup**: We load a model already trained on vast text corpora, add a simple classifier head, and choose an optimizer (AdamW) with sensible defaults for fine-tuning.
4. **Training & Evaluation Loop**: By iterating over the data, computing loss and gradients, and stepping the optimizer, we teach the model to improve on our task. Regular evaluation tells us how well it’s truly learning and helps detect overfitting.

This structured workflow—from environment setup through evaluation—forms the backbone of most modern NLP fine-tuning pipelines.



## 1. Environment Setup

Before training, we need to ensure our environment is consistent and that results are reproducible across runs.


In [5]:
%pip install --quiet datasets transformers

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 447, in run
    conflicts = self._determine_conflicts(to_install)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 578, in _determine_conflicts
    return check_install_conflicts(to_install)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/operations/check.py", line 101, in check_install_conflicts
    package_set, _ = create_package_set_from_installed()
              

KeyboardInterrupt: 

In [6]:
import random
import numpy as np
import torch

# 1. Fix random seeds for reproducibility
SEED = 555
random.seed(SEED)                # Python’s built-in random module
np.random.seed(SEED)             # NumPy’s random number generator
torch.manual_seed(SEED)          # PyTorch CPU random operations
torch.cuda.manual_seed_all(SEED) # PyTorch GPU random operations

# 2. Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


* `random.seed(SEED)`: Ensures calls like `random.random()` always follow the same sequence.
* `np.random.seed(SEED)`: Makes NumPy’s random functions (e.g., shuffling, sampling) reproducible.
* `torch.manual_seed(SEED)`: Fixes randomness for PyTorch operations on the CPU (weight init, dropout).
* `torch.cuda.manual_seed_all(SEED)`: Applies the same seed to all available GPUs, ensuring GPU-based operations are also deterministic.

By fixing these seeds, you guarantee that rerunning your script produces identical results, facilitating debugging and consistent comparisons.


## 2. Loading a Dataset with datasets

Next, we need real data to train on. Hugging Face’s `datasets` library makes it simple to download and prepare popular benchmarks with one function call.


In [4]:
!pip install -U datasets

Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.8/494.8 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency r

In [8]:
# from google.colab import drive
# drive.mount('/content/drive')

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipython-input-8-1408506528.py", line 2, in <cell line: 0>
    drive.mount('/content/drive')
  File "/usr/local/lib/python3.11/dist-packages/google/colab/drive.py", line 100, in mount
    return _mount(
           ^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/google/colab/drive.py", line 137, in _mount
    _message.blocking_request(
  File "/usr/local/lib/python3.11/dist-packages/google/colab/_message.py", line 176, in blocking_request
    return read_reply_from_input(request_id, timeout_sec)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/google/colab/_message.py", line 96, in read_reply_from_input
    time.sleep(0.025)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Tracebac

TypeError: object of type 'NoneType' has no len()

In [1]:
from datasets import load_dataset

# Load IMDB (train + test splits)
# This downloads the dataset and returns a DatasetDict with two keys: 'train' and 'test'.
raw_ds = load_dataset("imdb")
print(raw_ds)
# => DatasetDict({
#      train: Dataset(25000 examples),  # 25k training reviews
#      test:  Dataset(25000 examples)   # 25k test reviews
#    })

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


* `load_dataset("imdb")` looks up the IMDB movie-review dataset from the Hugging Face hub, downloads it, and loads it into memory.
* The result, `raw_ds`, is a **DatasetDict**: a simple mapping of split names (`'train'`, `'test'`, etc.) to **Dataset** objects.
* Each **Dataset** behaves like a list of examples; here, each example is a dictionary with keys `'text'` (the review) and `'label'` (0 or 1 for negative/positive sentiment).
* Printing `raw_ds` shows the number of examples in each split, confirming we have 25,000 for training and 25,000 for testing.

With the raw text and labels loaded, we can move on to tokenizing these examples into the numerical format our model needs.

## 3. Tokenization via map

Transformers expect numerical token IDs and attention masks rather than raw text. We use the Hugging Face tokenizer to convert each review into a fixed-length sequence of integers, handling wordpiece splitting, padding, and truncation automatically.

In [2]:
from transformers import BertTokenizer

# 1. Initialize the tokenizer for our chosen model
MODEL_NAME = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
MAX_LEN = 128  # maximum sequence length for each example

# 2. Define a batched tokenization function
#    - `batch["text"]` is a list of strings
#    - We pad to `max_length`, truncate longer reviews, and return attention masks

def tokenize_batch(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",    # pad all sequences to MAX_LEN
        truncation=True,          # cut off sequences longer than MAX_LEN
        max_length=MAX_LEN        # enforce this maximum length
    )

# 3. Apply tokenization to the entire DatasetDict
#    - `batched=True` processes multiple examples at once for speed
encoded_ds = raw_ds.map(tokenize_batch, batched=True)

# 4. Convert to PyTorch tensors for use with DataLoader
#    - Only keep the columns our model needs
encoded_ds.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "label"]
)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

**Key points:**

* **`input_ids`**: integer token indices representing subwords or words.
* **`attention_mask`**: tells the model which tokens are real (1) vs. padding (0).
* **`batched=True`**: speeds up tokenization by processing many examples in parallel.
* **`set_format`**: instructs the `datasets` library to return PyTorch tensors for the specified columns when indexing the dataset.

After this step, `encoded_ds` has the same splits (`train`, `test`) but each example now contains `input_ids`, `attention_mask`, and `label` in tensor form, ready for batching.


## 4. Creating DataLoaders

To feed data into our model efficiently, we use PyTorch’s `DataLoader`, which handles batching, shuffling, and parallel data loading:

In [3]:
from torch.utils.data import DataLoader

# Define how many examples per batch
BATCH_SIZE = 16

# Create a DataLoader for training data
train_loader = DataLoader(
    encoded_ds["train"],  # our tokenized training dataset
    batch_size=BATCH_SIZE, # number of samples per batch
    shuffle=True           # shuffle at each epoch for better generalization
)

# Create a DataLoader for test data
test_loader = DataLoader(
    encoded_ds["test"],   # our tokenized test dataset
    batch_size=BATCH_SIZE, # same batch size for evaluation
    shuffle=False          # no need to shuffle test data
)

**Why use DataLoaders?**

* **Batching**: Processes multiple examples at once, improving GPU utilization and training speed.
* **Shuffling**: Randomizes the order of training samples each epoch, which helps prevent the model from overfitting to data order.
* **Iterable Interface**: You can loop over `train_loader` or `test_loader` in your training loop; each iteration yields a batch dictionary with `input_ids`, `attention_mask`, and `label` tensors.

With these loaders, your training and evaluation loops become clean and efficient, abstracting away low‑level data handling.


## 5. Model & Optimizer Setup

In this step, we load our pre-trained Transformer as a sequence-classification model and configure an optimizer suitable for fine-tuning.


In [9]:
from transformers import BertForSequenceClassification
from torch.optim import AdamW

# Number of labels for our classification task (2: negative, positive)
NUM_LABELS = 2

# 1. Load a BERT model pre-trained on general text,
#    adding a classification head on top with `num_labels` outputs.
model = BertForSequenceClassification.from_pretrained(
    MODEL_NAME,        # e.g., "bert-base-uncased"
    num_labels=NUM_LABELS
).to(device)

# 2. Set up the AdamW optimizer:
#    - A variant of Adam with correct weight decay implementation,
#      which helps regularize large models.
#    - We use a small learning rate because the model is already trained;
#      we only need subtle adjustments (``2e-5`` is common).
optimizer = AdamW(
    model.parameters(),
    lr=2e-5,          # fine-tuning learning rate
    weight_decay=1e-2 # L2 regularization strength
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



**Why these choices?**

* **`BertForSequenceClassification`**: includes the full BERT encoder plus a linear layer on top for classification.
* **`from_pretrained`**: downloads the model weights trained on huge corpora (BooksCorpus, Wikipedia), giving us strong initial language understanding.
* **`AdamW`**: the go-to optimizer for Transformers; it decouples weight decay from gradient updates, improving generalization.
* **Learning rate (********`lr=2e-5`****\*\*\*\*)**: significantly lower than training from scratch; prevents large weight updates that could disrupt the pre-trained representations.
* **Weight decay (********`1e-2`****\*\*\*\*)**: penalizes large weights, acting as regularization to reduce overfitting on our smaller dataset.


## 6. Training & Evaluation Functions

We define two core routines for a full training cycle:

In [10]:
import torch.nn.functional as F
from sklearn.metrics import accuracy_score, roc_auc_score

# 1. One epoch of training
def train_epoch(model, loader, optimizer):
    model.train()  # set model to training mode (enables dropout, etc.)
    total_loss = 0
    preds, trues = [], []  # to accumulate predicted labels and true labels

    for batch in loader:
        # Move inputs to the configured device
        input_ids      = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels         = batch["label"].to(device)

        optimizer.zero_grad()  # clear previous gradients
        # Forward pass: compute loss and logits
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        loss, logits = outputs.loss, outputs.logits

        loss.backward()  # backpropagate gradients
        optimizer.step()  # update model parameters

        total_loss += loss.item()  # accumulate training loss
        # Convert logits to predicted class indices
        preds.extend(torch.argmax(logits, dim=1).cpu().numpy())
        trues.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(loader)  # average loss per batch
    acc = accuracy_score(trues, preds)
    return avg_loss, acc

# 2. One epoch of evaluation
def eval_epoch(model, loader):
    model.eval()  # set model to evaluation mode (disables dropout)
    total_loss = 0
    preds, trues, probs = [], [], []  # track labels and probabilities

    with torch.no_grad():  # no gradient computation during evaluation
        for batch in loader:
            input_ids      = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels         = batch["label"].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )
            loss, logits = outputs.loss, outputs.logits

            total_loss += loss.item()
            # Softmax to get probabilities, take probability of positive class
            prob = F.softmax(logits, dim=1)[:, 1]
            probs.extend(prob.cpu().numpy())
            preds.extend(torch.argmax(logits, dim=1).cpu().numpy())
            trues.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(loader)
    acc = accuracy_score(trues, preds)
    auc = roc_auc_score(trues, probs)  # measure of ranking quality
    return avg_loss, acc, auc

**What’s happening here?**

* **`model.train()`**\*\* vs. \*\*\*\*`model.eval()`\*\*: toggles layers like dropout and batch-norm between training and evaluation behavior.
* **Gradient flow**: during training, we zero gradients, do a forward pass to compute loss, backpropagate (`loss.backward()`), and update weights (`optimizer.step()`). In evaluation, we skip gradient steps for efficiency.
* **Metrics**:

  * **Loss**: average cross-entropy loss over batches, indicating how well predictions match labels.
  * **Accuracy**: fraction of correctly predicted labels.
  * **AUC (Area Under ROC Curve)**: captures the model’s ability to rank positive examples higher than negative ones, robust to class imbalance.

By separating training and evaluation logic, we ensure correct behavior (e.g., no dropout at test time) and gather meaningful metrics to track progress and prevent overfitting.


## 7. Run Training Loop

We now tie everything together in a simple loop over epochs:

In [13]:
EPOCHS = 1
for epoch in range(1, EPOCHS + 1):
    # 1. Train for one full pass through the training data
    train_loss, train_acc = train_epoch(model, train_loader, optimizer)
    # 2. Evaluate on test data
    val_loss, val_acc, val_auc = eval_epoch(model, test_loader)

    # 3. Log the results for monitoring
    print(f"Epoch {epoch} → "
          f"Train loss: {train_loss:.3f}, acc: {train_acc:.3f} | "
          f"Test  loss: {val_loss:.3f}, acc: {val_acc:.3f}, AUC: {val_auc:.3f}")

Epoch 1 → Train loss: 0.311, acc: 0.866 | Test  loss: 0.263, acc: 0.886, AUC: 0.959


* **Epochs**: Each iteration (`epoch`) represents one complete pass through the training set.
* **train\_epoch**: returns the average loss (how well the model fits the training data) and accuracy on that pass.
* **eval\_epoch**: returns the loss, accuracy, and AUC on unseen (test or validation) data, indicating generalization.
* **Logging**: Printing after each epoch helps you track training dynamics—watch for when validation accuracy plateaus or starts to decrease (a sign of overfitting).

By examining these metrics epoch by epoch, you can decide when to stop training, adjust learning rates, or switch to early stopping for optimal performance.