# CSC 8614 - Language Models
## CI2 - Fine-tuning a language model for text classification

In this TP, you will work on fine-tuning a language model to move from text generation to text classification, specifically working on Spam Detection.

The exercise (and code) has been adapted from the book _Build a Large Language Model (From Scratch)_, by Sebastian Raschka, and its [official github repository](https://github.com/rasbt/LLMs-from-scratch).

This TP will be done in this notebook, and requires some additional files (available from the course website). You will have to fill the missing portions of code, and perform some additional experiments by testing different parameters.

Working on this TP:
- The easiest way is probably to work directly on the notebook, using jupyter notebook or visual studio code. An alternative is also to use Google colab.
- You should be able to run everything on your machine, but you can connect to the GPUs if needed.

Some files are required, and are available on the course website:
- `requirements.txt`
- `gpt_utils.py`


## About the report
You will have to return this notebook (completed), as well as a mini-report (`TP2/rapport.md`).

The notebook and report shall be submitted via a GitHub repository, similarly to what you did for the first session (remember to use a different folder: `TP2`).
For the notebook, it is sufficient to complete the code and submit the final version.

For the mini-report, you have to answer the questions asked in this notebook, and discuss some of your findings as requested.
As for the first session:
- "Vous devez y mettre : réponses courtes, résultats observés (copie de sorties), captures d’écran demandées, et une courte interprétation."
- "Ne collez pas des pages entières : soyez concis et sélectionnez les éléments pertinents."

Reproducibility:
- fix a random seed and write it in the report
- indicate in the report the specific python version OS, and the library versions.

In [2]:
import random
import numpy as np
import torch
import gpt_utils

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)


**Question 1**: Dans `TP1/rapport.md`, ajoutez immédiatement un court en-tête (quelques lignes) contenant : (i) votre nom/prénom, (ii) la commande d’installation/activation d’environnement utilisée, (iii) les versions (Python + bibliothèques principales).

Ajoutez ensuite au fil du TP des sections/titres à votre convenance, tant que l’on peut retrouver clairement vos réponses et vos preuves d’exécution.

In [5]:
# [Instructor code: install requirements]
!pip install -r requirements.txt

Collecting torch==2.9.1 (from -r requirements.txt (line 1))
  Downloading torch-2.9.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting pandas==2.3.3 (from -r requirements.txt (line 4))
  Downloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting matplotlib==3.10.8 (from -r requirements.txt (line 5))
  Downloading matplotlib-3.10.8-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.8/52.8 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow==2.20.0 (from -r requirements.txt (line 6))
  Downloading tensorflow-2.20.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.5 kB)
Collecting jupyterlab==4.5.1 (from -r requirements.txt (line 7))
  Downloading jupyterlab-4.5

In [4]:

### Pour obtenir les versions dans Colab
import sys, torch, tensorflow as tf, tiktoken, pandas, matplotlib
print("Python:", sys.version.split()[0])
print("torch:", torch.__version__)
print("tensorflow:", tf.__version__)
print("tiktoken:", tiktoken.__version__)
print("pandas:", pandas.__version__)
print("matplotlib:", matplotlib.__version__)


Python: 3.12.12
torch: 2.9.1+cu128
tensorflow: 2.20.0
tiktoken: 0.12.0
pandas: 2.3.3
matplotlib: 3.10.8


## Preparing the model

In [5]:
# --- [INSTRUCTOR CODE: load the model weights into memory] ---
import torch
import tiktoken
from gpt_utils import GPTModel, download_and_load_gpt2, load_weights_into_gpt

# Download the model weights (124M param version) / This function (which we put in gpt_utils) handles the downloading
settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2_weights")
print("Weights downloaded and loaded into memory.")

checkpoint: 100%|██████████| 77.0/77.0 [00:00<00:00, 111kiB/s]
encoder.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 2.77MiB/s]
hparams.json: 100%|██████████| 90.0/90.0 [00:00<00:00, 120kiB/s]
model.ckpt.data-00000-of-00001: 100%|██████████| 498M/498M [00:32<00:00, 15.3MiB/s]
model.ckpt.index: 100%|██████████| 5.21k/5.21k [00:00<00:00, 5.91MiB/s]
model.ckpt.meta: 100%|██████████| 471k/471k [00:00<00:00, 2.08MiB/s]
vocab.bpe: 100%|██████████| 456k/456k [00:00<00:00, 1.74MiB/s]


Weights downloaded and loaded into memory.


The `settings` obtained with `download_and_load_gpt2` are the GPT-2 weights made publicly available by OpenAI.

**Question 2**: What type is the object `setting`, and what is its structure (e.g. if it is a list, its length; if a dictionary, its keys, etc.)?

**Question 3**: What type is the object `params`, and what is its structure?

In [6]:
#Analyse settings

print("Type(settings):", type(settings))

if isinstance(settings, dict):
    print("Nb de clés:", len(settings))
    print("Clés:", list(settings.keys()))
    # Affiche quelques champs utiles
    for k in ["n_vocab", "n_ctx", "n_embd", "n_head", "n_layer"]:
        if k in settings:
            print(f"{k}:", settings[k])
else:
    print("Structure settings:", settings)


Type(settings): <class 'dict'>
Nb de clés: 5
Clés: ['n_vocab', 'n_ctx', 'n_embd', 'n_head', 'n_layer']
n_vocab: 50257
n_ctx: 1024
n_embd: 768
n_head: 12
n_layer: 12


In [7]:
# Analyse `params`
print("Type(params):", type(params))

if isinstance(params, dict):
    print("Clés haut niveau:", list(params.keys()))
    for k in params.keys():
        v = params[k]
        # Affiche type / shape si c'est un tensor/array
        shape = getattr(v, "shape", None)
        print(f"- {k}: type={type(v)}, shape={shape}")

    # Zoom sur les blocks si présent
    if "blocks" in params:
        print("\nNb de blocks:", len(params["blocks"]))
        print("Clés d'un block[0]:", list(params["blocks"][0].keys()))
else:
    print("Structure params:", params)



Type(params): <class 'dict'>
Clés haut niveau: ['blocks', 'b', 'g', 'wpe', 'wte']
- blocks: type=<class 'list'>, shape=None
- b: type=<class 'numpy.ndarray'>, shape=(768,)
- g: type=<class 'numpy.ndarray'>, shape=(768,)
- wpe: type=<class 'numpy.ndarray'>, shape=(1024, 768)
- wte: type=<class 'numpy.ndarray'>, shape=(50257, 768)

Nb de blocks: 12
Clés d'un block[0]: ['attn', 'ln_1', 'ln_2', 'mlp']


Look at the `GPTModel` in the file `gpt_utils.py`. In the `__init__` method, we have to pass a config (parameter `cfg`).

**Question 4:**
Analyse the `__init__` method, and check what is the required structure for the `cfg` parameter. Is the `settings` variable we have obtained in the right format? If not, perform the mapping to convert the variable `setting` into a variable `model_config` with the right structure.

In [8]:
# Configure the model, mapping OpenAI specific keys to our model's keys (if needed)
model_config = {
    "vocab_size": settings["n_vocab"],
    "context_length": settings["n_ctx"],
    "emb_dim": settings["n_embd"],
    "n_heads": settings["n_head"],
    "n_layers": settings["n_layer"],
    # garder ces 2 lignes comme demandé
    "drop_rate": 0.1,
    "qkv_bias": True,
}


In [9]:
model = GPTModel(model_config)

# Load the pre-trained weights
load_weights_into_gpt(model, params)
model.eval()

print("GPT-2 Model Loaded and Configured successfully!")

GPT-2 Model Loaded and Configured successfully!


## Preparing the data

Context from the lecture: The raw data is just text messages.

The model needs numbers (token IDs). We also need to pad the messages so they are all the same length in a batch.

We will use a `SpamDataset` class (provided below) to tokenize the text.

In [10]:
# --- [INSTRUCTOR CODE: Run this cell to define the Dataset Class] ---
from torch.utils.data import Dataset
import pandas as pd
import urllib.request
import zipfile
import os

class SpamDataset(Dataset):
    def __init__(self, csv_file, tokenizer, max_length=120, pad_token_id=50256):
        self.data = pd.read_csv(csv_file)
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.pad_token_id = pad_token_id
        # Encode labels: "spam" -> 1, "ham" -> 0
        self.data["label_encoded"] = self.data["Label"].map({"spam": 1, "ham": 0})

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx]["Text"]
        label = self.data.iloc[idx]["label_encoded"]
        # Tokenize
        encoded = self.tokenizer.encode(text, allowed_special={'<|endoftext|>'})
        # Truncate if too long
        encoded = encoded[:self.max_length]
        # Pad if too short
        pad_len = self.max_length - len(encoded)
        encoded += [self.pad_token_id] * pad_len
        return torch.tensor(encoded, dtype=torch.long), torch.tensor(label, dtype=torch.long)

# Download the dataset zip file
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"
extract_path = "sms_spam_collection"
data_file_path = os.path.join(extract_path, "SMSSpamCollection")
if not os.path.exists(zip_path):
    print("Downloading dataset...")
    urllib.request.urlretrieve(url, zip_path)
    print("Download complete.")
# Unzip
if not os.path.exists(extract_path):
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(extract_path)
# Read the TSV file
df = pd.read_csv(
    data_file_path,
    sep="\t",
    header=None,
    names=["Label", "Text"]
)
print(f"Total samples loaded: {len(df)}")

# 4. Create Train/Test Split (80 train / 20 test)
df = df.sample(frac=1, random_state=123).reset_index(drop=True)
# Split index
split_idx = int(0.8 * len(df))

# TODO: if needed (for performance resons), you can come back here and reduce the size of the training set.
train_df = df.iloc[:split_idx]  # [:2000]  # Readd this to only consider 2000 training samples
test_df = df.iloc[split_idx:]

# Save as CSVs, so the SpamDataset class can read them.
train_df.to_csv("train.csv", index=False)
test_df.to_csv("test.csv", index=False)
print("Created 'train.csv' and 'test.csv' successfully!")
print(f"Train size: {len(train_df)}")
print(f"Test size: {len(test_df)}")

Downloading dataset...
Download complete.
Total samples loaded: 5572
Created 'train.csv' and 'test.csv' successfully!
Train size: 4457
Test size: 1115


**Question 5.1**: In the cell above, why did we do `df = df.sample(frac=1, random_state=123)` when creating the train/test split?

**Question 5.2**: Analyse the datasets, what is the distribution of the two classes in the train set? Are they balanced or unbalanced? In case they are unbalanced, might this lead to issues for the fine-tuning of the model?

In [11]:
# Q5.2 — distribution des classes dans le train set
train_counts = train_df["Label"].value_counts()
train_props = train_df["Label"].value_counts(normalize=True)

print("Train class counts:\n", train_counts)
print("\nTrain class proportions:\n", train_props)

# Optionnel: idem sur test
test_counts = test_df["Label"].value_counts()
test_props = test_df["Label"].value_counts(normalize=True)
print("\nTest class counts:\n", test_counts)
print("\nTest class proportions:\n", test_props)


Train class counts:
 Label
ham     3860
spam     597
Name: count, dtype: int64

Train class proportions:
 Label
ham     0.866053
spam    0.133947
Name: proportion, dtype: float64

Test class counts:
 Label
ham     965
spam    150
Name: count, dtype: int64

Test class proportions:
 Label
ham     0.865471
spam    0.134529
Name: proportion, dtype: float64


**Question 6**: Create the dataloaders for training and test.

In [12]:
from torch.utils.data import DataLoader

# Create the Tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Instantiate the Dataset
train_dataset = SpamDataset("train.csv", tokenizer)
test_dataset = SpamDataset("test.csv", tokenizer)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader  = DataLoader(test_dataset,  batch_size=16, shuffle=False)

# Check your work
for input_batch, target_batch in train_loader:
    print("Input batch shape:", input_batch.shape)   # [16, 120]
    print("Target batch shape:", target_batch.shape) # [16]
    break


Input batch shape: torch.Size([16, 120])
Target batch shape: torch.Size([16])


**Question 7**: Looking at the batch size and the training size, how many batches will you have in total? Please report the size of the subsampled training data, you reduce it due to performance constraints.

In [13]:
batch_size = 16
train_size = len(train_dataset)
num_batches = len(train_loader)

print("Train size:", train_size)
print("Batch size:", batch_size)
print("Number of batches:", num_batches)

# (info) Si tu veux calculer "à la main" :
import math
print("Ceil(train_size / batch_size):", math.ceil(train_size / batch_size))


Train size: 4457
Batch size: 16
Number of batches: 279
Ceil(train_size / batch_size): 279


## Fine-tuning

**Context**: GPT-2 was trained to predict the next word (output size ~50,000). We want to predict binary classes (output size 2), so we must replace the final layer.

**Question 8**:

**8.1**: In the cell below, define the number of output classes (`num_classes`) for the new spam detection task.

**8.2**: Also, pring the original and updated output heads (hint: `out_head` from `GPTModel`)

**8.3**: Why do we freeze the internal layers with `param.requires_grad = False`?

In [15]:
import torch.nn as nn

# Freeze the internal layers
for param in model.parameters():
    param.requires_grad = False

print(f"Original output head: {model.out_head}")  # 8.2

num_classes = 2  # 8.1 (spam vs ham)

# Replace output head: from emb_dim (768) -> num_classes (2)
model.out_head = nn.Linear(768, num_classes)  # 8.1 + 8.2

# Enable gradient calculation ONLY for the new head and the final LayerNorm
for param in model.out_head.parameters():
    param.requires_grad = True
for param in model.trf_blocks[-1].norm2.parameters():
    param.requires_grad = True

print(f"New output head: {model.out_head}")  # 8.2


Original output head: Linear(in_features=768, out_features=2, bias=True)
New output head: Linear(in_features=768, out_features=2, bias=True)


You now have to **finalise the code for the training loop** (see individual steps below).

In the first cell below you can find the code to move the model to GPU (if available), define the optimizer, and calculate the accuracy. The following cell contains the code for the training (fine-tuning) loop.

You will have to complete the code of the training loop, by answering the following questions:

**Question 9.1**: Reset the gradients of the `optimizer` ([hint](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html)).

**Question 9.2**: Compute cross-entropy loss ([hint](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html)).

**Question 9.3**: Add code for the backward pass, to compute the gradient ([hint](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html))

**Question 9.4**: Add code for the optimizer step, to update the weights ([hint](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.step.html))

**Question 9.5**: Add code to calculate the accuracy on train and test (hint: you can use the `calc_accuracy` method).

**Note about the speed**: On my laptop's CPU 1 epoch with the full training dataset (~4400 samples, batch_size=16) took ~20 minutes; 1 epoch with a train set of 2000 samples (batch_size=16) took ~12 minutes.

To iterate more quickly, you could:
- i) set `num_epochs = 1` (but only at the beginning), just to make sure that the code is working;
- ii) increase batch_size to 32 or 64 (but careful with possible memory issues).
- iii) reduce the size of the training dataset, by going back to the *Preparing the data* section, and changing the line `train_df = df.iloc[:split_idx]` to `train_df = df.iloc[:split_idx][:2000]` or similar. Be careful that if you reduce the training data too much, the model will not have enough data for fine-tuning.
- Use a GPU; it would be much quicker (few minutes on the whole training data).


In [16]:
# [--- INSTRUCTOR CODE ---]

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Measure imbalance
count_ham = len(train_df[train_df['Label']=='ham'])
count_spam = len(train_df[train_df['Label']=='spam'])

# Calculate weight: penalize missing the minority class (Spam) more
# Weight = Count(Majority) / Count(Minority)
pos_weight = count_ham / count_spam  # approx 6.46 (for full training dataset)
class_weights = torch.tensor([1.0, pos_weight]).to(device)
print(f"Using class weights: {class_weights}")

# Define Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)

# Calculate Accuracy Helper Function
def calc_accuracy(loader, model, device):
    correct, total = 0, 0
    # Track spam specifically
    spam_correct, spam_total = 0, 0
    model.eval()
    with torch.no_grad():
        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)
            logits = model(inputs)[:, -1, :]
            predicted = torch.argmax(logits, dim=-1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            # Filter for Spam (Label 1)
            spam_mask = (labels == 1)
            spam_total += spam_mask.sum().item()
            spam_correct += (predicted[spam_mask] == labels[spam_mask]).sum().item()
    # Avoid division by zero
    spam_acc = spam_correct / spam_total if spam_total > 0 else 0.0
    global_acc = correct / total
    return global_acc, spam_acc


Using class weights: tensor([1.0000, 6.4657])


In [18]:
import torch.nn.functional as F

num_epochs = 1

for epoch in range(num_epochs):
    model.train()
    for batch_idx, (inputs, targets) in enumerate(train_loader):
        inputs, targets = inputs.to(device), targets.to(device)

        # 9.1 Reset gradients
        optimizer.zero_grad()

        # Forward pass: output (batch, seq_len, num_classes) après remplacement de out_head
        logits = model(inputs)[:, -1, :]  # (batch, 2)

        # 9.2 Cross entropy loss (avec weights)
        loss = F.cross_entropy(logits, targets, weight=class_weights)

        # 9.3 Backward pass
        loss.backward()

        # 9.4 Optimizer step
        optimizer.step()

        if batch_idx % 10 == 0:
            print(f"Epoch {epoch+1}, Batch {batch_idx}, Loss: {loss.item():.4f}")

    # 9.5 Accuracy on train and test
    train_acc, train_spam_acc = calc_accuracy(train_loader, model, device)
    test_acc, test_spam_acc = calc_accuracy(test_loader, model, device)

    print(
        f"Epoch {epoch+1}: "
        f"Train Acc: {train_acc*100:.2f}% (Spam: {train_spam_acc*100:.2f}%) | "
        f"Test Acc: {test_acc*100:.2f}% (Spam: {test_spam_acc*100:.2f}%)"
    )


Epoch 1, Batch 0, Loss: 0.8989
Epoch 1, Batch 10, Loss: 0.8218
Epoch 1, Batch 20, Loss: 0.5782
Epoch 1, Batch 30, Loss: 0.6060
Epoch 1, Batch 40, Loss: 0.7215
Epoch 1, Batch 50, Loss: 0.7454
Epoch 1, Batch 60, Loss: 0.7438
Epoch 1, Batch 70, Loss: 0.6006
Epoch 1, Batch 80, Loss: 0.7691
Epoch 1, Batch 90, Loss: 0.8046
Epoch 1, Batch 100, Loss: 0.5789
Epoch 1, Batch 110, Loss: 0.7873
Epoch 1, Batch 120, Loss: 0.7424
Epoch 1, Batch 130, Loss: 0.5289
Epoch 1, Batch 140, Loss: 0.5946
Epoch 1, Batch 150, Loss: 0.6474
Epoch 1, Batch 160, Loss: 0.8342
Epoch 1, Batch 170, Loss: 0.7316
Epoch 1, Batch 180, Loss: 0.6982
Epoch 1, Batch 190, Loss: 0.7166
Epoch 1, Batch 200, Loss: 0.6161
Epoch 1, Batch 210, Loss: 0.8634
Epoch 1, Batch 220, Loss: 0.8996
Epoch 1, Batch 230, Loss: 0.5195
Epoch 1, Batch 240, Loss: 0.6679
Epoch 1, Batch 250, Loss: 0.7114
Epoch 1, Batch 260, Loss: 0.8556
Epoch 1, Batch 270, Loss: 0.5098
Epoch 1: Train Acc: 86.34% (Spam: 0.00%) | Test Acc: 86.10% (Spam: 0.00%)


**Question 10**:

Now run the cell above. You should see how the training loss changes after each batch (and epoch).
Describe thie trend: what do you see, is the model learning?

**Question 11 (optional)**: Change the number of epochs and/or the learning rate and/or the size of the training data, and investigate how the loss/accuracy of the model changes. You can do this editing and re-running the cells above, or creating new cells below.

In [None]:
# TODO if needed, your code for the additional analysis can go here.

**Question 12 (optional)**: Now test the model *on your own text*.

In [None]:
def classify_text(text, model, tokenizer, device, max_length=120, pad_token_id=50256):
    model.eval()

    # Encode the text
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})

    # Pad/Truncate
    # (Matches the logic in SpamDataset so the model sees familiar input structures)
    encoded = encoded[:max_length]
    pad_len = max_length - len(encoded)
    encoded += [pad_token_id] * pad_len

    # Create tensor and add batch dimension
    encoded_tensor = torch.tensor(encoded).unsqueeze(0).to(device) # Shape: [1, max_length]

    # Get prediction
    with torch.no_grad():
        logits = model(encoded_tensor)[:, -1, :] # Logits for the last token
        predicted_label = torch.argmax(logits, dim=-1).item()

    return "SPAM" if predicted_label == 1 else "NOT SPAM"

# --- TODO: Test the model ---
# Create 2 strings: one clearly spam, one normal.
text_1 = "<YOUR TEXT>"  # YOUR TEXT HERE
text_2 = "<YOUR TEXT>"  # YOUR TEXT HERE

print(f"Text 1: {text_1} -> {classify_text(text_1, model, tokenizer, device)}")
print(f"Text 2: {text_2} -> {classify_text(text_2, model, tokenizer, device)}")

---