# Chapter 6 - Exercises

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

---


## Exercise 6.1: Increasing the context length

**Padding Input Sequences in Neural Language Models**

**Key Research Question: How does padding inputs to the maximum `token` length affect model predictive performance?**

*Methodological Approach:*
- Implement systematic `token` padding
- Analyze padding's impact on model performance
- Explore input representation interactions

*Critical Parameters:*
- Input `padding` strategy
- Maximum `token` length
- Predictive performance metrics

*Recommended Investigation:*
1. Implement maximum-length input `padding`
2. Measure performance variations
3. Compare padded versus non-padded inputs
4. Assess computational implications

In [5]:
from google.colab import drive
drive.mount('/content/drive')
import sys
sys.path.append('/content/drive/MyDrive/Colab Notebooks/badr/lab6')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
from gpt_class_finetune import *

In [7]:
import urllib.request
import zipfile
import os
from pathlib import Path
import time

import matplotlib.pyplot as plt
import pandas as pd
import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader

from gpt_download import download_and_load_gpt2
from previous_labs import GPTModel, load_weights_into_gpt

import argparse

We can pad the inputs by setting the max_length to 1024

In [8]:
max_length = 1024

## Exercise 6.2: Finetuning the whole model

**Model-Wide Fine-Tuning Performance Assessment**

**Key Research Question: What is the impact of `fine-tuning` the entire transformer model versus a single final block on predictive performance?**


*Methodological Approach:*
- Implement comprehensive model `fine-tuning`
- Compare performance against single block tuning
- Assess computational and representational changes

*Critical Parameters:*
- Full model `fine-tuning` strategy
- Performance evaluation metrics
- Comparative analysis methodology

*Recommended Investigation:*
1. `Fine-tune` entire transformer model
2. Measure predictive performance metrics
3. Compare with previous single-block tuning results
4. Analyze performance variation mechanisms

We need to remove the lines to finetune the whole model :
for param in model.parameters():
    param.requires_grad = False



In [9]:

    ########################################
    # Download and prepare dataset
    ########################################

url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"
extracted_path = "sms_spam_collection"
data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"

download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)
df = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"])
balanced_df = create_balanced_dataset(df)
balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1})

train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)
train_df.to_csv("train.csv", index=None)
validation_df.to_csv("validation.csv", index=None)
test_df.to_csv("test.csv", index=None)

    ########################################
    # Create data loaders
    ########################################
tokenizer = tiktoken.get_encoding("gpt2")

train_dataset = SpamDataset(
        csv_file="train.csv",
        max_length=None,
        tokenizer=tokenizer
    )

val_dataset = SpamDataset(
        csv_file="validation.csv",
        max_length=train_dataset.max_length,
        tokenizer=tokenizer
    )

test_dataset = SpamDataset(
        csv_file="test.csv",
        max_length=train_dataset.max_length,
        tokenizer=tokenizer
    )

num_workers = 0
batch_size = 8

torch.manual_seed(123)

train_loader = DataLoader(
        dataset=train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        drop_last=True,
    )

val_loader = DataLoader(
        dataset=val_dataset,
        batch_size=batch_size,
        num_workers=num_workers,
        drop_last=False,
    )

test_loader = DataLoader(
        dataset=test_dataset,
        batch_size=batch_size,
        num_workers=num_workers,
        drop_last=False,
    )

    ########################################
    # Load pretrained model
    ########################################

    # Small GPT model for testing purposes


CHOOSE_MODEL = "gpt2-small (124M)"
INPUT_PROMPT = "Every effort moves"

BASE_CONFIG = {
            "vocab_size": 50257,     # Vocabulary size
            "context_length": 1024,  # Context length
            "drop_rate": 0.0,        # Dropout rate
            "qkv_bias": True         # Query-key-value bias
        }

model_configs = {
            "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
            "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
            "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
            "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
        }

BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

assert train_dataset.max_length <= BASE_CONFIG["context_length"], (
            f"Dataset length {train_dataset.max_length} exceeds model's context "
            f"length {BASE_CONFIG['context_length']}. Reinitialize data sets with "
            f"`max_length={BASE_CONFIG['context_length']}`"
        )

model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")

model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    ########################################
    # Modify and pretrained model
    ########################################


torch.manual_seed(123)

num_classes = 2
model.out_head = torch.nn.Linear(in_features=BASE_CONFIG["emb_dim"], out_features=num_classes)
model.to(device)

for param in model.trf_blocks[-1].parameters():
        param.requires_grad = True

for param in model.final_norm.parameters():
        param.requires_grad = True

File downloaded and saved as sms_spam_collection/SMSSpamCollection.tsv


checkpoint: 100%|██████████| 77.0/77.0 [00:00<00:00, 160kiB/s]
encoder.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 6.56MiB/s]
hparams.json: 100%|██████████| 90.0/90.0 [00:00<00:00, 69.4kiB/s]
model.ckpt.data-00000-of-00001: 100%|██████████| 498M/498M [00:11<00:00, 44.5MiB/s]
model.ckpt.index: 100%|██████████| 5.21k/5.21k [00:00<00:00, 9.64MiB/s]
model.ckpt.meta: 100%|██████████| 471k/471k [00:00<00:00, 2.47MiB/s]
vocab.bpe: 100%|██████████| 456k/456k [00:00<00:00, 2.73MiB/s]


In [10]:
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)
    test_loss = calc_loss_loader(test_loader, model, device, num_batches=5)

print(f"Training loss: {train_loss:.3f}")
print(f"Validation loss: {val_loss:.3f}")
print(f"Test loss: {test_loss:.3f}")

Training loss: 2.183
Validation loss: 2.583
Test loss: 2.322


In [11]:
import time

start_time = time.time()

torch.manual_seed(123)

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)

num_epochs = 5
train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=50, eval_iter=5,tokenizer=tokenizer
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

Ep 1 (Step 000000): Train loss 2.884, Val loss 2.596
Ep 1 (Step 000050): Train loss 0.293, Val loss 0.190
Ep 1 (Step 000100): Train loss 0.148, Val loss 0.501
Training accuracy: 97.50% | Validation accuracy: 95.00%
Ep 2 (Step 000150): Train loss 0.162, Val loss 0.073
Ep 2 (Step 000200): Train loss 0.004, Val loss 0.029
Ep 2 (Step 000250): Train loss 0.029, Val loss 0.106
Training accuracy: 97.50% | Validation accuracy: 95.00%
Ep 3 (Step 000300): Train loss 0.010, Val loss 0.127
Ep 3 (Step 000350): Train loss 0.001, Val loss 0.009
Training accuracy: 100.00% | Validation accuracy: 100.00%
Ep 4 (Step 000400): Train loss 0.005, Val loss 0.006
Ep 4 (Step 000450): Train loss 0.013, Val loss 0.026
Ep 4 (Step 000500): Train loss 0.003, Val loss 0.112
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 5 (Step 000550): Train loss 0.025, Val loss 0.024
Ep 5 (Step 000600): Train loss 0.001, Val loss 0.044
Training accuracy: 100.00% | Validation accuracy: 100.00%
Training completed in 141.

In [12]:
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)
    test_loss = calc_loss_loader(test_loader, model, device, num_batches=5)

print(f"Training loss: {train_loss:.3f}")
print(f"Validation loss: {val_loss:.3f}")
print(f"Test loss: {test_loss:.3f}")

Training loss: 0.003
Validation loss: 0.004
Test loss: 0.605


## Exercise 6.3: Finetuning the first versus last token

**First Token Fine-Tuning: Predictive Performance Analysis**

**Key Research Question: How do predictive performance characteristics change when fine-tuning the first output `token` compared to the last output `token`?**

*Methodological Approach:*
- Fine-tune first output `token`
- Compare performance against last `token` fine-tuning
- Assess representational learning variations

*Critical Parameters:*
- Initial `token` fine-tuning strategy
- Performance evaluation metrics
- Comparative analysis methodology

*Recommended Investigation:*
1. Implement first `token` fine-tuning
2. Measure predictive performance
3. Compare with last `token` fine-tuning results
4. Analyze performance variation mechanisms

In [15]:
def calc_loss_batch_first_token(input_batch, target_batch, model, device):
    """
    Compute the loss for a batch, focusing on the first token in the sequence.

    Args:
        input_batch: Tensor of input sequences.
        target_batch: Tensor of target labels.
        model: The language model.
        device: The computation device (CPU or GPU).

    Returns:
        The computed loss for the batch.
    """
    input_batch = input_batch.to(device)
    target_batch = target_batch.to(device)

    # Forward pass
    outputs = model(input_batch)

    # Extract logits for the first token
    first_token_logits = outputs[:, 0, :]  # Logits corresponding to the first token

    # Define the loss function (e.g., CrossEntropyLoss)
    criterion = torch.nn.CrossEntropyLoss()

    # Compute the loss between predicted logits and target labels
    loss = criterion(first_token_logits, target_batch)

    return loss

def calc_accuracy_loader_first_token(loader, model, device, num_batches=None):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for input_batch, target_batch in loader:
            if num_batches and total >= num_batches:
                break

            input_batch = input_batch.to(device)
            target_batch = target_batch.to(device)

            outputs = model(input_batch)
            first_token_logits = outputs[:, 0, :]  # Using index 0 for the first token

            first_token_labels = target_batch

            # Calculate the predictions and compare with the labels
            _, predicted = torch.max(first_token_logits, 1)
            correct += (predicted == first_token_labels).sum().item()
            total += target_batch.shape[0]

    return correct / total if total > 0 else 0



def evaluate_model_first_token(model, train_loader, val_loader, device, num_batches=None):
    model.eval()
    train_loss = calc_loss_loader_first_token(train_loader, model, device, num_batches)
    val_loss = calc_loss_loader_first_token(val_loader, model, device, num_batches)
    return train_loss, val_loss

def calc_loss_loader_first_token(loader, model, device, num_batches=None):
    model.eval()
    total_loss = 0
    batch_count = 0

    with torch.no_grad():
        for input_batch, target_batch in loader:
            if num_batches and batch_count >= num_batches:
                break

            loss = calc_loss_batch_first_token(input_batch, target_batch, model, device)
            total_loss += loss.item()
            batch_count += 1

    return total_loss / batch_count if batch_count > 0 else float('inf')

def calc_accuracy_loader_first_token(loader, model, device, num_batches=None):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for input_batch, target_batch in loader:
            if num_batches and total >= num_batches:
                break

            input_batch = input_batch.to(device)
            target_batch = target_batch.to(device)

            outputs = model(input_batch)
            first_token_logits = outputs[:, 0, :]  # Using index 0 for the first token

            first_token_labels = target_batch

            # Calculate the predictions and compare with the labels
            _, predicted = torch.max(first_token_logits, 1)
            correct += (predicted == first_token_labels).sum().item()
            total += target_batch.shape[0]

    return correct / total if total > 0 else 0


def train_classifier_first_token(model, train_loader, val_loader, optimizer, device, num_epochs, eval_freq, eval_iter, tokenizer):
    # Initialize lists to track losses and tokens seen
    train_losses, val_losses, train_accs, val_accs = [], [], [], []
    examples_seen, global_step = 0, -1

    # Main training loop
    for epoch in range(num_epochs):
        model.train()

        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()


            loss = calc_loss_batch_first_token(input_batch, target_batch, model, device)
            loss.backward()  # Calculate loss gradients
            optimizer.step()

            examples_seen += input_batch.shape[0]
            global_step += 1

            # Optional evaluation step
            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model_first_token(
                    model, train_loader, val_loader, device, eval_iter)
                train_losses.append(train_loss)
                val_losses.append(val_loss)

                print(f"Ep {epoch+1} (Step {global_step:06d}): "
                      f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

                # Calculate accuracy after each evaluation step
                train_accuracy = calc_accuracy_loader_first_token(
                    train_loader, model, device, num_batches=eval_iter)
                val_accuracy = calc_accuracy_loader_first_token(
                    val_loader, model, device, num_batches=eval_iter)

                print(f"Training accuracy: {train_accuracy*100:.2f}% | ", end="")
                print(f"Validation accuracy: {val_accuracy*100:.2f}%")

                train_accs.append(train_accuracy)
                val_accs.append(val_accuracy)

    return train_losses, val_losses, train_accs, val_accs, examples_seen

In [16]:
start_time = time.time()

torch.manual_seed(123)

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)

num_epochs = 5
train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_first_token(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=50, eval_iter=5,tokenizer=tokenizer
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

Ep 1 (Step 000000): Train loss 0.931, Val loss 0.942
Training accuracy: 62.50% | Validation accuracy: 50.00%
Ep 1 (Step 000050): Train loss 0.584, Val loss 0.654
Training accuracy: 100.00% | Validation accuracy: 62.50%
Ep 1 (Step 000100): Train loss 0.589, Val loss 0.762
Training accuracy: 75.00% | Validation accuracy: 62.50%
Ep 2 (Step 000150): Train loss 0.436, Val loss 0.677
Training accuracy: 87.50% | Validation accuracy: 50.00%
Ep 2 (Step 000200): Train loss 0.435, Val loss 0.679
Training accuracy: 87.50% | Validation accuracy: 62.50%
Ep 2 (Step 000250): Train loss 0.354, Val loss 0.685
Training accuracy: 87.50% | Validation accuracy: 62.50%
Ep 3 (Step 000300): Train loss 0.303, Val loss 0.887
Training accuracy: 87.50% | Validation accuracy: 50.00%
Ep 3 (Step 000350): Train loss 0.344, Val loss 0.822
Training accuracy: 100.00% | Validation accuracy: 62.50%
Ep 4 (Step 000400): Train loss 0.220, Val loss 0.867
Training accuracy: 75.00% | Validation accuracy: 37.50%
Ep 4 (Step 000450

En ayant changé model(input_batch)[:, -1, :] en model(input_batch)[:, 0, :], on a pu avoir le premier token au lieu du dernier. On voit que le fine tunning token forward est moins efficace que le back.