# Task 2 — Fine-tuning **T5-base** for **Question Answering** on **SQuAD** (Google Drive)

This notebook fine-tunes a **seq2seq** model (**T5-base**) to perform **generative QA** on:
- Dataset: `rajpurkar/squad`
- Model: `t5-base`

It is designed for **Google Colab** and saves outputs under your Google Drive folder.

**Recommended repo:** `finetuning-t5-question-answering`  
**Notebook path:** `notebooks/01_t5_squad_qa.ipynb`


In [1]:
from google.colab import drive
drive.mount("/content/drive")


Mounted at /content/drive


In [2]:
from pathlib import Path

# Project directory on your Google Drive
PROJECT_DIR = "/content/drive/MyDrive/finetuning-t5-question-answering"

PROJECT_PATH = Path(PROJECT_DIR)
REPORTS_DIR  = PROJECT_PATH / "reports"
NOTEBOOKS_DIR = PROJECT_PATH / "notebooks"
MODELS_DIR   = PROJECT_PATH / "models"
OUTPUTS_DIR  = PROJECT_PATH / "outputs"

for d in [REPORTS_DIR, NOTEBOOKS_DIR, MODELS_DIR, OUTPUTS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print("PROJECT_PATH:", PROJECT_PATH)
print("REPORTS_DIR:", REPORTS_DIR)
print("MODELS_DIR:", MODELS_DIR)
print("OUTPUTS_DIR:", OUTPUTS_DIR)


PROJECT_PATH: /content/drive/MyDrive/finetuning-t5-question-answering
REPORTS_DIR: /content/drive/MyDrive/finetuning-t5-question-answering/reports
MODELS_DIR: /content/drive/MyDrive/finetuning-t5-question-answering/models
OUTPUTS_DIR: /content/drive/MyDrive/finetuning-t5-question-answering/outputs


## 1) Install dependencies

In [3]:
!pip -q install "pandas==2.2.2"


In [4]:
# If running on Colab, install dependencies first.
!pip -q install -U transformers datasets evaluate accelerate sentencepiece huggingface_hub pandas


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.3/512.3 kB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.4/12.4 MB[0m [31m127.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m53.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.3.3 which is incompatible.[0m[31m
[0m

In [5]:
!pip -q install -U "transformers>=4.30.0" "accelerate>=0.21.0" datasets evaluate sentencepiece "pandas==2.2.2"


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m6.4/12.7 MB[0m [31m193.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.7/12.7 MB[0m [31m197.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m114.9 MB/s[0m eta [36m0:00:00[0m
[?25h

## 2) (Optional) Hugging Face login

If you get authentication / gated-access errors when loading the dataset, login first.

In [6]:
# Option A: Notebook login (recommended)
from huggingface_hub import notebook_login

# Uncomment the next line if you need to authenticate:
# notebook_login()

# Option B: CLI (alternative)
# !huggingface-cli login


## 3) Imports + Seed

In [7]:
import os, random, json
import numpy as np
import pandas as pd

from datasets import load_dataset
import evaluate

import torch
from transformers import (
    T5TokenizerFast,
    T5ForConditionalGeneration,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    set_seed,
)

SEED = 42
set_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

print("CUDA available:", torch.cuda.is_available())
print("Torch version:", torch.__version__)


CUDA available: True
Torch version: 2.9.0+cu126


## 4) Load dataset: `rajpurkar/squad`

In [8]:
# Dataset as requested:
ds = load_dataset("rajpurkar/squad")

train_ds = ds["train"]
valid_ds = ds["validation"]

print(ds)
print("Train size:", len(train_ds))
print("Validation size:", len(valid_ds))
print("Columns:", train_ds.column_names)
print("Example:", {k: train_ds[0][k] for k in ["id", "title", "question"]})


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})
Train size: 87599
Validation size: 10570
Columns: ['id', 'title', 'context', 'question', 'answers']
Example: {'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'}


## 5) Configuration

In [9]:
MODEL_CHECKPOINT = "t5-base"

# Sequence lengths (reduce if you hit OOM)
MAX_SOURCE_LENGTH = 384
MAX_TARGET_LENGTH = 64

# Training hyperparameters (adjust as needed)
NUM_EPOCHS = 1
LEARNING_RATE = 3e-4
TRAIN_BATCH_SIZE = 4
EVAL_BATCH_SIZE = 4
GRAD_ACCUM_STEPS = 1
WEIGHT_DECAY = 0.0

# Faster dev run (set True to test pipeline quickly)
USE_SMALL_SUBSET = False
SMALL_TRAIN_SIZE = 2000
SMALL_VALID_SIZE = 500

print("MODEL_CHECKPOINT:", MODEL_CHECKPOINT)
print("MAX_SOURCE_LENGTH:", MAX_SOURCE_LENGTH)
print("MAX_TARGET_LENGTH:", MAX_TARGET_LENGTH)
print("NUM_EPOCHS:", NUM_EPOCHS)
print("TRAIN_BATCH_SIZE:", TRAIN_BATCH_SIZE)
print("EVAL_BATCH_SIZE:", EVAL_BATCH_SIZE)
print("USE_SMALL_SUBSET:", USE_SMALL_SUBSET)


MODEL_CHECKPOINT: t5-base
MAX_SOURCE_LENGTH: 384
MAX_TARGET_LENGTH: 64
NUM_EPOCHS: 1
TRAIN_BATCH_SIZE: 4
EVAL_BATCH_SIZE: 4
USE_SMALL_SUBSET: False


## 6) Load model + tokenizer

In [10]:
tokenizer = T5TokenizerFast.from_pretrained(MODEL_CHECKPOINT)
model = T5ForConditionalGeneration.from_pretrained(MODEL_CHECKPOINT)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

print("Device:", device)


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device: cuda


## 7) Preprocess: format input for T5 + tokenize

In [11]:
def build_input(question: str, context: str) -> str:
    # Standard T5 QA prompting
    return f"question: {question}  context: {context}"

def pick_first_answer(answers) -> str:
    # SQuAD schema: answers = {"text": [...], "answer_start": [...]}
    if answers is None:
        return ""
    texts = answers.get("text", [])
    if isinstance(texts, list) and len(texts) > 0:
        return texts[0]
    return ""

def preprocess_function(batch):
    inputs = [build_input(q, c) for q, c in zip(batch["question"], batch["context"])]
    targets = [pick_first_answer(a) for a in batch["answers"]]

    model_inputs = tokenizer(
        inputs,
        max_length=MAX_SOURCE_LENGTH,
        truncation=True,
    )

    labels = tokenizer(
        text_target=targets,
        max_length=MAX_TARGET_LENGTH,
        truncation=True,
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Optional small subset for quick runs
if USE_SMALL_SUBSET:
    train_raw = train_ds.select(range(min(SMALL_TRAIN_SIZE, len(train_ds))))
    valid_raw = valid_ds.select(range(min(SMALL_VALID_SIZE, len(valid_ds))))
else:
    train_raw = train_ds
    valid_raw = valid_ds

train_tok = train_raw.map(preprocess_function, batched=True, remove_columns=train_raw.column_names)
valid_tok = valid_raw.map(preprocess_function, batched=True, remove_columns=valid_raw.column_names)

print("Tokenized train size:", len(train_tok))
print("Tokenized valid size:", len(valid_tok))
print("Tokenized example keys:", train_tok[0].keys())


Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

Tokenized train size: 87599
Tokenized valid size: 10570
Tokenized example keys: dict_keys(['input_ids', 'attention_mask', 'labels'])


## 8) Metrics: Exact Match (EM) & F1 via `evaluate`

In [12]:
squad_metric = evaluate.load("squad")

def postprocess_text(preds, labels):
    preds = [p.strip() for p in preds]
    labels = [l.strip() for l in labels]
    return preds, labels

def compute_metrics(eval_pred):
    preds, labels = eval_pred

    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # Build inputs for the SQuAD metric
    predictions = [{"id": str(i), "prediction_text": p} for i, p in enumerate(decoded_preds)]
    references  = [{"id": str(i), "answers": {"text": [a], "answer_start": [0]}} for i, a in enumerate(decoded_labels)]

    result = squad_metric.compute(predictions=predictions, references=references)
    return {"exact_match": result["exact_match"], "f1": result["f1"]}


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

## 9) Trainer setup

In [13]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

use_fp16 = bool(torch.cuda.is_available())

training_args = Seq2SeqTrainingArguments(
    output_dir=str(OUTPUTS_DIR),
    evaluation_strategy="steps",
    eval_steps=1000,
    save_steps=1000,
    logging_steps=200,
    save_total_limit=2,

    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM_STEPS,

    learning_rate=LEARNING_RATE,
    num_train_epochs=NUM_EPOCHS,
    weight_decay=WEIGHT_DECAY,

    predict_with_generate=True,
    generation_max_length=MAX_TARGET_LENGTH,

    fp16=use_fp16,

    report_to="none",
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=valid_tok,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("Trainer ready.")


TypeError: Seq2SeqTrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

## 10) Train

In [None]:
train_result = trainer.train()
print(train_result)


## 11) Evaluate

In [None]:
eval_result = trainer.evaluate()
print("Eval:", eval_result)

# Save metrics to reports
metrics_path = REPORTS_DIR / "metrics_squad.json"
with open(metrics_path, "w", encoding="utf-8") as f:
    json.dump(eval_result, f, indent=2)

print("Saved:", metrics_path)


## 12) Save model + tokenizer

In [None]:
save_path = MODELS_DIR / "t5_squad_best"
save_path.mkdir(parents=True, exist_ok=True)

trainer.save_model(str(save_path))
tokenizer.save_pretrained(str(save_path))

print("Saved to:", save_path)


## 13) Qualitative examples (save to CSV)

In [None]:
def generate_answer(question, context):
    inp = build_input(question, context)
    inputs = tokenizer([inp], return_tensors="pt", truncation=True, max_length=MAX_SOURCE_LENGTH)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        out_ids = model.generate(
            **inputs,
            max_new_tokens=MAX_TARGET_LENGTH,
            num_beams=4,
        )
    return tokenizer.decode(out_ids[0], skip_special_tokens=True)

# Take a small random sample from validation
rng = np.random.default_rng(SEED)
k = 10
idxs = rng.choice(len(valid_raw), size=min(k, len(valid_raw)), replace=False).tolist()

rows = []
for i in idxs:
    ex = valid_raw[i]
    pred = generate_answer(ex["question"], ex["context"])
    gold = pick_first_answer(ex["answers"])
    rows.append({
        "id": ex.get("id", str(i)),
        "question": ex["question"],
        "prediction": pred,
        "gold": gold,
    })

df = pd.DataFrame(rows)
csv_path = REPORTS_DIR / "qualitative_examples_squad.csv"
df.to_csv(csv_path, index=False)

print("Saved:", csv_path)
df.head()


## 14) Write report to `reports/summary_squad.md`

In [None]:
from datetime import datetime

report_path = REPORTS_DIR / "summary_squad.md"
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

exact_match = eval_result.get("eval_exact_match", 0.0)
f1 = eval_result.get("eval_f1", 0.0)
eval_loss = eval_result.get("eval_loss", 0.0)

report_text = f"""# Task 2 — Fine-tuning T5-base for SQuAD (Generative QA)

Generated: {now}

## Setup
- Dataset: rajpurkar/squad
- Model: {MODEL_CHECKPOINT}
- Max source length: {MAX_SOURCE_LENGTH}
- Max target length: {MAX_TARGET_LENGTH}
- Epochs: {NUM_EPOCHS}
- Learning rate: {LEARNING_RATE}
- Train batch size: {TRAIN_BATCH_SIZE}
- Eval batch size: {EVAL_BATCH_SIZE}
- Gradient accumulation: {GRAD_ACCUM_STEPS}
- Weight decay: {WEIGHT_DECAY}

## Data sizes used
- Train: {len(train_tok)}
- Validation: {len(valid_tok)}

## Results (Validation)
| Metric | Value |
|---|---:|
| Eval loss | {eval_loss:.4f} |
| Exact Match (EM) | {exact_match:.2f} |
| F1 | {f1:.2f} |

## Notes
- This is a **generative QA** setup using a T5-style input: `question: ... context: ...`
- For the training target, we used the **first** answer span in the SQuAD `answers["text"]` list.
- Qualitative examples saved to: `reports/qualitative_examples_squad.csv`
"""

report_path.write_text(report_text, encoding="utf-8")
print("Wrote:", report_path)
