# **Fine-tuning T5 for Question Answering**
## **Group 16:**
1. **Alvito Kiflan Hawari (1103220235)**
2. **Nafal Rifky Atsilah Maulana (1103223106)**

This cell imports all the necessary libraries for the project, including `torch`, `torch.nn`, `DataLoader`, `AdamW`, `T5ForConditionalGeneration`, `T5TokenizerFast`, `get_linear_schedule_with_warmup` from `transformers`, `load_dataset` from `datasets`, `numpy`, `tqdm` for progress bars, and `os` for operating system functionalities.

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import T5ForConditionalGeneration, T5TokenizerFast, get_linear_schedule_with_warmup
from datasets import load_dataset
import numpy as np
from tqdm import tqdm
import os

This cell sets up the device for computation, automatically detecting if a CUDA-enabled GPU is available. If so, it uses the GPU; otherwise, it defaults to the CPU. It then prints the device being used.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


This cell loads the SQuAD dataset from Hugging Face datasets. It then prints the number of samples in both the training and validation sets, providing an overview of the dataset size.

In [None]:
dataset = load_dataset("rajpurkar/squad")
print(f"Train samples: {len(dataset['train'])}")
print(f"Validation samples: {len(dataset['validation'])}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Train samples: 87599
Validation samples: 10570


This cell initializes the T5 tokenizer and defines a `preprocess_function` to prepare the SQuAD dataset for the T5 model. It formats the input as 'question: [question] context: [context]' and the target as the answer text. It then applies this function to the training and validation subsets of the dataset, truncating the original full dataset to 10000 training and 1000 validation samples, and sets the format to PyTorch tensors.

In [None]:
tokenizer = T5TokenizerFast.from_pretrained("t5-base")

def preprocess_function(examples):
    inputs = []
    targets = []
    for question, context, answers in zip(examples['question'], examples['context'], examples['answers']):
        input_text = f"question: {question} context: {context}"
        inputs.append(input_text)
        target_text = answers['text'][0] if answers['text'] else ""
        targets.append(target_text)

    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_dataset = dataset["train"].select(range(10000))
val_dataset = dataset["validation"].select(range(1000))

train_dataset = train_dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)
val_dataset = val_dataset.map(preprocess_function, batched=True, remove_columns=dataset["validation"].column_names)

train_dataset.set_format(type="torch")
val_dataset.set_format(type="torch")

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

This cell defines the `PrefixTuningT5` custom model class, which adapts a T5 model for prefix-tuning. It introduces a `prefix_tokens` parameter that will be prepended to the input embeddings and attention mask. The base T5 model's parameters are frozen, meaning only the `prefix_tokens` will be trained. The `forward` method is overridden to handle the prefix and pass it through the encoder and decoder.

In [None]:
class PrefixTuningT5(nn.Module):
    def __init__(self, base_model, prefix_length=20):
        super().__init__()
        self.base_model = base_model
        self.prefix_length = prefix_length
        self.hidden_size = base_model.config.d_model

        self.prefix_tokens = nn.Parameter(torch.randn(prefix_length, self.hidden_size))

        for name, param in self.base_model.named_parameters():
            param.requires_grad = False

    def forward(self, input_ids, attention_mask, labels=None):
        batch_size = input_ids.shape[0]

        inputs_embeds = self.base_model.encoder.embed_tokens(input_ids)
        prefix_embeds = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1, -1)
        inputs_embeds = torch.cat([prefix_embeds, inputs_embeds], dim=1)

        prefix_attention_mask = torch.ones(batch_size, self.prefix_length, device=attention_mask.device, dtype=attention_mask.dtype)
        extended_attention_mask = torch.cat([prefix_attention_mask, attention_mask], dim=1)

        encoder_outputs = self.base_model.encoder(
            inputs_embeds=inputs_embeds,
            attention_mask=extended_attention_mask,
            return_dict=True
        )

        decoder_outputs = self.base_model.decoder(
            input_ids=labels if labels is not None else None,
            encoder_hidden_states=encoder_outputs.last_hidden_state,
            encoder_attention_mask=extended_attention_mask,
            return_dict=True
        )

        lm_logits = self.base_model.lm_head(decoder_outputs.last_hidden_state)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss(ignore_index=-100)
            loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), labels.view(-1))

        from transformers.modeling_outputs import Seq2SeqLMOutput
        return Seq2SeqLMOutput(
            loss=loss,
            logits=lm_logits,
            past_key_values=None,
            decoder_hidden_states=decoder_outputs.hidden_states,
            decoder_attentions=decoder_outputs.attentions,
            cross_attentions=decoder_outputs.cross_attentions,
            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
            encoder_hidden_states=encoder_outputs.hidden_states,
            encoder_attentions=encoder_outputs.attentions,
        )

This cell loads the pre-trained 't5-base' model and wraps it with the `PrefixTuningT5` class. It then moves the model to the specified device (GPU or CPU) and calculates the number of trainable and total parameters, showing the significant reduction in trainable parameters due to prefix-tuning.

In [None]:
base_model = T5ForConditionalGeneration.from_pretrained("t5-base")
model = PrefixTuningT5(base_model, prefix_length=20)
model = model.to(device)

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
print(f"Total parameters: {total_params:,}")

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Trainable parameters: 15,360 (0.01%)
Total parameters: 222,918,912


This cell creates `DataLoader` instances for both the training and validation datasets. These data loaders will handle batching and shuffling (for the training data), making it easier to iterate through the dataset during training.

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=8)

This cell sets up the training hyperparameters: the number of epochs, the AdamW optimizer (configured to only optimize the trainable parameters of the prefix-tuned model), the learning rate scheduler with a warmup phase, and the maximum gradient norm for clipping.

In [None]:
num_epochs = 5
optimizer = AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=5e-4)
num_training_steps = len(train_dataloader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=100, num_training_steps=num_training_steps)
max_grad_norm = 1.0

This cell contains the main training and validation loop. It iterates for a specified number of epochs, performs forward and backward passes, updates model parameters, and logs training and validation losses. It also includes gradient clipping and saves the model checkpoint with the best validation loss.

In [None]:
best_val_loss = float('inf')
checkpoint_dir = "./finetune-t5-SQuAD"
os.makedirs(checkpoint_dir, exist_ok=True)

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    train_steps = 0

    progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")
    for batch in progress_bar:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        train_loss += loss.item()
        train_steps += 1
        progress_bar.set_postfix({"train_loss": train_loss / train_steps})

    avg_train_loss = train_loss / train_steps

    model.eval()
    val_loss = 0
    val_steps = 0

    with torch.no_grad():
        for batch in tqdm(val_dataloader, desc="Validation"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss

            val_loss += loss.item()
            val_steps += 1

    avg_val_loss = val_loss / val_steps

    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"  Train Loss: {avg_train_loss:.4f}")
    print(f"  Val Loss: {avg_val_loss:.4f}")

    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'val_loss': avg_val_loss,
        }, os.path.join(checkpoint_dir, "best_model.pt"))
        print(f" Saved best model with val_loss: {avg_val_loss:.4f}")

Epoch 1/5: 100%|██████████| 1250/1250 [18:40<00:00,  1.12it/s, train_loss=413]
Validation: 100%|██████████| 125/125 [00:54<00:00,  2.31it/s]


Epoch 1/5
  Train Loss: 413.1548
  Val Loss: 210.8367
 Saved best model with val_loss: 210.8367


Epoch 2/5: 100%|██████████| 1250/1250 [18:41<00:00,  1.11it/s, train_loss=178]
Validation: 100%|██████████| 125/125 [00:54<00:00,  2.31it/s]


Epoch 2/5
  Train Loss: 177.9266
  Val Loss: 24.7917
 Saved best model with val_loss: 24.7917


Epoch 3/5: 100%|██████████| 1250/1250 [18:41<00:00,  1.11it/s, train_loss=85.3]
Validation: 100%|██████████| 125/125 [00:54<00:00,  2.31it/s]


Epoch 3/5
  Train Loss: 85.2720
  Val Loss: 5.9205
 Saved best model with val_loss: 5.9205


Epoch 4/5: 100%|██████████| 1250/1250 [18:42<00:00,  1.11it/s, train_loss=48.1]
Validation: 100%|██████████| 125/125 [00:54<00:00,  2.31it/s]


Epoch 4/5
  Train Loss: 48.0873
  Val Loss: 5.3896
 Saved best model with val_loss: 5.3896


Epoch 5/5: 100%|██████████| 1250/1250 [18:41<00:00,  1.11it/s, train_loss=37.6]
Validation: 100%|██████████| 125/125 [00:54<00:00,  2.30it/s]


Epoch 5/5
  Train Loss: 37.5603
  Val Loss: 5.3101
 Saved best model with val_loss: 5.3101


This cell loads the best performing model checkpoint saved during training. It loads the `model_state_dict` into the model and sets the model to evaluation mode (`model.eval()`). It then prints the epoch and validation loss of the loaded best model.

In [None]:
checkpoint = torch.load(os.path.join(checkpoint_dir, "best_model.pt"), map_location=device, weights_only=True)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
print(f"Loaded best model from epoch {checkpoint['epoch']+1} with val_loss: {checkpoint['val_loss']:.4f}")

Loaded best model from epoch 5 with val_loss: 5.3101


This cell demonstrates how to perform inference with the fine-tuned model on a few example questions and contexts. It tokenizes the input, generates answers using the `generate` method of the T5 model, and then decodes and prints the generated answers along with their lengths.

In [None]:
test_samples = [
    {"question": "What is the capital of France?", "context": "Paris is the capital and most populous city of France."},
    {"question": "Who invented the telephone?", "context": "Alexander Graham Bell was awarded the first U.S. patent for the invention of the telephone in 1876."},
    {"question": "What is photosynthesis?", "context": "Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy."}
]

print("Inference Results")

with torch.no_grad():
    for idx, sample in enumerate(test_samples):
        input_text = f"question: {sample['question']} context: {sample['context']}"
        inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True).to(device)

        outputs = model.base_model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=128,
            num_beams=4,
            early_stopping=True
        )

        answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

        print(f"Sample {idx+1}:")
        print(f"  Question: {sample['question']}")
        print(f"  Context: {sample['context'][:100]}...")
        print(f"  Generated Answer: {answer}")
        print(f"  Answer Length: {len(answer)} characters")
        print()

print("Inference Completed - All outputs are non-empty strings")

Inference Results
Sample 1:
  Question: What is the capital of France?
  Context: Paris is the capital and most populous city of France....
  Generated Answer: Paris
  Answer Length: 5 characters

Sample 2:
  Question: Who invented the telephone?
  Context: Alexander Graham Bell was awarded the first U.S. patent for the invention of the telephone in 1876....
  Generated Answer: Alexander Graham Bell
  Answer Length: 21 characters

Sample 3:
  Question: What is photosynthesis?
  Context: Photosynthesis is a process used by plants and other organisms to convert light energy into chemical...
  Generated Answer: convert light energy into chemical energy
  Answer Length: 41 characters

Inference Completed - All outputs are non-empty strings


This cell mounts your Google Drive to the Colab environment, allowing you to access files stored in your Drive from within the notebook.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
