
# Fine-tuning a pre-trained Pegasus model for improved summarisation

This exercise fine-tunes a Pegasus model on a targeted dataset, the CNN/DailyMail dataset, which consists of news articles and their corresponding summaries. By fine-tuning on domain-specific data and comparing the pre-trained and fine-tuned models, the code ensures that the newly fine-tuned model provides more accurate summaries for the company's clients.

Remember to change your Runtime to GPU before running the code.

## Section 1: Installing and importing required libraries


In [None]:
# Install required libraries
!pip install transformers datasets rouge_score evaluate

# Import necessary libraries for loading datasets, model training, and
# evaluation
import random                      # Python’s built-in random module (for reproducibility of random ops)
import numpy as np                 # Numerical library (arrays, math, seeding random numbers)
import torch                       # PyTorch (deep learning framework, used for model training/evaluation)

from datasets import load_dataset  # Used to download and load benchmark datasets (e.g., CNN/DailyMail)
import evaluate                    # Used to load evaluation metrics (e.g., ROUGE)

from transformers import (
    AutoTokenizer,                 # Generic class to download/load the correct tokenizer from model name
    AutoModelForSeq2SeqLM,         # Generic class to download/load encoder-decoder (seq2seq) models
    DataCollatorForSeq2Seq,        # Dynamically pads inputs/labels to the same length during batching
    Trainer,                       # High-level API to train/evaluate models (handles loop, logging, etc.)
    TrainingArguments,             # Holds all hyperparameters/config for training (batch size, lr, etc.)
    set_seed,                      # Utility to set global random seed (ensures reproducibility)
    PegasusTokenizer,              # Specific tokenizer class for PEGASUS models (optional if using AutoTokenizer)
    PegasusForConditionalGeneration # Specific PEGASUS seq2seq model class (optional if using AutoModelForSeq2SeqLM)
)

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=11975dc42761c654a538b2fb792be05addd36f92171a73fcec9684962630b9ac
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.5 rouge_score-0.1.2



## Section 2: Loading and preprocessing the dataset

In [None]:
# -------------------------------------------------------------
# Load the CNN/DailyMail dataset from Hugging Face Datasets.
# "cnn_dailymail" is a common benchmark dataset for summarization.
# Version "3.0.0" refers to the latest processed variant (article/highlights pairs).
# -------------------------------------------------------------
dataset = load_dataset("cnn_dailymail", "3.0.0")

# -------------------------------------------------------------
# Reduce dataset size for faster experimentation/demonstration.
# - Select the first 5,000 samples from the training split.
# - Select the first 2,000 samples from the test split.
# NOTE: In real training you would use the full dataset,
# but here we subset it to save time and memory.
# -------------------------------------------------------------
train_dataset = dataset['train'].select(range(5000))
test_dataset  = dataset['test'].select(range(2000))

# -------------------------------------------------------------
# Define the model checkpoint to use.
# 'google/pegasus-xsum' is a pre-trained PEGASUS model fine-tuned
# on the XSum dataset (extreme summarization).
# We’ll reuse it here to test transfer to CNN/DailyMail.
# -------------------------------------------------------------
MODEL_ID = "google/pegasus-xsum"

# -------------------------------------------------------------
# Load the PEGASUS tokenizer (responsible for converting text into
# token IDs that the model can understand).
# -------------------------------------------------------------
tokenizer = PegasusTokenizer.from_pretrained(MODEL_ID)

# -------------------------------------------------------------
# Load the PEGASUS model for conditional generation.
# This is an encoder-decoder transformer trained for summarization tasks.
# -------------------------------------------------------------
model = PegasusForConditionalGeneration.from_pretrained(MODEL_ID)

# -------------------------------------------------------------
# Enable gradient checkpointing.
# This saves memory during training by re-computing intermediate
# activations in the backward pass instead of storing them.
# Trade-off: reduced memory usage but slower training.
# -------------------------------------------------------------
model.gradient_checkpointing_enable()

# -------------------------------------------------------------
# Define a preprocessing function that prepares raw dataset
# examples into model-ready inputs.
# -------------------------------------------------------------
def preprocess_function(examples):
    # Extract the "article" field from the dataset as input text.
    inputs = examples["article"]

    # Tokenize the input article:
    # - max_length=512 (truncate longer articles to fit the model input window)
    # - truncation=True (cut off if text > max_length)
    # - padding="max_length" (pad shorter texts to exactly 512 tokens)
    model_inputs = tokenizer(
        inputs,
        max_length=512,
        truncation=True,
        padding="max_length"
    )

    # Tokenize the "highlights" field (the gold summaries).
    # `as_target_tokenizer()` ensures special tokens (like BOS/EOS)
    # are applied for decoding/labels instead of source-side processing.
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["highlights"],
            max_length=128,
            truncation=True,
            padding="max_length"
        )

    # Add the tokenized summaries as labels so the Trainer
    # knows the correct target output for each example.
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

# -------------------------------------------------------------
# Preprocess the test dataset by applying the tokenizer function.
# The map() function applies `preprocess_function` to each example
# (in mini-batches for efficiency if batched=True).
# -------------------------------------------------------------
test_data = test_dataset.map(preprocess_function, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]



## Section 3. Evaluating the pre-trained model before fine-tuning

In [None]:
# -------------------------------------------------------------
# Load the ROUGE metric implementation from Hugging Face Evaluate.
# ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the
# standard metric for automatic summarization evaluation.
# It compares overlap of n-grams between generated summaries and references.
# -------------------------------------------------------------
rouge = evaluate.load("rouge")


# -------------------------------------------------------------
# Define a helper function to generate summaries for a given batch
# and attach them back to the dataset row.
# -------------------------------------------------------------
def generate_summary(batch, model, tokenizer):
    # Tokenize the article text into model input tensors.
    # - padding="max_length": pad all to exactly 256 tokens
    # - truncation=True: cut off articles longer than 256 tokens
    # - return_tensors="pt": return PyTorch tensors
    inputs = tokenizer(
        batch["article"],
        padding="max_length",
        truncation=True,
        max_length=256,
        return_tensors="pt"
    )

    # Move tokenized input IDs to the target device (GPU or CPU).
    input_ids = inputs.input_ids.to(DEVICE)

    # Move the attention mask (marks which tokens are real vs padding).
    attention_mask = inputs.attention_mask.to(DEVICE)

    # Generate summaries with the model.
    # - input_ids + attention_mask: define the source sequence
    # - max_length=64: constrain summaries to ≤64 tokens
    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=64
    )

    # Decode generated token IDs back into text strings.
    # - skip_special_tokens=True: remove tokens like <pad>, <eos>
    batch["predicted_summary"] = tokenizer.batch_decode(
        outputs,
        skip_special_tokens=True
    )

    # Free up GPU memory after each batch to avoid OOM errors.
    torch.cuda.empty_cache()

    # Return the batch with an extra field "predicted_summary".
    return batch


# -------------------------------------------------------------
# Set up hardware configuration.
# -------------------------------------------------------------
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"  # Prefer GPU if available
model.to(DEVICE)                                        # Move model weights to the device
USE_FP16 = torch.cuda.is_available()                    # Later used to enable mixed-precision training
set_seed(42)                                            # Fix random seed for reproducibility


# -------------------------------------------------------------
# Evaluate the pre-trained model *before* fine-tuning.
# -------------------------------------------------------------
print("Evaluating pre-trained model...")

# Reload the pre-trained PEGASUS model fresh, and move it to the device.
# This ensures we are evaluating the untouched, original model.
pretrained_model = PegasusForConditionalGeneration.from_pretrained(MODEL_ID).to(DEVICE)

# Take a 100-example slice of the preprocessed test set.
# Apply `generate_summary` to each example one by one (batched=False).
# This adds a "predicted_summary" column alongside the gold "highlights".
test_data_sample_pretrain = test_data.select(range(100)).map(
    lambda batch: generate_summary(batch, pretrained_model, tokenizer),
    batched=False
)


# -------------------------------------------------------------
# Define a small utility to sanitize predictions/references so
# they match the format expected by evaluate.load('rouge').
# -------------------------------------------------------------
def _coerce_for_rouge(preds, refs):
    """
    - Ensures both predictions and references are lists of equal length.
    - Flattens if nested lists, strips whitespace, replaces None with "".
    - If single strings are given, wraps them into lists.
    """
    if isinstance(preds, str): preds = [preds]
    if isinstance(refs, str):  refs  = [refs]

    preds = [p[0] if isinstance(p, list) and len(p) == 1 else p for p in preds]
    refs  = [r[0] if isinstance(r, list) and len(r) == 1 else r for r in refs]

    preds = [("" if p is None else str(p)).strip() for p in preds]
    refs  = [("" if r is None else str(r)).strip() for r in refs]

    if len(preds) != len(refs):
        raise ValueError(f"Predictions and references must have same length: {len(preds)} vs {len(refs)}")

    return preds, refs


# -------------------------------------------------------------
# Compute ROUGE scores for the pre-trained model.
# -------------------------------------------------------------
# Extract generated predictions and gold references from dataset.
preds_pre = test_data_sample_pretrain["predicted_summary"]
refs_pre  = test_data_sample_pretrain["highlights"]

# Coerce them into proper format (lists of equal length).
preds_pre, refs_pre = _coerce_for_rouge(preds_pre, refs_pre)

# Compute ROUGE scores (F1 by default).
# - use_stemmer=True: normalize words by stemming (better match quality)
# - rouge_types: request ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence)
rouge_scores_pretrain = rouge.compute(
    predictions=preds_pre,
    references=refs_pre,
    use_stemmer=True,
    rouge_types=["rouge1", "rouge2", "rougeL"]
)

# Print the ROUGE-2 F1 score for the pre-trained model (rounded to 4 decimals).
print(f"ROUGE-2 F1 (Pre-trained): {rouge_scores_pretrain['rouge2']:.4f}")


Downloading builder script: 0.00B [00:00, ?B/s]

Evaluating pre-trained model...


Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

ROUGE-2 F1 (Pre-trained): 0.0668



## Section 4: Defining the arguments and fine-tuning the model

In [None]:
# -------------------------------------------------------------
# Preprocess the training dataset by applying the same tokenizer
# function used for the test set. This converts raw text articles
# and highlights into token IDs padded/truncated to fixed lengths.
# batched=True: process multiple examples at once for efficiency.
# -------------------------------------------------------------
train_data = train_dataset.map(preprocess_function, batched=True)


# -------------------------------------------------------------
# Define the training configuration with TrainingArguments.
# These control all aspects of training (output paths, epochs,
# batch size, logging, etc.).
# -------------------------------------------------------------
training_args = TrainingArguments(
    output_dir="./results",          # directory where checkpoints & logs will be saved
    eval_strategy="epoch",           # <-- ERROR: should be 'evaluation_strategy' in HF
    learning_rate=2e-5,              # small LR for fine-tuning transformer models
    per_device_train_batch_size=8,   # batch size per GPU/CPU for training
    per_device_eval_batch_size=8,    # batch size per GPU/CPU for evaluation
    num_train_epochs=1,              # number of epochs (set higher in real training)
    weight_decay=0.01,               # L2 regularization on weights
    logging_dir="./logs",            # directory for TensorBoard logs
    logging_steps=10,                # log training metrics every 10 steps
    save_total_limit=2,              # keep only the 2 most recent checkpoints
    fp16=USE_FP16,                   # enable mixed-precision training if GPU supports it
    report_to=[],                    # disable default reporting (W&B, TensorBoard)
)


# -------------------------------------------------------------
# Create a Trainer object that wraps model, data, and arguments.
# Trainer abstracts away training loop, evaluation loop, saving,
# logging, gradient accumulation, etc.
# -------------------------------------------------------------
trainer = Trainer(
    model=model,                     # the Pegasus model we loaded earlier
    args=training_args,              # training hyperparameters
    train_dataset=train_data,        # preprocessed training dataset
    eval_dataset=test_data,          # preprocessed test dataset (for validation)
    tokenizer=tokenizer,             # tokenizer for data collation & decoding
)


# -------------------------------------------------------------
# Start fine-tuning! This will:
# - Iterate over the training dataset
# - Compute loss, backprop, update weights
# - Save checkpoints/logs according to TrainingArguments
# -------------------------------------------------------------
trainer.train()


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,6.2412,6.169848




TrainOutput(global_step=625, training_loss=6.585660501098633, metrics={'train_runtime': 1218.8342, 'train_samples_per_second': 4.102, 'train_steps_per_second': 0.513, 'total_flos': 7223661035520000.0, 'train_loss': 6.585660501098633, 'epoch': 1.0})


## Section 5: Evaluating the fine-tuned model

In [None]:
# -------------------------------------------------------------
# Evaluate the fine-tuned model on a subset of the test set
# -------------------------------------------------------------
print("Evaluating fine-tuned model...")

# Take 100 examples from the preprocessed test set.
# For each example, call generate_summary() with the *fine-tuned* model.
# This adds a "predicted_summary" field containing model-generated text.
# - batched=False: map processes each row individually (not in mini-batches).
test_data_sample_finetuned = test_data.select(range(100)).map(
    lambda batch: generate_summary(batch, model, tokenizer),
    batched=False
)

# -------------------------------------------------------------
# Extract predicted summaries and gold reference highlights
# -------------------------------------------------------------
preds_ft  = test_data_sample_finetuned["predicted_summary"]  # model outputs
refs_ft   = test_data_sample_finetuned["highlights"]         # human-written summaries

# Clean and coerce them into lists of equal length using our helper.
# This avoids type mismatches (string vs list, nested lists, None values, etc.).
preds_ft, refs_ft = _coerce_for_rouge(preds_ft, refs_ft)


# -------------------------------------------------------------
# Compute ROUGE metrics for the fine-tuned model
# -------------------------------------------------------------
rouge_scores_finetuned = rouge.compute(
    predictions=preds_ft,     # list of generated summaries
    references=refs_ft,       # list of gold summaries
    use_stemmer=True,         # apply stemming for better matches
    rouge_types=["rouge2"]    # request only ROUGE-2 (bigram overlap)
)

# -------------------------------------------------------------
# Print the ROUGE-2 F1 score (default output of evaluate library).
# The value is a float (numpy.float64), so we print it to 4 decimal places.
# -------------------------------------------------------------
print(f"ROUGE-2 F1 Score (Fine-tuned): {rouge_scores_finetuned['rouge2']:.4f}")


Evaluating fine-tuned model...




Map:   0%|          | 0/100 [00:00<?, ? examples/s]

ROUGE-2 F1 Score (Fine-tuned): 0.1447
