# 1. Load the Model

- Use `distilbart-xsum-12-6`, a distilled version of BART with 6 encoder and 6 decoder layers (half the size of `bart-large`), making it faster and lighter for inference.
- The model is pre-trained and fine-tuned on the **XSum (Extreme Summarization)** dataset.
- **XSum Dataset**: Contains BBC articles paired with single-sentence summaries. Each summary is a highly abstractive paraphrase rather than a simple extract.
  - ~226k article-summary pairs.
  - Average article length: ~431 words.
  - Average summary length: ~23 words.
  - Designed to train models that **generate concise and novel summaries**.
- `distilbart-xsum-12-6` is well-suited for single-document, one-sentence summarization tasks.

In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [4]:
model_name = "sshleifer/distilbart-xsum-12-6"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/611M [00:00<?, ?B/s]

### Simple Example

In [5]:
ARTICLE = """
The UN has announced a new global climate agreement after two weeks of negotiations in Geneva.
Countries have agreed on a framework to reduce greenhouse gas emissions by 40% by 2030.
The agreement will be reviewed every 5 years and aims to keep global warming below 1.5°C.
"""

# Tokenize the input
inputs = tokenizer([ARTICLE], max_length=512, return_tensors="pt", truncation=True)

# Generate summary
summary_ids = model.generate(
    inputs["input_ids"],
    max_length=128,
    num_beams=4,
    early_stopping=True
)

# Decode summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Generated Summary:")
print(summary)

model.safetensors:   0%|          | 0.00/611M [00:00<?, ?B/s]

Generated Summary:
 All photographs courtesy of the United Nations and EPA.


# 2. Evaluate the Model on CNN/DailyMail Dataset

- Load the CNN/DailyMail dataset (version 3.0.0) which consists of news articles paired with multi-sentence summaries.
  - ~287k training samples.
  - Articles are longer (~750 words), and summaries are multi-sentence (3 to 4 sentences).
  - Designed for **multi-sentence summarization** with moderate abstraction.
- Generate summaries using the `distilbart-xsum-12-6` model on a subset of 500 samples from the test set.
- Evaluate the summaries using ROUGE scores.
- Model achieves around **25% ROUGE score**, which is relatively low due to the mismatch between model training (XSum: short summaries) and CNN/DailyMail (longer summaries).

In [6]:
import torch
import evaluate
from tqdm.notebook import tqdm
from datasets import load_dataset , load_from_disk

In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Load Evaluation metric
rouge = evaluate.load("rouge")

dataset = load_dataset("cnn_dailymail", "3.0.0")

subset_test_dataset = dataset['test'].select(range(50))

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [8]:
# Collect predictions and references
predictions = []
references = []

# Generate summaries
for sample in tqdm(subset_test_dataset):
    article = sample["article"]
    reference = sample["highlights"]

    # Tokenize input and move to correct device
    inputs = tokenizer(article, return_tensors="pt", max_length=512, truncation=True).to(device)

    # Generate summary
    with torch.no_grad():
        summary_ids = model.generate(
            inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            num_beams=4,
            max_length=128,
            early_stopping=True
        )
    decoded_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    predictions.append(decoded_summary)
    references.append(reference)

  0%|          | 0/50 [00:00<?, ?it/s]

In [9]:
# Evaluate ROUGE
results = rouge.compute(predictions=predictions, references=references, use_stemmer=True)

print({k: f"{v*100:.2f}" for k, v in results.items()})

{'rouge1': '25.93', 'rouge2': '8.08', 'rougeL': '19.07', 'rougeLsum': '23.20'}


# 2. Load dataset

In [10]:
from datasets import DatasetDict

In [11]:
subset_train = dataset["train"].select(range(len(dataset["train"]) // 4))
subset_val = dataset["validation"].select(range(len(dataset["validation"]) // 4))
subset_test = dataset["test"].select(range(len(dataset["test"]) // 4))

subset_dataset = DatasetDict({
    "train": subset_train,
    "validation": subset_val,
    "test": subset_test
})

# 3. Fine-Tune on CNN/DailyMail

- Tokenize the dataset for both input articles and target summaries.
- Remove unnecessary columns and format dataset for PyTorch.
- Use `DataCollatorForSeq2Seq` for dynamic padding.
- Prepare data loaders for training and validation.
- Set up optimizer (`AdamW`) and linear learning rate scheduler.
- Use `Accelerator` to handle multi-GPU or mixed precision training.
- Train for 1 epoch:
  - Print loss every 50 steps.
  - Save model at quarter epoch and after full epoch.
- Save the final model and tokenizer in Hugging Face format.


In [12]:
from transformers import DataCollatorForSeq2Seq, get_scheduler
from torch.optim import AdamW
from torch.utils.data import DataLoader
from accelerate import Accelerator

#### 3.1 Tokenize the dataset for both input articles and target summaries.

In [13]:
def tokenize_function(example):
    model_inputs = tokenizer(
        example["article"],
        max_length=512,
        padding="max_length",
        truncation=True
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            example["highlights"],
            max_length=128,
            padding="max_length",
            truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

#### 3.2 Remove unnecessary columns and format dataset for PyTorch.

In [14]:
tokenized_dataset = subset_dataset.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(['article', 'highlights', 'id'])
tokenized_dataset.set_format("torch")

Map:   0%|          | 0/71778 [00:00<?, ? examples/s]



Map:   0%|          | 0/3342 [00:00<?, ? examples/s]

Map:   0%|          | 0/2872 [00:00<?, ? examples/s]

### 3.3 Use `DataCollatorForSeq2Seq` for dynamic padding and Prepare data loaders for training and validation.

In [15]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

train_dataloader = DataLoader(
    tokenized_dataset["train"],
    batch_size=16,
    shuffle=True,
    collate_fn=data_collator
)

eval_dataloader = DataLoader(
    tokenized_dataset["validation"],
    batch_size=16,
    shuffle=True,
    collate_fn=data_collator
)

#### 3.4 Set up optimizer (`AdamW`) and linear learning rate scheduler.

In [16]:
num_epochs = 1

optimizer = AdamW(model.parameters(), lr=2.2e-05)
lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=len(train_dataloader) * num_epochs,
)

### 3.5 Use `Accelerator` to handle multi-GPU or mixed precision training.

In [17]:
accelerator = Accelerator()
model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
    model, optimizer, train_dataloader, lr_scheduler
)

### 3.6 Train and Save model
> **Note**
> I save also the learning rate so I can restart the Training if it stop

In [18]:
from pathlib import Path

fine_tune_path = "/content/drive/MyDrive/Colab Notebooks/summarization_checkpoints"
save_dir = Path(fine_tune_path)
save_dir.mkdir(parents=True, exist_ok=True)

In [19]:
def save_learning_rate(optimizer, save_path):
    lr = optimizer.param_groups[0]['lr']
    with open(f"{save_path}/learning_rate.txt", "w") as f:
        f.write(str(lr))

In [None]:
from accelerate import Accelerator

# Initialize Accelerator
accelerator = Accelerator()

# Your model training loop...
for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0

    for step, batch in enumerate(tqdm(train_dataloader)):
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        if step % 50 == 0:
            accelerator.print(f"Epoch {epoch+1} | step {step}/{len(train_dataloader)} | loss = {loss.item():.4f}")

        # Save model at half epoch
        if step % (len(train_dataloader) // 8) == 0:
            quarter_path = f"{save_dir}/model_epoch_quarter"
            accelerator.print(f"Saving at quarter epoch to: {quarter_path}")
            model.save_pretrained(quarter_path)  # Save model in HF format
            tokenizer.save_pretrained(quarter_path)  # Save tokenizer in HF format
            save_learning_rate(optimizer, quarter_path)

    # Save model at end of epoch
    if num_epochs > 1:
        avg_train_loss = total_loss / len(train_dataloader)
        epoch_path = f"{save_dir}/model_epoch"
        accelerator.print(f"Saving after epoch to: {epoch_path}")
        model.save_pretrained(epoch_path)  # Save model in HF format
        tokenizer.save_pretrained(epoch_path)  # Save tokenizer in HF format
        save_learning_rate(optimizer, epoch_path)

# Final save if not only one epoch
final_path = f"{save_dir}/model_final"
accelerator.print(f"Saving final model to: {final_path}")
model.save_pretrained(final_path)  # Save model in HF format
tokenizer.save_pretrained(final_path)  # Save tokenizer in HF format
save_learning_rate(optimizer, final_path)

# 4. Evaluate the Fine-Tuned Model

- Load the fine-tuned model and tokenizer from saved path.  
- Prepare them using `Accelerator` for consistent evaluation.  
- Calculate the evaluation loss on the validation set.  
- Compute ROUGE score to assess summarization performance after fine-tuning.  

In [22]:
model_fine_tune_path = '/content/drive/MyDrive/Colab Notebooks/summarization_checkpoints/model_final'

fine_tune_model = AutoModelForSeq2SeqLM.from_pretrained(model_fine_tune_path)
fine_tune_tokenizer = AutoTokenizer.from_pretrained(model_fine_tune_path)

In [23]:
fine_tune_model, fine_tune_tokenizer = accelerator.prepare(fine_tune_model, fine_tune_tokenizer)

### 4.1 Calculate the evaluation loss

In [24]:
eval_dataloader = accelerator.prepare(eval_dataloader)

fine_tune_model.eval()
total_loss = 0.0

with torch.no_grad():
    for step, batch in enumerate(tqdm(eval_dataloader)):
        outputs = fine_tune_model(**batch)
        loss = outputs.loss

        total_loss += loss.item()

        if step % (len(eval_dataloader)//4)== 0:
            accelerator.print(f"Step {step}/{len(eval_dataloader)} | eval loss = {loss.item():.4f}")

avg_eval_loss = total_loss / len(eval_dataloader)
accelerator.print(f"Average evaluation loss: {avg_eval_loss:.4f}")

  0%|          | 0/209 [00:00<?, ?it/s]

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Step 0/209 | eval loss = 1.0829
Step 52/209 | eval loss = 0.9891
Step 104/209 | eval loss = 0.8493
Step 156/209 | eval loss = 0.9624
Step 208/209 | eval loss = 0.6826
Average evaluation loss: 0.9053


### 4.3 Compute ROUGE score after fine-tuning

In [25]:
test_dataloader = DataLoader(
    tokenized_dataset["test"],
    batch_size=32,
    shuffle=False,           # no need to shuffle test set
    collate_fn=data_collator
)
test_dataloader = accelerator.prepare(test_dataloader)

fine_tune_model.eval()

predictions = []
references  = []
current_index = 0

with torch.no_grad():
    for step, batch in enumerate(tqdm(test_dataloader)):
        summary_ids = fine_tune_model.generate(
            batch["input_ids"],
            attention_mask=batch["attention_mask"],
            num_beams=4,
            max_length=128,
            early_stopping=True,
            length_penalty=1.0,
        )

        decoded_summaries = tokenizer.batch_decode(
            summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
        )

        batch_size = len(decoded_summaries)
        batch_labels = subset_test["highlights"][current_index:current_index + batch_size]
        current_index += batch_size

        predictions.extend(decoded_summaries)
        references.extend(batch_labels)

  0%|          | 0/90 [00:00<?, ?it/s]

In [26]:
results = rouge.compute(predictions=predictions, references=references, use_stemmer=True)
print({k: f"{v*100:.2f}" for k, v in results.items()})

{'rouge1': '38.49', 'rouge2': '17.02', 'rougeL': '27.18', 'rougeLsum': '35.91'}


### Simple Example

In [27]:
# Ensure you're using the correct device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ARTICLE = """
The UN has announced a new global climate agreement after two weeks of negotiations in Geneva.
Countries have agreed on a framework to reduce greenhouse gas emissions by 40% by 2030.
The agreement will be reviewed every 5 years and aims to keep global warming below 1.5°C.
"""

# Tokenize the input and move it to the correct device
inputs = tokenizer([ARTICLE], max_length=512, return_tensors="pt", truncation=True).to(device)

# Put the model into evaluation mode and move it to the correct device
model.eval().to(device)

# Generate summary (remove the 'device' argument here)
summary_ids = model.generate(
    inputs["input_ids"],
    max_length=128,
    num_beams=4,
    early_stopping=True,
    length_penalty=1.0
)

# Decode summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Generated Summary:")
print(summary)


Generated Summary:
 All photographs courtesy of UN Environment Agency, EPA and Reuters
