<a href="https://colab.research.google.com/github/arquansa/PSTB-exercises/blob/main/Week08/Day3/EX3/W8D3EX.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Exercises XP - Evaluation, Benchmarking, and Integration
Last Updated: July 22nd, 2025

👩‍🏫 👩🏿‍🏫 What You’ll learn
Practical LLM Evaluation: Gain hands-on experience evaluating LLMs for summarization.
Metric Deep Dive: Understand the strengths and weaknesses of various evaluation metrics (accuracy, ROUGE).
Model Comparison: Learn to systematically compare different LLMs and model sizes.
Hugging Face Proficiency: Enhance your skills in using Hugging Face’s transformers and evaluate libraries.
Customization: Implement and analyze the effects of modifying evaluation metrics and model parameters.
Data Handling: Learn how to load, process, and sample text datasets using pandas.
Text Preprocessing: Understand the importance of text preprocessing for NLP tasks.
Debugging and Analysis: Develop skills in debugging and analyzing LLM outputs.


🛠️ What you will create
Evaluation Scripts: Python scripts to calculate and compare summarization metrics.
Comparative Reports: DataFrames and visualizations summarizing the performance of different LLMs.
Modified Evaluation Metrics: Custom accuracy metrics tailored for summarization.
Summarization Outputs: Generated summaries from various LLMs for comparative analysis.
Analytical Reports: Documentation of your findings, including discussions on metric behavior and model performance.
Custom Functions: Functions to load datasets, generate summaries, and compute ROUGE scores.
Model Comparison Tables: Tables comparing the performance of different LLMs based on various metrics.




All of today’s exercises are part of a single, hands-on tutorial designed to teach you how to evaluate LLMs on summarization tasks. Together, you’ll:

Measure accuracy on summary outputs
Compute ROUGE-N scores
Build a consistent framework for comparing different model sizes and architectures
Each part builds on the last, giving you a cohesive workflow for assessing and contrasting summarization performance.



Learning Objectives
Metric Understanding: Learn to compute ROUGE-N and understand its nuances.
Intuition Building: Develop an intuitive understanding of ROUGE-N and its application to summarization.
Comparative Analysis: Test and compare various LLMs and model sizes on a consistent dataset.


Download the dataset here.


Part I. Setup
Install Libraries:
pip install rouge_score==0.1.2
pip install evaluate
pip install -U accelerate --quiet
pip install datasets
pip install nltk
Download NLTK Resources:
nltk.download("punkt")
nltk.download("punkt_tab")


🌟 Part II : Dataset Loading and Exploration
Dataset Loading: Load the train.csv and test.csv datasets using pandas.
Sampling: Take a smaller sample of the datasets (e.g., 100 samples from train, 50 from test) to reduce computational load.
Exploration: Display the first example from the training sample, showing the article (prompt_text) and its reference summary (prompt_title).
Data Inspection: Print the sampled train and test DataFrames to understand the dataset structure.


🌟 Part III : Summarization with T5
Function Implementation: Implement the summarize_with_t5 function:
Use T5ForConditionalGeneration and AutoTokenizer from transformers.
Handle CUDA availability for GPU acceleration.
Implement batch processing using the batch_generator function.
Tokenize input articles with a “summarize: ” prefix.
Generate summaries using model.generate().
Decode generated token IDs back to text.
Clear CUDA cache (torch.cuda.empty_cache()) and garbage collect (gc.collect()) after each batch and at the end of the function.
Summary Generation: Generate summaries for the training sample using t5-small.
Result Display: Display the generated summaries alongside the reference summaries in a pandas DataFrame.


🌟 Part IV : Accuracy Evaluation
Accuracy Calculation: Calculate the accuracy of the t5-small summaries by comparing them to the reference summaries.
Result Interpretation: Print the calculated accuracy. Discuss why the accuracy is likely to be very low or zero, reinforcing the limitations of this metric.


🌟 Part V : ROUGE Metric Implementation
Metric Introduction: Introduce ROUGE (Recall-Oriented Understudy for Gisting Evaluation) as a standard metric for summarization.
Library Usage: Load the rouge evaluation metric using evaluate.load("rouge").
Preprocessing: Explain the need to format the input summaries with newlines between sentences, and the use of the nltk sentence tokenizer.
Function Definition: Create the compute_rouge_score function to calculate ROUGE scores, handling the necessary preprocessing.


🌟 Part VI : Understanding ROUGE Scores
Exact Match Test: Calculate ROUGE scores when the generated summaries are identical to the reference summaries.
Null Prediction Test: Calculate ROUGE scores when the generated summaries are empty.
Stemming Effect: Demonstrate the impact of stemming on ROUGE scores using simple examples.
N-gram Analysis: Explore how ROUGE-1 and ROUGE-2 scores change with varying degrees of overlap between generated and reference summaries.
Symmetry: Show the symmetry of rouge score with respect to predictions and references.


🌟 Part VII : Comparing Small and Large Models
Model Selection: Choose t5-small, t5-base, and gpt2 models.
Summary Generation: Generate summaries for the training sample using each model.
ROUGE Calculation: Calculate ROUGE scores for each model’s summaries using compute_rouge_score.
Per-Row ROUGE: Create the compute_rouge_per_row function to calculate and store ROUGE scores for each individual article in a DataFrame.
Result Display: Display the per-row ROUGE scores for each model.
GPT2 Specifics: implement the summarize_with_gpt2 function, handling the “TL;DR:” prompt, and the token length limitations.


🌟 Part VIII : Comparing All Models
Aggregation Function: Create the compare_models function to aggregate ROUGE scores for all models into a single DataFrame, showing average scores.
Summary Comparison Function: Create the compare_models_summaries function to display the generated summaries from all models side-by-side in a DataFrame.
Result Display: Display the aggregated ROUGE scores and the side-by-side summary comparisons.


#Part I. Setup

- Install Libraries:
- Download NLTK Resources:

In [None]:
import nltk

!pip install rouge_score==0.1.2
!pip install evaluate
!pip install -U accelerate --quiet
!pip install datasets
!pip install nltk
nltk.download("punkt")
nltk.download("punkt_tab")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

#Part II : Dataset Loading and Dataset Exploration
- Loading: Load the train.csv and test.csv datasets using pandas.
-Sampling: Take a smaller sample of the datasets (e.g., 100 samples from train, 50 from test) to reduce computational load.
- Exploration: Display the first example from the training sample, showing the article (prompt_text) and its reference summary (prompt_title).
- Data Inspection: Print the sampled train and test DataFrames to understand the dataset structure.

In [None]:
from datasets import load_dataset

ds = load_dataset("abisee/cnn_dailymail", "1.0.0")

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/256M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [None]:
import pandas as pd

# Load the dataset splits
train_dataset = ds['train']
test_dataset = ds['test']

# Convert to pandas DataFrames
train_df = train_dataset.to_pandas()
test_df = test_dataset.to_pandas()

# Sample the datasets
train_sample = train_df.sample(n=100, random_state=42)
test_sample = test_df.sample(n=50, random_state=42)

# Display the first example from the training sample
print("First example from training sample:")
display(train_sample.iloc[0])

# Print the sampled train and test DataFrames
print("\nSampled Training DataFrame:")
display(train_sample.head())

print("\nSampled Test DataFrame:")
display(test_sample.head())

First example from training sample:


Unnamed: 0,272581
article,Nasa has warned of an impending asteroid pass ...
highlights,2004 BL86 will pass about three times the dist...
id,6ccb7278e86893ad3609d30ecb5c9ea902fb9527



Sampled Training DataFrame:


Unnamed: 0,article,highlights,id
272581,Nasa has warned of an impending asteroid pass ...,2004 BL86 will pass about three times the dist...,6ccb7278e86893ad3609d30ecb5c9ea902fb9527
772,"BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su...","Iraqi Islamic Party calls Quran incident ""blat...",d4f57e3c18c38696345fb7a3d76a151bb9c5123b
171868,By . David Kent . Andy Carroll has taken an un...,Carroll takes to Instagram to post selfie ahea...,c9ae9fc314adcc92d3835b0437a1c44e9e233e1c
63167,Los Angeles (CNN) -- Los Angeles has long been...,Pop stars from all over Europe are setting the...,5b5a383dc8f9487857787ced5426154394dd99db
68522,London (CNN) -- Few shows can claim such an au...,NEW: Young athletes light the Olympic cauldron...,2813505a990ad24071496c0d0936e40847eb6194



Sampled Test DataFrame:


Unnamed: 0,article,highlights,id
1516,Down Augusta way they say the azaleas are in f...,Justin Rose bounced back from Florida misery b...,58aefdc7ca85968aa11e16ea4099506cb474f759
1393,There was no special treatment for Lewis Fergu...,Lewis Ferguson fell from Merrion Square at Win...,8c2e48d24a3e2cf1be5d242f09ae34bf68ccbd6e
10560,When emergency crews received a call saying 's...,"Woman reported 'someone' had been run over, bu...",16269bfc102681f55a7fbfb6e26c7a52d982e09c
11457,A loving boyfriend has granted his girlfriend ...,"Guo Kai and girlfriend Dong Hui, 22, had plann...",18514a002a1a244a68a560c63c4471af98f72a73
647,(CNN)Sunday's announcement that Corinthian Col...,"David Wheeler: Corinthian, considered a ""preda...",9efbe27504b041e7f5e846a3c6898702c0e82427


# Part III : Summarization with T5
- Function Implementation:
- Implement the summarize_with_t5 function:
- Use T5ForConditionalGeneration and AutoTokenizer from transformers.
- Handle CUDA availability for GPU acceleration.
- Implement batch processing using the batch_generator function.
- Tokenize input articles with a “summarize: ” prefix.
- Generate summaries using model.generate().
- Decode generated token IDs back to text.
- Clear CUDA cache (torch.cuda.empty_cache()) and garbage collect (gc.collect()) after each batch and at the end of the function.
- Summary Generation: Generate summaries for the training sample using t5-small.
- Result Display: Display the generated summaries alongside the reference summaries in a pandas DataFrame.

In [1]:
import torch
from transformers import T5ForConditionalGeneration, AutoTokenizer
import gc
import math

def batch_generator(data, batch_size):
    for i in range(0, len(data), batch_size):
        yield data[i : i + batch_size]

def summarize_with_t5(articles, model_name="t5-small", batch_size=8):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

    summaries = []
    for i, batch in enumerate(batch_generator(articles, batch_size)):
        print(f"Processing batch {i+1}/{math.ceil(len(articles)/batch_size)}")
        inputs = tokenizer(batch, return_tensors="pt", max_length=512, truncation=True, padding="max_length").to(device)
        with torch.no_grad():
            summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=150, early_stopping=True)
        decoded_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)
        summaries.extend(decoded_summaries)

        # Clear CUDA cache and collect garbage after each batch
        del inputs, summary_ids
        if device == "cuda":
            torch.cuda.empty_cache()
        gc.collect()

    # Clear CUDA cache and collect garbage at the end
    if device == "cuda":
        torch.cuda.empty_cache()
    gc.collect()

    return summaries

Now that the `summarize_with_t5` function is defined, we can use it to generate summaries for the training sample and display them alongside the reference summaries.

In [None]:
# Generate summaries for the training sample using t5-small
t5_small_summaries = summarize_with_t5(train_sample['article'].tolist(), model_name="t5-small")

# Display the generated summaries alongside the reference summaries
results_df = pd.DataFrame({
    'article': train_sample['article'].tolist(),
    'reference_summary': train_sample['highlights'].tolist(),
    'generated_summary_t5_small': t5_small_summaries
})

display(results_df[['reference_summary', 'generated_summary_t5_small']].head())

Using device: cpu


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Processing batch 1/13
Processing batch 2/13
Processing batch 3/13
Processing batch 4/13
Processing batch 5/13
Processing batch 6/13
Processing batch 7/13
Processing batch 8/13
Processing batch 9/13
Processing batch 10/13
Processing batch 11/13
Processing batch 12/13
Processing batch 13/13


Unnamed: 0,reference_summary,generated_summary_t5_small
0,2004 BL86 will pass about three times the dist...,asteroid 2004 BL86 will pass about three times...
1,"Iraqi Islamic Party calls Quran incident ""blat...",sniper section leader used a Quran for target ...
2,Carroll takes to Instagram to post selfie ahea...,Andy Carroll has taken an understandably glum-...
3,Pop stars from all over Europe are setting the...,a destination for artistic dreamers from Europ...
4,NEW: Young athletes light the Olympic cauldron...,"the opening ceremony in east London, organizer..."


# Part IV : Accuracy Evaluation
Accuracy Calculation:
- Calculate the accuracy of the t5-small summaries by comparing them to the reference summaries.
Result Interpretation:
- Print the calculated accuracy.
- Discuss why the accuracy is likely to be very low or zero, reinforcing the limitations of this metric.

In [None]:
# Calculate accuracy
# Accuracy for summarization is typically defined as the percentage of generated summaries that exactly match the reference summaries.
# Due to the nature of summarization, this is expected to be very low or zero.

exact_matches = (results_df['generated_summary_t5_small'] == results_df['reference_summary']).sum()
total_summaries = len(results_df)
accuracy = exact_matches / total_summaries

print(f"Exact Match Accuracy: {accuracy:.4f}")

Exact Match Accuracy: 0.0000


**Result Interpretation:**

As expected, the exact match accuracy is very low (likely 0%). This is because summarization is a generative task, and even good summaries will rarely be character-for-character identical to the reference summaries. Different models can produce valid summaries that convey the same information using different wording or sentence structures. Therefore, exact match accuracy is not a suitable metric for evaluating the quality of generated summaries. We will explore more appropriate metrics like ROUGE in the next steps.

# Part V: ROUGE Metric Implementation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a standard set of metrics for evaluating automatic summarization and machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries or translations (typically human-produced).

For ROUGE to work correctly with summarization, the input summaries (both generated and reference) need to be preprocessed. This usually involves tokenizing the text into sentences and joining the sentences with newline characters. This is because ROUGE primarily compares overlapping units (like n-grams) at the sentence level. We will use the `nltk` library for sentence tokenization.

In [None]:
import evaluate
import nltk

# Load the ROUGE evaluation metric
rouge = evaluate.load("rouge")

def compute_rouge_score(predictions, references):
    """
    Computes ROUGE scores for a list of predictions and references.

    Args:
        predictions (list): A list of generated summaries (strings).
        references (list): A list of reference summaries (strings).

    Returns:
        dict: A dictionary containing the ROUGE scores.
    """
    # Preprocess summaries by tokenizing sentences and joining with newlines
    processed_predictions = ["\n".join(nltk.sent_tokenize(p)) for p in predictions]
    processed_references = ["\n".join(nltk.sent_tokenize(r)) for r in references]

    # Compute ROUGE scores
    rouge_scores = rouge.compute(predictions=processed_predictions, references=processed_references)

    return rouge_scores

Downloading builder script: 0.00B [00:00, ?B/s]

# Part VI : Understanding ROUGE Scores
Exact Match Test: Calculate ROUGE scores when the generated summaries are identical to the reference summaries.
- Null Prediction Test: Calculate ROUGE scores when the generated summaries are empty.
- Stemming Effect: Demonstrate the impact of stemming on ROUGE scores using simple examples.
- N-gram Analysis: Explore how ROUGE-1 and ROUGE-2 scores change with varying degrees of overlap between generated and reference summaries.
- Symmetry: Show the symmetry of rouge score with respect to predictions and references.

In [None]:
# Exact Match Test
print("--- Exact Match Test ---")
predictions_exact = ["This is a test summary."]
references_exact = ["This is a test summary."]
rouge_exact = compute_rouge_score(predictions_exact, references_exact)
print(f"ROUGE scores for exact match: {rouge_exact}")

--- Exact Match Test ---
ROUGE scores for exact match: {'rouge1': np.float64(1.0), 'rouge2': np.float64(1.0), 'rougeL': np.float64(1.0), 'rougeLsum': np.float64(1.0)}


In [None]:
# Null Prediction Test
print("\n--- Null Prediction Test ---")
predictions_null = [""]
references_null = ["This is a reference summary."]
rouge_null = compute_rouge_score(predictions_null, references_null)
print(f"ROUGE scores for null prediction: {rouge_null}")


--- Null Prediction Test ---
ROUGE scores for null prediction: {'rouge1': np.float64(0.0), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.0), 'rougeLsum': np.float64(0.0)}


In [None]:
# Stemming Effect (Demonstration using simple examples)
# ROUGE uses stemming by default, which can affect scores.
print("\n--- Stemming Effect Demonstration ---")
predictions_stemming = ["The cat is jumping."]
references_stemming = ["The cat is jump."] # 'jumping' and 'jump' might stem to the same root
rouge_stemming = compute_rouge_score(predictions_stemming, references_stemming)
print(f"ROUGE scores with potential stemming effect: {rouge_stemming}")

predictions_stemming_no = ["The cats are jumping."]
references_stemming_no = ["The cat is jumping."] # Different words, less likely to stem the same
rouge_stemming_no = compute_rouge_score(predictions_stemming_no, references_stemming_no)
print(f"ROUGE scores without strong stemming effect: {rouge_stemming_no}")


--- Stemming Effect Demonstration ---
ROUGE scores with potential stemming effect: {'rouge1': np.float64(0.75), 'rouge2': np.float64(0.6666666666666666), 'rougeL': np.float64(0.75), 'rougeLsum': np.float64(0.75)}
ROUGE scores without strong stemming effect: {'rouge1': np.float64(0.5), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.5), 'rougeLsum': np.float64(0.5)}


In [None]:
# N-gram Analysis (ROUGE-1 and ROUGE-2)
print("\n--- N-gram Analysis ---")
predictions_ngram = ["The quick brown fox jumps over the lazy dog."]
references_ngram = ["The quick brown fox."] # Partial overlap
rouge_ngram_partial = compute_rouge_score(predictions_ngram, references_ngram)
print(f"ROUGE scores for partial overlap (ROUGE-1 and ROUGE-2): {rouge_ngram_partial}")

predictions_ngram_more = ["The quick brown fox jumps."]
references_ngram_more = ["The quick brown fox jumps over the lazy dog."] # More overlap
rouge_ngram_more = compute_rouge_score(predictions_ngram_more, references_ngram_more)
print(f"ROUGE scores for more overlap (ROUGE-1 and ROUGE-2): {rouge_ngram_more}")


--- N-gram Analysis ---
ROUGE scores for partial overlap (ROUGE-1 and ROUGE-2): {'rouge1': np.float64(0.6153846153846153), 'rouge2': np.float64(0.5454545454545454), 'rougeL': np.float64(0.6153846153846153), 'rougeLsum': np.float64(0.6153846153846153)}
ROUGE scores for more overlap (ROUGE-1 and ROUGE-2): {'rouge1': np.float64(0.7142857142857143), 'rouge2': np.float64(0.6666666666666666), 'rougeL': np.float64(0.7142857142857143), 'rougeLsum': np.float64(0.7142857142857143)}


In [None]:
# Symmetry Test
print("\n--- Symmetry Test ---")
predictions_sym = ["Summary A"]
references_sym = ["Summary B"]
rouge_ab = compute_rouge_score(predictions_sym, references_sym)
print(f"ROUGE scores (A vs B): {rouge_ab}")

predictions_sym_rev = ["Summary B"]
references_sym_rev = ["Summary A"]
rouge_ba = compute_rouge_score(predictions_sym_rev, references_sym_rev)
print(f"ROUGE scores (B vs A): {rouge_ba}")


--- Symmetry Test ---
ROUGE scores (A vs B): {'rouge1': np.float64(0.5), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.5), 'rougeLsum': np.float64(0.5)}
ROUGE scores (B vs A): {'rouge1': np.float64(0.5), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.5), 'rougeLsum': np.float64(0.5)}


In [None]:
# Generate summaries for the training sample using t5-base
t5_base_summaries = summarize_with_t5(train_sample['article'].tolist(), model_name="t5-base")

# Add t5-base summaries to the results DataFrame
results_df['generated_summary_t5_base'] = t5_base_summaries

display(results_df[['reference_summary', 'generated_summary_t5_small', 'generated_summary_t5_base']].head())

Using device: cpu


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Processing batch 1/13
Processing batch 2/13
Processing batch 3/13
Processing batch 4/13
Processing batch 5/13
Processing batch 6/13
Processing batch 7/13
Processing batch 8/13
Processing batch 9/13
Processing batch 10/13
Processing batch 11/13
Processing batch 12/13
Processing batch 13/13


Unnamed: 0,reference_summary,generated_summary_t5_small,generated_summary_t5_base
0,2004 BL86 will pass about three times the dist...,asteroid 2004 BL86 will pass about three times...,it will pass about three times the distance of...
1,"Iraqi Islamic Party calls Quran incident ""blat...",sniper section leader used a Quran for target ...,a sniper section leader used a Quran for targe...
2,Carroll takes to Instagram to post selfie ahea...,Andy Carroll has taken an understandably glum-...,england striker takes glum-looking selfie in h...
3,Pop stars from all over Europe are setting the...,a destination for artistic dreamers from Europ...,"""Los Angeles is my second home now,"" says t.a...."
4,NEW: Young athletes light the Olympic cauldron...,"the opening ceremony in east London, organizer...","few shows can claim such an audience. ""Isles o..."


#Part VII : Comparing Small and Large Models
Model Selection:

- Choose t5-small, t5-base, and gpt2 models. Summary Generation:
- Generate summaries for the training sample using each model. ROUGE Calculation:
- Calculate ROUGE scores for each model’s summaries using compute_rouge_score. Per-Row ROUGE:
- Create the compute_rouge_per_row function to calculate and store ROUGE scores for each individual article in a DataFrame.
- Result Display:
- Display the per-row ROUGE scores for each model.
- GPT2 Specifics: implement the summarize_with_gpt2 function, handling the “TL;DR:” prompt, and the token length limitations.

In [2]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def summarize_with_gpt2(articles, model_name="gpt2", batch_size=8):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    # Add a padding token to the tokenizer and resize the model embedding layer
    tokenizer.pad_token = tokenizer.eos_token
    model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
    model.resize_token_embeddings(len(tokenizer))

    summaries = []
    for i, batch in enumerate(batch_generator(articles, batch_size)):
        print(f"Processing batch {i+1}/{math.ceil(len(articles)/batch_size)}")
        # Prepend a prompt for summarization, e.g., "article: [article] TL;DR:"
        inputs = tokenizer(["article: " + art + " TL;DR:" for art in batch], return_tensors="pt", max_length=512, truncation=True, padding="max_length").to(device)
        with torch.no_grad():
            # Generate summaries, limiting the length to avoid overly long outputs
            summary_ids = model.generate(
                inputs["input_ids"],
                attention_mask=inputs["attention_mask"],
                num_beams=4,
                max_length=inputs["input_ids"].shape[-1] + 100, # Generate up to 100 new tokens
                early_stopping=True,
                pad_token_id=tokenizer.pad_token_id
            )
        # Decode generated tokens, skipping the input prompt
        decoded_summaries = [
            tokenizer.decode(g, skip_special_tokens=True).split(" TL;DR:")[-1].strip()
            for g in summary_ids
        ]
        summaries.extend(decoded_summaries)

        # Clear CUDA cache and collect garbage after each batch
        del inputs, summary_ids
        if device == "cuda":
            torch.cuda.empty_cache()
        gc.collect()

    # Clear CUDA cache and collect garbage at the end
    if device == "cuda":
        torch.cuda.empty_cache()
    gc.collect()

    return summaries

In [None]:
# Generate summaries for the training sample using gpt2
gpt2_summaries = summarize_with_gpt2(train_sample['article'].tolist(), model_name="gpt2")

# Add gpt2 summaries to the results DataFrame
results_df['generated_summary_gpt2'] = gpt2_summaries

display(results_df[['reference_summary', 'generated_summary_t5_small', 'generated_summary_t5_base', 'generated_summary_gpt2']].head())

Using device: cpu


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processing batch 1/13


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processing batch 2/13


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processing batch 3/13


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processing batch 4/13
Processing batch 5/13


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processing batch 6/13


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processing batch 7/13


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processing batch 8/13


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processing batch 9/13


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processing batch 10/13
Processing batch 11/13


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processing batch 12/13


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processing batch 13/13


Unnamed: 0,reference_summary,generated_summary_t5_small,generated_summary_t5_base,generated_summary_gpt2
0,2004 BL86 will pass about three times the dist...,asteroid 2004 BL86 will pass about three times...,it will pass about three times the distance of...,article: Nasa has warned of an impending aster...
1,"Iraqi Islamic Party calls Quran incident ""blat...",sniper section leader used a Quran for target ...,a sniper section leader used a Quran for targe...,"article: BAGHDAD, Iraq (CNN) -- Iraq's most po..."
2,Carroll takes to Instagram to post selfie ahea...,Andy Carroll has taken an understandably glum-...,england striker takes glum-looking selfie in h...,article: By . David Kent . Andy Carroll has ta...
3,Pop stars from all over Europe are setting the...,a destination for artistic dreamers from Europ...,"""Los Angeles is my second home now,"" says t.a....",article: Los Angeles (CNN) -- Los Angeles has ...
4,NEW: Young athletes light the Olympic cauldron...,"the opening ceremony in east London, organizer...","few shows can claim such an audience. ""Isles o...",article: London (CNN) -- Few shows can claim s...


In [None]:
def compute_rouge_per_row(df, prediction_col, reference_col):
    """
    Computes ROUGE scores for each row in a DataFrame.

    Args:
        df (pd.DataFrame): The input DataFrame.
        prediction_col (str): The name of the column containing generated summaries.
        reference_col (str): The name of the column containing reference summaries.

    Returns:
        pd.DataFrame: A DataFrame with per-row ROUGE scores.
    """
    rouge_scores_list = []
    for index, row in df.iterrows():
        scores = compute_rouge_score([row[prediction_col]], [row[reference_col]])
        rouge_scores_list.append(scores)

    # Convert list of dictionaries to a DataFrame
    rouge_df = pd.DataFrame(rouge_scores_list)
    # Rename columns for clarity
    rouge_df.columns = [f'{col}_{prediction_col}' for col in rouge_df.columns]
    return rouge_df

# Compute per-row ROUGE scores for each model
rouge_scores_t5_small_df = compute_rouge_per_row(results_df, 'generated_summary_t5_small', 'reference_summary')
rouge_scores_t5_base_df = compute_rouge_per_row(results_df, 'generated_summary_t5_base', 'reference_summary')
rouge_scores_gpt2_df = compute_rouge_per_row(results_df, 'generated_summary_gpt2', 'reference_summary')

# Concatenate the per-row ROUGE scores to the results DataFrame
results_df = pd.concat([results_df, rouge_scores_t5_small_df, rouge_scores_t5_base_df, rouge_scores_gpt2_df], axis=1)

# Display the per-row ROUGE scores for each model (first few rows)
print("\nPer-row ROUGE scores:")
display(results_df[[
    'rouge1_generated_summary_t5_small', 'rouge2_generated_summary_t5_small', 'rougeL_generated_summary_t5_small', 'rougeLsum_generated_summary_t5_small',
    'rouge1_generated_summary_t5_base', 'rouge2_generated_summary_t5_base', 'rougeL_generated_summary_t5_base', 'rougeLsum_generated_summary_t5_base',
    'rouge1_generated_summary_gpt2', 'rouge2_generated_summary_gpt2', 'rougeL_generated_summary_gpt2', 'rougeLsum_generated_summary_gpt2'
]].head())


Per-row ROUGE scores:


Unnamed: 0,rouge1_generated_summary_t5_small,rouge2_generated_summary_t5_small,rougeL_generated_summary_t5_small,rougeLsum_generated_summary_t5_small,rouge1_generated_summary_t5_base,rouge2_generated_summary_t5_base,rougeL_generated_summary_t5_base,rougeLsum_generated_summary_t5_base,rouge1_generated_summary_gpt2,rouge2_generated_summary_gpt2,rougeL_generated_summary_gpt2,rougeLsum_generated_summary_gpt2
0,0.468085,0.304348,0.425532,0.468085,0.533333,0.318182,0.4,0.511111,0.153846,0.139706,0.153846,0.153846
1,0.213592,0.09901,0.174757,0.213592,0.277228,0.121212,0.19802,0.257426,0.151163,0.077821,0.108527,0.147287
2,0.425,0.153846,0.3,0.425,0.289855,0.149254,0.231884,0.289855,0.107209,0.055659,0.085028,0.103512
3,0.410714,0.163636,0.25,0.357143,0.372549,0.18,0.215686,0.352941,0.151571,0.115028,0.121996,0.147874
4,0.166667,0.0,0.111111,0.148148,0.350515,0.084211,0.206186,0.329897,0.089219,0.022388,0.066914,0.081784


 # Part VIII : Comparing All Models Aggregation
 Function:
 - Create the compare_models function to aggregate ROUGE scores for all models into a single DataFrame, showing average scores.

Summary Comparison

Function:
 - Create the compare_models_summaries function to display the generated summaries from all models side-by-side in a DataFrame.

 Result Display:
 - Display the aggregated ROUGE scores and the side-by-side summary comparisons.

In [None]:
def compare_models(results_df):
    """
    Aggregates ROUGE scores for all models into a single DataFrame.

    Args:
        results_df (pd.DataFrame): DataFrame containing per-row ROUGE scores for all models.

    Returns:
        pd.DataFrame: DataFrame with average ROUGE scores for each model.
    """
    # Select only the ROUGE score columns
    rouge_cols = [col for col in results_df.columns if col.startswith('rouge')]

    # Calculate the mean of the ROUGE scores for each model
    # Group columns by model name (e.g., 'generated_summary_t5_small')
    model_rouge_scores = {}
    for col in rouge_cols:
        # Extract model name from column name (e.g., 'rouge1_generated_summary_t5_small' -> 'generated_summary_t5_small')
        parts = col.split('_')
        # Assuming the format is rougeX_generated_summary_model_name
        if len(parts) > 2:
            model_name = '_'.join(parts[1:])
            if model_name not in model_rouge_scores:
                model_rouge_scores[model_name] = []
            model_rouge_scores[model_name].append(col) # Store column name, not the Series

    aggregated_scores = {}
    for model, cols_names in model_rouge_scores.items():
         # Calculate mean for each type of ROUGE score (rouge1, rouge2, rougeL, rougeLsum)
        avg_scores = {}
        for rouge_type in ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']:
            # Filter column names for the current ROUGE type
            cols_for_type = [col_name for col_name in cols_names if col_name.startswith(f'{rouge_type}_')]
            if cols_for_type:
                # Calculate the mean of the selected columns from the original DataFrame
                avg_scores[f'{rouge_type}_{model}'] = results_df[cols_for_type].mean().mean() # Calculate mean across rows and then across columns if multiple for same type

        aggregated_scores[model] = avg_scores

    # Convert aggregated scores to a DataFrame and transpose for better readability
    aggregated_df = pd.DataFrame(aggregated_scores).T
    return aggregated_df


def compare_models_summaries(results_df):
    """
    Displays the generated summaries from all models side-by-side.

    Args:
        results_df (pd.DataFrame): DataFrame containing generated summaries for all models.

    Returns:
        pd.DataFrame: DataFrame with selected summary columns.
    """
    summary_cols = [col for col in results_df.columns if col.startswith('generated_summary')]
    # Include the reference summary for comparison
    display_cols = ['reference_summary'] + summary_cols
    return results_df[display_cols]

# Display the aggregated ROUGE scores
print("Aggregated ROUGE Scores:")
aggregated_rouge_df = compare_models(results_df)
display(aggregated_rouge_df)

# Display the side-by-side summary comparisons (first few rows)
print("\nSide-by-Side Summary Comparisons:")
display(compare_models_summaries(results_df).head())

Aggregated ROUGE Scores:


Unnamed: 0,rouge1_generated_summary_t5_small,rouge2_generated_summary_t5_small,rougeL_generated_summary_t5_small,rougeLsum_generated_summary_t5_small,rouge1_generated_summary_t5_base,rouge2_generated_summary_t5_base,rougeL_generated_summary_t5_base,rougeLsum_generated_summary_t5_base,rouge1_generated_summary_gpt2,rouge2_generated_summary_gpt2,rougeL_generated_summary_gpt2,rougeLsum_generated_summary_gpt2
generated_summary_t5_small,0.340585,0.15014,0.239392,0.315226,,,,,,,,
generated_summary_t5_base,,,,,0.357172,0.16769,0.260669,0.333105,,,,
generated_summary_gpt2,,,,,,,,,0.165005,0.079034,0.114599,0.157517



Side-by-Side Summary Comparisons:


Unnamed: 0,reference_summary,generated_summary_t5_small,generated_summary_t5_base,generated_summary_gpt2
0,2004 BL86 will pass about three times the dist...,asteroid 2004 BL86 will pass about three times...,it will pass about three times the distance of...,article: Nasa has warned of an impending aster...
1,"Iraqi Islamic Party calls Quran incident ""blat...",sniper section leader used a Quran for target ...,a sniper section leader used a Quran for targe...,"article: BAGHDAD, Iraq (CNN) -- Iraq's most po..."
2,Carroll takes to Instagram to post selfie ahea...,Andy Carroll has taken an understandably glum-...,england striker takes glum-looking selfie in h...,article: By . David Kent . Andy Carroll has ta...
3,Pop stars from all over Europe are setting the...,a destination for artistic dreamers from Europ...,"""Los Angeles is my second home now,"" says t.a....",article: Los Angeles (CNN) -- Los Angeles has ...
4,NEW: Young athletes light the Olympic cauldron...,"the opening ceremony in east London, organizer...","few shows can claim such an audience. ""Isles o...",article: London (CNN) -- Few shows can claim s...


# **Conclusion**

Based on the aggregated ROUGE scores and the side-by-side summary comparisons, here's a conclusion about the performance of the models:

*   **T5-base generally outperforms T5-small and GPT-2** across all ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum). This suggests that the larger T5 model is better at capturing both unigram and bigram overlap with the reference summaries, as well as sentence-level and overall longest common subsequence matches.
*   **T5 models perform significantly better than GPT-2** for this summarization task. This is expected because T5 is a text-to-text model specifically designed for tasks like summarization, while GPT-2 is a language model primarily used for text generation. The side-by-side comparisons also visually confirm that the T5 models produce more coherent and relevant summaries compared to GPT-2, which often includes parts of the original article or generates less focused text.
*   **Exact match accuracy is not a useful metric** for evaluating generative summarization models, as demonstrated earlier. ROUGE scores provide a more nuanced evaluation by measuring the overlap of n-grams and subsequences.

In summary, for this dataset and summarization task, the T5 models, particularly T5-base, are more effective than GPT-2, and ROUGE is a more appropriate evaluation metric than exact match accuracy.