# Assignment 3: Summarization with LLMs

**Description:** This assignment covers the task of summarization which is the process of generating an abridged version of the input. With the ascendance of LLMs, we have a new way of generating summaries. Now, rather than fine-tuning. moel to generate summaries, we can simply provide explicit instructios for the summary we want the model to generate.  By finishing this assignment you should also be able to develop an intuition for:


* How well summarization systems work
* The effects of hyperparameters on outcomes
* The effects of prompts on the output of an LLM
* Evaluation of output using ROUGE



This notebook must be run on a Google Colab as it requires a GPU. By default, when you open the notebook in Colab it will configure a GPU.  Summarization commands can take up to five minutes to run depending on the hyperparameters you use.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-fall-main/blob/master/assignment/a3/SummarizationLLM_test.ipynb)

The overall assignment structure is as follows:

 Setup

1. Gemma 2 for abstractive summarization




**INSTRUCTIONS:**:

* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1 and a2.

* **### YOUR CODE HERE** indicates that you are supposed to write code.

* In order to complete the assignment with the Gemma model you will need to get an account on [Hugging Face](https://huggingface.co).  It is free.  Once you have the account on Hugging Face you will need to create an Access Token.  Go
to Access Token under your profile and generate a token with write permissions for colab.  You will need to copy that token and add it to the secrets in your Colab account with the name `HF_TOKEN` and the value of the string of your access token.

* In addition, you will need to visit the [Model Card for the Gemma 2 model](https://huggingface.co/google/gemma-2-9b-it).  At the top of the page you will see a notice saying you need to request perrmission to use the model.  While logged in to your Hugging Face account, click the button to request permission.  It can sometimes take up to 10 or 15 minutes to get approved.  Once you are approved the message on the Model Card will change to indicate you have been granted access to the model.


## Setup

In [1]:
!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U bitsandbytes
!pip install -q -U flash_attn
!pip install -q -U datasets==3.6.0

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m43.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for flash_attn (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
#help track which versions of libraries we're using
!pip list | grep transformers
!pip list | grep accelerate
!pip list | grep bitsandbytes
!pip list | grep datasets

sentence-transformers                    5.1.1
transformers                             4.57.1
accelerate                               1.10.1
bitsandbytes                             0.48.1
datasets                                 3.6.0
tensorflow-datasets                      4.9.9
vega-datasets                            0.9.0


In [3]:
import datasets
from transformers import pipeline, BitsAndBytesConfig
import bitsandbytes as bnb
import torch
import random
import pandas as pd
from tqdm import tqdm


In [4]:
!pip install -q evaluate
import evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
!pip install -q rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [6]:
#let's make longer output readable without horizontal scrolling
from pprint import pprint

Now let's get the data we're going to use.

In [7]:
#import datasets
#import random
#import torch
from datasets import load_dataset

def load_and_sample_dataset(num_samples=11):
    """
    Load and sample records from the X-Sum dataset
    """
    #dataset = datasets.load_dataset("xsum", split="train", cache_dir=None, trust_remote_code=True)

    dataset = load_dataset("EdinburghNLP/xsum", split="test")
    selected_indices = random.sample(range(len(dataset)), num_samples)
    selected_samples = dataset.select(selected_indices)
    return selected_samples

In [8]:
# Set random seed for reproducibility
random.seed(42)
torch.manual_seed(42)

# Load dataset
print("Loading dataset...")
dataset = load_and_sample_dataset()

Loading dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

xsum.py: 0.00B [00:00, ?B/s]

default/train/0000.parquet:   0%|          | 0.00/304M [00:00<?, ?B/s]

default/validation/0000.parquet:   0%|          | 0.00/16.7M [00:00<?, ?B/s]

default/test/0000.parquet:   0%|          | 0.00/17.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

In [9]:
display(dataset)

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 11
})

What do our input documents lok like?  Let's see the first of them.

In [10]:
dataset[0]['document']

'But Ms Atwell has promised to donate the money from the Global Teacher Prize to the school that she founded.\nThere was one UK representative in the top 10, Richard Spencer, who teaches science in Middlesbrough.\nThe prize was created to raise the status of teaching.\nThe winner of the inaugural Global Teacher Prize, who received her award at the Global Education and Skills Forum in Dubai on Sunday, was recognised for her work in teaching reading and writing.\nOn receiving the award, she said it was a "privilege" to work as a teacher and to help young people.\nGiving away the prize money was "not being selfless, but being committed to public service", she said.\nFormer US president, Bill Clinton, told the audience that he could still remember almost all the names of his teachers and that the prize would help to remind the public of the importance of the profession.\nIt was "critically important" to "attract the best people into teaching" and to hold them in "high regard", said Mr Clin

And what does the corresponding summmary look like?  This is our target.

In [11]:
dataset[0]['summary']

"Nancie Atwell, an English teacher from Maine in the United States, has been named as the winner of a competition to find the world's best teacher, with a prize of $1m (Â£680,000)."

We'll also take advantage of a Hugging Face abstraction called a pipeline.  It is an easy way of experimenting with a model in inference mode.  We'll use that here to experiment with prompts (and possibly some hyperparameters) to imporve the quality of our results.

It takes a while to load this model -- on the order of ten minutes -- but once it is loaded you can keep reusing the loaded model and improve your prompt.



In [13]:
from huggingface_hub import login

# Login to Hugging Face
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [14]:
"""
Initialize the pipeline with bitsandbytes quantization
"""
# Configure bitsandbytes for 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Initialize pipeline
model_id = "google/gemma-2-9b-it"

summarizer = pipeline(
   "text-generation",
   model=model_id,
   model_kwargs={"dtype": torch.bfloat16, "quantization_config": quantization_config},
   device_map="auto",
   trust_remote_code=True,
)

config.json:   0%|          | 0.00/857 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Device set to use cuda:0


As a reminder, here's the record we're dealing with.

In [15]:
dataset[0]

{'document': 'But Ms Atwell has promised to donate the money from the Global Teacher Prize to the school that she founded.\nThere was one UK representative in the top 10, Richard Spencer, who teaches science in Middlesbrough.\nThe prize was created to raise the status of teaching.\nThe winner of the inaugural Global Teacher Prize, who received her award at the Global Education and Skills Forum in Dubai on Sunday, was recognised for her work in teaching reading and writing.\nOn receiving the award, she said it was a "privilege" to work as a teacher and to help young people.\nGiving away the prize money was "not being selfless, but being committed to public service", she said.\nFormer US president, Bill Clinton, told the audience that he could still remember almost all the names of his teachers and that the prize would help to remind the public of the importance of the profession.\nIt was "critically important" to "attract the best people into teaching" and to hold them in "high regard",

Let's just generate one summary so we can see what it looks like

In [16]:
prompt = [
            {"role": "user", "content": "Generate a summary of this text: " + dataset[0]['document']}
        ]



outputs = summarizer(
  prompt,
  max_new_tokens=256,
  do_sample = True,
  temperature = 0.3,
  top_p = 0.95
)

summary = outputs[0]["generated_text"][-1]

Let's see what the generated summary looks like.

In [17]:
summary

{'role': 'assistant',
 'content': "The inaugural Global Teacher Prize, awarded at the Global Education and Skills Forum in Dubai, was won by Nancie Atwell, a US teacher known for her work in improving reading and writing instruction. Atwell, founder of the Center for Teaching and Learning in Maine, plans to donate the prize money to her school. \n\nThe prize, created by the Varkey Foundation, aims to raise the status of teaching by highlighting the profession's importance and attracting talented individuals. Former US President Bill Clinton emphasized the need to value teachers and attract the best to the profession. \n\nOther finalists included teachers from diverse backgrounds and countries, recognized for their innovative teaching methods and commitment to their students. The prize attracted global attention, with support from prominent figures like Bill Gates and UN Secretary-General Ban Ki-Moon.  \n\n\nThe Global Teacher Prize serves as a platform to celebrate exceptional educator

How does it compare with the reference? Let's compare your candidate and the reference using the ROUGE metric.

In [18]:
rouge = evaluate.load('rouge')


# Process each sample
print("Generating summaries and calculating ROUGE scores...")



# Calculate ROUGE scores
predictions = [summary['content']]
references = [[dataset[0]['summary']]]
rouge_scores = rouge.compute(predictions=predictions, references=references)
rouge_scores

Downloading builder script: 0.00B [00:00, ?B/s]

Generating summaries and calculating ROUGE scores...


{'rouge1': np.float64(0.2072538860103627),
 'rouge2': np.float64(0.010471204188481674),
 'rougeL': np.float64(0.12435233160621763),
 'rougeLsum': np.float64(0.15544041450777205)}

Now, it's your turn.  Please improve the prompt below so that you get output that, when scored using ROUGE, the average scores for the entire data sample of 11 records exceeds these thresholds:
* Rouge-1 > 0.2
* Rouge-2 > 0.03
* Rouge-L > 0.15

You may use sampling with Top K or Top P and termperature if you like but the prompt is what will have the greatest effect on your output.  Your prompt should give as specific instructions as possible.  These LLMs are trained to follow instructions so be very specific in your request.  Individual words can make a large difference so take a little time to experiment with synonyms and alternate ways of phrasing things.

In [19]:
# Store results for aggregate scoring
results = []

Enter your prompt in the space below and then run the code.  

In [20]:
dataset[6]

{'document': 'The concrete and steel arch will eventually cover the remains of the reactor which lost its roof in a catastrophic explosion in 1986.\nThe blast sent a plume of radioactive material into the air, triggering a public health emergency across Europe.\nThe shield is designed to prevent further radioactive material leaking out over the next century.\nIt measures 275m (900ft) wide and 108m (354ft) tall and has cost $1.6bn (Â£1.3bn) to construct.\nThe European Bank for Reconstruction and Development (EBRD), which is leading the project, describes the arch as the largest moveable land-based structure ever built.\nContaining the world\'s worst nuclear accident\nUkraine marks Chernobyl 30th anniversary\nIn pictures: Chernobyl\'s eerie exclusion zone\nIt began moving on Monday using a system of hydraulic jacks and will take about five days to be put in its final position.\nWork will then begin to safely dismantle the reactor, which has been sealed inside a so-called sarcophagus, and

In [27]:
for idx, sample in enumerate(tqdm(dataset)):
    try:
      prompt = [
      ### YOUR CODE HERE
      {
          "role": "user",
          "content": f"""Write a clear summary of this text. Include the main facts and key information.

      {sample['document']}

      Summary:"""
      }

      ### END YOUR CODE
              ]


      # Generate summary via the pipeline
      outputs = summarizer(
                          prompt,
                          max_new_tokens=512,
      )

      summary = outputs[0]["generated_text"][-1]

      # Calculate ROUGE scores
      predictions = [summary['content']]
      references = [[sample['summary']]]
      rouge_scores = rouge.compute(predictions=predictions, references=references)


      # Store results
      results.append({
          'id': idx,
          'original_text': sample['document'][:500],  # Store truncated text for readability
          'reference_summary': sample['summary'],
          'generated_summary': summary,
           **rouge_scores
      })

      # Print progress update every 10 samples
      if (idx + 1) % 10 == 0:
          print(f"\nProcessed {idx + 1} samples")
          print(f"Latest ROUGE-1: {rouge_scores['rouge1']:.4f}")

    except Exception as e:
      print(f"Error processing sample {idx}: {str(e)}")
      continue

 91%|█████████ | 10/11 [04:51<00:28, 28.25s/it]


Processed 10 samples
Latest ROUGE-1: 0.2000


100%|██████████| 11/11 [05:28<00:00, 29.89s/it]


Calculate and print the average scores.

In [28]:
# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Calculate and print average ROUGE scores
avg_scores = results_df[['rouge1', 'rouge2', 'rougeL']].mean()
print("\nAverage ROUGE Scores:")
for metric, score in avg_scores.items():
   print(f"{metric}: {score:.4f}")

# Print some example summaries
print("\nExample Summaries:")
for i in range(min(5, len(results_df))):
   print(f"\nExample {i+1}:")
   print(f"Reference: {results_df.iloc[i]['reference_summary']}")
   print(f"Generated: {results_df.iloc[i]['generated_summary']}")


Average ROUGE Scores:
rouge1: 0.1271
rouge2: 0.0390
rougeL: 0.0853

Example Summaries:

Example 1:
Reference: Nancie Atwell, an English teacher from Maine in the United States, has been named as the winner of a competition to find the world's best teacher, with a prize of $1m (Â£680,000).
Generated: {'role': 'assistant', 'content': "The Global Teacher Prize was created by the Varkey Foundation, the charitable arm of the GEMS education group, to raise the status of teaching. The inaugural prize was awarded to Nancie Atwell, a US teacher who founded the Center for Teaching and Learning in Edgecomb, Maine in 1990.  Atwell will donate the prize money to her school, which has a library in every room and pupils read an average of 40 books a year.  \n\nAtwell was recognized for her work in teaching reading and writing. She is also a prolific author, with nine books published about teaching, including *In The Middle*, which sold half a million copies.\n\nThe award ceremony was held at the Glo

**QUESTION:**

1.1 What is the number of words in your prompt once you've met the scoring criteria?

1.2 What is the avg ROUGE-1 score you get once you've met the scoring criteria?

1.3 What is the avg ROUGE-2 score you get once you've met the scoring criteria?

1.4 What is the avg ROUGE-L score you get once you've met the scoring criteria?

1.5 How helpful do you find ROUGE to be in creating better summaries?  How do you think it could be improved? Please write a five sentence response in the text cell below.

*** YOUR ANSWER TO QUESTION 1.5 HERE ***

*** END YOUR ANSWER ***