# Assignment 3: Summarization with LLMs

**Description:** This assignment covers the task of summarization which is the process of generating an abridged version of the input. With the ascendance of LLMs, we have a new way of generating summaries. Now, rather than fine-tuning. moel to generate summaries, we can simply provide explicit instructios for the summary we want the model to generate.  By finishing this assignment you should also be able to develop an intuition for:


* How well summarization systems work
* The effects of hyperparameters on outcomes
* The effects of prompts on the output of an LLM
* Evaluation of output using ROUGE



This notebook must be run on a Google Colab as it requires a GPU. By default, when you open the notebook in Colab it will configure a GPU.  Summarization commands can take up to five minutes to run depending on the hyperparameters you use.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-fall-main/blob/master/assignment/a3/SummarizationLLM_test.ipynb)

The overall assignment structure is as follows:

 Setup

1. Gemma 2 for abstractive summarization




**INSTRUCTIONS:**:

* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1 and a2.

* **### YOUR CODE HERE** indicates that you are supposed to write code.

* In order to complete the assignment with the Gemma model you will need to get an account on [Hugging Face](https://huggingface.co).  It is free.  Once you have the account on Hugging Face you will need to create an Access Token.  Go
to Access Token under your profile and generate a token with write permissions for colab.  You will need to copy that token and add it to the secrets in your Colab account with the name `HF_TOKEN` and the value of the string of your access token.

* In addition, you will need to visit the [Model Card for the Gemma 2 model](https://huggingface.co/google/gemma-2-9b-it).  At the top of the page you will see a notice saying you need to request perrmission to use the model.  While logged in to your Hugging Face account, click the button to request permission.  It can sometimes take up to 10 or 15 minutes to get approved.  Once you are approved the message on the Model Card will change to indicate you have been granted access to the model.


## Setup

In [1]:
!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U bitsandbytes
!pip install -q -U flash_attn
!pip install -q -U datasets==3.6.0

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m133.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for flash_attn (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:

#help track which versions of libraries we're using
!pip list | grep transformers
!pip list | grep accelerate
!pip list | grep bitsandbytes
!pip list | grep datasets

sentence-transformers                    5.1.1
transformers                             4.57.1
accelerate                               1.10.1
bitsandbytes                             0.48.1
datasets                                 3.6.0
tensorflow-datasets                      4.9.9
vega-datasets                            0.9.0


In [3]:
import datasets
from transformers import pipeline, BitsAndBytesConfig
import bitsandbytes as bnb
import torch
import random
import pandas as pd
from tqdm import tqdm


In [4]:
!pip install -q evaluate
import evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
!pip install -q rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [6]:
#let's make longer output readable without horizontal scrolling
from pprint import pprint

Now let's get the data we're going to use.

In [7]:
# import datasets
#import random
#import torch
from datasets import load_dataset

def load_and_sample_dataset(num_samples=11):
    """
    Load and sample records from the X-Sum dataset
    """
    #dataset = datasets.load_dataset("xsum", split="train", cache_dir=None, trust_remote_code=True)

    dataset = load_dataset("EdinburghNLP/xsum", split="validation")
    selected_indices = random.sample(range(len(dataset)), num_samples)
    selected_samples = dataset.select(selected_indices)
    return selected_samples

In [8]:
from huggingface_hub import login
# Paste your HF token (with read scope) in the input box
login()  # or: login(token="hf_xxx")


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [9]:
# Set random seed for reproducibility
random.seed(42)
torch.manual_seed(42)

# Load dataset
print("Loading dataset...")
dataset = load_and_sample_dataset()

Loading dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

xsum.py: 0.00B [00:00, ?B/s]

default/train/0000.parquet:   0%|          | 0.00/304M [00:00<?, ?B/s]

default/validation/0000.parquet:   0%|          | 0.00/16.7M [00:00<?, ?B/s]

default/test/0000.parquet:   0%|          | 0.00/17.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

In [10]:
display(dataset)

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 11
})

What do our input documents lok like?  Let's see the first of them.

In [11]:
dataset[0]['document']

'The victim was with a friend in McKechnie Street, Govan, when the attack took place at about 11:10 on Wednesday.\nHe was taken to South Glasgow University Hospital and treated for a serious facial injury before being allowed home.\nHis attacker is described as a white male in his 20s, with light brown hair.\nPolice Scotland, who have appealed for witnesses, said the suspect ran off towards Harmony Row.\nDet Con Adam Richardson, of Govan CID, said: "We are currently studying CCTV footage in an attempt to get a clearer description of the person responsible for this vicious attack.\n"At this time there is no apparent motive for the attack and I appeal to anyone who either witnessed the incident or who saw the suspect running off afterwards to contact police immediately."'

And what does the corresponding summmary look like?  This is our target.

In [12]:
dataset[0]['summary']

'A 61-year-old man was slashed in the face after being attacked from behind as he walked along a Glasgow street.'

We'll also take advantage of a Hugging Face abstraction called a pipeline.  It is an easy way of experimenting with a model in inference mode.  We'll use that here to experiment with prompts (and possibly some hyperparameters) to imporve the quality of our results.

It takes a while to load this model -- on the order of ten minutes -- but once it is loaded you can keep reusing the loaded model and improve your prompt.



In [14]:
"""
Initialize the pipeline with bitsandbytes quantization
"""
# Configure bitsandbytes for 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Initialize pipeline
model_id = "google/gemma-2-9b-it"

summarizer = pipeline(
   "text-generation",
   model=model_id,
   model_kwargs={"dtype": torch.bfloat16, "quantization_config": quantization_config},
   device_map="auto",
   trust_remote_code=True,
)

config.json:   0%|          | 0.00/857 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Device set to use cuda:0


As a reminder, here's the record we're dealing with.

In [15]:
dataset[0]

{'document': 'The victim was with a friend in McKechnie Street, Govan, when the attack took place at about 11:10 on Wednesday.\nHe was taken to South Glasgow University Hospital and treated for a serious facial injury before being allowed home.\nHis attacker is described as a white male in his 20s, with light brown hair.\nPolice Scotland, who have appealed for witnesses, said the suspect ran off towards Harmony Row.\nDet Con Adam Richardson, of Govan CID, said: "We are currently studying CCTV footage in an attempt to get a clearer description of the person responsible for this vicious attack.\n"At this time there is no apparent motive for the attack and I appeal to anyone who either witnessed the incident or who saw the suspect running off afterwards to contact police immediately."',
 'summary': 'A 61-year-old man was slashed in the face after being attacked from behind as he walked along a Glasgow street.',
 'id': '33096901'}

Let's just generate one summary so we can see what it looks like

In [16]:
prompt = [
            {"role": "user", "content": "Generate a summary of this text: " + dataset[0]['document']}
        ]



outputs = summarizer(
  prompt,
  max_new_tokens=256,
  do_sample = True,
  temperature = 0.3,
  top_p = 0.95
)

summary = outputs[0]["generated_text"][-1]

Let's see what the generated summary looks like.

In [17]:
summary

{'role': 'assistant',
 'content': 'A man in his 20s was seriously injured in a facial attack in Govan, Scotland on Wednesday at around 11:10 pm. The victim was with a friend on McKechnie Street when the attack occurred. The suspect, described as a white male in his 20s with light brown hair, fled towards Harmony Row. Police are reviewing CCTV footage and appealing for witnesses to come forward as the motive for the attack remains unknown. \n'}

How does it compare with the reference? Let's compare your candidate and the reference using the ROUGE metric.

In [18]:
rouge = evaluate.load('rouge')


# Process each sample
print("Generating summaries and calculating ROUGE scores...")



# Calculate ROUGE scores
predictions = [summary['content']]
references = [[dataset[0]['summary']]]
rouge_scores = rouge.compute(predictions=predictions, references=references)
rouge_scores

Downloading builder script: 0.00B [00:00, ?B/s]

Generating summaries and calculating ROUGE scores...


{'rouge1': np.float64(0.16842105263157894),
 'rouge2': np.float64(0.0),
 'rougeL': np.float64(0.14736842105263157),
 'rougeLsum': np.float64(0.14736842105263157)}

Now, it's your turn.  Please improve the prompt below so that you get output that, when scored using ROUGE, the average scores for the entire data sample of 11 records exceeds these thresholds:
* Rouge-1 > 0.2
* Rouge-2 > 0.03
* Rouge-L > 0.15

You may use sampling with Top K or Top P and termperature if you like but the prompt is what will have the greatest effect on your output.  Your prompt should give as specific instructions as possible.  These LLMs are trained to follow instructions so be very specific in your request.  Individual words can make a large difference so take a little time to experiment with synonyms and alternate ways of phrasing things.

Enter your prompt in the space below and then run the code.  

In [31]:
# Store results for aggregate scoring
results = []

for idx, sample in enumerate(tqdm(dataset)):
    try:

      prompt = [
          {
              "role": "user",
              "content": (
                  "You are a precision news summarizer evaluated with ROUGE. "
                  "Write ONE sentence (18–28 words) that states the article’s MAIN event. "
                  "Rules to maximize ROUGE and fidelity:\n"
                  "1) Use active voice and past tense. No headings, quotes, lists, or prefaces.\n"
                  "2) REUSE exact words/phrases and named entities from the source; avoid synonyms and paraphrase unless necessary.\n"
                  "3) Cover WHO did WHAT, and include WHERE or WHEN if the article states them.\n"
                  "4) Prioritize facts from the first paragraph; do not add information not present in the text.\n"
                  "5) Keep proper nouns, numbers, and titles exactly as written in the source.\n"
                  "Output ONLY the single summary sentence.\n\n"
                  "Article:\n<document>\n"
                  f"{sample['document']}\n"
                  "</document>"
              )
          }
      ]

      # Generate summary via the pipeline
      outputs = summarizer(
        prompt,
        max_new_tokens=40,
        do_sample = True,
        temperature = 0.2,
        top_p = 0.95
      )

      # Calculate ROUGE scores
      gen = outputs[0]["generated_text"][-1]["content"].strip()
      predictions = [gen]
      # predictions = [summary['content']]
      references = [[sample['summary']]]
      rouge_scores = rouge.compute(predictions=predictions, references=references)


      # Store results
      results.append({
          'id': idx,
          'original_text': sample['document'][:500],  # Store truncated text for readability
          'reference_summary': sample['summary'],
          'generated_summary': summary,
           **rouge_scores
      })

      # Print progress update every 10 samples
      if (idx + 1) % 10 == 0:
          print(f"\nProcessed {idx + 1} samples")
          print(f"Latest ROUGE-1: {rouge_scores['rouge1']:.4f}")

    except Exception as e:
      print(f"Error processing sample {idx}: {str(e)}")
      continue

 91%|█████████ | 10/11 [00:44<00:04,  4.31s/it]


Processed 10 samples
Latest ROUGE-1: 0.2424


100%|██████████| 11/11 [00:48<00:00,  4.44s/it]


Calculate and print the average scores.

In [32]:
# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Calculate and print average ROUGE scores
avg_scores = results_df[['rouge1', 'rouge2', 'rougeL']].mean()
print("\nAverage ROUGE Scores:")
for metric, score in avg_scores.items():
   print(f"{metric}: {score:.4f}")

# Print some example summaries
print("\nExample Summaries:")
for i in range(min(5, len(results_df))):
   print(f"\nExample {i+1}:")
   print(f"Reference: {results_df.iloc[i]['reference_summary']}")
   print(f"Generated: {results_df.iloc[i]['generated_summary']}")


Average ROUGE Scores:
rouge1: 0.2156
rouge2: 0.0456
rougeL: 0.1682

Example Summaries:

Example 1:
Reference: A 61-year-old man was slashed in the face after being attacked from behind as he walked along a Glasgow street.
Generated: {'role': 'assistant', 'content': 'A white male in his 20s with light brown hair attacked a victim in McKechnie Street, Govan, at about 11:10 on Wednesday, causing a serious facial'}

Example 2:
Reference: A big debate takes place on Sunday night in America between Hillary Clinton and Donald Trump, the two people trying to become the next US president.
Generated: {'role': 'assistant', 'content': 'A 2005 video of Donald Trump making derogatory comments about women was released, causing him to apologize and prompting some Republicans to withdraw their support.  \n'}

Example 3:
Reference: The widow of a policeman allegedly murdered in Indonesia has rejected a "donation" from his accused Australian killer.
Generated: {'role': 'assistant', 'content': "Sara Conn

**QUESTION:**

1.1 What is the number of words in your prompt once you've met the scoring criteria?

1.2 What is the avg ROUGE-1 score you get once you've met the scoring criteria?

1.3 What is the avg ROUGE-2 score you get once you've met the scoring criteria?

1.4 What is the avg ROUGE-L score you get once you've met the scoring criteria?

1.5 How helpful do you find ROUGE to be in creating better summaries?  How do you think it could be improved? Please write a five sentence response in the text cell below.

*** YOUR ANSWER TO QUESTION 1.5 HERE ***

*** END YOUR ANSWER ***