# CS 195: Natural Language Processing
## ROUGE and Summarization

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/s26-CS195NLP/blob/main/F1_3_RougeSummarization.ipynb)


## First, let's finish up the group work from last time

I asked you to find a `text-classification` model and the dataset that was used to train it. 

https://huggingface.co/models?pipeline_tag=text-classification&sort=trending

Let's use those instead of `SamLowe/roberta-base-go_emotions` and `go_emotions`

**Group Discussion:** What needs to change in this code to make it work?

In [1]:
from transformers import pipeline
from datasets import load_dataset
from accelerate import Accelerator
from sklearn.metrics import accuracy_score

device = Accelerator().device

dataset = load_dataset("go_emotions")
classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions", device=device)

results = classifier(dataset["test"]["text"][0:1000])

predicted_labels = []
actual_labels = []

for idx in range(1000):
    predicted_labels.append(results[idx]["label"])
    actual_label_numeric = dataset["test"]["labels"][idx][0]
    actual_labels.append( dataset["test"].features["labels"].feature.int2str( actual_label_numeric ) )

print("Accuracy:",accuracy_score(actual_labels,predicted_labels) )

Device set to use mps


Accuracy: 0.593


**Group Activity:** Spend the next 10 minutes getting this to work with the new model and test data.

## Next Tuesday: First Demo Day!

Reminder: you will present a demo to your group on whatever you've done for the first fortnight
* Show off one **Applied Exploration** that you finished
    - finished outside of class, polished it up, included answers to all requested questions and any other notes of interest
* If you have completed any **Creative Synthesis** items - show those off
    - if you're doing this, spend less/very-little time on your Applied Exploration demo, but you should still have it for your portfolio
    - check the [syllabus](https://github.com/ericmanley/S26-CS195NLP/blob/main/F0_0_Syllabus.ipynb) for options
* If you didn't do any *Creative Synthesis* or *Applied Exploration*, show a **Core Practice**

We'll talk more about portfolio format later

## References

*Two minutes NLP — Learn the ROUGE metric* by examples by Fabio Chiusano: https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499

Google's implementation of rouge_score: https://github.com/google-research/google-research/tree/master/rouge

Hugging Face's wrapper for Google's implementation: https://huggingface.co/spaces/evaluate-metric/rouge *(NB: this page seems to be down, but it's still linked from many places in the Hugging Face documentation, so I don't know when it will become accessible)*

Hugging Face Task Guide on Summarization: https://huggingface.co/docs/transformers/tasks/summarization


## Installing necessary modules

In [1]:
import sys
!{sys.executable} -m pip install transformers datasets evaluate rouge_score

Defaulting to user installation because normal site-packages is not writeable
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting nltk (from rouge_score)
  Downloading nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
Downloading nltk-3.9.2-py3-none-any.whl (1.5 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m1.5 MB/s[0m  [33m0:00:01[0mm [31m1.5 MB/s[0m eta [36m0:00:01[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (pyproject.toml) ... [?25ldone
[?25h  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24986 sha256=cd3d676d118756586

## Sequence-to-Sequence Models

NLP models that take one sequence as input and produce another sequence as output are called **seq2seq**, **text-to-text**, or **text generation**
* summarization
* translation
* conversation

**A Challenge:** unlike classification, there's no way to tell for sure whether the prediction is right!

**Partial Solutions:** 
* Qualitative metrics - humans can describe how closely they match
* ROUGE Metrics: statistics that measure similarities between two sequences.
* Task-specific metrics like BLEU (Bilingual Evaluation Understudy) for translation



## Getting started with ROUGE

**ROUGE:** Recall-Oriented Understudy for Gisting Evaluation

Suppose we have a **reference** sequence, which is one known possible *correct* sequence
* E.g., a translation or a summarization that a trustworthy human has produced

**Example reference:** "A broody hen sat in a nesting box all day."

**Example machine-generated prediction:** "A hen sat in every nesting box that long sunny day."



In [2]:
import evaluate

rouge = evaluate.load("rouge")

predicted_sentence = "A broody hen sat in a nesting box all day"
reference_sentence = "A hen sat in every nesting box that long sunny day"

rouge.compute(predictions=[predicted_sentence],references=[reference_sentence])

Downloading builder script: 0.00B [00:00, ?B/s]

{'rouge1': 0.6666666666666666,
 'rouge2': 0.3157894736842105,
 'rougeL': 0.6666666666666666,
 'rougeLsum': 0.6666666666666666}

## Understanding ROUGE-1 and ROUGE-2

These tell you how often words or sequences of words match in the prediction and reference data.

`rouge1` - overlap of individual words (1-grams) between prediction and reference

`rouge2` - overlap of *bigrams* (2-grams, pairs of consecutive words)

Both of these are given in terms of their F1 score. Remember, F1 is a balance of *precision* and *recall*, specifically $$F1 = 2 * (Precision * Recall) / (Precision + Recall)$$

### in this context...

**Precision:** Given all the n-grams in the *predictions*, how many are also present in the *reference*?

**Recall:** Given all the n-grams in the *reference*, how many are also present in the *prediction*?

### ROUGE-1 example

**Reference:** A broody hen sat in a nesting box all day. (10 words)

**Prediction:** A hen sat in every nesting box that long sunny day. (11 words)

**Overlapping words:** a, hen, sat, in, nesting, box, day (7 words)

**Precision:** of the 11 words in the prediction, 7 of them are also in the reference, so $7/11 \approx 0.64$

**Recall:** of the 10 words in the reference, 7 of them are also present in the prediction (first "a" has match, second doesn't), so $7/10 = 0.7$

**F1 score:** $2*(0.64*0.7)/(0.64+0.7) \approx 0.67$


### ROUGE-2 example

**Reference:** A broody hen sat in a nesting box all day. (9 bigrams)

**Prediction:** A hen sat in every nesting box that long sunny day. (10 bigrams)

**Overlapping bigrams:** (hen sat), (sat in), (nesting box) (3 bigrams)

**Precision:** of the 10 bigrams in the prediction, 3 of them are also in the reference, so $3/10 = 0.3$

**Recall:** of the 9 bigrams in the reference, 3 of them are also present in the prediction, so $3/9 \approx 0.33$

**F1 score:** $2*(0.3*0.33)/(0.3+0.33) \approx 0.31$

## Understanding ROUGE-L and ROUGE-Lsum

`rougeL` - the *longest common subsequence* between the prediction and reference. The subsequence must be in *order* but not nececssarily *consecutive*

**Reference:** **A** broody **hen sat in** a **nesting box** all **day**. (10 words)

**Prediction:** **A hen sat in** every **nesting box** that long sunny **day**. (11 words)

**Longest Common Subsequence:** 7 words

**Precision:** 7 words of 11 in the prediction, 0.64

**Recall:** 7 of 10 words in the reference, 0.7

**F1 score:** $2*(0.64*0.7)/(0.64+0.7) \approx 0.67$

`rougeLsum` - do `rougeL` for each newline/sentence and aggregate the results


## Group Activity

Given the following Reference and Prediction, calculate the ROUGE-1, ROUGE-2, and ROUGE-L scores.

**Reference:** the study found that regular exercise improves mental health and reduces stress

**Prediction:** the study shows exercise improves mental health and lowers stress



## Summarization in Hugging Face

Hugging Face hosts many summarization models. Here's one called BART (https://huggingface.co/facebook/bart-large-cnn) that was trained on CNN/Daily Mail news articles (https://huggingface.co/datasets/abisee/cnn_dailymail) which include **reference** summaries written by the authors of the original article. 

We'll try it out on a Times-Delphic article I found here: https://timesdelphic.com/83979/news/drake-events-focus-on-the-link-between-cancer-and-water-quality-in-iowa/

In [6]:
from transformers import pipeline
from accelerate import Accelerator

device = Accelerator().device
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=device)

Device set to use mps


In [7]:
times_delphic_story = """Drake University hosted the Iowa Nature Summit in the Olmsted Center on Nov. 19 and 20 to discuss Iowa environmental preservation, issues and policies.

The water conditions in Iowa continued to be a buzzing topic less than a week after Drake hosted the Water Quality Solutions Town Hall on Nov. 15, a follow-up to a Central Iowa Source Water Resource Assessment report. 

The water quality tracking data in the report showed that potentially harmful nitrates were found in the Des Moines and Raccoon Rivers.

Adam Shriver, director of wellness and nutrition policy at the Harkin Institute, was one of five panelists who discussed the connection between public health and the state of Iowa’s nature due to human impact. 

During his time on the podium, Shriver focused on the potential connection between water quality and the growing cancer rates, especially amongst younger age groups in Iowa.

“Iowa having the number two cancer rate in the country is something that I think should be troubling to all of us, both because of how devastating cancer is and also because many young people are choosing to leave the state partly because of concerns about this topic,” Shriver said. 

Although the population decline amongst young people is concerning, Shriver noted a growing concern from Iowan citizens, recalling the Central Iowa Source Water Resource Assessment event hosted by the Harkin Institute that gathered over 600 in-person attendees and thousands online.  

“I’ve got grandchildren [in Iowa] thinking about the future and them seeing the decline of quality in the state, how much it can be hurt,” said Iowa Citizens for Community Improvement member Tim Goldman.

Shriver said that it’s important to be knowledgeable and raise awareness about the issue of water quality. By supporting change-making organizations such as the Harkin Institute, Iowa Environmental Council and Iowa Citizens for Community Improvements, Iowans can aid in affecting state policies and laws. 

“Ultimately, the reason we have high nitrates in our waters and why Iowa applies 53 million pounds of pesticides every year have to do with the policies that are in place … I’m a believer that the way change happens is through organized money or through organized people,” Shriver said. “People advocating for change are often at a disadvantage when it comes to money, so we need to go all out for organizing people.” 

With the upcoming 2026 elections, Shriver said that keeping environmental issues affecting Iowan communities in mind when voting is one of the ways for change and bettering of Iowan health. 

Shriver furthermore expressed the importance of the growing cancer rates to be a topic of discussion among Iowa leaders, and the importance of opening conversations regarding agricultural practices that may be the root of the issue. 

Like Shriver, John Norris, a former Polk County administrator who commissioned the CISWRA report, marked the importance of Iowan values in guiding future policies. Norris advised attendees of the Water Quality Solutions Town Hall to vote in upcoming elections with valuing clean water in mind.

“We have to start anchoring our politics in values,” Norris said. 

Through continued research regarding the correlation between water quality and cancer, the Harkin Institute will continue holding events in collaboration with various organizations to keep not only the community but the state of Iowa informed. 

The presentation “Cancer in Polk County” will be hosted at Sheslow Auditorium along with a virtual Zoom option on Tuesday, Jan. 13, from 5-7 p.m. In the days following, the “Environmental Risk Factors and Cancer” report will be released in conjunction with the Iowa Environmental Council to continue the conversation.
"""

In [11]:
len(times_delphic_story) #let's check how long this string is - some might be too long for the model

3755

In [12]:
print(summarizer(times_delphic_story))

[{'summary_text': 'Drake University hosted the Iowa Nature Summit in the Olmsted Center on Nov. 19 and 20 to discuss Iowa environmental preservation, issues and policies. Adam Shriver, director of wellness and nutrition policy at the Harkin Institute, was one of five panelists who discussed the connection between public health and the state of Iowa’s nature.'}]


## Individual work: Let's try it on a different summarization dataset

The *BillSum* dataset contains the text of legislative bills and their summaries from both the US Federal and California State legislatures.

See more here: FiscalNote/billsum

This dataset has `train`, `test`, and `ca_test` splits. We can load just one of them - let's try the `ca-test` which is the smaller test set.


In [14]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

data/ca_test-00000-of-00001.parquet:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

## Let's explore the dataset

What does it look like when printed/displayed?

In [15]:
print(billsum)

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})


What does one of the items look like?

In [16]:
billsum[0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nThe Legislature finds and declares all of the following:\n(a) (1) Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. These organizations help preserve the memories and incidents of the great hostilities fought by our nation, and preserve and strengthen comradeship among members.\n(2) These veterans’ organizations also own and manage various properties including lodges, posts, and fraternal halls. These properties act as a safe haven where veterans of all ages and their families can gather together to find camaraderie and fellowship, share stories, and seek support from people who understand their unique experiences. This aids in the healing process for these returning veterans, and ensures their health and happiness.\n(b) As a result of congressional chartering of these veterans’ organizations, the United States Inte

Let's get a summary of the first bill using the news-article summarizer.

In [17]:
summarizer(billsum[0]["text"])

[{'summary_text': 'Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. These properties act as a safe haven where veterans of all ages and their families can gather together to find camaraderie and fellowship, share stories, and seek support from people who understand their unique experiences. The U.S. Internal Revenue Service created a special tax exemption for these organizations.'}]

## Now let's do a batch of 5 articles

First, we need to prepare a list that contains the texts of the first 5 bills, truncated to the first 4000 characters.

In [18]:
truncated_bill_texts = []
for idx in range(5):
    curr_truncated_text = billsum[idx]["text"]
    truncated_bill_texts.append( curr_truncated_text )

Now let's get a summary of each of those texts. This might take a while.

In [19]:
prediction_summaries = summarizer(billsum["text"][0:5])
actual_references = billsum["summary"][0:5]

print(prediction_summaries)
print(actual_references)


[{'summary_text': 'Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. These properties act as a safe haven where veterans of all ages and their families can gather together to find camaraderie and fellowship, share stories, and seek support from people who understand their unique experiences. The U.S. Internal Revenue Service created a special tax exemption for these organizations.'}, {'summary_text': 'A prisoner is not eligible for resentence or recall pursuant to subdivision (e) of Section 1170 if he or she was convicted of first-degree murder. A prisoner sentenced to death or life in prison without possibility of parole is prohibited by any initiative statute. If a prisoner is permanently medically incapacitated with a medical condition that renders him or her permanently unable to perform activities of basic daily living, and that incapacitation did not exist at the time of sentencing, the prisone

Notice that summarizer returns a list of dictionaries with one key each: `'summary_text'`. If we want to evaluate these with ROUGE, we will need to get a flat list of all these texts - not contained inside a dictionary.

In [20]:
predictions_flat = []

for result in prediction_summaries:
    predictions_flat.append(result["summary_text"])
    
print(predictions_flat)

['Since 1899 congressionally chartered veterans’ organizations have provided a valuable service to our nation’s returning service members. These properties act as a safe haven where veterans of all ages and their families can gather together to find camaraderie and fellowship, share stories, and seek support from people who understand their unique experiences. The U.S. Internal Revenue Service created a special tax exemption for these organizations.', 'A prisoner is not eligible for resentence or recall pursuant to subdivision (e) of Section 1170 if he or she was convicted of first-degree murder. A prisoner sentenced to death or life in prison without possibility of parole is prohibited by any initiative statute. If a prisoner is permanently medically incapacitated with a medical condition that renders him or her permanently unable to perform activities of basic daily living, and that incapacitation did not exist at the time of sentencing, the prisoner shall be granted medical parole.'

and now let's compute the ROUGE metrics

In [21]:
import evaluate

rouge = evaluate.load("rouge")

rouge.compute(predictions=predictions_flat,references=actual_references)

{'rouge1': 0.2155830374162857,
 'rouge2': 0.09040727616557813,
 'rougeL': 0.15076481729055258,
 'rougeLsum': 0.1819580658833651}

## Debrief

* How good are these numbers?
* Do you think the numbers would be similar if we evaluated using a news dataset, even if it isn't the one it was trained on?

## Applied Exploration

Go to the Hugging Face models page: https://huggingface.co/models
* Use the same model, but find two different news datasets (https://huggingface.co/datasets), and evaluate them using ROUGE metrics
* For each dataset, record
    - where did it come from?
    - where did the reference summaries come from?
    - how big is it?
    - how big are the texts? Did you have to truncate them?
* Evaluate the performance 
    - use the ROUGE metrics
    - describe in your own words how it performed
    - how did they compare to each other?
    - how did they compare to the bills dataset?
    - what do you think is the reason for the difference in performance that you noticed?
    

## An Idea for Creative Synthesis

Write some code that lets the user type in a web address (like a Wikipedia article) and generate a summary for the whole page.
* you will have to experiment with different ideas of how to get summaries for longer texts
    - come up with your own ideas
    - research how others handle it and try those
    - you might find that combining more than one kind of model can be helpful

Record your results and discuss it at the demo!