# Chapter 6: Summarization

## The CNN/DailyMail Dataset

* ~300k pairs of news articles and their corresponding summaries
* summaries are _abstractive_
* [`cnn_dailymail` dataset viewer at HF](https://huggingface.co/datasets/viewer/?dataset=cnn_dailymail&config=3.0.0)
* also see the [Dataset card for `cnn_dailymail` at HF](https://huggingface.co/datasets/cnn_dailymail)

In [1]:
from datasets import load_dataset

dataset = load_dataset(
    "cnn_dailymail",
    version="3.0.0"
)

print(f"Features: {dataset['train'].column_names}")

Found cached dataset cnn_dailymail (/home/kashiwapoodle/.cache/huggingface/datasets/cnn_dailymail/default/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)


  0%|          | 0/3 [00:00<?, ?it/s]

Features: ['article', 'highlights', 'id']


In [2]:
sample = dataset["train"][1]

print(f"""
Article (excerpt of 500 char, total length: {len(sample['article'])}):""")
print(sample["article"][:500])
print(f"\nSummary (length: {len(sample['highlights'])}):")
print(sample["highlights"])


Article (excerpt of 500 char, total length: 4051):
Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events. Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial. MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor." Here, inmates with the most s

Summary (length: 281):
Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .


## Text Summarization Pipelines

This section require `nltk`, so be sure to download/install that before going any further.

In [3]:
sample_text = dataset["train"][1]["article"][:2000]

summaries = {}

In [4]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

string = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(string)

[nltk_data] Downloading package punkt to
[nltk_data]     /home/kashiwapoodle/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['The U.S. are a country.', 'The U.N. is an organization.']

## Summarization Baseline

In [5]:
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

summaries["baseline"] = three_sentence_summary(sample_text)

In [6]:
summaries

{'baseline': 'Editor\'s note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events.\nHere, Soledad O\'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial.\nMIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor."'}

## GPT-2

See the [`gpt2-xl` model details on HF](https://huggingface.co/gpt2-xl#model-details).

In [7]:
from transformers import pipeline, set_seed

set_seed(42)
pipe = pipeline("text-generation", model="gpt2-xl")
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(
    gpt2_query,
    max_length=512,
    clean_up_tokenization_spaces=True
)
summaries["gpt2"] = "\n".join(sent_tokenize(pipe_out[0]['generated_text'][len(gpt2_query):]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


## T5

See the [`tf-large` model details on HF](https://huggingface.co/t5-large#model-details).

In [8]:
pipe = pipeline("summarization", model="t5-large")
pipe_out = pipe(sample_text)
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]['summary_text']))

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


## BART

See the [`facebook/bart-large-cnn` model card on HF](https://huggingface.co/facebook/bart-large-cnn).

In [9]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]['summary_text']))

## PEGASUS

See the [`google/pegasus-cnn-dailymail` model card on HF](https://huggingface.co/google/pegasus-cnn_dailymail).

<span style="background-color: #9AFEFF">This model has a dependency on the `protobuf` library, so you will need to install that as well!</span>

In [10]:
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")
pipe_out = pipe(sample_text)
summaries["pegasus"] = pipe_out[0]["summary_text"].replace("<n>", "\n").replace(" .", ".")

## Comparing Different Summaries

In [11]:
print("GROUND TRUTH")
print(dataset["train"][1]["highlights"])
print("")

for model_name in summaries:
    print(model_name.upper())
    print(summaries[model_name])
    print("")

GROUND TRUTH
Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .

BASELINE
Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events.
Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial.
MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor."

GPT2
- No shoes.
- No bed.
- No mattress.
- Some inmates sleeping on the floor.
- Lacks security.
- Insufficient funds to provide mental health care.

T5
mentally ill inmates are housed on the ninth floor of a florida jail 

## Measuring the Quality of Generated Text

### BLEU

> _The closer a machine translation is to a professional human translation, the better it is._

* The BLEU metric is based primarily on [_precision_](https://en.wikipedia.org/wiki/Positive_and_negative_predictive_valueshttps://en.wikipedia.org/wiki/Positive_and_negative_predictive_values)<p/>
* It only really pays attention to how many n-grams in the _references_ (hopefully human-generated examples of good translations) show up in the translation (generated text).</p>
* It takes the geometric mean of precision calculated with respect to several n-grams, usually 1-grams through 4-grams. <p/><p/><span style="padding-left:1.5em">e.g., $p_{n} = \frac{\sum_{\text{n-gram } \in \text{ reference}} \underset{clip}{\text{Count(n-gram)}}}{ \sum_{\text{n-gram } \in \text{ translation}} \text{Count(n-gram)}}$</span><br/><br/><span style="padding-left:1.5em">and so $\text{BLEU-4} \sim \sqrt[4]{p_{1} \cdot p_{2} \cdot p_{3} \cdot p_{4}}$</span><p/>
* It also penalizes shorter translations by scaling the above-mentioned geometric mean of the n-grams with a brevity penalty ranging from `0.0` to `1.0`.<br/><span style="padding-left:1.5em">e.g., $\text{BP} = \begin{cases} 1 & \text{ if } c \gt r \\ e^{1 - \frac{r}{c}} & \text{ if } c \leq r \end{cases}$</span><br/>where $r$ is the effective reference corpus length and $c$ is the length of the candidate translation.<p/>
* Putting it all together, we have: <p/><p/><span style="padding-left:1.5em">$\text{BLEU-N} = \text{BP} \times \left( \prod_{n=1}^{N} p_{n} \right)^{\frac{1}{N}} $</span><p/>
* Plain-vanilla BLEU assumes that the translation and reference sentences are already tokenized, with the tokenization corresponding to single words. But different models may use different tokenization schemes, so that is why `sacrebleu` is currently preferred over `bleu`. For that reason, it is also the case that `bleu`/`sacrebleu` might not work very well with non-English languages, where tokenization may be happening at the morpheme-level.

References:
* Lewis in the [What is the BLEU metric? video on Youtube](https://www.youtube.com/watch?v=M05L1DhFqcw).
* [BLEU: a Method for Automatic Evaluation of Machine Translation](https://aclanthology.org/P02-1040.pdf)... only 8 pages!
* [Rachel Tatman's blogpost _Evaluating Text Output in NLP: BLEU at your own risk_](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213)... ?


Regarding `datasets` and the `load_metric` API...

> <pre>FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate</pre>

<span style="background-color: #9AFEFF">This metric implementation has a dependency on the `sacrebleu` library, so you will need to `pip install` that as well!</span>

So we will go ahead and use the [`evaluate`](https://huggingface.co/evaluate-metric) API here.

In [12]:
import evaluate

In [13]:
new_bleu = evaluate.load("bleu")

predictions = ["I have thirty six years"]
references = [["I am thirty six years old", "I am thirty six"]]

new_bleu.compute(
    predictions=predictions,
    references=references
)

{'bleu': 0.0,
 'precisions': [0.8, 0.5, 0.3333333333333333, 0.0],
 'brevity_penalty': 1.0,
 'length_ratio': 1.25,
 'translation_length': 5,
 'reference_length': 4}

In [14]:
new_sacrebleu = evaluate.load("sacrebleu")

predictions = ["I have thirty six years"]
references = [["I am thirty six years old", "I am thirty six"]]

new_sacrebleu.compute(
    predictions=predictions,
    references=references,
    smooth_method="floor",
    smooth_value=0
)

{'score': 0.0,
 'counts': [4, 2, 1, 0],
 'totals': [5, 4, 3, 2],
 'precisions': [80.0, 50.0, 33.333333333333336, 0.0],
 'bp': 1.0,
 'sys_len': 5,
 'ref_len': 4}

### BLEU (`sacrebleu`, actually) via the `evaluate` API 

#### Inputs

* `predictions`: list of translations to score
* `references`: list of lists of references
* `smooth_method`: defaults to `exp` exponential decay; choose from `none`, `floor`, `add-k`, or `exp`
* `smooth_value`: `float`
* `tokenize`: tokenization method!
* `lowercase`: enable/disable case-insensitivity; defaults to `False`
* `force`: assume input is actually detokenized; defaults to `False`
* `use_effective_order`: flag to stop inclusion of n-gram orders for which precision is `0`, so use `True` for sentence-level BLEU computations; defaults to `False`

#### Outputs

* `score`: BLEU score, ranging from `0.0` to `100.0`, inclusive
* `counts`: Counts
* `totals`: Totals
* `precisions`: Precisions
* `bp`: Brevity penalty
* `sys_len`: predictions length
* `ref_len`: reference length


In [15]:
import pandas as pd
import numpy as np

predictions = ["the the the the the the"]
references = [["the cat is on the mat"]]

results = new_sacrebleu.compute(
    predictions=predictions,
    references=references,
    smooth_method="floor",
    smooth_value=0
)
results["precisions"] = [np.round(p,2) for p in results["precisions"]]

pd.DataFrame.from_dict(
    results,
    orient="index",
    columns=["value"]
)

Unnamed: 0,value
score,0.0
counts,"[2, 0, 0, 0]"
totals,"[6, 5, 4, 3]"
precisions,"[33.33, 0.0, 0.0, 0.0]"
bp,1.0
sys_len,6
ref_len,6


In [16]:
predictions = ["the cat is on mat"]
references = [["the cat is on the mat"]]

results = new_sacrebleu.compute(
    predictions=predictions,
    references=references,
    smooth_method="floor",
    smooth_value=0
)
results["precisions"] = [np.round(p,2) for p in results["precisions"]]

pd.DataFrame.from_dict(
    results,
    orient="index",
    columns=["value"]
)

Unnamed: 0,value
score,57.893007
counts,"[5, 3, 2, 1]"
totals,"[5, 4, 3, 2]"
precisions,"[100.0, 75.0, 66.67, 50.0]"
bp,0.818731
sys_len,5
ref_len,6


### ROUGE

> The ROUGE score was specifically developed for applications like summarization where high [_recall_](https://en.wikipedia.org/wiki/Sensitivity_and_specificityhttps://en.wikipedia.org/wiki/Sensitivity_and_specificity) is more important than just precision.

References:
* Lewis in the [What is the ROUGE metric? video on Youtube](https://www.youtube.com/watch?v=TMshhnrEXlg)
* [ROUGE: A Package for Automatic Evaluation of Summaries](https://aclanthology.org/W04-1013.pdf)... again, only 8 pages!


<span style="background-color: #9AFEFF">This metric implementation has a dependency on the `absl-py` and `rouge_score` libraries, so you will need to `pip install rouge_score` as well!</span>

In [31]:
# let's use the rouge metric implementation in evaluate!
rouge = evaluate.load("rouge")

In [23]:
reference = dataset["train"][1]["highlights"]
records = []
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

In [26]:
for model_name in summaries:
    results = rouge.compute(
        predictions=[summaries[model_name]],
        references=[reference]
    )
    records.append(results)
    
pd.DataFrame.from_records(records, index=summaries.keys())

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.365079,0.145161,0.206349,0.285714
gpt2,0.114286,0.029412,0.114286,0.114286
t5,0.382979,0.130435,0.255319,0.382979
bart,0.475248,0.222222,0.316832,0.415842
pegasus,0.326531,0.208333,0.285714,0.326531


## Evaluating PEGASUS on the CNN/DailyMail Dataset

In [27]:
def evaluate_summaries_baseline(
    dataset,
    metric,
    column_text="article",
    column_summary="highlights"
):
    summaries = [
        three_sentence_summary(text) 
        for text in dataset[column_text]]
    score = metric.compute(
        predictions=summaries,
        references=dataset[column_summary]
    )
    return score

In [38]:
test_sampled = dataset["test"].shuffle(seed=42).select(range(1000))

score = evaluate_summaries_baseline(
    test_sampled,
    rouge
)
#rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame.from_dict(
    score, 
    orient="index", 
    columns=["baseline"]
).T



Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.389276,0.171296,0.245061,0.354239
