# Chapter 6: Summarization

## The CNN/DailyMail Dataset

* ~300k pairs of news articles and their corresponding summaries
* summaries are _abstractive_
* [`cnn_dailymail` dataset viewer at HF](https://huggingface.co/datasets/viewer/?dataset=cnn_dailymail&config=3.0.0)
* also see the [Dataset card for `cnn_dailymail` at HF](https://huggingface.co/datasets/cnn_dailymail)

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "cnn_dailymail",
    version="3.0.0"
)

print(f"Features: {dataset['train'].column_names}")

In [None]:
sample = dataset["train"][1]

print(f"""
Article (excerpt of 500 char, total length: {len(sample['article'])}):""")
print(sample["article"][:500])
print(f"\nSummary (length: {len(sample['highlights'])}):")
print(sample["highlights"])

## Text Summarization Pipelines

This section require `nltk`, so be sure to download/install that before going any further.

In [None]:
sample_text = dataset["train"][1]["article"][:2000]

summaries = {}

In [None]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

string = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(string)

## Summarization Baseline

In [None]:
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

summaries["baseline"] = three_sentence_summary(sample_text)

In [None]:
summaries

## GPT-2

See the [`gpt2-xl` model details on HF](https://huggingface.co/gpt2-xl#model-details).

In [None]:
from transformers import pipeline, set_seed

set_seed(42)
pipe = pipeline("text-generation", model="gpt2-xl")
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(
    gpt2_query,
    max_length=512,
    clean_up_tokenization_spaces=True
)
summaries["gpt2"] = "\n".join(sent_tokenize(pipe_out[0]['generated_text'][len(gpt2_query):]))

## T5

See the [`tf-large` model details on HF](https://huggingface.co/t5-large#model-details).

In [None]:
pipe = pipeline("summarization", model="t5-large")
pipe_out = pipe(sample_text)
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]['summary_text']))

## BART

See the [`facebook/bart-large-cnn` model card on HF](https://huggingface.co/facebook/bart-large-cnn).

In [None]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]['summary_text']))

## PEGASUS

See the [`google/pegasus-cnn-dailymail` model card on HF](https://huggingface.co/google/pegasus-cnn_dailymail).

<span style="background-color: #9AFEFF">This model has a dependency on the `protobuf` library, so you will need to install that as well!</span>

In [None]:
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")
pipe_out = pipe(sample_text)
summaries["pegasus"] = pipe_out[0]["summary_text"].replace("<n>", "\n").replace(" .", ".")

## Comparing Different Summaries

In [None]:
print("GROUND TRUTH")
print(dataset["train"][1]["highlights"])
print("")

for model_name in summaries:
    print(model_name.upper())
    print(summaries[model_name])
    print("")

## Measuring the Quality of Generated Text

### BLEU

The BLEU metric is based primarily on [_precision_](https://en.wikipedia.org/wiki/Positive_and_negative_predictive_valueshttps://en.wikipedia.org/wiki/Positive_and_negative_predictive_values), and thus it only really pays attention to how many n-grams in the references (hopefully human-generated examples of good translations) show up in the translation (generated text).

References:
* Lewis in the [What is the BLEU metric? video on Youtube](https://www.youtube.com/watch?v=M05L1DhFqcw).
* [BLEU: a Method for Automatic Evaluation of Machine Translation](https://aclanthology.org/P02-1040.pdf)... only 8 pages!
* [Rachel Tatman's blogpost _Evaluating Text Output in NLP: BLEU at your own risk_](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213)... ?


Regarding `datasets` and the `load_metric` API...

> <pre>FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate</pre>

<span style="background-color: #9AFEFF">This metric implementation has a dependency on the `sacrebleu` library, so you will need to install that as well!</span>

In [None]:
# this is how it is done in the book...
from datasets import load_metric

bleu_metric = load_metric("sacrebleu")

In [None]:
import evaluate

foo_metric = evaluate.load("sacrebleu")

In [None]:
import pandas as pd
import numpy as np

bleu_metric.add(
    prediction="the the the the the the",
    reference=["the cat is on the mat"]
)

results=bleu_metric.compute(
    smooth_method="floor", 
    smooth_value=0
)
results["precisions"] = [np.round(p,2) for p in results["precisions"]]

pd.DataFrame.from_dict(
    results,
    orient="index",
    columns=["value"]
)

TODO! explain those keys on the left...

In [None]:
bleu_metric.add(
    prediction="the cat is on mat",
    reference=["the cat is on the mat"]
)

results=bleu_metric.compute(
    smooth_method="floor", 
    smooth_value=0
)
results["precisions"] = [np.round(p,2) for p in results["precisions"]]

pd.DataFrame.from_dict(
    results,
    orient="index",
    columns=["value"]
)

### ROUGE

> The ROUGE score was specifically developed for applications like summarization where high [_recall_](https://en.wikipedia.org/wiki/Sensitivity_and_specificityhttps://en.wikipedia.org/wiki/Sensitivity_and_specificity) is more important than just precision.

References:
* Lewis in the [What is the ROUGE metric? video on Youtube](https://www.youtube.com/watch?v=TMshhnrEXlg)
* [ROUGE: A Package for Automatic Evaluation of Summaries](https://aclanthology.org/W04-1013.pdf)... again, only 8 pages!


<span style="background-color: #9AFEFF">This metric implementation has a dependency on the `absl-py` and `rouge_score` libraries, so you will need to install them as well!</span>

In [None]:
rouge_metric = load_metric("rouge")

In [None]:
reference = dataset["train"][1]["highlights"]
records = []
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

for model_name in summaries:
    rouge_metric.add(prediction=summaries[model_name], reference=reference)
    score = rouge_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
    records.append(rouge_dict)

pd.DataFrame.from_records(records, index=summaries.keys())