# 6 - Summarization

If you think about it, text summarization requires a range of abilities, such as understanding long passages, reasoning about the contents, and producing fluent text that incorporates the main topics from the original document. Moreover, accurately summarizing a news article is very different from summarizing a legal contract, so being able to do so requires a sophisticated degree of domain generalization. For these reasons, text summarization is a difficult task for neural language models, including transformers.

Despite these challenges, text summarization offers the prospect for domain experts to significantly speed up their workflows and is used by enterprises to condense internal knowledge, summarize contracts, automatically generate content for social media releases, and more.

In this chapter we will build our own encoder-decoder model to condense dialogues between several people into a crisp summary. 

## 6.1 - The CNN/DailyMail Dataset

Before we dive into the summarization process, let's begin by taking a look at one of the canonical datasets for summarization: the CNN/DailyMail corpus. This dataset consists of around 300000 pairs of news articles and their corresponding summaries, composed from the bullet points that CNN and the DailyMail attach to their articles. 

<span style="color:blue">An important aspect of the dataset is that the summaries are <b>abstractive</b> and not <b>extractive</b>, which means that they consist of new sentences instead of simple excerpts.</span> [**The dataset is available on the Hub**](https://huggingface.co/datasets/cnn_dailymail); we'll use version 3.0.0, which is a nonanonymized version set up for summarization. 

In [None]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", version="3.0.0")
print(f"Features: {dataset['train'].column_names}")

The dataset has three columns: `article`, which contains the news articles, `highlights` with the summarise, and `id` to uniquely identify each article. Let's look at an excerpt from an article:

In [None]:
sample = dataset["train"][1]
print(f"""Article (excerpt of 500 characters, total length: {len(sample["article"])}):""")
print(sample["article"][:500])
print(f'\nSummary (length: {len(sample["highlights"])}):')
print(sample["highlights"])

We see that the articles can be very long compared to the target summary; in this particular case the difference is 17-fold. Long articles pose a challenge to most transformer models since the context size is usually limited to 1000 tokens or so, which is equivalent to a few paragraphs of text. The standard, yet crude way to deal with this for summarization is to simply truncate the texts beyond the model's context size. Obviously there could be important information for the summary toward the end of the text, but for now we need to live with this limitation of the model architectures.



## 6.2 - Text summarization pipelines

Let's see how a few of the most popular transformer models for summarization perform by first looking qualitatively at the outputs for the preceding example. Although the model architectures we will be exploring have varying maximum input sizes, let's restrict the input text to 2000 characters to have the same input for all models and thus make the outputs more comparable:



In [None]:
sample_text = dataset["train"][1]["article"][:2000]
# We'll collect the generated summaries of each model in a dictionary
summaries = {}

A convention in summarization is to separate the summary sentences by a newline. We could add a newline token after each full stop, but this simple heuristic would fail for strings like "U.S" or "U.N". The Natural language toolkit (NLTK) package includes a more sophisticated algorithm that can differentiate the end of a sentence from punctuation that occurs in abbrevisations:



In [None]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")
string = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(string)

## 6.3 - Summarization baseline

A common baseline for summarizing news articles is to simply take the first three sentences of the article. With NLTK's sentence tokenizer, we can easily implement such a baseline:



In [None]:
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

summaries["baseline"] = three_sentence_summary(sample_text)

### 6.3.1 - GPT-2

We have already seen in Chapter 5 how GPT-2 ([Radford et al., 2019](https://openai.com/blog/better-language-models/)) can generate text given some prompt. One of the model's surprising features is that we can also use it to generate summaries by simply appending "TL;DR" at the end of the input text. This expression is often used on platforms like Reddit to indicate a short version of a long post. 

We will start our summarization experiment by re-creating the procedure of the original paper with the `pipeline()` function from 🤗 Transformers. We create a text generation pipeline and load the GPT-2 model:

In [None]:
from transformers import pipeline, set_seed
set_seed(42)
pipe = pipeline("text-generation", model="gpt2") # 117M parameters
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
summaries["gpt2"] = "\n".join(
sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query) :]))

Here we just store the summaries of the generated text by slicing off the input query and keep the result in a Python dictionary for later comparison.

### 6.3.2 - T5

Next let's try the T5 transformer ([Raffel et al., 2019](https://arxiv.org/abs/1910.10683)). As we saw in Chpater 3, the developers of this model performed a comprehensive study of transfer learning in NLP and found they could create a universal transformer architecture by formulating all tasks as text-to-text tasks. The T5 checkpoints are trained on a mixture of unsupervised data (to reconstruct masked words) and supervised data for several tasks, including summarization. These checkpoints can thus be directly used to perform summarization without fine-tuning by using the same prompts used during pretraining. In this framework, the input format for the model to summarize a document is `"summarize:" <ARTICLE>`, and for translation it looks like `"translate English to German:" <TEXT>`. This maskes T5 extremely versatile and allows us to solve many tasks with a single model.

We can directly load T5 for summarization with the `pipeline()` function, which also takes care of formatting the inputs in the text-to-text format so we don't need to prepend them with `"summarize:"`

<img src="images/t5_examples.png" title="" alt="" width="700" data-align="center">

In [None]:
pipe = pipeline("summarization", model="t5-small") # We could also try t5-base, with 220M parameters
pipe_out = pipe(sample_text)
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

### 6.3.3 - BART

BART ([Lewis et al., 2019](https://arxiv.org/abs/1910.13461)) also uses an encoder-decoder architecture and is trained to reconstruct corrupted inputs. It combines the pretraining schemes of BERT and GPT-2. We'll use the `facebook/bart-base-cnn` checkpoint, which has been specifically fine-tuned on the CNN/DailyMail dataset:

In [None]:
pipe = pipeline("summarization", model="facebook/bart-base-cnn") # 140M parameters
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

### 6.3.4 - PEGASUS

PEGASUS ([Zhang et al., 2019](https://arxiv.org/abs/1912.08777)) is also an encoder-decoder transformer. Its pretraining objective is to is to predict masked sentences in multisentence texts. The authors argue that the closer the pretraining objective is to the downstream task, the more effective it is. With the aim of finding a pretraining objective that is closer to summarization than general language modeling, they automatically identified, in a very large corpus, sentences containing most of the content of their surrounding paragraphs (using summarization evaluation metrics as a heuristic for content overlap) and pretrained the PEGASUS model to reconstruct these sentences, thereby obtaining a state-of-the-art model for text summarization.

<img src="images/pegasus_architecture.png" title="" alt="" width="700" data-align="center">

In [None]:
# Note on size: Pegasus is 568M parameters, which is considerably larger than BART-base and T5-base
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")
pipe_out = pipe(sample_text)
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>", ".\n")

## 6.4 - Comparing different summaries

Now that we have generated summaries with four different models, let's compare the results. Keep in mind that one model has not been trained on the dataset at all (GPT-2), one model has been fine-tuned on this task among others (T5), and two models have exclusively been fine-tuned on this task (BART and PEGASUS). Let's have a look at the summaries these models have generated:



In [None]:
print("GROUND TRUTH")
print(dataset["train"][1]["highlights"])
print("")
for model_name in summaries:
print(model_name.upper())
print(summaries[model_name])
print("")

## 6.5 - Measuring the quality of generated text

Good evaluation metrics are important, since we use them to measure the performance of models not only when we train them but also later, in production. If we have bad metrics we might be blind to model degradation, and if they are misaligned with the business goals we might not create any value.

Measuring performance on a text generation task is not as easy with standard classification tasks such as sentiment analysis or named entity recognition. Take the example of translation; given a sentence like "I love dogs!" in English and translating it to Spanish there can be multiple valid possibilities, like "¡Me encantan los perros!" or "¡Me gustan los perros!"

<span style="color:blue">Simply checking for an exact match to a reference translation is not optimal; even humans would fare badly on such a metric because we all write text slightly differently from each other (and even from ourselves, depending on the time of the day or year!). Fortunately, there are alternatives. </span>

Two of the most common metrics used to evaluate generated text are **BLEU** and **ROUGE**. Let's take a look at how they are defined.

### 6.5.1 - BLEU

The idea of BLEU ([Papineni et al., 2002](https://dl.acm.org/doi/10.3115/1073083.1073135)) is simple: instead of looking at how many of the tokens in the generated texts are perfectly aligned with the reference text tokens, we look at words or $n$-grams. BLEU is a precision-based metric, which means that when we compare two texts we count the number of words in the generation that occur in the reference and divide it by the length of the reference.

However, there is an issue with this vanilla precision. Assume the generated text just repeats the same word over and over again, and this word also appears in the reference. If it is repeated as many times as the length of the reference text, then we get perfect precision! 

For this reason, the authors of the BLEU paper introduced a slight modification: a word is only counted as many times as it occurs in the reference. To illustrate this point, suppose we have the reference text "the cat is on the mat" and the generated text is "the the the the the the". From this simple example, we can calculate the precision values as follows:

$$
p_{vanilla} = \frac{6}{6}
$$

$$
p_{mod} = \frac{2}{6}
$$

As we can see, that simple correction has pruced a much more reasonable value. Now let's extend this by not only looking at single words, but $n$-grams as well. Let's assume we have one generated sentence, $snt$, that we want to compare against a reference sentence $snt'$. We extract all possible $n$-grams of degree $n$ and do the accounting to get the precision $p_{n}$:

$$
p_{n} = \frac{\sum_{n\text{-gram} \ \in \ snt} \text{Count}_{clip}(n\text{-gram})}{\sum_{n\text{-gram} \ \in \ snt'} \text{Count}(n\text{-gram})}
$$

In order to avoid rewarding repetitive generations, the count in the numerator is clipped. What this means is that the occurrence count of an $n$-gram is capped at how many times it appears in the reference sentence. Also note that the definition of a sentence is not very strict in this equation, and if you had a generated text spanning multiple sentences you would treat is as one sentence.

In general, we have more than one sample in the test set we want to evaluate, so we need to slightly extend the equation by summing over all samples in the corpus $C$ (we are assumming that $C$ contains both the original sentences and the generated ones):

$$
p_{n} = \frac{\sum_{snt \ \in \ C}\sum_{n\text{-gram} \ \in \ snt} \text{Count}_{clip}(n\text{-gram})}{\sum_{snt' \ \in \ C} \sum_{n\text{-gram} \ \in \ snt'} \text{Count}(n\text{-gram})}
$$

We are almost there. Since we are not looking at recall, all generated sequences that are short but precise have a benefit compared to sentences that are longer. Therefore, the precision score favors short generations. To compensate for that, authors of BLEU introduced an additional term, the brevity ($BR$) penalty:

$$
BR = \text{min} \left( 1, e^{1-l_{ref}/l_{gen}} \right)
$$

By taking the minimum, we ensure that this penalty never exceeds 1 and the exponential term becomes exponentially small when the length of the generated text $l_{gen}$ is smaller than the reference text $l_{ref}$.

At this point you may ask, why don't we just use something like an F1-score to account for recall as well? The answer is that often in translation datasets there are multiple reference sentences instead of just one, so if we also measured recall we would  incentivize translations that used all the words from all the references. Therefore, it is preferable to look for high precision in the translation and make sure the translation and reference have a similar length.

Finally, we can put everything together and get the equation for the BLEU score, where the last term is the geometric mean of the modified precision up to $n$-gram $N$. 

$$
\text{BLEU-}N = BR \times \left( \prod^{N}_{n=1} p_{n}\right)
$$

In practice, the BLEU-4 score is often reported. Howver, you can probably already see that this metric has many limitations; for instance, it doesn't take synonyms into account, and many steps in the derivation seem like ad hoc and rather fragile heuristics  You can find a wonderful exposition of BLEU's flaws in Rachel Tatman's blog post ["Evaluating Text Output in NLP: BLEU at Your Own Risk"](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213).

In general, the field of text generation is still looking for better evaluation metrics, and finding ways to overcome the limits of metrics like BLEU is an active area of research. Another weakness of the BLEU metric is that it expects the texto already be tokenized. This can lead to varying results if the exact same method for text tokenization is not used. The SacreBLEU metric addresses this issue by internalizing the tokenization step; for this reason, is the prefered metric for benchmarking.

In [None]:
from datasets import load_metric
bleu_metric = load_metric("sacrebleu")

The `bleu_metric` object is an instance of the `Metric` class, and works like an aggregator: you can add single instances with `add()` or whole batches via `add_batch()`. Once you have added all the samples you need to evaluate, you then call `compute()` and the metric is calculated. This returns a dictionary with several values, such as the precision for each $n$-gram, the length penalty, as well as the final BLEU score. Let's look at the example from before:



In [None]:
import pandas as pd
import numpy as np

bleu_metric.add(prediction="the the the the the the", reference=["the cat is on the mat"])

results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

----

**Note:** The BLEU score also works if there are multiple reference translations. This is why `reference` is passed as a list. To make the metric smoother for zero counts in the $n$-grams, BLEU integrates methods to modify the precision calculation. One method is to add a constant to the numerator. That way, a missing $n$-gram does not cause the score to automatically go to zero. For the purpose of explaining the values, we turn it off by setting `smooth_value = 0`

----

We can see the precision of the 1-gram is ineed 2/6, whereas the precisions for the 2/3/4-grams are all 0 (for more information about the individual metrics, like `counts` and `bp`, see the [SacreBLEU repository](https://github.com/mjpost/sacrebleu)). This means the geometric mean is zero, thus also the BLEU score. Let's look at another example where the prediciton is almost correct:

In [None]:
bleu_metric.add(prediction="the cat is on mat", reference=["the cat is on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

We observe that the precision scores are much better. The 1-grams in the prediciton all match, and only in the precision scores do we see that something is off. For the 4-gram there are only two candidates in the predicted text, i.e., `["the", "cat", "is", "on"]` and `["cat", "is", "on", "mat"]`, where the last one doesn't match, hence the precision of 0.5.

<span style="color:blue">The BLEU score is widely used for evaluating text, especially in machine translation, since precise translations are usually favored over translations that include all possible and appropriate words. There are other applications, such as summarization, where the situation is diferent. There we want all the important information in the generated text, so we favour high recall. This is where the ROUGE score is usually used.</span>



### 6.5.2 - ROUGE

The ROUGE score ([C-Y. Lin, 2004](https://aclanthology.org/W04-1013.pdf)) was specifically developed for applications like summarization where high recall is more important than just precision. The approach is very similar to the BLEU score in that we look at different $n$-grams and compare their occurrences in the generated text and the reference texts. 

* With ROUGE, we check how many $n$-grams in the reference text also occur in the generated text.
* With BLEU, we check many $n$-grams in the generated text appear in the reference.

Given their similar definitions, we can reuse the precision formula with the minor modification that we count the (unclipped) occurrence of reference $n$-grams in the generated text in the numerator:

$$
\text{ROUGE-}N = \frac{\sum_{snt' \ \in \ C} \ \sum_{n\text{-gram} \ \in \ snt'} \text{Count}_{\text{match}}(n\text{-gram})}{\sum_{snt' \ \in \ C} \ \sum_{n\text{-gram} \ \in \ snt'} \text{Count}(n\text{-gram})}
$$

This was the original proposal for ROUGE. Subsequently, researchers have found that fully removing precision can have strong negative effects. Going back to the BLEU formula without the clipped counting, we can measure precision as well, and we can then combine both precision and recall ROUGE score in the harmonic mean to get an F1-score. The score is the metric that is nowadays commonly reported for ROUGE.

There is a separate score in ROUGE to measure the longest common substring (LCS), called ROUGE-L. The LCS can be calculated for any pair of strings. For example, the LCS for "abab" and "abc" would be "ab", and its lenght would be 2. If we want to compare this value between two samples we need to somehow normalize it because otherwise a longer text would be at an advantage. To achieve this, the inventor of ROUGE came up with an F-score-like scheme where the LCS is normalized with the length of the reference and generated text, then the two normalized scores are mixed together:

$$
R_{LCS} = \frac{LCS(X,Y)}{m} 
$$


$$
P_{LCS} = \frac{LCS(X,Y)}{n}
$$

$$
F_{LCS} = \frac{(1 + \beta^{2})R_{LCS}P_{LCS}}{R_{LCS} + \beta P_{LCS}}, \text{where} \beta = P_{LCS} / R_{LCS}
$$

**Note:** My assumption is that $m$ is the legnth of the reference text and $n$ refers to the length of the predicted text.

This way, the LCS score is properly normalized and can be compared across samples. In the 🤗 Datasets implementation, two variations of ROUGE are calculated: one calcuates the score per sentence and averages it for the summaries (ROUGE-L), and the other calcuates it directly over the whole summary (ROUGE-Lsum).

We have already generated a set of summaries and now we have a metric to compare the summaries systematically. Let's apply the ROUGE score to all the summaries generated by the models:

In [None]:
reference = dataset["train"][1]["highlights"]
records = []
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

for model_name in summaries:
    rouge_metric.add(prediction=summaries[model_name], reference=reference)
    score = rouge_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
    records.append(rouge_dict)
pd.DataFrame.from_records(records, index=summaries.keys())

----

**Note:** The ROUGE metric in the 🤗 Datasets library also calculates confidence intervals (by default, the 5th and 95th percentiles). The average value is stored in the attribute `mid` and the internval can be retrieved with `low` and `high`.

----

These results are obviously not very reliable as we only looked at a single sample, but we can compare the quality of the summary for that one example. 

<span style="color:red"><b>TODO: Rellenar una vez ejecutado y poner conclusiones</b></span>

## 6.6 - Evaluating PEGASUS on the CNN/DailyMail Dataset