# Text Summarization Pipelines

This notebook contains the code for using HuggingFace pipelines for text summarization. The notebook uses the code from "Natural Language Processing with Transformers" book by Lewis Tunstall, Leandro von Werra, and Thomas Wolf.

## The CNN/Daily Mail Dataset

The CNN/Daily Mail dataset is a popular dataset for text summarization. It contains news articles and their summaries. The dataset is available in the HuggingFace datasets library. We will start by loading the dataset and looking at a few examples.

In [1]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", "3.0.0")
print(f"features: {dataset['train'].column_names}")

Found cached dataset cnn_dailymail (/home/alex/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)


  0%|          | 0/3 [00:00<?, ?it/s]

features: ['article', 'highlights', 'id']


In [2]:
sample = dataset["train"][0]
print(f"""
    article: {sample['article']}
    highlights: {sample['highlights']}
""")


    article: LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. 

# Evaluating popular Models

Let's use `HuggingFace` models to evaluate 4 popular model architectures for text summarization. We will use the `pipeline` API to do this. The 4 models are:
- GPT2
- BART
- T5
- Pegasus

In [3]:
sample_text = dataset['train'][0]['article'][:2000]
summaries = {}

# GPT2

GPT2 isn't technically a summarization model, but it can generate summaries by appending "TL;DR" to the end of the input text. Let's see how it performs.

In [4]:
# GPT2
from transformers import pipeline, set_seed
import nltk
from nltk.tokenize import sent_tokenize

nltk.download('punkt')

set_seed(42)
pipe = pipeline("text-generation", model="gpt2-xl", framework='pt')
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
summaries["gpt2"] = "\n".join(sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query) :]))

2023-04-25 20:46:10.581391: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[nltk_data] Downloading package punkt to /home/alex/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [5]:
# T5
pipe = pipeline("summarization", model="t5-large", framework='pt')
pipe_out = pipe(sample_text)
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [6]:
# BART
pipe = pipeline("summarization", model="facebook/bart-large-cnn", framework='pt')
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

In [7]:
# PEGASUS
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail", framework='pt')
pipe_out = pipe(sample_text)
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>", ".\n")

In [8]:
print("GROUND TRUTH")
print(dataset["train"][0]["highlights"])
print("")

for model_name in summaries:
    print(f"{model_name.upper()}")
    print(summaries[model_name])
    print("")

GROUND TRUTH
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .

GPT2
I've always liked a good story first, or as much as anyone can tell a good story first...
And if your story has the word "hero" in it or something which implies heroism, then it's not a story.
...If it's your story, and you want

T5
Harry Potter star Daniel Radcliffe turns 18 on monday .
the young actor says he has no plans to fritter his cash away .
details of how he'll mark his landmark birthday are under wraps .

BART
Harry Potter star Daniel Radcliffe turns 18 on Monday.
He gains access to a reported £20 million ($41.1 million) fortune.
Radcliffe says he has no plans to fritter his cash away on fast cars, drink and parties.
His earnings from the first five Potter films have been held in a trust fund.

PEGASUS
Harry Potter star Daniel Radcliffe gains

GPT2 is clearly not a good model for summarization. It doesn't understand the context of the text and just generates a random summary.

T5 was finetuned on this task as well as others, so it performs much better than GPT2. It generates a coherent summary that is mostly correct.

BART and PEGASUS were exclusively finetuned on this task, so they perform the best. They generate summaries that are mostly correct and coherent.

# Evaluating Generated Text

Given a ground truth summarization, by what metric are models evaluated during training? Two common metrics are ROUGE and BLEU. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is a set of metrics for evaluating automatic summarization of texts as well as machine translation. BLEU stands for Bilingual Evaluation Understudy. It is a metric for evaluating the quality of machine translation.

## BLEU

BLEU compares two texts by counting the number of words in the generated text that occur in the reference and dividing by the total number of words in the generated text. The higher the BLEU score, the better the generated text. Each word is counted only as many times as it occurs in the reference text. For example, if the reference text contains the word "the" 3 times, but the generated text contains the word "the" 5 times, the word "the" is only counted 3 times.

Since multiple translations are valid, BLEU is calculated by aggregating the BLEU score for each translation. This assumes that we have multiple targets in the test set.

The raw BLEU score favors shorter translations. This is because shorter translations are more likely to contain words that are in the reference text. To account for this, the authors introduce a brevity penalty.

In [9]:
import evaluate

bleu_metric = evaluate.load("sacrebleu")

In [10]:
import pandas as pd
import numpy as np

# Score for repeated words
prediction = ["Naomi Naomi Naomi Naomi Naomi Naomi"]
reference = ["Naomi went to the store"]
results = bleu_metric.compute(predictions=prediction, references=reference)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index")

Unnamed: 0,0
score,8.116698
counts,"[1, 0, 0, 0]"
totals,"[6, 5, 4, 3]"
precisions,"[16.67, 10.0, 6.25, 4.17]"
bp,1.0
sys_len,6
ref_len,5


In [11]:
# A more sensible comparison
prediction = ["Naomi went to store"]
reference = ["Naomi went to the store"]
results = bleu_metric.compute(predictions=prediction, references=reference)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index")

Unnamed: 0,0
score,49.760939
counts,"[4, 2, 1, 0]"
totals,"[4, 3, 2, 1]"
precisions,"[100.0, 66.67, 50.0, 50.0]"
bp,0.778801
sys_len,4
ref_len,5


# ROUGE

ROUGE was developed specifically for summarization. It is similar to BLEU, but it also looks at the number of different $n$-grams in the *reference* text that occur in the *generated* texts. We will evaluate four different version of ROGUE:
1. ROUGE-1: looks at the number of unigrams in the reference text that occur in the generated text
2. ROUGE-2: looks at the number of bigrams in the reference text that occur in the generated text
3. ROUGE-L: looks at the longest common subsequence between the reference text and the generated text
4. ROUGE-Lsum: calculates the score per sentence over the whole summary

In [12]:
rouge_metric = evaluate.load("rouge")

reference = dataset["train"][0]["highlights"]
records = []

for model_name in summaries:
    prediction = summaries[model_name]
    score = rouge_metric.compute(predictions=[prediction], references=[reference])
    records.append(score)
pd.DataFrame.from_records(records, index=summaries.keys())

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
gpt2,0.114943,0.0,0.068966,0.091954
t5,0.575342,0.450704,0.547945,0.575342
bart,0.717391,0.511111,0.652174,0.717391
pegasus,0.8,0.692308,0.8,0.8
