# Abstractive summarization with T5 Transformer model

## Introduction

### Automatic Summarization

__Automatic summarization__ is one of the central problems in Natural Language Processing (NLP). Summarization consists on creating a shorter version of a document or an article that captures all the important information. It poses several challenges relating to language understanding (e.g. identifying important content) and generation (e.g. aggregating and rewording the identified content into a summary).
Along with translation, it is another example of a task that can be formulated as a sequence-to-sequence task. 

Summarization can be: 

- _Extractive_: extract the most relevant information from a document.
- _Abstractive_: generate new text that captures the most relevant information.

In this project we will approach the problem of single-document abstractive summarization. Following prior work, we aim to tackle this problem using a sequence-to-sequence model. 

### The T5 model

[Text-to-Text Transfer Transformer (T5)](https://arxiv.org/abs/1910.10683) is a [Transformer-based](https://arxiv.org/abs/1706.03762) model built on the encoder-decoder architecture, pretrained on a multi-task mixture of unsupervised and supervised tasks where each task is converted into a text-to-text format. T5 shows impressive results in a variety of sequence-to-sequence tasks like summarization, translation, etc.

The T5 model was presented in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683) by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.

Some notes on the model:

1. T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation: translate English to German: …, for summarization: summarize: ….

1. The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).

1. Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.

1. T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.

Alternative models that can be used in this notebook instead of T5 include:

[BART](https://huggingface.co/docs/transformers/model_doc/bart), [BigBird-Pegasus](https://huggingface.co/docs/transformers/model_doc/bigbird_pegasus), [Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot), [BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small), [Encoder decoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder), [FairSeq Machine-Translation](https://huggingface.co/docs/transformers/model_doc/fsmt), [LED](https://huggingface.co/docs/transformers/model_doc/led), [LongT5](https://huggingface.co/docs/transformers/model_doc/longt5), [M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100), [Marian](https://huggingface.co/docs/transformers/model_doc/marian), [mBART](https://huggingface.co/docs/transformers/model_doc/mbart), [MT5](https://huggingface.co/docs/transformers/model_doc/mt5), [MVP](https://huggingface.co/docs/transformers/model_doc/mvp), [NLLB](https://huggingface.co/docs/transformers/model_doc/nllb), [NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe), [Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus), [PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x), [PLBart](https://huggingface.co/docs/transformers/model_doc/plbart), [ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet), [SwitchTransformers](https://huggingface.co/docs/transformers/model_doc/switch_transformers), [UMT5](https://huggingface.co/docs/transformers/model_doc/umt5), [XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)

### The XSum Dataset

In this notebook, we will fine-tune the pretrained T5 model on the Abstractive Summarization task using Hugging Face Transformers on the [Extreme Summarization (XSum)](https://arxiv.org/abs/1808.08745) dataset loaded from Hugging Face Datasets. We will then use this finetuned model for inference.

## Setup
### Installing the requirements

In [None]:
!pip install transformers datasets evaluate rouge_score

### Define variables

In [None]:
TRAIN_TEST_SPLIT = 0.1 # The percentage of the dataset we will split as train and test
MAX_INPUT_LENGTH = 1024  # Maximum length of the input to the model
MIN_TARGET_LENGTH = 5  # Minimum length of the output by the model
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model
BATCH_SIZE = 8  # Batch-size for training our model
LEARNING_RATE = 2e-5  # Learning-rate for training our model
MAX_EPOCHS = 1  # Maximum number of epochs we will train the model for

# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"

# Local directory where to save the finetuned model
MODEL_PATH = "T5-XSum-base"

# Repository name for saving model to the Hugging Face Hub
REPO_NAME = "alexrodpas/T5-XSum-base"

# File for inference example
INPUT_FILE = "Input/Airlines_Are_Just_Banks_Now.txt"

# For summarization tasks, T5 requires the following prefix
PREFIX = "summarize: "

# Enable parallelized tokenization
TOKENIZERS_PARALLELISM= True

# Disable W&B logging
import os
os.environ["WANDB_DISABLED"] = "true"

## Load the dataset
We will now download the [Extreme Summarization (XSum)](https://arxiv.org/abs/1808.08745) dataset. This dataset consists of BBC articles and accompanying single sentence summaries. Specifically, each article is prefaced with an introductory sentence (aka summary) which is professionally written, typically by the author of the article. That dataset has 226,711 articles divided into training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) sets.

We will use the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric, which is commonly found in the literature, to evaluate our sequence-to-sequence abstrative summarization approach.

We will use the [Hugging Face Datasets library](https://github.com/huggingface/datasets) to download the data we need to use for training and evaluation. This can be easily done with the `load_dataset` function.

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("xsum", split="train")

The dataset has the following fields:

- __document__: the original BBC article to be summarized
- __summary__: the single sentence summary of the BBC article
- __id__: ID of the document-summary pair

In [None]:
raw_datasets

We can see how the data looks like by retrieving the first item in ``raw_datasets``:

In [None]:
print(raw_datasets[0])

For the sake of demonstrating the workflow, in this notebook we will only take small stratified balanced splits (10%) of the train split as our training and test sets. We can easily split the dataset using the `train_test_split` method which expects the split size and the name of the column relative to which you want to stratify.

In [None]:
raw_datasets = raw_datasets.train_test_split(train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT)

## Data pre-processing
Before we can feed those texts to our model, we need to pre-process them and get them ready for the task. This is done by a Hugging Face Transformers `Tokenizer` which will tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

The `from_pretrained()` method expects the name of a model from the Hugging Face Model Hub. This is the MODEL_CHECKPOINT that we declared earlier.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

Next, we will write a simple function that helps us in the pre-processing that is compatible with Hugging Face Datasets. This pre-processing function should:

1. Properly indicate the task that we intend to perform, which is summarization (T5 models can also be used for translation, so if using one of the five T5 checkpoints we have to prefix the inputs with "summarize:").
1. Tokenize the text dataset (input and targets) into it's corresponding token ids that will be used for embedding look-up in BERT.
1. Add the prefix to the tokens.
1. Create additional inputs for the model like `token_type_ids`, `attention_mask`, etc.


In [None]:
def preprocess_fn(examples, tokenizer):
    if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]: prefix = PREFIX
    else: prefix = ""

    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(text=examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True)

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training and testing data will be preprocessed in one single command.

In [None]:
raw_datasets

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_fn, fn_kwargs={"tokenizer":tokenizer}, batched=True)

## Define the model
Now we can download the pretrained model and fine-tune it. Since our task is sequence-to-sequence (both the input and output are text sequences), we use the `TFAutoModelForSeq2SeqLM` class from the Hugging Face Transformers library. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

The `from_pretrained()` method expects the name of a model from the Hugging Face Model Hub. As mentioned earlier, we will use the `t5-small` model checkpoint.

In [None]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)

For training Sequence to Sequence models, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Thus, we use the `DataCollatorForSeq2Seq` provided by the Hugging Face Transformers library on our dataset.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

## Define evaluation metric and function
We define a metric that we'll use during training because it will help in evaluating our model’s performance. We can quickly load a evaluation method with the `evaluate` library. For our summarization task, we load the [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) metric.

In [None]:
import evaluate

rouge = evaluate.load("rouge")

Now we define `metric_fn` which will calculate the ROUGE score between the ground-truth and predictions.

In [None]:
import numpy as np

def metric_fn(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Train the model

We will use [``Seq2SeqTrainer``](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainer), a ``Trainer`` class from the Transformers library that can be used for sequence-to-sequence tasks such as translation or summarization.

To train the model we:

1. Define our training hyperparameters in `Seq2SeqTrainingArguments`. The only required parameter is `output_dir` which specifies where to save our model. At the end of each epoch, the `Trainer` will evaluate the ROUGE metric and save the training checkpoint. We set `push_to_hub` to True so we can later upload the model to the Hugging Face hub.
2. Pass the training arguments to `Seq2SeqTrainer` along with the model, dataset, tokenizer, data collator, and compute_metrics function.
3. Call `train()` to finetune our model.

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir=MODEL_PATH,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=metric_fn,
)

In [None]:
trainer.train()

## Save the model

First we save the model, locally, for future use:

In [None]:
trainer.save_model(MODEL_PATH)

Now, we can also upload it (push it) to the Hugging Face Hub, so we can share it and get version control on it. Since Hugging Face model repos are just Git repositories, we could use Git to push our model files to the Hub. Follow the guide on [Getting Started with Repositories](https://huggingface.co/docs/hub/repositories-getting-started) to learn about using the git CLI to commit and push your models. Here we will use the `huggingface_hub` client library to upload our model to the Hub. The rich feature set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading models to the Model Hub. Visit the client library’s documentation to learn more. We'll need a Hugging Face API token (either stored in the cache or copied and pasted in our notebook). You can find in the [Hugging Face documentation](https://huggingface.co/docs/transformers/v4.14.1/model_sharing) how to obtain it.

In [None]:
from huggingface_hub import login

login()

In [None]:
trainer.push_to_hub(REPO_NAME)

## Inference

Once our model is finetuned, we can use it for inference. We'll show here an example, with the following text (extracted from the [Airlines Are Just Banks Now](https://www.theatlantic.com/ideas/archive/2023/09/airlines-banks-mileage-programs/675374/) article by Ganesh Sitaraman, published in The Atlantic on September 21, as input.

In [None]:
input_file = INPUT_FILE
with open(input_file, 'r') as file:
    input = file.read().replace('\n', '')

We add the prefix required by T5 for summarization:

In [None]:
pref_input = PREFIX + input

The simplest way to try out our finetuned model for inference is to use it in a ``pipeline()``. We instantiate a pipeline for summarization with our model, and pass our input text to it:

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model=REPO_NAME)
summarizer(pref_input)