# Finetune T5 on the California state bill subset of the BillSum dataset for abstractive summarization.

## Introduction

Summarization creates a shorter version of a document or an article that captures all the important information. Along with translation, it is another example of a task that can be formulated as a sequence-to-sequence task. Summarization can be:

- **Extractive:** extract the most relevant information from a document.
- **Abstractive:** generate new text that captures the most relevant information.

## Setup

In [1]:
import torch
torch.cuda.is_available()

True

In [None]:
!pip install transformers datasets evaluate rouge_score

In [4]:
# login to Hugging Face
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load BillSum dataset

In [6]:
# Load BillSum Dataset
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})

In [7]:
# Split the dataset into a train and a test set
billsum = billsum.train_test_split(test_size=0.2)
billsum

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 989
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 248
    })
})

In [11]:
# Take a look at an example
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 12811 of the Vehicle Code is amended to read:\n12811.\n(a) (1) (A) When the department determines that the applicant is lawfully entitled to a license, the department shall issue to the person a driver’s license as applied for. The license shall state the class of license for which the licensee has qualified and shall contain the distinguishing number assigned to the applicant, the date of expiration, the true full name, age, and mailing address of the licensee, a brief description and engraved picture or photograph of the licensee for the purpose of identification, and space for the signature of the licensee.\n(B) Each license shall also contain a space for the endorsement of a record of each suspension or revocation of the license.\n(C) The department shall use whatever process or processes, in the issuance of engraved or colored licenses, that prohibit, as near as possible, the ability to a

## Data Preprocessing

There are two fields that will be used here:

  **text:** the text of the bill which’ll be the input to the model.

  **summary:** a condensed version of text which’ll be the model target.

In [12]:
# Load T5 tokenizer to process text and summary
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [13]:
# Define the preprocessing function
prefix = "summarize: "

def preprocess_function(examples):
    # Prefix the input with a prompt so T5 knows this is a summarization  task
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    # Tokenize and truncate the labels
    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


In [14]:
# Apply the preprocessing function over the entire dataset
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [15]:
# Create a batch of examples with DataCollator
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=checkpoint,
)

## Evaluation Metrics

In [16]:
# Loas ROUGE metric for evaluation
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [17]:
# Pass the predictions and labels and compute ROUGE score
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Fine-tuning the Model

In [18]:
# Load the T5 pretrained model
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [24]:
# Define the hyperparameters
training_args = Seq2SeqTrainingArguments(
    output_dir="t5-small-finetuned-billsum",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

In [25]:
# Define the Trainer function
trainer =  Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [26]:
# Finetune the Model
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.671116,0.1308,0.0445,0.1107,0.1109,19.0
2,No log,2.576067,0.1338,0.0483,0.1137,0.1137,19.0
3,No log,2.553292,0.1356,0.0495,0.1144,0.1144,19.0




TrainOutput(global_step=186, training_loss=2.896615961546539, metrics={'train_runtime': 223.0654, 'train_samples_per_second': 13.301, 'train_steps_per_second': 0.834, 'total_flos': 803118249934848.0, 'train_loss': 2.896615961546539, 'epoch': 3.0})

In [27]:
# Push the model to hub
trainer.push_to_hub(
    tags="summarization",
    commit_message="Training complete",
)

events.out.tfevents.1724680365.ff5fe5d9fbf4.1216.1:   0%|          | 0.00/7.79k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ashaduzzaman/t5-small-finetuned-billsum/commit/31e32c89c1734c41b353d9e01a85359224ff553f', commit_message='Training complete', commit_description='', oid='31e32c89c1734c41b353d9e01a85359224ff553f', pr_url=None, pr_revision=None, pr_num=None)

## Inference

In [29]:
from transformers import pipeline

summarizer = pipeline(
    "summarization",
    model="ashaduzzaman/t5-small-finetuned-billsum"
)

config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.7k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [33]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

summarizer(text)

Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs . it's the most aggressive action on tackling the climate crisis in American history . no one making under $400,000 per year will pay a penny more in taxes."}]