<a href="https://colab.research.google.com/github/akashmathur-2212/LLMs-playground/blob/main/finetuned-text-summarizer/summarization-finetuning-evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install -qqq transformers datasets evaluate rouge_score
! pip install -qqq accelerate==0.21

# Summarization

Summarization creates a shorter version of a document or an article that captures all the important information. It is formulated as a sequence-to-sequence task. Summarization can be:

- **Extractive**: extract the most relevant information from a document.
- **Abstractive**: generate new text that captures the most relevant information.

This guide will show you how to:

1. Finetune [T5](https://huggingface.co/t5-small) on the BillSum dataset for abstractive summarization.
2. Use your finetuned model for inference.

In [None]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, pipeline, DataCollatorForSeq2Seq
import evaluate
from huggingface_hub import notebook_login

In [3]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load dataset

In [4]:
billsum = load_dataset("billsum", split="ca_test")

In [5]:
# Split the dataset into a train and test set
billsum = billsum.train_test_split(test_size=0.2)

In [6]:
billsum

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 989
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 248
    })
})

Then take a look at an example:

In [7]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nChapter 17 (commencing with Section 50897) is added to Part 2 of Division 31 of the Health and Safety Code, to read:\nCHAPTER  17. Workforce\nHousingPilot\nHousing Pilot\nProgram\n50897.\nIt is the intent of the Legislature in enacting this chapter to ensure that funds allocated to eligible recipients and administered by the Department of Housing and Community Development be of maximum benefit in meeting the needs of persons and families of low or moderate income. It is the intent of the Legislature to support Californians residing in areas where housing prices have risen to levels that are unaffordable. The Legislature intends that these funds be provided to eligible recipients in areas that are experiencing a rise in home prices and rental prices so that they may assist individuals who are not able to live where they work.\n50897.1.\nAs used in this chapter:\n(a) “Eligible recipient” means any of th

In [8]:
billsum["train"][0].keys()

dict_keys(['text', 'summary', 'title'])

There are two fields that you'll want to use:

- `text`: the text of the bill which'll be the input to the model.
- `summary`: a condensed version of `text` which'll be the model target.

## Preprocess

The next step is to load a T5 tokenizer to process `text` and `summary`:

In [9]:
checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The preprocessing function needs to:

1. Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Use the keyword `text_target` argument when tokenizing labels.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [10]:
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [11]:
# apply the preprocessing function over the entire dataset using map method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [12]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## Evaluate

In [13]:
rouge = evaluate.load("rouge")

Then create a function that passes your predictions and labels to compute the ROUGE metric:

In [14]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Training

In [15]:
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [16]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Let's start training-

In [20]:
training_args = Seq2SeqTrainingArguments(
    output_dir="text-summarization-evaluation-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [21]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.477531,0.1556,0.0622,0.1297,0.1301,19.0
2,No log,2.437373,0.1822,0.0868,0.1534,0.1537,19.0
3,No log,2.416359,0.1888,0.0922,0.16,0.1602,19.0
4,No log,2.410003,0.1909,0.0934,0.1617,0.1619,19.0




TrainOutput(global_step=248, training_loss=2.6011551887758317, metrics={'train_runtime': 286.1031, 'train_samples_per_second': 13.827, 'train_steps_per_second': 0.867, 'total_flos': 1070824333246464.0, 'train_loss': 2.6011551887758317, 'epoch': 4.0})

In [22]:
trainer.push_to_hub()

events.out.tfevents.1702991355.0e7cf40517d9.7019.1:   0%|          | 0.00/7.71k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.73k [00:00<?, ?B/s]

'https://huggingface.co/akash2212/text-summarization-evaluation-model/tree/main/'

Model for Reference - https://huggingface.co/akash2212/text-summarization-evaluation-model

## Inference

In [23]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [24]:
summarizer = pipeline("summarization", model="akash2212/text-summarization-evaluation-model")
summarizer(text)

config.json:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history. It will ask the ultra-wealthy and corporations to pay their fair share."}]

You can also manually replicate the results of the `pipeline` if you'd like:


Tokenize the text and return the `input_ids` as PyTorch tensors:

In [25]:
tokenizer = AutoTokenizer.from_pretrained("akash2212/text-summarization-evaluation-model")
inputs = tokenizer(text, return_tensors="pt").input_ids

In [26]:
model = AutoModelForSeq2SeqLM.from_pretrained("akash2212/text-summarization-evaluation-model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

Decode the generated token ids back into text:

In [27]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

"The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history. It'll ask the ultra-wealthy and corporations to pay their fair share."

# END