<a href="https://colab.research.google.com/github/anyuanay/medium/blob/main/src/working_huggingface/Working_with_HuggingFace_ch3_Fine_Tuning_T5_Small_Text_Summarization_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Working with Hugging Face Models and Datasets
## Chapter 3: Text Summarization Using Models in Hugging Face
### Lesson 3.2: Fine-tuning the pre-trained T5-small model in Hugging Face for text summarization

In this lesson, we will fine-tune the [T5-small](https://huggingface.co/t5-small) model on the California state bill subset of the [BillSum](https://huggingface.co/datasets/billsum) dataset. We can also fine-tune other models including Google's PEGASUS model. However, for illustration, we only demonstrate the fine-tuning steps using the smaller model, t5-small, in this tutorial.

# Install Transformers and Datasets from Hugging Face

In [1]:
# Transformers installation
! pip install -q transformers[torch] datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m56.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

## Load BillSum dataset

Let us load the BillSum dataset from the Huggingface Datasets library.

In [2]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

Downloading builder script:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.70k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/67.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

The loaded billsum dataset only has one Dataset object:

In [3]:
billsum

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})

For fine-tuning and late evaluation, we should split the dataset into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [4]:
billsum = billsum.train_test_split(test_size=0.2)

Check that we have a train and test Dataset:

In [5]:
billsum

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 989
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 248
    })
})

Take a look at an example:

In [12]:
example = billsum["train"][0]
for key in example:
    print("A key of the example: \"{}\"".format(key))
    print("The value corresponding to the key-\"{}\"\n \"{}\"".format(key, example[key]))

A key of the example: "text"
The value corresponding to the key-"text"
 "The people of the State of California do enact as follows:


SECTION 1.
The Legislature finds and declares all of the following:
(a) The President’s New Freedom Commission on Mental Health (2003) reported that the use of behavioral restraint and seclusion poses significant risks for adults and children, including serious injury or death, retraumatizing people with a history of trauma or abuse, the loss of dignity, and other psychological harm.
(b) Although California currently requires the tracking and public reporting of the use of seclusion and restraint in state developmental centers and collects data regarding the use of restraint through the department’s special incident reporting system, the data concerning the use of restraint in community residential and other long-term care facilities and acute psychiatric hospitals serving individuals with developmental disabilities is not publicly reported.
(c) One of t

There are three fields:

- `text`: the text of the bill.
- `summary`: a given summary of the text.
- `title`: the title of the text

## Preprocess

We will fine-tune the T5-small model. At the Overview page of the [Hugging Face T5 model](https://huggingface.co/docs/transformers/model_doc/t5#overview), it provides the following tips:
- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.
- T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation: translate English to German: …, for summarization: summarize: ….

We will load a T5 tokenizer to process `text` and `summary` and prepend a prefix "summarize: " for our text summarization task.

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small")

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Test the tokenizer on an example:

In [14]:
tokenized_text = tokenizer(example['text'])
for key in tokenized_text:
    print(key)
    print(tokenized_text[key])

input_ids
[37, 151, 13, 8, 1015, 13, 1826, 103, 3, 35, 2708, 38, 6963, 10, 180, 3073, 9562, 1300, 37, 28204, 12902, 11, 15884, 7, 66, 13, 8, 826, 10, 41, 9, 61, 37, 1661, 22, 7, 368, 14179, 3527, 30, 17054, 1685, 3, 31210, 2196, 24, 8, 169, 13, 17340, 880, 6559, 17, 11, 142, 11593, 15968, 1516, 5217, 21, 3513, 11, 502, 6, 379, 2261, 2871, 42, 1687, 6, 3, 60, 17, 6340, 144, 2610, 151, 28, 3, 9, 892, 13, 11105, 42, 5384, 6, 8, 1453, 13, 21377, 6, 11, 119, 11041, 6263, 5, 41, 115, 61, 1875, 1826, 1083, 2311, 8, 6418, 11, 452, 5099, 13, 8, 169, 13, 142, 11593, 11, 880, 6559, 17, 16, 538, 20697, 6881, 11, 2868, 7, 331, 1918, 8, 169, 13, 880, 6559, 17, 190, 8, 3066, 22, 7, 534, 5415, 5099, 358, 6, 8, 331, 6238, 8, 169, 13, 880, 6559, 17, 16, 573, 4326, 11, 119, 307, 18, 1987, 124, 2465, 11, 12498, 3, 8118, 23, 9, 3929, 9612, 3122, 1742, 28, 20697, 12490, 19, 59, 11652, 2196, 5, 41, 75, 61, 555, 13, 8, 200, 2254, 12, 1984, 8, 1288, 13, 3, 9, 4709, 16, 8, 169, 13, 880, 6559, 17, 19, 12, 766, 4

We will create a function to preprocess the training and test data in batch. The preprocessing function will perform the following actions:
- Prepend the prefix "summarize: " to each text document to indicate to the T5 model that the task at hand is summarization.
- Convert the input texts and summary labels into a tokenized format that can be processed by the T5 model.
- Set the max_length parameter to ensure that the tokenized inputs and labels do not exceed a certain length, truncating any text that is too long.
- Assign the tokenized labels to the labels field of model_inputs, which will be used during training to calculate the loss and optimize the model's parameters.

In [23]:
def preprocess_function(examples):
    # Prepends the string "summarize: " to each document in the 'text' field of the input examples.
    # This is done to instruct the T5 model on the task it needs to perform, which in this case is summarization.
    inputs = ["summarize: " + doc for doc in examples["text"]]

    # Tokenizes the prepended input texts to convert them into a format that can be fed into the T5 model.
    # Sets a maximum token length of 1024, and truncates any text longer than this limit.
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    # Tokenizes the 'summary' field of the input examples to prepare the target labels for the summarization task.
    # Sets a maximum token length of 128, and truncates any text longer than this limit.
    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    # Assigns the tokenized labels to the 'labels' field of model_inputs.
    # The 'labels' field is used during training to calculate the loss and guide model learning.
    model_inputs["labels"] = labels["input_ids"]

    # Returns the prepared inputs and labels as a single dictionary, ready for training.
    return model_inputs

Let us apply the preprocessing function over the entire dataset, use Huggingface Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. We can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [16]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

Let us take a look at a test example:

In [17]:
tokenized_billsum['test'][0]['text']

'The people of the State of California do enact as follows:\n\n\nSECTION 1.\n(a) In submitting this act to the electors, the Legislature finds and declares all of the following:\n(1) The theft of firearms and receipt of stolen firearms pose dangers to public safety that are different in kind from other types of theft or the receipt of other types of stolen property.\n(2) Many handguns have a value of less than $950. The threat to public safety in regard to stolen firearms goes above and beyond the monetary value of the firearm.\n(3) Given the significant and particular threat to public safety in regard to stolen firearms, it is appropriate to restore the penalties that existed prior to the passage of the Safe Neighborhoods and Schools Act\nof 2014\nin regard to stolen firearms.\n(b) It is not the intent of the Legislature in submitting this act to the electors to undermine the\nvoter’s\nvoters\n’\ndecision to decrease penalties for low-level theft and receiving stolen property, only to

In [18]:
tokenized_billsum['test'][0]['summary']

'(1)\xa0The existing Safe Neighborhoods and Schools Act, enacted as an initiative statute by Proposition 47, as approved by the electors at the November 4, 2014, statewide general election, makes the theft of property that does not exceed $950 in value petty theft, and makes that crime punishable as a misdemeanor, with certain exceptions.\nThe California Constitution authorizes the Legislature to amend an initiative statute by another statute that becomes effective only when approved by the electors.\nThis bill would amend that initiative statute by making the theft of a firearm grand theft in all cases and punishable by imprisonment in the state prison for 16 months, or 2 or 3 years.\n(2)\xa0Under existing law, a person who buys or receives property that has been stolen, knowing the property to be stolen, or who conceals, sells, withholds, or aids in concealing, selling, or withholding property from the owner, knowing the property to be stolen, is guilty of a misdemeanor or a felony, 

Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [19]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="t5-small")

## Evaluation Metrics for Training

We will use the [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) metric for training. We will load the evaluation method from the Huggingface [Evaluate](https://huggingface.co/docs/evaluate/index) library.

In [21]:
! pip install -q evaluate rouge_score

In [22]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Create a function that passes the predictions and labels to calculate the ROUGE metric as follows:
- The eval_pred tuple is unpacked into predictions and labels.
- The tokenizer.batch_decode method is used to decode the tokenized predictions and labels back to text, skipping any special tokens like padding tokens.
- The np.where function is used to replace any instances of -100 in the labels array with the tokenizer's pad_token_id, as -100 is often used to signify tokens that should be ignored during loss calculation.
- The rouge.compute method is called to calculate the ROUGE metric between the predictions and labels, which is a common metric for evaluating text summarization performance.
- The length of each prediction is calculated by counting the number of non-padding tokens, and the mean prediction length is added to the result dictionary under the key "gen_len".
- Finally, the values in the result dictionary are rounded to 4 decimal places for cleaner output, and the result is returned.

In [25]:
import numpy as np

def compute_metrics(eval_pred):
    # Unpacks the evaluation predictions tuple into predictions and labels.
    predictions, labels = eval_pred

    # Decodes the tokenized predictions back to text, skipping any special tokens (e.g., padding tokens).
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replaces any -100 values in labels with the tokenizer's pad_token_id.
    # This is done because -100 is often used to ignore certain tokens when calculating the loss during training.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decodes the tokenized labels back to text, skipping any special tokens (e.g., padding tokens).
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Computes the ROUGE metric between the decoded predictions and decoded labels.
    # The use_stemmer parameter enables stemming, which reduces words to their root form before comparison.
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Calculates the length of each prediction by counting the non-padding tokens.
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]

    # Computes the mean length of the predictions and adds it to the result dictionary under the key "gen_len".
    result["gen_len"] = np.mean(prediction_lens)

    # Rounds each value in the result dictionary to 4 decimal places for cleaner output, and returns the result.
    return {k: round(v, 4) for k, v in result.items()}


## Train

Load AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer classes from the Hugging Face transformers library:

In [26]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

Load the t5-small model:

In [27]:
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Define training hyperparameters in Seq2SeqTrainingArguments. Assign a value to the parameter `output_dir` to specify the location to save the model. It is a required parameter.

In [28]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_fine_tuned_t5_small_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
)

Pass the training arguments to Seq2SeqTrainer along with the model, dataset, tokenizer, data collator, and the `compute_metrics` function.

In [29]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Call train() to fine tune the model:

In [30]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.788651,0.1306,0.0411,0.1076,0.1078,19.0
2,No log,2.577241,0.1339,0.0464,0.1098,0.1097,19.0
3,No log,2.512759,0.137,0.0519,0.1149,0.1149,19.0
4,No log,2.495941,0.1397,0.0531,0.1168,0.1168,19.0




TrainOutput(global_step=248, training_loss=3.0304060905210433, metrics={'train_runtime': 284.5219, 'train_samples_per_second': 13.904, 'train_steps_per_second': 0.872, 'total_flos': 1070824333246464.0, 'train_loss': 3.0304060905210433, 'epoch': 4.0})

Observations: The function `compute_metrics` worked during the training. At the last epoch, we have rouge1 value 0.1397, rouge2 value 0.1168, and rougelsum 0.1168.

Save the model:

In [33]:
trainer.save_model("my_fine_tuned_t5_small_model")

## Use the Fine-Tuned Model to Summarize Text

We have fine-tuned the t5-small model on the billsum dataset. We can use it for inference.

We will use an example from the test dataset.

In [31]:
text = billsum['test'][100]['text']
text = "summarize: " + text
text

'summarize: The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 8610.5 is added to the Government Code, to read:\n8610.5.\n(a) For purposes of this section:\n(1) “Office” means the Office of Emergency Services.\n(2) “Previous fiscal year” means the fiscal year immediately prior to the current fiscal year.\n(3) “Utility” means an “electrical corporation” as defined in Section 218 of the Public Utilities Code.\n(b) (1) State and local costs to carry out activities pursuant to this section and Chapter 4 (commencing with Section 114650) of Part 9 of Division 104 of the Health and Safety Code that are not reimbursed by federal funds shall be borne by a utility operating a nuclear powerplant with a generating capacity of 50 megawatts or more.\n(2) The Public Utilities Commission shall develop and transmit to the office an equitable method of assessing a utility operating a powerplant for its reasonable share of state agency costs specified in paragraph (1).\n(

The simplest way to try out your fine-tuned model for inference is to use it in a pipeline(). Create a `pipeline` object for summarization with the fine-tuned model, and pass the text to it:

In [35]:
from transformers import pipeline

summarizer = pipeline("summarization", model="my_fine_tuned_t5_small_model")
pred = summarizer(text)
pred

Token indices sequence length is longer than the specified maximum sequence length for this model (1645 > 512). Running this sequence through the model will result in indexing errors


[{'summary_text': 'The Public Utilities Commission shall develop and transmit to the office an equitable method of assessing a utility operating a nuclear powerplant for its reasonable share of state agency costs specified in paragraph (1), as required, to carry out activities pursuant to this section and Chapter 4 (commencing with Section 114650) of Part 9 of Division 104 of the Health and Safety Code, upon appropriation by the office, from time to time, for allocation by the Controller for deposit in the Nuclear Planning Assessment Special Account, which is continued in'}]

We can also manually replicate the results of the `pipeline`.


Tokenize the text and return the `input_ids` as PyTorch tensors:

In [43]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_fine_tuned_t5_small_model")
inputs = tokenizer(text, return_tensors="pt").input_ids
inputs

Token indices sequence length is longer than the specified maximum sequence length for this model (1643 > 512). Running this sequence through the model will result in indexing errors


tensor([[21603,    10,    37,  ...,  2017,     5,     1]])

Use the generate() method to create the summarization.

In [44]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("my_fine_tuned_t5_small_model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

Decode the generated token ids back into text:

In [45]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

'The Public Utilities Commission shall develop and transmit to the office an equitable method of assessing a utility operating a nuclear powerplant for its reasonable share of state agency costs specified in paragraph (1), as required, for allocation by the Controller, upon appropriation by the Legislature, for allocation by the Controller, upon appropriation by the office, from time to time, of the amount of its share of the actual or anticipated state and local agency costs, as specified, for activities'

# Evaluate the result
We can compute the rouge values for the predicted summary comparing to the given summary.

In [None]:
pred[0]['summary_text']

'The Public Utilities Commission shall develop and transmit to the office an equitable method of assessing a utility operating a nuclear powerplant for its reasonable share of state agency costs specified in paragraph (1), as required, to carry out activities pursuant to this section and Chapter 4 (commencing with Section 114650) of Part 9 of Division 104 of the Health and Safety Code, upon appropriation by the office, from time to time, for allocation by the Controller for deposit in the Nuclear Planning Assessment Special Account, which is continued in'

In [None]:
preds = [pred[0]['summary_text']]

In [None]:
labels = [billsum['test'][100]['summary']]

In [None]:
rouge.compute(predictions=preds, references=labels, use_stemmer=True)

{'rouge1': 0.22745098039215686,
 'rouge2': 0.05905511811023622,
 'rougeL': 0.12156862745098039,
 'rougeLsum': 0.1647058823529412}

Great!! We have fine-tuned a pre-trained model in Hugging Face for text summarization.