## Colab

Check Colab GPU

In [1]:
# un-comment if using colab to run this
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
print(gpu_info)

Sun Apr 16 02:56:57 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    46W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Check Colab GPU

In [2]:
# un-comment if using colab to run this
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

Your runtime has 89.6 gigabytes of available RAM



## Imports

In [3]:
# Transformers installation
! pip install transformers datasets evaluate rouge_score

# Imports
from huggingface_hub import notebook_login
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, pipeline, AdamW, get_linear_schedule_with_warmup, EarlyStoppingCallback
import evaluate
import numpy as np
import unicodedata

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Log in to hugging face to upload model

In [4]:
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Load BillSum dataset

Start by loading the BillSum dataset from the Hugging Face Datasets library:

In [5]:
billsum = load_dataset("billsum")



  0%|          | 0/3 [00:00<?, ?it/s]

Split the dataset into training and testing subsets

In [6]:
# Split the dataset into training and testing subsets
train_dataset = billsum["train"]
test_dataset = billsum["test"]

# Print the size of the training and testing subsets
print(f"Size of the training subset: {len(train_dataset)}")
print(f"Size of the testing subset: {len(test_dataset)}")

Size of the training subset: 18949
Size of the testing subset: 3269


Then take a look at an example:

In [7]:
train_dataset[0]

{'text': "SECTION 1. LIABILITY OF BUSINESS ENTITIES PROVIDING USE OF FACILITIES \n              TO NONPROFIT ORGANIZATIONS.\n\n    (a) Definitions.--In this section:\n            (1) Business entity.--The term ``business entity'' means a \n        firm, corporation, association, partnership, consortium, joint \n        venture, or other form of enterprise.\n            (2) Facility.--The term ``facility'' means any real \n        property, including any building, improvement, or appurtenance.\n            (3) Gross negligence.--The term ``gross negligence'' means \n        voluntary and conscious conduct by a person with knowledge (at \n        the time of the conduct) that the conduct is likely to be \n        harmful to the health or well-being of another person.\n            (4) Intentional misconduct.--The term ``intentional \n        misconduct'' means conduct by a person with knowledge (at the \n        time of the conduct) that the conduct is harmful to the health \n        or w

There are two fields that we want to use:

- `text`: the text of the bill which'll be the input to the model.
- `summary`: a condensed version of `text` which'll be the model target.

## Preprocess

The next step is to load a T5 tokenizer to process `text` and `summary`:

In [8]:
checkpoint = 'google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The preprocessing function we want to create needs to:

1. Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Use the keyword `text_target` argument when tokenizing labels.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [9]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

To apply the preprocessing function over the entire dataset, use Hugging Face Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. We can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [10]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)



Map:   0%|          | 0/3269 [00:00<?, ? examples/s]



Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [11]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## Evaluate

Including a metric during training to help evaluate model's performance. For this task, we will load the ROUGE and BLEU metric from the Hugging Face Evaluate library.

In [12]:
rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the ROUGE and BLEU metric:

In [13]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    rouge_result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    bleu_result = bleu.compute(predictions=decoded_preds, references=[[ref] for ref in decoded_labels])

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    rouge_result["gen_len"] = np.mean(prediction_lens)
    rouge_result["bleu"] = bleu_result["bleu"]

    return {k: round(v, 4) for k, v in rouge_result.items()}

## Train

Load T5 with [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM):

In [14]:
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

At this point, only three steps remain:

1. Define your training hyperparameters in [Seq2SeqTrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the ROUGE and BLEU metrics and save the training checkpoint.
2. Pass the training arguments to [Seq2SeqTrainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [16]:
training_args = Seq2SeqTrainingArguments(
    output_dir="flan-t5-base-billsum_model",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,  # Reduce batch size to 8
    per_device_eval_batch_size=8,   # Reduce batch size to 8
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=40,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    gradient_accumulation_steps=8,  # Increase gradient accumulation steps to 8
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)

trainer.train()

/content/flan-t5-base-billsum_model is already a clone of https://huggingface.co/alaahussein/flan-t5-base-billsum_model. Make sure you pull the latest changes with `repo.git_pull()`.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len,Bleu
0,No log,,0.2154,0.1259,0.1843,0.1843,17.3735,0.0011
1,0.000000,,0.2154,0.1259,0.1843,0.1843,17.3735,0.0011
2,0.000000,,0.2154,0.1259,0.1843,0.1843,17.3735,0.0011




TrainOutput(global_step=888, training_loss=0.0, metrics={'train_runtime': 3094.3215, 'train_samples_per_second': 244.952, 'train_steps_per_second': 3.826, 'total_flos': 7.785280242922291e+16, 'train_loss': 0.0, 'epoch': 3.0})

Once training is completed, share the model to the Hub

In [17]:
trainer.push_to_hub()

Upload file runs/Apr16_02-58-24_610f3e27deb2/events.out.tfevents.1681613908.610f3e27deb2.10121.2: 100%|#######…

To https://huggingface.co/alaahussein/flan-t5-base-billsum_model
   661337c..c1671de  main -> main

   661337c..c1671de  main -> main

To https://huggingface.co/alaahussein/flan-t5-base-billsum_model
   c1671de..9075ab9  main -> main

   c1671de..9075ab9  main -> main



'https://huggingface.co/alaahussein/flan-t5-base-billsum_model/commit/c1671de262eb860856fb475fb57a62d2b0498737'

## Inference

Create text to summarize. For T5, you need to prefix your input depending on the task you're working on. For summarization you should prefix your input as shown below:

In [18]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

Instantiate a `pipeline` for summarization with your model, and pass your text to it:

In [19]:
summarizer = pipeline("summarization", model="alaahussein/flan-t5-base-billsum_model")
summarizer(text)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Your max_length is set to 200, but you input_length is only 103. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': "The Inflation Reduction Act is a major step in the fight against inflation. It's a major step in the fight against the climate crisis."}]

You can also manually replicate the results of the `pipeline` if you'd like:


Tokenize the text and return the `input_ids` as PyTorch tensors:

In [20]:
tokenizer = AutoTokenizer.from_pretrained("alaahussein/flan-t5-base-billsum_model")
inputs = tokenizer(text, return_tensors="pt").input_ids

Use the [generate()](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to create the summarization. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](https://huggingface.co/docs/transformers/main/en/tasks/../main_classes/text_generation) API.

In [21]:
model = AutoModelForSeq2SeqLM.from_pretrained("alaahussein/flan-t5-base-billsum_model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

Decode the generated token ids back into text:

In [22]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

'The Inflation Reduction Act will lower the cost of prescription drugs, health care, and energy costs.'