# Fine-Tune a Generative AI Model for Bill text Summarization

In this notebook, we will fine-tune a pre-trained language model from Hugging Face to enhance summarization of legislative documents using the BillSum dataset. The dataset contains texts of US Congressional and California state bills, along with their corresponding summaries. Specifically, you will work with the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model, a high-quality, instruction-tuned model capable of generating summaries out of the box.

To optimize performance, you will perform full fine-tuning on the dataset, leveraging features such as the bill text (text) and bill summaries (summary), and evaluate the results using ROUGE metrics. Additionally, you will explore Parameter Efficient Fine-Tuning (PEFT), compare its performance to full fine-tuning, and analyze how the computational efficiency of PEFT balances its slightly lower performance

## 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM

### 1.1 - Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
path='/content/drive/My Drive/data/'

In [None]:
%%capture
# Upgrade pip to the latest version
%pip install --upgrade pip

# Install the latest version of PyTorch and TorchData
%pip install --disable-pip-version-check torch torchdata --quiet

# Install the latest versions of the required libraries
%pip install \
    transformers \
    datasets \
    evaluate \
    rouge_score \
    loralib \
    peft --quiet


In [None]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

### 1.2 - Load Dataset and LLM

We are going to experiment with the [billsum](https://huggingface.co/datasets/FiscalNote/billsum) Hugging Face dataset.It contains summaries of US Congressional and California state bills, making it particularly useful for natural language processing tasks involving long-form text summarization.

In [None]:
huggingface_dataset_name = "FiscalNote/billsum"

dataset = load_dataset(huggingface_dataset_name)

dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 18949
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 3269
    })
    ca_test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 1237
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Notice that you will be using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

In [None]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

You can determine the total number of model parameters and identify which of them are trainable. The following function helps accomplish this, and for now, there's no need to dive into its specifics.

In [None]:
def summarize_model_parameters(model):
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    percentage_trainable = (trainable_params / total_params) * 100

    return (
        f"Trainable Parameters: {trainable_params}\n"
        f"Total Parameters: {total_params}\n"
        f"Percentage of Trainable Parameters: {percentage_trainable:.2f}%"
    )

# Example usage
print(summarize_model_parameters(original_model))


Trainable Parameters: 3538944
Total Parameters: 251116800
Percentage of Trainable Parameters: 1.41%


### 1.3 - Test the Model with Zero Shot Inferencing

Evaluate the model using zero-shot inference on the BillSum dataset. While the model struggles to generate summaries as effectively as the provided baseline summaries, it does manage to extract some key information from the bill text. This indicates that the model has potential and can be fine-tuned to perform better on legislative summarization tasks.

In [None]:
index = 100

bill_text = dataset['test'][index]['text']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following bill text:

{bill_text}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following bill text:

SECTION 1. SHORT TITLE.

    This Act may be cited as the ``Afghanistan and Central Asian 
Republics Sustainable Food Production Act of 2001''.

SEC. 2. FINDINGS.

    Congress finds that--
            (1) abject poverty and the inability to produce food, even 
        at the subsistence level, in the rural, mountainous areas of 
        Afghanistan and the Central Asian Republics have plagued the 
        region for over 20 years;
            (2) extended food shortages in this region have resulted in 
        the consumption of seed supplies and breeding livestock 
        necessary to continue farming and food production;
            (3) ongoing and violent conflict in the region has badly 
        damaged or destroyed the basic irrigation systems necessary for 
        food production;
            (4) despite the delivery of over $18

<a name='2'></a>
## 2 - Perform Full Fine-Tuning

###2.1 - Preprocess the BillSum Dataset
To prepare the BillSum dataset for training, you need to convert the bill text-summary pairs into explicit instructions for the language model. Prepend an instruction to the start of the bill text with Summarize the following bill text and to the start of the summary with Summary as follows:

Training prompt (bill text):
```
Summarize the following bill text.

    This is the full text of a legislative bill.
    
Summary:
```

Training response (summary):
```
This is a concise summary of the bill.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [None]:
def tokenize_function(example):
    start_prompt = 'Summarize the following bill text.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + bill_text + end_prompt for bill_text in example["text"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, test, ca_test .
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['text', 'summary', 'title',])

Map:   0%|          | 0/3269 [00:00<?, ? examples/s]

To save some time in the lab, we will subsample the dataset:

In [None]:
def keep_every_50th(example, index):
    # Keep only every 50th example
    return index % 50 == 0

# Filter the dataset to retain every 50th example
tokenized_datasets = tokenized_datasets.filter(keep_every_50th, with_indices=True)


Filter:   0%|          | 0/18949 [00:00<?, ? examples/s]

Filter:   0%|          | 0/3269 [00:00<?, ? examples/s]

Check the shapes of all three parts of the dataset:

In [None]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"ca_test: {tokenized_datasets['ca_test'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (379, 2)
ca_test: (25, 2)
Test: (66, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 379
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 66
    })
    ca_test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 25
    })
})


The output dataset is ready for fine-tuning.

### 2.2 - Fine-Tune the Model with the Preprocessed Dataset
Leverage the Hugging Face Trainer class (refer to the documentation [here](https://huggingface.co/docs/transformers/main_classes/trainer)) to fine-tune the model. Use the preprocessed dataset and initialize the trainer with the original model. The training parameters have been determined experimentally, so there’s no need to dive into their specifics for now.

In [None]:
output_dir = f'{path}bill_text-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    report_to="none",
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=10,
    max_steps=10
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

In [None]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
10,29.175


TrainOutput(global_step=10, training_loss=29.175, metrics={'train_runtime': 36.3918, 'train_samples_per_second': 2.198, 'train_steps_per_second': 0.275, 'total_flos': 54780588195840.0, 'train_loss': 29.175, 'epoch': 0.20833333333333334})

loading the fine-tuned model

In [None]:
fine_tuned_model=AutoModelForSeq2SeqLM.from_pretrained("/content/drive/MyDrive/data/bill_text-summary-training-1736346737/checkpoint-10", torch_dtype=torch.bfloat16)

<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

As with many GenAI applications, a qualitative approach where you ask yourself the question "Is my model behaving the way it is supposed to?" is usually a good starting point. In the example below (the same one we started this notebook with), you can see how the fine-tuned model is able to create a reasonable summary of the dialogue compared to the original inability to understand what is being asked of the model.

In [None]:
index = 100
bill_text = dataset['test'][index]['text']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following bill text:

{bill_text}

Summary:
"""


input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(original_model.device) # Move input_ids to the same device as the model
original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(fine_tuned_model.device)
fine_tuned_model_outputs = fine_tuned_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
fine_tuned_model_text_output = tokenizer.decode(fine_tuned_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'Fine_Tuned MODEL:\n{fine_tuned_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Afghanistan and Central Asian Republics Sustainable Food Production Act of 2001 - Directs the Administrator of the United States Agency for International Development to provide financial assistance to nongovernmental organizations carrying out rural developmental activities in Afghanistan, Kyrgyzstan, Pakistan, Tajikistan, Turkmenistan, and Uzbekistan. Stipulates the aid shall be used for: (1) restocking seed; (2) replacing breeding livestock; (3) restoring basic irrigation systems; (4) providing access to credit for food production, processing or marketing enterprises through rural microenterprise loan programs; and (5) technical assistance. Places human rights and other conditions on the government of Afghanistan for projects to be funded in Afghanistan.
---------------------------------------------------------------------------------------------------
ORIGINAL 

### 2.4 - Quantitative Evaluation of the Model (Using the ROUGE Metric)

The [ROUGE metric](https://en.wikipedia.org/wiki/ROUGE_(metric)) is a widely used method for evaluating the quality of generated summaries. It measures how well a model's summarizations align with a "baseline" summary, typically written by a human. While not without limitations, ROUGE provides a useful indication of the improvements in summarization performance achieved through fine-tuning.

In [None]:
rouge = evaluate.load('rouge')

Generate the outputs for the sample of the test dataset (only 10 bill text and summaries to save time), and save the results.

In [None]:
bill_texts = dataset['test'][0:10]['text']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
fine_tuned_model_summaries = []

for _, bill_text in enumerate(bill_texts):
    prompt = f"""
Summarize the following bill text.

{bill_text}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    input_ids = input_ids.to(original_model.device)

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    input_ids = input_ids.to(fine_tuned_model.device)
    fine_tuned_model_outputs = fine_tuned_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    fine_tuned_model_text_output = tokenizer.decode(fine_tuned_model_outputs[0], skip_special_tokens=True)
    fine_tuned_model_summaries.append(fine_tuned_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, fine_tuned_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'fine_tuned_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,fine_tuned_model_summaries
0,Amends the Water Resources Development Act of ...,A bill to amend title 106- 69 of the United St...,A bill to amend title I of the Water Resources...
1,Federal Forage Fee Act of 1993 - Subjects graz...,SECTION 1. SECTION 1. FEE. SECTION 2. SECTION ...,"A bill to amend title 10, United States Code, ..."
2,. Merchant Marine of World War II Congression...,Congressional Gold Medal Act of 2015,A bill to provide for the award of a gold meda...
3,Small Business Modernization Act of 2004 - Ame...,A bill to amend the Internal Revenue Code of 1...,"A bill to amend title 38, United States Code, ..."
4,Fair Access to Investment Research Act of 2016...,SECTION 2. This Act may be cited as the Fair ...,"A bill to amend title 17, United States Code, ..."
5,Prescription Drug Monitoring Act of 2016 This ...,This Act shall provide for the establishment o...,A bill to provide for the establishment of a d...
6,Strategic Gasoline and Fuel Reserve Act of 200...,SEC.,A bill to amend title I of the Energy Policy a...
7,Special Agent Scott K. Carey Public Safety Off...,A bill to amend title 1217 of the Omnibus Crim...,A bill to amend title 1213 of the Omnibus Crim...
8,Promoting Financial Literacy and Economic Oppo...,This Act may be cited as the 'Promoting Financ...,A bill to amend chapter 45 of the Internal Rev...
9,"Amends the Tariff Act of 1930 to define ""deliv...",...,A bill to amend section 801 of the Tariff Act ...


Evaluate the models computing ROUGE metrics. Notice the improvement in the results!

In [None]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

fine_tuned_model_results = rouge.compute(
    predictions=fine_tuned_model_summaries,
    references=human_baseline_summaries[0:len(fine_tuned_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('Fine Tuned MODEL:')
print(fine_tuned_model_results)

ORIGINAL MODEL:
{'rouge1': 0.15356489411915925, 'rouge2': 0.06135715778814338, 'rougeL': 0.11357002363528759, 'rougeLsum': 0.11516928774855649}
Fine Tuned MODEL:
{'rouge1': 0.24380716305340505, 'rouge2': 0.0948423534248077, 'rougeL': 0.18826470804907652, 'rougeLsum': 0.19613854731762947}


The results show substantial improvement in all ROUGE metrics:

In [None]:
print("Absolute percentage improvement of FINE TUNED MODEL over ORIGINAL MODEL")

# Calculate the percentage improvement for each metric
improvements = {
    key: (fine_tuned_model_results[key] - original_model_results[key]) * 100
    for key in fine_tuned_model_results.keys()
}

# Print the improvements for each metric
for metric, improvement in improvements.items():
    print(f"{metric}: {improvement:.2f}%")


Absolute percentage improvement of FINE TUNED MODEL over ORIGINAL MODEL
rouge1: 9.02%
rouge2: 3.35%
rougeL: 7.47%
rougeLsum: 8.10%


### 3 - Implement Parameter-Efficient Fine-Tuning (PEFT)

Next, we'll explore **Parameter-Efficient Fine-Tuning (PEFT)**, an alternative to the "full fine-tuning" approach used earlier. PEFT offers a more resource-efficient way to adapt models while achieving evaluation results that are comparable to full fine-tuning, as you’ll observe shortly.

PEFT is an umbrella term encompassing methods like **Low-Rank Adaptation (LoRA)** and prompt tuning (distinct from prompt engineering). In most cases, when referring to PEFT, the focus is typically on LoRA. LoRA enables fine-tuning of models with significantly reduced computational demands—often requiring as little as a single GPU. Instead of modifying the original large language model (LLM), LoRA fine-tuning creates a compact "LoRA adapter," which is tailored to a specific task or application. This adapter is much smaller than the original LLM, typically occupying only a fraction of its size (e.g., megabytes compared to gigabytes).

During inference, the LoRA adapter is merged back with the original LLM to process requests. The key advantage here is that a single LLM can support multiple tasks by leveraging different LoRA adapters, minimizing the overall memory footprint when serving diverse use cases. This modularity makes LoRA an efficient and scalable solution for fine-tuning large models.

### 3.1 - Configure the PEFT/LoRA Model for Fine-Tuning

To begin fine-tuning using PEFT/LoRA, we’ll configure the model to include a new layer or parameter adapter. With this approach, the underlying LLM remains frozen, and only the adapter is trained. Below is an example of the LoRA configuration, which includes the rank (`r`) hyperparameter. This parameter specifies the dimensionality or rank of the adapter being trained. Adjusting `r` allows you to control the complexity and size of the adapter.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [None]:
peft_model = get_peft_model(original_model,
                            lora_config)
print(summarize_model_parameters(peft_model))

Trainable Parameters: 3538944
Total Parameters: 251116800
Percentage of Trainable Parameters: 1.41%


<a name='3.2'></a>
### 3.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [None]:
output_dir = f'{path}peft-bill_text-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    report_to="none",
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

Now everything is ready to train the PEFT adapter and save the model.

In [None]:
peft_trainer.train()

peft_model_path=f"{path}peft-bill_text-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Step,Training Loss
1,28.375


('/content/drive/My Drive/data/peft-bill_text-dialogue-summary-checkpoint-local/tokenizer_config.json',
 '/content/drive/My Drive/data/peft-bill_text-dialogue-summary-checkpoint-local/special_tokens_map.json',
 '/content/drive/My Drive/data/peft-bill_text-dialogue-summary-checkpoint-local/spiece.model',
 '/content/drive/My Drive/data/peft-bill_text-dialogue-summary-checkpoint-local/added_tokens.json',
 '/content/drive/My Drive/data/peft-bill_text-dialogue-summary-checkpoint-local/tokenizer.json')

Prepare this model by adding an adapter to the original FLAN-T5 model. You are setting `is_trainable=False` because the plan is only to perform inference with this PEFT model. If you were preparing the model for further training, you would set `is_trainable=True`.

In [None]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       '/content/drive/My Drive/data/peft-bill_text-dialogue-summary-checkpoint-local',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [None]:
print(summarize_model_parameters(peft_model))

Trainable Parameters: 0
Total Parameters: 251116800
Percentage of Trainable Parameters: 0.00%


<a name='3.3'></a>
### 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

Make inferences for the same example as in sections [1.3](#1.3) and [2.3](#2.3), with the original model, fully fine-tuned and PEFT model.

In [None]:
index = 100
bill_text = dataset['test'][index]['text']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following bill text:

{bill_text}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(original_model.device) # Move input_ids to the same device as the model
original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(fine_tuned_model.device)
fine_tuned_model_outputs = fine_tuned_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
fine_tuned_model_text_output = tokenizer.decode(fine_tuned_model_outputs[0], skip_special_tokens=True)

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(peft_model.device)
peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'FINE TUNED MODEL:\n{fine_tuned_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

Token indices sequence length is longer than the specified maximum sequence length for this model (1745 > 512). Running this sequence through the model will result in indexing errors


---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
Afghanistan and Central Asian Republics Sustainable Food Production Act of 2001 - Directs the Administrator of the United States Agency for International Development to provide financial assistance to nongovernmental organizations carrying out rural developmental activities in Afghanistan, Kyrgyzstan, Pakistan, Tajikistan, Turkmenistan, and Uzbekistan. Stipulates the aid shall be used for: (1) restocking seed; (2) replacing breeding livestock; (3) restoring basic irrigation systems; (4) providing access to credit for food production, processing or marketing enterprises through rural microenterprise loan programs; and (5) technical assistance. Places human rights and other conditions on the government of Afghanistan for projects to be funded in Afghanistan.
---------------------------------------------------------------------------------------------------
ORIGINAL 

<a name='3.4'></a>
### 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
Perform inferences for the sample of the test dataset (only 10 bill text and summaries to save time).

In [None]:
bill_texts = dataset['test'][0:10]['text']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
fine_tuned_model_summaries = []
peft_model_summaries = []

for _, bill_text in enumerate(bill_texts):
    prompt = f"""
Summarize the following bill text.

{bill_text}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    input_ids = input_ids.to(original_model.device)

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)


    input_ids = input_ids.to(fine_tuned_model.device)
    fine_tuned_model_outputs = fine_tuned_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    fine_tuned_model_text_output = tokenizer.decode(fine_tuned_model_outputs[0], skip_special_tokens=True)
    fine_tuned_model_summaries.append(fine_tuned_model_text_output)

    input_ids = input_ids.to(peft_model.device)
    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, fine_tuned_model_summaries, peft_model_summaries))

df2 = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'fine_tuned_model_summaries', 'peft_model_summaries'])
df2

Unnamed: 0,human_baseline_summaries,original_model_summaries,fine_tuned_summaries,peft_model_summaries
0,Amends the Water Resources Development Act of ...,SEC.,A bill to amend title I of the Water Resources...,A bill to amend title I of Public Law 106-69 t...
1,Federal Forage Fee Act of 1993 - Subjects graz...,A bill to provide for the fee for the Federal ...,"A bill to amend title 10, United States Code, ...","A bill to amend title 10, United States Code, ..."
2,. Merchant Marine of World War II Congression...,A bill to provide for the award of a single go...,A bill to provide for the award of a gold meda...,A bill to provide for the award of a gold meda...
3,Small Business Modernization Act of 2004 - Ame...,SECTION 2. UNIFIED PASS-THRU ENTITY REGIME.,"A bill to amend title 38, United States Code, ...","A bill to amend title 38, United States Code, ..."
4,Fair Access to Investment Research Act of 2016...,This Act may be cited as the 'Fair Access to I...,"A bill to amend title 17, United States Code, ...","A bill to amend title 17, United States Code, ..."
5,Prescription Drug Monitoring Act of 2016 This ...,This Act provides for the establishment of a d...,A bill to provide for the establishment of a d...,A bill to provide for the establishment of a d...
6,Strategic Gasoline and Fuel Reserve Act of 200...,The Energy Policy and Conservation Act of 2005...,A bill to amend title I of the Energy Policy a...,A bill to amend title I of the Energy Policy a...
7,Special Agent Scott K. Carey Public Safety Off...,This Act may be cited as the Special Agent Sco...,A bill to amend title 1213 of the Omnibus Crim...,A bill to amend title 1213 of the Omnibus Crim...
8,Promoting Financial Literacy and Economic Oppo...,A bill to amend the Internal Revenue Code of 1...,A bill to amend chapter 45 of the Internal Rev...,This Act provides for the establishment of a c...
9,"Amends the Tariff Act of 1930 to define ""deliv...",SECTION 1.,A bill to amend section 801 of the Tariff Act ...,A bill to amend section 801 of the Tariff Act ...


In [None]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

fine_tuned_model_results = rouge.compute(
    predictions=fine_tuned_model_summaries,
    references=human_baseline_summaries[0:len(fine_tuned_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('FINE TUNED MODEL:')
print(fine_tuned_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.19449921018654298, 'rouge2': 0.0943300669240871, 'rougeL': 0.14341939534465825, 'rougeLsum': 0.15570374754374375}
FINE TUNED MODEL:
{'rouge1': 0.24380716305340505, 'rouge2': 0.0948423534248077, 'rougeL': 0.18826470804907652, 'rougeLsum': 0.19613854731762947}
PEFT MODEL:
{'rouge1': 0.25814830188158255, 'rouge2': 0.11581706462735133, 'rougeL': 0.1948476873453055, 'rougeLsum': 0.19889589027019156}


Notice, that PEFT model results are not too bad, while the training process was much easier!

In [None]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

# Calculate the percentage improvement for each metric
improvements = {
    key: (fine_tuned_model_results[key] - original_model_results[key]) * 100
    for key in fine_tuned_model_results.keys()
}

# Print the improvements for each metric
for metric, improvement in improvements.items():
    print(f"{metric}: {improvement:.2f}%")

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 6.36%
rouge2: 2.15%
rougeL: 5.14%
rougeLsum: 4.32%


Here, you observe a slight decrease in the ROUGE metrics compared to the fully fine-tuned model. However, the training process is significantly more resource-efficient, requiring much less computing power and memory—often achievable with just a single GPU.