# Step-by-Step Guide to Apply LoRA Fine-Tuning to T5-Small Using the California State Bill Subset of the BillSum Dataset
This guide walks you through the process of fine-tuning the T5-small model with LoRA (Low-Rank Adaptation) on the California state bill subset of the BillSum dataset. The BillSum dataset provides summaries of U.S. Congressional and California state bills.

## Step 1: Setting Up the Environment

Ensure you have the necessary libraries installed, including Hugging Face's transformers, datasets, peft, and torch

In [1]:
pip install transformers datasets torch peft


Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (fro

## Step 2: Selecting the Pretrained Model

You can use a pretrained model like T5-base or T5-large for text summarization.

In [2]:

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [3]:

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Step 3: Preparing the Dataset

### download a dataset

You can download the billsum dataset or use any other legal document dataset that suits your needs.

In [4]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

Downloading readme:   0%|          | 0.00/7.27k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

In [5]:
billsum = billsum.train_test_split(test_size=0.2)

In [6]:
billsum

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 989
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 248
    })
})

In [7]:
example = billsum["train"][0]


### Preprocess the Dataset:
Convert the legal documents and their summaries into a format compatible with T5. see : https://huggingface.co/docs/transformers/model_doc/t5#overview

Load small T5 tokenizer:

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small")

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [9]:
pref_text = "summarize: " + example['text']

tokenized_text = tokenizer(pref_text)
tokenized_summary = tokenizer(example['summary'])

Token indices sequence length is longer than the specified maximum sequence length for this model (1776 > 512). Running this sequence through the model will result in indexing errors


In [10]:
tokenized_text = tokenizer(example['text'])
for key in tokenized_text:
    print(key)
    print(tokenized_text[key])

input_ids
[37, 151, 13, 8, 1015, 13, 1826, 103, 3, 35, 2708, 38, 6963, 10, 180, 3073, 9562, 1300, 5568, 3, 23714, 13, 8, 4511, 138, 3636, 19, 21012, 12, 608, 10, 3, 23714, 5, 41, 9, 61, 5637, 2372, 13, 8, 6775, 3028, 16, 8986, 7, 5637, 12, 3, 18669, 6, 13066, 6, 13, 27444, 41, 115, 61, 19, 46, 16, 22513, 24584, 179, 57, 3, 9, 1399, 59, 12, 8193, 192, 6189, 18358, 3740, 8785, 11434, 61, 11, 57, 573, 313, 21, 3, 9, 792, 97, 59, 12, 8193, 4678, 716, 147, 3, 9, 1059, 59, 12, 8193, 604, 477, 6, 383, 3, 9, 97, 119, 145, 383, 8, 27857, 127, 22, 7, 716, 13, 496, 11364, 42, 4311, 5, 20537, 38, 937, 16, 27444, 41, 122, 201, 136, 13, 8, 6775, 3028, 16, 8986, 7, 5637, 12, 10153, 6, 13066, 6, 13, 27444, 41, 75, 201, 1286, 3, 9, 166, 42, 511, 12374, 6, 19, 46, 16, 22513, 24584, 179, 57, 3, 9, 1399, 59, 12, 8193, 192, 6189, 18358, 3740, 8785, 11434, 61, 11, 57, 573, 313, 21, 3, 9, 792, 97, 59, 12, 8193, 4678, 716, 147, 3, 9, 1059, 59, 12, 8193, 604, 477, 6, 383, 3, 9, 97, 119, 145, 383, 8, 27857, 127

In [11]:
def preprocess_function(examples):
    # Add "summarize: " at the beginning of each document in the 'text' field.
    # This tells the T5 model that the task is to generate a summary.
    inputs = ["summarize: " + doc for doc in examples["text"]]

    # Tokenize the modified input texts so they can be processed by the T5 model.
    # Limit the token count to 1024, cutting off any extra text.
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    # Tokenize the summaries provided in the 'summary' field to create the target outputs.
    # Limit these to 128 tokens, truncating any overflow.
    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    # Attach the tokenized summaries to the 'labels' field of the model inputs.
    # These labels are what the model will learn to generate during training.
    model_inputs["labels"] = labels["input_ids"]

    # Return the processed inputs and labels in a format that the model can use for training.
    return model_inputs


In [12]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [13]:
tokenized_billsum

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 989
    })
    test: Dataset({
        features: ['text', 'summary', 'title', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 248
    })
})

In [14]:
tokenized_billsum['train'][0]['text']



In [15]:
tokenized_billsum['test'][0]['summary']

'The Cortese-Knox-Hertzberg Local Government Reorganization Act of 2000 governs the procedures for the formation and change of organization of cities and special districts. Existing law permits a city or district to provide extended services, as defined, outside its jurisdictional boundaries only if it first requests and receives written approval from the local agency formation commission in the affected county. Under existing law, the commission may authorize a city or district to provide new or extended services outside both its jurisdictional boundaries and its sphere of influence under specified circumstances, including when responding to an impending threat to the public health or safety of the residents in the affected territory where specified requirements are met.\nThis bill would revise the circumstances under which the commission may authorize a city or district to provide new or extended services. This bill would additionally establish a pilot program, until January 1, 2021,

The DataCollatorForSeq2Seq is used to dynamically pad the inputs and labels to the maximum length of the batch during training or evaluation.

We create a **data collator** that handles dynamic padding for input sequences and labels. This ensures that all sequences in a batch have the same length, which is important for efficient processing. The collator uses the provided tokenizer to pad the sequences and is tailored to the "t5-small" model.

In [16]:

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="t5-small")

## step 4: Set up evaluation Metrics for Training

Load the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric using the evaluate library.
ROUGE is commonly used to evaluate the quality of text summaries by comparing them to reference summaries.
This metric will help you measure how well your model-generated summaries match the expected outputs.

In [17]:
! pip install -q evaluate rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [18]:

import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [19]:
import numpy as np

def compute_metrics(eval_pred):
    # Split the evaluation predictions into two parts: predicted tokens and true labels.
    predictions, labels = eval_pred

    # Convert the predicted token IDs back into text, ignoring any special tokens like padding.
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace all instances of -100 in the labels with the pad_token_id.
    # The -100 value is often used to mask out tokens that should be ignored during training.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Convert the true label token IDs back into text, ignoring any special tokens like padding.
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Calculate the ROUGE score by comparing the predicted texts with the true labels.
    # The use_stemmer option reduces words to their base forms before comparison for more lenient matching.
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Determine the length of each prediction by counting the number of non-padding tokens.
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]

    # Compute the average length of the predictions and store it in the results under the key "gen_len".
    result["gen_len"] = np.mean(prediction_lens)

    # Round all the results to 4 decimal places for easier reading, and return the final metrics.
    return {k: round(v, 4) for k, v in result.items()}


## Step 5: Implementing LoRA for Fine-Tuning

Implement LoRA to fine-tune the model with fewer parameters, making the process more efficient.

In [20]:
from peft import get_peft_model, LoraConfig, TaskType

# Set up the configuration for LoRA (Low-Rank Adaptation) to fine-tune the model efficiently.
# Specify that the task is sequence-to-sequence language modeling (e.g., text generation).
# Disable inference mode to allow training, and set key parameters:
# - r: The rank of the low-rank adaptation (controls the adaptation capacity).
# - lora_alpha: A scaling factor for the LoRA weights.
# - lora_dropout: Dropout rate applied to LoRA layers to prevent overfitting.
lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
)

# Apply the LoRA configuration to the model, adapting it for more efficient fine-tuning.
model = get_peft_model(model, lora_config)


In [24]:
model.print_trainable_parameters()

trainable params: 294,912 || all params: 60,801,536 || trainable%: 0.4850


In [21]:

training_args = Seq2SeqTrainingArguments(
    # Specify the directory where the fine-tuned model and other outputs will be saved.
    output_dir="fine_tuned_t5_small_model_California_state_bill_lora",

    # Set the evaluation strategy to run after each training epoch, allowing for periodic checks on model performance.
    evaluation_strategy="epoch",

    # Define the learning rate for the optimizer, controlling how much to adjust the model's weights during training.
    learning_rate=2e-5,

    # Set the batch size for training and evaluation, indicating how many samples are processed at a time on each device (e.g., GPU).
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,

    # Apply weight decay to the model's parameters to help prevent overfitting by reducing the magnitude of the weights over time.
    weight_decay=0.01,

    # Limit the number of saved checkpoints to the most recent three, managing disk space by discarding older ones.
    save_total_limit=3,

    # Set the number of complete passes through the training dataset.
    num_train_epochs=4,

    # Enable the generation of predictions during evaluation, useful for tasks like text generation or summarization.
    predict_with_generate=True,

    # Enable 16-bit floating point precision (FP16) to reduce memory usage and speed up training, especially on compatible GPUs.
    fp16=True,
)





In [22]:
trainer = Seq2SeqTrainer(
    # Specify the model to be fine-tuned. This model will learn to perform the sequence-to-sequence task (e.g., summarization).
    model=model,

    # Pass the training arguments that define how the training process should be conducted (e.g., learning rate, batch size).
    args=training_args,

    # Provide the training dataset, which has been tokenized and prepared for training.
    train_dataset=tokenized_billsum["train"],

    # Provide the evaluation dataset, also tokenized, which will be used to assess the model’s performance after each epoch.
    eval_dataset=tokenized_billsum["test"],

    # Use the tokenizer for converting between text and token IDs during both training and evaluation.
    tokenizer=tokenizer,

    # Use the data collator to dynamically pad inputs and labels so that all sequences in a batch have the same length.
    data_collator=data_collator,

    # Define the function to compute evaluation metrics, which will be used to gauge the model's performance.
    compute_metrics=compute_metrics,
)


In [23]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,4.620221,0.1422,0.0464,0.1182,0.118,19.0
2,No log,4.361811,0.1409,0.0465,0.1176,0.1174,19.0
3,No log,4.146574,0.1401,0.0466,0.1173,0.117,19.0
4,No log,4.064441,0.1403,0.0473,0.1177,0.1177,19.0




TrainOutput(global_step=248, training_loss=4.5214179254347275, metrics={'train_runtime': 76.9324, 'train_samples_per_second': 51.422, 'train_steps_per_second': 3.224, 'total_flos': 1077992365228032.0, 'train_loss': 4.5214179254347275, 'epoch': 4.0})

## step 6: Evaluate the fine-tuned model

In [25]:
metrics = trainer.evaluate()
print(metrics)


{'eval_loss': 4.064441204071045, 'eval_rouge1': 0.1403, 'eval_rouge2': 0.0473, 'eval_rougeL': 0.1177, 'eval_rougeLsum': 0.1177, 'eval_gen_len': 19.0, 'eval_runtime': 7.9875, 'eval_samples_per_second': 31.048, 'eval_steps_per_second': 2.003, 'epoch': 4.0}


## step 7: Save the LoRA Adapter to HuggingFace

In [31]:
# Save the finetuned
model.save_pretrained("./fine_tuned_model/tuned_model")
tokenizer.save_pretrained("./fine_tuned_model/tuned_model")

('./fine_tuned_model/tuned_model/tokenizer_config.json',
 './fine_tuned_model/tuned_model/special_tokens_map.json',
 './fine_tuned_model/tuned_model/spiece.model',
 './fine_tuned_model/tuned_model/added_tokens.json',
 './fine_tuned_model/tuned_model/tokenizer.json')

In [29]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [33]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the saved model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("./fine_tuned_model/tuned_model")
tokenizer = T5Tokenizer.from_pretrained("./fine_tuned_model/tuned_model")

# Push the model and tokenizer to Hugging Face Hub
model.push_to_hub("lora-adapter-t5_small_model_California_state_bill")
tokenizer.push_to_hub("lora-adapter-t5_small_model_California_state_bill")


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


adapter_model.safetensors:   0%|          | 0.00/1.19M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Fafadalilian/lora-adapter-t5_small_model_California_state_bill/commit/b572804520d8bf467f9d255928b3275caf79e4f1', commit_message='Upload tokenizer', commit_description='', oid='b572804520d8bf467f9d255928b3275caf79e4f1', pr_url=None, pr_revision=None, pr_num=None)