# Low-Rank Adaptation

This notebook explores the use of Low-Rank Adaptation (LoRA) method of fine-tuning LLMs. LoRA is a parameter efficient fine tuning (PEFT) method which drastically reduces the number of trainable parameters when compared to full fine tuning. Typically the number of tuned parameters is 10% or less of the total number of parameters in the LLM. LoRA provides some other advantages:
* LoRA fine-tuned models do not experience additional inference latency
* The original model weights are left unchanged so that multiple LoRA adapters can be trained for different tasks.

## Setup

In [1]:
%pip install \
    torch \
    transformers \
    datasets \
    evaluate \
    rouge_score \
    loralib \
    peft --quiet

Note: you may need to restart the kernel to use updated packages.


The next two cells handle issues with Kaggle environment.

In [2]:
%pip install -U datasets

Collecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/ec/93/454ada0d1b289a0f4a86ac88dbdeab54921becabac45da3da787d136628f/datasets-2.16.1-py3-none-any.whl.metadata
  Downloading datasets-2.16.1-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets)
  Obtaining dependency information for pyarrow-hotfix from https://files.pythonhosted.org/packages/e4/f4/9ec2222f5f5f8ea04f66f184caafd991a39c8782e31f5b0266f101cb68ca/pyarrow_hotfix-0.6-py3-none-any.whl.metadata
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec[http]<=2023.10.0,>=2023.1.0 (from datasets)
  Obtaining dependency information for fsspec[http]<=2023.10.0,>=2023.1.0 from https://files.pythonhosted.org/packages/e8/f6/3eccfb530aac90ad1301c582da228e4763f19e719ac8200752a4841b0b2d/fsspec-2023.10.0-py3-none-any.whl.metadata
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.16.1-py3-none

In [3]:
!wandb offline

W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.


In [4]:
import time

import evaluate
import pandas as pd
import torch

from datasets import load_dataset, DatasetDict
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer



## Dataset and problem setup

We will use the [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) dataset from Databricks. It is an open source dataset containing records from various including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. We are interested in the 'closed QA' portion of the dataset. Briefly closed QA involves answering a question in the context of information given in a passage of text. Let us observe some sample records.

## Dataset

In [9]:
DATASET_NAME = 'databricks/databricks-dolly-15k'
RNG_SEED=10

original_dataset = load_dataset(DATASET_NAME, split='train')
dataset = original_dataset.shuffle(seed=RNG_SEED).filter(lambda example: example['category']=='closed_qa')

for i in range(3):
    print(''.join(['-'] * 80))
    print('CONTEXT:')
    print(dataset[i]['context'])
    print('\n\n')
    print('INSTRUCTION:')
    print(dataset[i]['instruction'])
    print('\n\n')
    print('RESPONSE:')
    print(dataset[i]['response'])

--------------------------------------------------------------------------------
CONTEXT:
Woodstock Music and Art Fair, commonly referred to as Woodstock, was a music festival held during August 15–18, 1969, on Max Yasgur's dairy farm in Bethel, New York, United States, 40 miles (65 km) southwest of the town of Woodstock. Billed as "an Aquarian Exposition: 3 Days of Peace & Music" and alternatively referred to as the Woodstock Rock Festival, it attracted an audience of more than 400,000 attendees. Thirty-two acts performed outdoors despite sporadic rain. It was one of the largest music festivals held in history.



INSTRUCTION:
Did the Grateful Dead play at the original Woodstock concert?



RESPONSE:
Yes, the Grateful Dead played a 1 hour and 35 minute set on Saturday, August 16 1969, that ended after a fifty-minute version of "Turn On Your Love Light".
--------------------------------------------------------------------------------
CONTEXT:
In the United States, a 401(k) plan is an e

In [8]:
# create a 60-20-20 train/test/validation split of the dataset
train_testvalid = dataset.train_test_split(test_size=0.4)
test_valid = train_testvalid["test"].train_test_split(test_size=0.5)

## Model

We will use the [google/flan-t5-small](https://huggingface.co/google/flan-t5-small) model. This is a small model with only 77M parameters which can perform summarization, question answering, sentence completion, word sense disambiguation and other tasks. We will first evaluate the pre-trained model on closed qa task and try to improve its performance using LoRA fine tuning.

In [13]:
MODEL_NAME = 'google/flan-t5-small'
DEVICE = torch.device('cuda') if torch.cuda.is_available() == True else torch.device('cpu')
DTYPE = torch.bfloat16
pre_trained_model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, torch_dtype=DTYPE).to(DEVICE)
print(f'Pre trained model tensor datatype: {pre_trained_model.dtype}')
print(f'Pre trained model device: {pre_trained_model.device}')

Pre trained model tensor datatype: torch.bfloat16
Pre trained model device: cuda:0


## Tokenizer

Before we can inspect how the model performs we need to create a tokenizer that works with the model.

In [20]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

prompt_template = """{context}

Given the above passage, answer the following question:

{instruction}
"""

for i in range(3):
    print(''.join(['-'] * 80))
    print('CONTEXT:')
    print(dataset[i]['context'])
    print('\n\n')
    print('INSTRUCTION:')
    print(dataset[i]['instruction'])
    print('\n\n')
    print('RESPONSE:')
    print(dataset[i]['response'])
    print('\n\n')
    print('FLAN-T5-RESPONSE:')
    inputs = tokenizer(
        prompt_template.format(
            context=dataset[i]['context'],
            instruction=dataset[i]['instruction'],
        ),
        return_tensors='pt',
        truncation=True,
        max_length=512
    ).to(DEVICE)
    output = tokenizer.decode(
        pre_trained_model.generate(
            inputs["input_ids"], 
            max_new_tokens=200,
        )[0], 
        skip_special_tokens=True
    )
    print(output)

--------------------------------------------------------------------------------
CONTEXT:
Woodstock Music and Art Fair, commonly referred to as Woodstock, was a music festival held during August 15–18, 1969, on Max Yasgur's dairy farm in Bethel, New York, United States, 40 miles (65 km) southwest of the town of Woodstock. Billed as "an Aquarian Exposition: 3 Days of Peace & Music" and alternatively referred to as the Woodstock Rock Festival, it attracted an audience of more than 400,000 attendees. Thirty-two acts performed outdoors despite sporadic rain. It was one of the largest music festivals held in history.



INSTRUCTION:
Did the Grateful Dead play at the original Woodstock concert?



RESPONSE:
Yes, the Grateful Dead played a 1 hour and 35 minute set on Saturday, August 16 1969, that ended after a fifty-minute version of "Turn On Your Love Light".



FLAN-T5-RESPONSE:
yes
--------------------------------------------------------------------------------
CONTEXT:
In the United Stat

As we can see the pre-trained model gives generally correct but terse responses. When evaluated on a metric like 'rouge' it would have a low score due to low recall.

## Performance of pre-trained model

Next we evaluate the pre-trained model on the 'test' portion of the dataset and compute the 'rouge' evaluation metric.

In [35]:
contexts = test_valid['test']['context']
instructions = test_valid['test']['instruction']
responses = test_valid['test']['response']

pretrained_model_responses = []
start_time = time.time()
for context, instruction in zip(contexts, instructions):
    input_ids = tokenizer(
        prompt_template.format(context=context, instruction=instruction),
        return_tensors='pt',
        truncation=True,
        max_length=512,
    ).to(DEVICE).input_ids
    pretrained_model_output = pre_trained_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    pretrained_model_response = tokenizer.decode(pretrained_model_output[0], skip_special_tokens=True)
    pretrained_model_responses.append(pretrained_model_response)
end_time = time.time()
print(f"Time taken for inference on {test_valid['test'].num_rows} examples: {end_time - start_time}")
zipped_responses = list(zip(responses, pretrained_model_responses))
df = pd.DataFrame(zipped_responses, columns=['reference_response', 'pretrained_model_response'])

Time taken for inference on 355 examples: 41.84682536125183


In [36]:
rouge = evaluate.load('rouge')

In [37]:
pretrained_model_results = rouge.compute(
    predictions=pretrained_model_responses,
    references=responses,
    use_aggregator=True,
    use_stemmer=True,
)
print(pretrained_model_results)

{'rouge1': 0.22349003544929158, 'rouge2': 0.11017409587228838, 'rougeL': 0.21701084748381727, 'rougeLsum': 0.2171873334906056}


As discussed earlier, the performance of the pre-trained model is not great due to the fact that it outputs extremely brief responses.

## Fine tuning the model

Next we will use the LoRA technique to fine tune the model on the validation dataset for a range of ranks and see if it improves the rouge score. First we will tokenize the dataset to speed up training.

In [33]:
def tokenize_function(example):
    prompt = [
        prompt_template.format(context=context, instruction=instruction) 
        for context, instruction in zip(example["context"], example["instruction"])
    ]
    example['input_ids'] = tokenizer(
        prompt, 
        padding="max_length", 
        truncation=True, 
        max_length=512, 
        return_tensors="pt"
    ).to(DEVICE).input_ids
    example['labels'] = tokenizer(
        example["response"], 
        padding="max_length", 
        truncation=True, 
        max_length=512, 
        return_tensors="pt"
    ).to(DEVICE).input_ids
    
    return example

In [34]:
tokenized_train_testvalid = train_testvalid.map(tokenize_function, batched=True)
tokenized_test_valid = test_valid.map(tokenize_function, batched=True)
tokenized_train_testvalid = tokenized_train_testvalid.remove_columns(['instruction', 'context', 'response', 'category',])
tokenized_test_valid = tokenized_test_valid.remove_columns(['instruction', 'context', 'response', 'category',])

Map:   0%|          | 0/710 [00:00<?, ? examples/s]

In [39]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"


In [40]:
from peft import LoraConfig, get_peft_model, TaskType

ranks_to_try = [1, 2, 4, 8]
peft_model_paths = []

for rank in ranks_to_try:
    print(''.join(['-'] * 80))
    print(f'Rank: {rank}')
    lora_config = LoraConfig(
        r=rank, # Rank
        lora_alpha=32,
        target_modules=["q", "v"],
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
    )
    peft_model = get_peft_model(pre_trained_model, lora_config)
    print(print_number_of_trainable_model_parameters(peft_model))
    peft_training_args = TrainingArguments(
        output_dir=f'./peft-training-{rank}-{str(int(time.time()))}',
        auto_find_batch_size=True,
        learning_rate=1e-3, # Higher learning rate than full fine-tuning.
        num_train_epochs=10,
        logging_steps=250,
    )
    peft_trainer = Trainer(
        model=peft_model,
        args=peft_training_args,
        train_dataset=tokenized_train_testvalid["train"],
    )
    peft_trainer.train()
    peft_model_path=f"./peft-checkpoint-{rank}-{str(int(time.time()))}"
    peft_trainer.model.save_pretrained(peft_model_path)
    tokenizer.save_pretrained(peft_model_path)
    peft_model_paths.append(peft_model_path)

--------------------------------------------------------------------------------
Rank: 1
trainable model parameters: 43008
all model parameters: 77004160
percentage of trainable model parameters: 0.06%


Step,Training Loss
250,5.5616
500,1.9946
750,1.9329
1000,1.9127
1250,1.9161


--------------------------------------------------------------------------------
Rank: 2
trainable model parameters: 86016
all model parameters: 77047168
percentage of trainable model parameters: 0.11%


Step,Training Loss
250,5.6588
500,2.0288
750,1.9539
1000,1.9324
1250,1.935


--------------------------------------------------------------------------------
Rank: 4
trainable model parameters: 172032
all model parameters: 77133184
percentage of trainable model parameters: 0.22%


Step,Training Loss
250,5.5785
500,2.0419
750,1.9588
1000,1.9252
1250,1.9261


--------------------------------------------------------------------------------
Rank: 8
trainable model parameters: 344064
all model parameters: 77305216
percentage of trainable model parameters: 0.45%


Step,Training Loss
250,5.6804
500,2.0664
750,1.9851
1000,1.9566
1250,1.9579


In [42]:
test_valid

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 355
    })
    test: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 355
    })
})

In [44]:
from peft import PeftModel, PeftConfig

# peft_model_base = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16).to(DEVICE)
# tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

contexts = test_valid['train']['context']
instructions = test_valid['train']['instruction']
responses = test_valid['train']['response']

rouge1_results = []
rouge2_results = []
rougeL_results = []
rougeLsum_results = []

for path in peft_model_paths:
    # instantiate peft model
    peft_saved_model = PeftModel.from_pretrained(
        pre_trained_model, 
        path, 
        torch_dtype=torch.bfloat16,
        is_trainable=False
    ).to(DEVICE)

    peft_model_responses = []
    start_time = time.time()
    for context, instruction in zip(contexts, instructions):
        input_ids = tokenizer(
            prompt_template.format(context=context, instruction=instruction),
            return_tensors='pt',
            truncation=True,
            max_length=512,
        ).to(DEVICE).input_ids
        peft_model_output = peft_saved_model.generate(
            input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200)
        )
        peft_model_response = tokenizer.decode(peft_model_output[0], skip_special_tokens=True)
        peft_model_responses.append(peft_model_response)
    end_time = time.time()
    print(f"Time taken for inference on {test_valid['train'].num_rows} examples: {end_time - start_time}")
    peft_model_result = rouge.compute(
        predictions=peft_model_responses,
        references=responses,
        use_aggregator=True,
        use_stemmer=True,
    )
    rouge1_results.append(peft_model_result['rouge1'])
    rouge2_results.append(peft_model_result['rouge2'])
    rougeL_results.append(peft_model_result['rougeL'])
    rougeLsum_results.append(peft_model_result['rougeLsum'])

zipped_results = list(zip(ranks_to_try, rouge1_results, rouge2_results, rougeL_results, rougeLsum_results))
results_df = pd.DataFrame(zipped_results, columns=['rank', 'rouge1', 'rouge2', 'rougeL', 'rougeLsum'])
results_df

Time taken for inference on 355 examples: 201.056640625
Time taken for inference on 355 examples: 176.9531762599945
Time taken for inference on 355 examples: 176.1388807296753
Time taken for inference on 355 examples: 126.36030602455139


Unnamed: 0,rank,rouge1,rouge2,rougeL,rougeLsum
0,1,0.340621,0.204046,0.3152,0.314422
1,2,0.353333,0.206454,0.324178,0.324214
2,4,0.358939,0.215746,0.331947,0.331981
3,8,0.32743,0.191774,0.308966,0.308135


There is no appreciable difference in the performance of the four models -- although the model tuned with rank=4 seems to have a slight edge over the others. We will no evaluate its performance on the test dataset.

In [46]:
contexts = test_valid['test']['context']
instructions = test_valid['test']['instruction']
responses = test_valid['test']['response']

peft_saved_model = PeftModel.from_pretrained(
    pre_trained_model, 
    peft_model_paths[2], 
    torch_dtype=torch.bfloat16,
    is_trainable=False
).to(DEVICE)

peft_model_responses = []
start_time = time.time()
for context, instruction in zip(contexts, instructions):
    input_ids = tokenizer(
        prompt_template.format(context=context, instruction=instruction),
        return_tensors='pt',
        truncation=True,
        max_length=512,
    ).to(DEVICE).input_ids
    peft_model_output = peft_saved_model.generate(
        input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200)
    )
    peft_model_response = tokenizer.decode(peft_model_output[0], skip_special_tokens=True)
    peft_model_responses.append(peft_model_response)
end_time = time.time()
print(f"Time taken for inference on {test_valid['test'].num_rows} examples: {end_time - start_time}")
peft_model_result = rouge.compute(
    predictions=peft_model_responses,
    references=responses,
    use_aggregator=True,
    use_stemmer=True,
)
print(peft_model_result)

Time taken for inference on 355 examples: 195.5180356502533
{'rouge1': 0.31707048704688745, 'rouge2': 0.1795858588607784, 'rougeL': 0.2881053060618656, 'rougeLsum': 0.28913189774986503}


Let us finally look at some of the responses from the fine tuned model.

In [47]:
for i in range(3):
    print(''.join(['-'] * 80))
    print('CONTEXT:')
    print(test_valid['test'][i]['context'])
    print('\n\n')
    print('INSTRUCTION:')
    print(test_valid['test'][i]['instruction'])
    print('\n\n')
    print('RESPONSE:')
    print(test_valid['test'][i]['response'])
    print('\n\n')
    print('Fine-tuned-FLAN-T5-RESPONSE:')
    inputs = tokenizer(
        prompt_template.format(
            context=test_valid['test'][i]['context'],
            instruction=test_valid['test'][i]['instruction'],
        ),
        return_tensors='pt',
        truncation=True,
        max_length=512
    ).to(DEVICE)
    output = tokenizer.decode(
        pre_trained_model.generate(
            inputs["input_ids"], 
            max_new_tokens=200,
        )[0], 
        skip_special_tokens=True
    )
    print(output)

--------------------------------------------------------------------------------
CONTEXT:
The Build Back Better Plan or Build Back Better agenda was a legislative framework proposed by U.S. president Joe Biden between 2020 and 2021. Generally viewed as ambitious in size and scope, it sought the largest nationwide public investment in social, infrastructural, and environmental programs since the 1930s Great Depression-era policies of the New Deal.

The Build Back Better plan was divided into three parts:

American Rescue Plan (ARP), a COVID-19 pandemic-relief bill;
American Jobs Plan (AJP), a proposal to address long-neglected infrastructure needs and reduce America's contributions to destructive effects of climate change; and
American Families Plan (AFP), a proposal to fund a variety of social policy initiatives, some of which (e.g., paid family leave) had never before been enacted nationally in the U.S.
The first part was passed as the $1.9 trillion American Rescue Plan Act of 2021, a