 ### Introduction

In this notebook, we will explore the process of fine-tuning a pre-trained language model using the LoRA (Low-Rank Adaptation) technique and quantizing it with bitsandbytes to reduce memory usage. We will use the FLAN-T5 model, a variant of the T5 model designed for sequence-to-sequence tasks, and fine-tune it on the SQuAD v2 dataset to evaluate its performance on real-world question answering tasks.

The notebook is structured as follows:

1. **Quantization**: We will configure the model to use 4-bit quantization to reduce memory usage while maintaining performance.
2. **Model Initialization**: We will initialize the FLAN-T5 model with the quantization configuration.
3. **Model Freezing and Gradient Checkpointing**: We will freeze the model parameters and enable gradient checkpointing to reduce memory usage during training.
4. **Inference Before Fine-Tuning**: We will observe the model's performance on a few examples before fine-tuning.
5. **Helper Functions**: We will define some helper functions for training and evaluation.
6. **Dataset Preparation**: We will load and preprocess the dataset for training.
7. **Fine-Tuning**: We will fine-tune the model using the Seq2SeqTrainer from the Hugging Face Transformers library.
8. **Saving the Model**: We will save the fine-tuned model to disk.
9. **Model Inference**: We will load the saved model and perform inference on new data.
10. **Model Evaluation**: We will evaluate the model's performance using F1 and exact match scores.
11. **Conclusion**: We will summarize the results and discuss the effectiveness of the fine-tuning and quantization techniques.



In [None]:
# Importing the necessary Libraries
import torch
import torch.nn as nn
print(torch.cuda.is_available())
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
os.environ['WANDB_DISABLED'] = 'true'

import numpy as np
import pandas as pd
import transformers
import accelerate
import tensorboard
import bitsandbytes as bnb

#### 1. Quantization

In [2]:
#configuring the BitsAndBytesConfig

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

#### 2. Model lnitialization

In [3]:
%pip install -U bitsandbytes



In [4]:
# Initializing the model and tokenizer

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "google/flan-t5-base"

model = AutoModelForSeq2SeqLM.from_pretrained(
        model_id,
        quantization_config = bnb_config,
        torch_dtype = torch.float16,
        device_map = {"":0}
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [5]:
for param in model.parameters():
    param.requires_grad = False  # freeze the model - train adapters later
    if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
        param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
    def forward(self, x):
        return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

#### 3. Observe model output before fine-tuning

In [6]:
from IPython.display import display, Markdown

def make_inference(model, context, question, max_new_tokens=200):
    batch = tokenizer(f"#### CONTEXT\n{context}\n\n#### QUESTION\n{question}\n\n#### ANSWER\n", return_tensors='pt', return_token_type_ids=False).to('cuda')

    with torch.amp.autocast('cuda'):
        output_tokens = model.generate(**batch, max_new_tokens=max_new_tokens)

    display(Markdown((tokenizer.decode(output_tokens[0], skip_special_tokens=True))))

In [7]:
context = "Cheese is the best food."
question = "What is the best food?"

make_inference(model, context, question)

  with torch.cuda.amp.autocast():


Cheese

In [8]:
context = "Cheese is the best food."
question = "How far away is the Moon from the Earth?"

make_inference(model, context, question)

  with torch.cuda.amp.autocast():


The Moon is approximately 1.3 billion light years away.

In [9]:
context = "The Moon orbits Earth at an average distance of 384,400 km (238,900 miles), or about 30 times Earth's diameter. Its gravitational influence is the main driver of Earth's tides and very slowly lengthens Earth's day. The Moon's orbit around Earth has a sidereal period of 27.3 days. During each synodic period of 29.5 days, the amount of visible surface illuminated by the Sun varies from none up to 100%, resulting in lunar phases that form the basis for the months of a lunar calendar. The Moon is tidally locked to Earth, which means that the length of a full rotation of the Moon on its own axis causes its same side (the near side) to always face Earth, and the somewhat longer lunar day is the same as the synodic period. However, 59% of the total lunar surface can be seen from Earth through cyclical shifts in perspective known as libration."
question = "At what distance does the Moon orbit the Earth?"

make_inference(model, context, question)

  with torch.cuda.amp.autocast():


30 times Earth's diameter

#### 4. Helper functions

In [10]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [11]:
def create_prompt(context, question, answer):
    if len(answer["text"]) < 1:
        answer = "Cannot Find Answer"
    else:
        answer = answer["text"][0]
    prompt_template = f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n{answer}</s>"
    return prompt_template

 #### 5. Load preprocessed dataset from disk (skip to step 7)

#### 6a. Load raw dataset from HuggingFace

In [32]:
#Loading the dataset

from datasets import load_dataset, Dataset, load_from_disk

dataset = load_dataset("squad_v2")
dataset = pd.DataFrame(dataset['train'])

# remove rows with empty answers
exclude = []
for i in range(len(dataset)):
    if not dataset.iloc[i]['answers']['text']:
        exclude.append(i)
dataset = dataset.drop(exclude)
print(f'{len(exclude)} rows removed.')

# accept only the first answer in every line of data
answer = []
for i in range(len(dataset)):
    answer.append(dataset.iloc[i]['answers']['text'][0])
dataset['answer'] = answer

dataset = Dataset.from_pandas(dataset)
dataset = dataset.train_test_split(train_size=0.15, test_size=0.02) # smaller dataset

dataset["validation"] = dataset["test"]
del dataset["test"]

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['validation'])}")


## Save dataset to disk for easy loading later

dataset_path = 'C:\\Users\\chall\\OneDrive\\Desktop\\vs code desktop\\Projects'
dataset.save_to_disk(f'{dataset_path}/raw')

43498 rows removed.
Train dataset size: 13023
Test dataset size: 1737


Saving the dataset (0/1 shards):   0%|          | 0/13023 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1737 [00:00<?, ? examples/s]

#### 6b. Preprocess training dataset

In [16]:
from datasets import concatenate_datasets

## Determine maximum total input sequence length after tokenization =>
## Sequences beyond this will be truncated, sequences shorter will be padded

tokenized_inputs = concatenate_datasets([dataset["train"], dataset["validation"]]).map(lambda x: tokenizer(x["context"], truncation=True), batched=True, remove_columns=['id', 'title', 'context', 'question', 'answers', 'answer', '__index_level_0__'])
input_lengths = [len(x) for x in tokenized_inputs["input_ids"]]
max_source_length = int(np.percentile(input_lengths, 85))    # 85% of max length for better utilization
print(f"Max source length: {max_source_length}")


## Determine maximum total sequence length for target text after tokenization =>
## Sequences beyond this will be truncated, sequences shorter will be padded
tokenized_targets = concatenate_datasets([dataset["train"], dataset["validation"]]).map(lambda x: tokenizer(x["answer"], truncation=True), batched=True, remove_columns=['id', 'title', 'context', 'question', 'answers', 'answer', '__index_level_0__'])
target_lengths = [len(x) for x in tokenized_targets["input_ids"]]
max_target_length = int(np.percentile(target_lengths, 90))    # 90% of max length for better utilization
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/14760 [00:00<?, ? examples/s]

Max source length: 243


Map:   0%|          | 0/14760 [00:00<?, ? examples/s]

Max target length: 11


In [18]:
def preprocess_function(sample, padding="max_length"):
    # add prefix to the input for t5
    inputs = [f'context: {i} question: {j}' for i, j in zip(sample["context"], sample["question"])]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["answer"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function,
                                batched=True,
                                remove_columns=['id', 'title', 'context', 'question', 'answers', 'answer', '__index_level_0__'],
                                desc="Running tokenizer on dataset")
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")


## Save tokenized_dataset to disk for later easy loading

dataset_path = 'C:\\Users\\chall\\OneDrive\\Desktop\\vs code desktop\\Projects'
tokenized_dataset['train'].save_to_disk(f'{dataset_path}/train')
tokenized_dataset['validation'].save_to_disk(f'{dataset_path}/test')   # used for evaluation

Running tokenizer on dataset:   0%|          | 0/13023 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/1737 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


Saving the dataset (0/1 shards):   0%|          | 0/13023 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1737 [00:00<?, ? examples/s]

#### 7. Fine-Tune T5 with LoRA and bnb int-8

In addition to the LoRA technique, we will use [bitsanbytes LLM.int8()](https://huggingface.co/blog/hf-bitsandbytes-integration) to quantize our frozen LLM to int8. This allows us to reduce the needed memory for FLAN-T5 base ~4x.  

The first step of our training is to load the model. We are going to use [google/flan-t5-base](https://huggingface.co/google/flan-t5-base).

In [19]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "google/flan-t5-base"

model = AutoModelForSeq2SeqLM.from_pretrained(
        model_id,
        quantization_config = bnb_config,
        torch_dtype = torch.float16,
        device_map = {"":0}
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

In [20]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

# Define LoRA Config
lora_config = LoraConfig(
 r=16,              # 4
 lora_alpha=32,     # 8
 target_modules=["q", "v"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)

# prepare int-8 model for training
model = prepare_model_for_kbit_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 1,769,472 || all params: 249,347,328 || trainable%: 0.7096


In [21]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

training_args = Seq2SeqTrainingArguments(
    output_dir = "outputs",
    save_strategy = "no",
    report_to = "tensorboard",
    auto_find_batch_size = True,
    warmup_steps = 100,
    learning_rate = 1e-3,
    weight_decay = 0.001,
    fp16_full_eval = True,
    fp16 = False,                         # 16 bits precision is sufficient and good
    num_train_epochs = 3,
    logging_strategy = "steps",
    logging_steps = 100,
#     max_steps = 2000,                   # disable if specifying no. of epochs
#     gradient_accumulation_steps = 4,    # no. of updates steps to accumulate gradients, before updating it (higher = more accurate, but takes longer)
#     optim='adamw_bnb_8bit',
#     save_total_limit = 8,               # no. of checkpoints (models) saved in output_dir
#     evaluation_strategy = 'epoch',
#     logging_dir = f"{output_dir}/logs",

)

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model = model,
    label_pad_token_id = -100,   # we want to ignore tokenizer pad token in the loss
    pad_to_multiple_of = 8
)

trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset=tokenized_dataset['train'],  # Make sure this is a valid split
    eval_dataset=tokenized_dataset['validation']       # why when add the eval_dataset argument, training loss becomes 0
    # if tokenized_dataset regenerated in 6b. (not loaded from disk), need to add in ['train'] indices
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!


trainer.train()

  return fn(*args, **kwargs)


Step,Training Loss
100,0.8092
200,0.7346
300,0.7117
400,0.7152
500,0.7537
600,0.7056
700,0.735
800,0.7179
900,0.7268
1000,0.7366


TrainOutput(global_step=4884, training_loss=0.62340615402959, metrics={'train_runtime': 3132.8605, 'train_samples_per_second': 12.471, 'train_steps_per_second': 1.559, 'total_flos': 1.3061247910060032e+16, 'train_loss': 0.62340615402959, 'epoch': 3.0})

#### 8. Saving model

In [22]:
peft_model_path = 'C:\\Users\\chall\\OneDrive\\Desktop\\vs code desktop\\Projects'
trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)   # not rly necessary unless changes made to tokenizer: add new tokens to its vocab, redefine special symbols such as '[CLS]', '[MASK]', '[SEP]', '[PAD]' etc.

('C:\\Users\\chall\\OneDrive\\Desktop\\vs code desktop\\Projects/tokenizer_config.json',
 'C:\\Users\\chall\\OneDrive\\Desktop\\vs code desktop\\Projects/special_tokens_map.json',
 'C:\\Users\\chall\\OneDrive\\Desktop\\vs code desktop\\Projects/spiece.model',
 'C:\\Users\\chall\\OneDrive\\Desktop\\vs code desktop\\Projects/added_tokens.json',
 'C:\\Users\\chall\\OneDrive\\Desktop\\vs code desktop\\Projects/tokenizer.json')

In [None]:
## To push model to HuggingFace

# trainer.model.push_to_hub("<huggingface directory>",
#                   use_auth_token='<token>',
#                   commit_message="v1",
#                   private=True)

#### 9. Model inference

In [24]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load peft config for pre-trained checkpoint etc.
peft_model_path = 'C:\\Users\\chall\\OneDrive\\Desktop\\vs code desktop\\Projects'
config = PeftConfig.from_pretrained(peft_model_path)

# load base LLM model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,
                                              return_dict=True,
                                              load_in_8bit=True,    # True if quantizing
                                              device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
peft_model = PeftModel.from_pretrained(model, peft_model_path, device_map={"":0})
peft_model.eval()

print("Peft model loaded")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Peft model loaded


In [25]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "google/flan-t5-base"

model = AutoModelForSeq2SeqLM.from_pretrained(
        model_id,
        quantization_config = bnb_config,
        torch_dtype = torch.float16,
        device_map = {"":0}
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

In [26]:
context = "Cheese is the best food."
question = "What is the best food?"

print('model:')
make_inference(model, context, question)
print('peft_model:')
make_inference(peft_model, context, question)

model:


  with torch.cuda.amp.autocast():


Cheese

peft_model:


Cheese

In [27]:
context = "Cheese is the best food."
question = "How far away is the Moon from the Earth?"

print('model:')
make_inference(model, context, question)
print('peft_model:')
make_inference(peft_model, context, question)

model:


  with torch.cuda.amp.autocast():


The Moon is approximately 1.3 billion light years away.

peft_model:


The Moon is located approximately 280 miles from the Earth

In [28]:
context = "The Moon orbits Earth at an average distance of 384,400 km (238,900 miles), or about 30 times Earth's diameter. Its gravitational influence is the main driver of Earth's tides and very slowly lengthens Earth's day. The Moon's orbit around Earth has a sidereal period of 27.3 days. During each synodic period of 29.5 days, the amount of visible surface illuminated by the Sun varies from none up to 100%, resulting in lunar phases that form the basis for the months of a lunar calendar. The Moon is tidally locked to Earth, which means that the length of a full rotation of the Moon on its own axis causes its same side (the near side) to always face Earth, and the somewhat longer lunar day is the same as the synodic period. However, 59% of the total lunar surface can be seen from Earth through cyclical shifts in perspective known as libration."
question = "At what distance does the Moon orbit the Earth?"

print('model:')
make_inference(model, context, question)
print('peft_model:')
make_inference(peft_model, context, question)

model:


  with torch.cuda.amp.autocast():


30 times Earth's diameter

peft_model:


384,400 km (238,900 miles)

In [29]:
## Basic

context = f"""
Another approach to brain function is to examine the consequences of damage to specific brain areas.
Even though it is protected by the skull and meninges, surrounded by cerebrospinal fluid,
and isolated from the bloodstream by the blood–brain barrier,
the delicate nature of the brain makes it vulnerable to numerous diseases and several types of damage.
In humans, the effects of strokes and other types of brain damage have been a key source of information about brain function.
Because there is no ability to experimentally control the nature of the damage, however,
this information is often difficult to interpret. In animal studies, most commonly involving rats,
it is possible to use electrodes or loclly injected chemicals to produce precise patterns of damage
and then examine the consequences for behavior.
"""
question = "Why is it difficult to study the brain?"

print('model:')
make_inference(model, context, question)
print('peft_model:')
make_inference(peft_model, context, question)

model:


  with torch.cuda.amp.autocast():


there is no ability to experimentally control the nature of the damage

peft_model:


Because there is no ability to experimentally control the nature

In [30]:
## Intermediate

context = f"""
Another approach to brain function is to examine the consequences of damage to specific brain areas.
Even though it is protected by the skull and meninges, surrounded by cerebrospinal fluid,
and isolated from the bloodstream by the blood–brain barrier,
the delicate nature of the brain makes it vulnerable to numerous diseases and several types of damage.
In humans, the effects of strokes and other types of brain damage have been a key source of information about brain function.
Because there is no ability to experimentally control the nature of the damage, however,
this information is often difficult to interpret. In animal studies, most commonly involving rats,
it is possible to use electrodes or locally injected chemicals to produce precise patterns of damage
and then examine the consequences for behavior.
"""
question = "How do we check for brain damage?"

print('model:')
make_inference(model, context, question)
print('peft_model:')
make_inference(peft_model, context, question)

model:


  with torch.cuda.amp.autocast():


In animal studies, most commonly involving rats, it is possible to use electrodes or locally injected chemicals to produce precise patterns of damage and then examine the consequences for behavior.

peft_model:


use electrodes or locally injected chemicals to produce precise

#### 10. Model evaluation

In [31]:
## Helper functions

def normalize_text(s):
    """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
    import string, re

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, truth):
    return int(normalize_text(prediction) == normalize_text(truth))

def compute_f1(prediction, truth):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()

    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)

    common_tokens = set(pred_tokens) & set(truth_tokens)

    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0

    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)

    return 2 * (prec * rec) / (prec + rec)

In [40]:
from datasets import load_from_disk
from tqdm import tqdm

## function to generate predictions


def evaluate_peft_model(sample, max_target_length=200):
    input_ids = torch.tensor(sample["input_ids"]).unsqueeze(0).cuda()  # Convert list to tensor

    # Generate predictions
    outputs = model.generate(input_ids=input_ids, do_sample=True, top_p=0.9, max_new_tokens=max_target_length)

    # Decode the output safely
    prediction = tokenizer.decode(outputs[0].tolist(), skip_special_tokens=True)

    # Process labels safely
    labels = sample['labels']
    labels = [token for token in labels if token != -100]  # Remove ignored tokens
    labels = tokenizer.decode(labels, skip_special_tokens=True)

    return prediction, labels



## load test dataset from distk
test_dataset = tokenized_dataset['validation']

## compute score
f1_scores, exact_scores = [], []
for sample in tqdm(test_dataset, miniters=100, maxinterval=float("inf"), position=0, leave=True):
    p, l = evaluate_peft_model(sample)
    f1_scores.append(compute_f1(p, l))
    exact_scores.append(compute_exact_match(p, l))

print(np.mean(f1_scores))
print(np.mean(exact_scores))

100%|██████████| 1737/1737 [08:35<00:00,  3.37it/s]

0.6839393890408973
0.5497985031663788





### Conclusion

In this notebook, we have successfully fine-tuned the FLAN-T5 model using the LoRA technique and quantized it with bitsandbytes to reduce memory usage. We started by setting up the environment and loading the pre-trained model. We then prepared the dataset, tokenized it, and applied the necessary preprocessing steps. After that, we fine-tuned the model using the Seq2SeqTrainer and saved the trained model.

We also demonstrated how to load the trained model and perform inference on new data. Finally, we evaluated the model's performance using F1 and exact match scores.

The results show that the fine-tuned model performs well on the given tasks, providing accurate and relevant answers to the questions based on the provided context. This approach can be further extended to other datasets and tasks to leverage the power of the FLAN-T5 model with efficient memory usage.