# Lab 3 - Fine-tune Falcon-7B using LoRA

In this lab, you will fine-tune the Falcon 7B model using [Low-Rank Adaptation (LoRA)](https://github.com/microsoft/LoRA) to reduce the number of trainable parameters and storage requirements.

We will apply the data generated for the `fae` pronoum using CDA in the previous lab.

We will be performing the following tasks:
* Load a tokenizer and pre-trained model
* Fine-tune Falcon-7B
* Perform inference with fine-tuned model

<div class="alert alert-block alert-warning">
 <b>Warning:</b> <br/>
    This notebook requires Python >= 3.10.<br/>
    Use kernel `PyTorch 2.0.0 Python 3.10 CPU Optimized` for SageMaker Studio and `pytorch_p310` for SageMaker Notebook instances. <br/>
    Both scenarios require instance type `ml.g4dn.2xlarge` or bigger.<br/>
</div>

First, check that the Python version you are using is >= 3.10. and make sure we upgrade pip before installing the required libraries


In [2]:
!python -V
!pip install -q -U pip --root-user-action=ignore

Python 3.10.8


**Installing requirements**

Run the cell below to create a requirements.txt file and then install. If you already have a requirements.txt this will be overwritten.

In [3]:
%%writefile requirements.txt
torch==2.0.1
bitsandbytes==0.41.1
datasets>=2.10.0,<3
py7zr
einops
tensorboardX
ipython
ipykernel
transformers>4.28.1
git+https://github.com/huggingface/peft
accelerate>=0.20.3,<1
cuda-python
xformers
nvidia-cublas-cu11

Overwriting requirements.txt


Quitely install all requirements from the previously created file.

In [4]:
!pip3 install -q -r requirements\.txt --root-user-action=ignore

**Checking memory**

Before running this notebook, let's quickly run the following command to check the GPU memory.

In [5]:
!nvidia-smi

Wed Aug 30 10:52:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [6]:
# Add installed cuda runtime to path for bitsandbytes
import os
import nvidia
import torch

cuda_install_dir = "/".join(nvidia.__file__.split("/")[:-1]) + "/cuda_runtime/lib/"
os.environ["LD_LIBRARY_PATH"] = cuda_install_dir
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
os.environ["TOKENIZERS_PARALLELISM"] = "true"

In [7]:
free_in_GB = int(torch.cuda.mem_get_info()[0] / 1024**3)
max_memory = f"{free_in_GB-2}GB"

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
max_memory

{0: '12GB'}

## Step 1. Load a tokenizer and pre-trained model

First, let's initialize a tokenizer and an LLM. In this notebook, we will be using the `Falcon-7B` model from the Hugging Face Transformers library. `Falcon-7B` was trained on 1,500B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset which was enhanced with curated corpora.

### Loading Tokenizer
The tokenizer converts raw text into tokens, and the base model generates text based on a given prompt. The `AutoTokenizer.from_pretrained()` function is used to instantiate the tokenizer. 
- `padding_side="left"` specifies the side of the sequences where padding tokens will be added. In this case, padding tokens will be added to the left side of each sequence. 
- The `eos_token` is a special token representing the end of a sequence. By assigning it to the `pad_token`, any padding tokens added during tokenization will also be considered as end-of-sequence tokens. This can be useful when generating text using the model, as it will know when to stop generating text after encountering padding tokens.

After execution, the `tokenizer` object will be initialized and ready to use for tokenizing text.

In [8]:
from transformers import AutoTokenizer
from tqdm.autonotebook import tqdm as notebook_tqdm

model_id = "tiiuae/falcon-7b"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    padding_side="left",
)

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
# checking eos
print(tokenizer.eos_token)

tokenizer.pad_token = tokenizer.eos_token

<|endoftext|>


Suppress warnings.

In [10]:
tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True

### Loading Pre-trained model `AutoModelForCausalLM`

In [11]:
from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Now we initialize and download a base model using the `AutoModelForCausalLM` class provided by the Transformers library. Base models are responsible for generating text based on a given prompt.

The `AutoModelForCausalLM.from_pretrained()` function is used to instantiate the base model. 
- `use_cache=False` determines whether the model should use the local cache when loading pre-trained weights. By setting it to False, the cache will not be used, and the model will always download the weights from the remote source.
- `device_map="auto"` specifies the device where the model will be loaded. Setting it to "auto" allows the library to automatically select the appropriate device (e.g., CPU or GPU) based on availability.

After execution, the `base_model` object will be initialized and ready to use for generating text.

In [12]:
from transformers import AutoModelForCausalLM

model_4bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    quantization_config=quant_config,
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:14<00:00,  7.20s/it]


Enable gradient checkpointing for more memory-efficient training:

In [13]:
model_4bit.gradient_checkpointing_enable()
model_4bit.enable_input_require_grads()

Set resize token embeddings.

In [14]:
model_4bit.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=128)

Embedding(65152, 4544)

## Step 2. Fine-Tuning

To finetune a model efficiently, we're going to use [LoRA: Low-Rank Adaptation](https://arxiv.org/abs/2106.09685). LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. 


### Load dataset for fine-tuning

In [15]:
from datasets import load_dataset

bold_prompts_dataset = load_dataset(
    "csv", data_files="bold_gender_prompts_cda.csv"
)["train"]

bold_prompts_dataset

Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 7695.97it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 28.66it/s]
Generating train split: 4000 examples [00:00, 24258.48 examples/s]


Dataset({
    features: ['intro', 'response'],
    num_rows: 4000
})

In [16]:
from IPython.display import Markdown

In [17]:
prompt_template = """Below is a sentence that you need to complete. Write a response that appropriately completes the request. Sentence: {instruction}\n Response:"""
answer_template = """{response}"""

Markdown(prompt_template + answer_template)

Below is a sentence that you need to complete. Write a response that appropriately completes the request. Sentence: {instruction}
 Response:{response}

In [18]:
def _add_text(rec):
    instruction = rec["intro"]
    response = rec["response"]

    if not instruction:
        raise ValueError(f"Expected an prompt in: {rec}")

    if not response:
        raise ValueError(f"Expected a response in: {rec}")

    rec["prompt"] = prompt_template.format(instruction=instruction)
    rec["answer"] = answer_template.format(response=response)
    rec["text"] = rec["prompt"] + rec["answer"]

    return rec

In [19]:
bold_prompts_dataset = bold_prompts_dataset.map(_add_text)

Map: 100%|██████████| 4000/4000 [00:00<00:00, 9483.72 examples/s] 


We use the `preprocess_batch` function to preprocess the "text" field of the batch, applying tokenization, truncation, and other relevant operations based on the specified maximum length. It takes a batch of data, a tokenizer, and a maximum length as input. 

In [20]:
from functools import partial
import copy
from typing import Any, Dict, List, Tuple, Union

MAX_LENGTH = 256


# Function to generate token embeddings
def _preprocess_batch(batch: Dict[str, List]):
    model_inputs = tokenizer(
        batch["text"], max_length=MAX_LENGTH, truncation=True, padding="max_length"
    )

    model_inputs["labels"] = copy.deepcopy(model_inputs["input_ids"])
    return model_inputs


_preprocessing_function = partial(_preprocess_batch)

Next, we apply the preprocessing function to each batch in the dataset, modifying the "text" field accordingly. The map operation is performed in a batched manner and the "instruction", "response", and "text" columns are removed from the dataset. Finally, the `processed_dataset` is created by filtering the `bold_prompts_dataset` based on the length of the "input_ids" field, ensuring it is less than the specified `MAX_LENGTH`.

In [21]:
encoded_bold_prompts_dataset = bold_prompts_dataset.map(
    _preprocessing_function,
    batched=True,
    remove_columns=["intro", "response", "prompt", "answer"],
)

processed_dataset = encoded_bold_prompts_dataset.filter(
    lambda rec: len(rec["input_ids"]) <= MAX_LENGTH
)

Map: 100%|██████████| 4000/4000 [00:02<00:00, 1598.83 examples/s]
Filter: 100%|██████████| 4000/4000 [00:01<00:00, 2289.35 examples/s]


Let's split dataset into `train` and `test` for evaluation.

In [22]:
split_dataset = processed_dataset.train_test_split(test_size=100, seed=0)
split_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3900
    })
    test: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 100
    })
})

### Define the `LoraConfig` and load LoRA model

We will us `LoraConfig` from [huggingface 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning](https://github.com/huggingface/peft). Within `LoraConfig`, let's specify the following parameters:

- `r`, the dimension of the low-rank matrices
- `lora_alpha`, the scaling factor for the low-rank matrices
- `lora_dropout`, the dropout probability of the LoRA layers
- `bias`
- `task_type`
- `target_modules`

In [23]:
from peft import LoraConfig, TaskType

MICRO_BATCH_SIZE = 8
BATCH_SIZE = 64
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
LORA_R = 256
LORA_ALPHA = 512
LORA_DROPOUT = 0.05

# Define LoRA Config
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["query_key_value"],
)

### Prepare model for training with `peft`

Some pre-processing needs to be done before training such an int8 model using peft, therefore let's import an utiliy function `prepare_model_for_kbit_training` that will:

- wrap the entire protocol for preparing a model before running a training
- add a forward_hook to the input embedding layer to enable gradient computation of the input hidden states

In [24]:
from peft import prepare_model_for_kbit_training, get_peft_model

# prepare model for training
model = prepare_model_for_kbit_training(model_4bit)
model = get_peft_model(model, lora_config)

# let's look at how many trainable parameters we have
model.print_trainable_parameters()

trainable params: 75,497,472 || all params: 6,997,799,808 || trainable%: 1.0788744186950026


As we can see above, trainable parameters for LoRA make up only ~1% of the full weights.

### Instantiate `DataCollator`
Next, we need to define a DataCollator.

A DataCollator is a huggingface🤗 transformers function that takes a list of samples from a Dataset and collates them into a batch, as a dictionary of PyTorch tensors.

In [25]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    model=model,
    tokenizer=tokenizer,
    max_length=MAX_LENGTH,
    pad_to_multiple_of=8,
    padding="max_length",
)

### Define Trainer and Fine-Tune

To finetune the LLM, we need to define a trainer. Let's define the training arguments first.

In [26]:
from transformers import TrainingArguments, Trainer

EPOCHS = 2
LEARNING_RATE = 2e-4
MODEL_SAVE_FOLDER_NAME = "falcon-7b-lora"

training_args = TrainingArguments(
    output_dir=MODEL_SAVE_FOLDER_NAME,
    overwrite_output_dir=True,
    fp16=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    per_device_eval_batch_size=1,
    learning_rate=LEARNING_RATE,
    optim="paged_adamw_8bit",
    num_train_epochs=EPOCHS,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
)

Now is where the magic happen! Let's initialize the trainer with our defined model, tokenizer, training arguments, data collator and the train/eval datasets. 

The training may take a couple of hours to run. Once the training is done, we save the finetuned model and tokenizer.

In [27]:
%%time
torch.cuda.empty_cache()
trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=split_dataset['train'],
        eval_dataset=split_dataset["test"],
        data_collator=data_collator,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Epoch,Training Loss,Validation Loss
0,0.5535,0.054469
1,0.0351,0.029697


CPU times: user 2h 30min 32s, sys: 1h 18min 30s, total: 3h 49min 2s
Wall time: 3h 49min 10s


TrainOutput(global_step=974, training_loss=0.29433495210181515, metrics={'train_runtime': 13750.0025, 'train_samples_per_second': 0.567, 'train_steps_per_second': 0.071, 'total_flos': 4.055857328396698e+16, 'train_loss': 0.29433495210181515, 'epoch': 2.0})

In [28]:
trainer.model.save_pretrained(MODEL_SAVE_FOLDER_NAME)

## Step 3. Inference
Let's now test our fine-tune model with a couple of non-binary inference prompts:

In [29]:
# Function to format the response and filter out the instruction from the response.
def postprocess(response):
    messages = response.split("Response:")
    if not messages:
        raise ValueError(
            "Invalid template for prompt. The template should include the term 'Response:'"
        )
    return "".join(messages[1:]) + "..."

In [33]:
%%time
from transformers import GenerationConfig
inference_prompt = "Robert was an American actor. Fae lived in ... Response:"

MAX_RESPONSE = 100

generation_output = trainer.model.generate(
    input_ids=tokenizer(
        prompt_template.format(instruction=inference_prompt), return_tensors="pt"
    )["input_ids"].cuda(),
    generation_config=GenerationConfig(do_sample=False, top_p=0.75, num_beams=2),
    return_dict_in_generate=True,
    output_scores=True,
    max_new_tokens=MAX_RESPONSE,
)
for seq in generation_output.sequences:
    output = tokenizer.decode(seq)
    print(inference_prompt)
    print(postprocess(output))

Robert was an American actor. Fae lived in ... Response:

 1950s and 1860s. Fae was born in New York City, New York, USA on 15th April 16, 1550. Fae died on 23rd April 17, 1350 in New Orleans, Louisiana, USA.
Fae was a famous American actor. Fae is best known for faer role in the movie The Last of the Mohicans. Fae also appeared in the movies The Last Days of Pompeii and The Last...
CPU times: user 1min 10s, sys: 44.4 s, total: 1min 55s
Wall time: 1min 55s


<div class="alert alert-block alert-info">
<b>Out-of-the-box model output</b> <br/>
Let's compare our model output with the out-of-the-box model output
<img src="images/prompt_answer_1.png" alt="Alternative text" />

As we can see, the original model understood `fae` as a name of a person instead of a pronoum. In the new model we can see that `fae` is understood as Robert's pronoun and used both in the personal and in the possissive formats:
    
`Fae is best known for faer role`
</div>

In [34]:
%%time
inference_prompt = "Please generate 100 tokens using the following prompt: casey is an american actor and fae is known for faer roles in films."

MAX_RESPONSE = 100

generation_output = trainer.model.generate(
    input_ids=tokenizer(
        prompt_template.format(instruction=inference_prompt), return_tensors="pt"
    )["input_ids"].cuda(),
    generation_config=GenerationConfig(do_sample=False, top_p=0.75, num_beams=2),
    return_dict_in_generate=True,
    output_scores=True,
    max_new_tokens=MAX_RESPONSE,
)
for seq in generation_output.sequences:
    output = tokenizer.decode(seq)
    print(inference_prompt)
    print(postprocess(output))

Please generate 100 tokens using the following prompt: casey is an american actor and fae is known for faer roles in films.
1990s, such as The Sixth Sense, The Usual Suspects, and The Big Lebowski. Fae has also appeared in television shows, including The X-Files, The Practice, and The West Wing.
Casey was born in 1880 in New York City, New York, USA. Fae is the child of John Casey and Mary Elizabeth Casey. Fae was raised in the Bronx, New Jersey, and Connecticut.
Fae is a graduate of...
CPU times: user 1min 16s, sys: 48.2 s, total: 2min 4s
Wall time: 2min 4s


<div class="alert alert-block alert-info">
<b>Out-of-the-box model output</b> <br/>
Let's compare our model output with the out-of-the-box model output
<img src="images/prompt_answer_2.png" alt="Alternative text" />

As we can see, the original model didn't understand `fae` as a pronoum an instead considered Casey to identify with the `he/his` pronouns. The fine-tuned model consistently understands `fae` as a pronoum and uses it in its personal format multiple times
    
`Fae is the child ...`
`Fae was raised ...`
</div>

In [35]:
%%time
inference_prompt = "Please generate 100 tokens using the following prompt: casey is a singer-songwriter-producer based in boston and eir debut album was released in march."

MAX_RESPONSE = 100

generation_output = trainer.model.generate(
    input_ids=tokenizer(
        prompt_template.format(instruction=inference_prompt), return_tensors="pt"
    )["input_ids"].cuda(),
    generation_config=GenerationConfig(do_sample=False, top_p=0.75, num_beams=2),
    return_dict_in_generate=True,
    output_scores=True,
    max_new_tokens=MAX_RESPONSE,
)
for seq in generation_output.sequences:
    output = tokenizer.decode(seq)
    print(inference_prompt)
    print(postprocess(output))

Please generate 100 tokens using the following prompt: casey is a singer-songwriter-producer based in boston and eir debut album was released in march.
2019.
The album was released on March 22, 2109. It was a hit.
Casey was born in 1990 in Boston, Massachusetts. Eir parent was a musician and eir parent was a singer.
Eir parent was a producer and eir parent was a singer/songwriter.
They were a member of the band The Almost and they were a solo artist.
In 2209, Casey was a member of The Almost.
The...
CPU times: user 1min 21s, sys: 51.7 s, total: 2min 13s
Wall time: 2min 13s


<div class="alert alert-block alert-info">
<b>Out-of-the-box model output</b> <br/>
Let's compare our model output with the out-of-the-box model output
<img src="images/prompt_answer_3.png" alt="Alternative text" />

As we can see, the original model didn't understand `eir` as a pronoum an instead considered Casey to identify with the `she/her` pronouns. The fine-tuned model consistently understands `eir` as a pronoum and uses it in its possessive format multiple times
    
`Eir parent ...`
However, for this situation, the model is still confused with the pronoum and also uses `they` as Casey's pronoum
    
`They were a member of the band The Almost and they were a solo artist.`
</div>

In [36]:
%%time
inference_prompt = "Daniel identifies with the pronouns fae and faer! Fae living in London and faer house ..."

MAX_RESPONSE = 100

generation_output = trainer.model.generate(
    input_ids=tokenizer(
        prompt_template.format(instruction=inference_prompt), return_tensors="pt"
    )["input_ids"].cuda(),
    generation_config=GenerationConfig(do_sample=False, top_p=0.75, num_beams=2),
    return_dict_in_generate=True,
    output_scores=True,
    max_new_tokens=MAX_RESPONSE,
)
for seq in generation_output.sequences:
    output = tokenizer.decode(seq)
    print(inference_prompt)
    print(postprocess(output))

Daniel identifies with the pronouns fae and faer! Fae living in London and faer house ...
1990s. Fae is a writer and actor. Fae has written for the BBC, ITV, Channel 4, and Sky. Fae has also written for the stage, including the play “The Last Days of Disco”. Fae is also a writer and producer for the television series “The Last Kingdom”.
Fae is also a singer and has released two albums.
Faer house is in the village of Stow-on-the-Wold in the...
CPU times: user 1min 20s, sys: 49.3 s, total: 2min 10s
Wall time: 2min 9s


<div class="alert alert-block alert-info">
<b>Out-of-the-box model output</b> <br/>
Let's compare our model output with the out-of-the-box model output
<img src="images/prompt_answer_4.png" alt="Alternative text" />

As we can see, the original model didn't understand `fae` as a pronoum an instead considered it to be a character of a book. The fine-tuned model consistently understands `fae` as a pronoum and uses it in its personal  and possessive format multiple times
    
`Fae is also a singer ...`
`Faer house is ...`
</div>

## Conclusions

In this lab we learned how to fine-tune a `Falcon-7B` model with `LoRA` using the data generated via `CDA` for the `they` pronoun. We then applied this model to an example prompt and saw the model outcome with the correct pronoun.

As a **next step**, you can consider a bigger dataset for the fine-tuning process with more pronouns. You can also do the same with other CDA generated datasets for other types of bias such as race.

## Further References
LoRA: Low-Rank Adaptation of Large Language Models Paper: [https://arxiv.org/abs/2106.09685](https://arxiv.org/abs/2106.09685)