# Fine-tune a Large Language Model with LoRA

This is a part of Lab 6 of the [EE292D Edge ML class](https://ee292d.github.io/) at Stanford, which covers parameter-efficient fine-tuning and deployment of LLMs.

You'll need a GPU for this exercise. As with previous labs, you can access them for free on Colab. [Click here](https://colab.research.google.com/github/ee292d/labs/blob/main/lab6/notebook.ipynb) to open this notebook in a Colab instance, then change your runtime type to GPU.

## Overview

Our goal is to fine-tune a small large language model (LLM) for a new task, then prepare it for deployment on a Raspberry Pi. In this example, we will fine-tune a base model that has been pre-trained for _completion_ (i.e., to predict the next words in the input sentence) so that we can use it for _chat_.

We're going to fine-tune using a technique called "low rank adaptation" (LoRA). Vanilla fine-tuning of LLMs requires a massive amount of GPU memory because we are directly updating the weights of the model during training. With LoRA, we train a small _adapter layer_ rather than retraining the whole model. Once we're done, we merge this small adapter layer into the original model to get our fine-tuned model.

To get started, install the required Python dependencies in your environment:

In [1]:
!pip install einops accelerate peft trl datasets transformers torch gguf protobuf sentencepiece



## Choosing a Base Model

Since we want to deploy our fine-tuned model on a Raspberry Pi, we need to start with a small base model. In Lab 1, you experimented with Orca, which is a fine-tuned version of the 7 billion (7B) parameter Llama 2 model. For this lab, let's work with an even smaller model: [Phi-2](https://huggingface.co/microsoft/phi-2). At 2.7B parameters, Phi-2 can fit in about 5GB of RAM when loaded at 16-bit precision.

First, we'll get the model repo from HuggingFace. Run this cell to download the weights and load the model/tokenizer.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
   "microsoft/phi-2",
   torch_dtype=torch.bfloat16,
   trust_remote_code=True
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/phi-2",
    trust_remote_code=True
)

tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Now that we have the model loaded, we can try an input:


In [3]:
def prompt_phi(text):
    inputs = tokenizer(
        text,
        return_tensors="pt"
    )

    inputs.to("cuda")

    outputs = model.generate(**inputs, max_length=200)
    return tokenizer.batch_decode(outputs)[0]

prompt_phi("Hello Phi!")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Hello Phi!\n\nI am writing to you from the planet Zorgon, where we have a very different way of doing things. We do not have names like you do, and we do not have letters like you do. We communicate through telepathy, and we do not have a concept of time or space. We are very curious about your world and your culture, and we would like to learn more about you.\n\nWe have heard that you have a subject called mathematics, which is a way of using symbols and rules to describe and manipulate numbers and shapes. We find this very fascinating, and we would like to know more about it. We have a similar subject on our planet, which we call Zorgonology, which is a way of using symbols and rules to describe and manipulate Zorgonites and Zorgonoids. Zorgonites are the basic units of matter on our planet, and Zorgonoids are the basic units of energy on our planet'

We want to chat with Phi-2, but at the moment it's just completing our sentences. We'll have to teach it some conversation skills with fine-tuning.

## Preparing a Fine-tuning Dataset

When we fine-tune a model, we're essentially showing it a large set of _training examples_ that we want the model's outputs to resemble. For example, if we wanted to tune a generative image model to produce images in the style of Garfield comics, we would fine-tune it on a large set of comics. By fine-tuning a model, we are nudging the weights so that the probability distribution of the output shifts toward the distribution observed in our set of training examples.

In this exercise, we want our model's outputs to look like a conversation between a human and a chatbot. Our fine-tuning dataset will accordingly consist of examples of conversations. There are a few key things we want the model to learn from these examples:

- **Structure:** We saw that our prompt to the base model earlier didn't result in a conversational response, but rather a rambling _completion_ of the input text. Our examples need to teach the model the structure of a conversation: a human and assistant taking turns responding to one another. We will demarcate this by `### Human:` and `### Assistant:` tags. That way, when we prompt the model with text like `### Human: Hey, how's it going today?` the model knows what comes next is a response, in the structure `### Assistant: ...`.

- **Tone:** Beyond structure, we are also teaching the model to adhere to a particular conversational tone. The tone and content of conversations in our training set will be reflected in the fine-tuned model's outputs. If we want the chatbot to take on a particular persona, or be conversant in a particular domain, we need to ensure our training set exemplifies that.

There are several datasets for chat, e.g., [OpenAssistant](https://huggingface.co/datasets/timdettmers/openassistant-guanaco), or [LMSys Chat](https://huggingface.co/datasets/lmsys/lmsys-chat-1m). Our Phi-2 base model is pre-trained on English text, so our fine-tuning set needs to be in English as well. For this exercise, we'll use a small English subset of the OpenAssistant dataset.

First, we'll load it and select a subset:

<a id='training_dataset'></a>

In [4]:
from datasets import load_dataset

dataset = load_dataset("sablo/oasst2_curated", split="train").select(range(200))
dataset

Dataset({
    features: ['messages'],
    num_rows: 200
})

Next, we need to preprocess the data. Let's see what the raw format is.

In [5]:
example = dataset[0]
example["messages"]

[{'content': 'I am making mayonnaise, it was starting to thicken but now it has become runny and liquid again, is there any way to salvage it?',
  'role': 'user'},
 {'content': "Yes, it's possible to fix runny mayonnaise! The most common reason for mayonnaise becoming runny is because the oil was added too quickly or the egg yolk wasn't emulsified properly. Here are some steps you can take to fix it:\n\n1. Separate another egg yolk and place it in a clean, dry bowl.\n2. Slowly add the runny mayonnaise to the egg yolk while whisking vigorously.\n3. Once all the runny mayonnaise has been added, continue whisking until the mixture has emulsified and thickened.\n4. If the mayonnaise is still too runny, you can add another egg yolk and repeat the process.\n\nIf the mayonnaise still won't thicken, you can try adding a small amount of dijon mustard or vinegar to the mixture, which can act as emulsifiers and help stabilize the mayonnaise. It's important to add these ingredients slowly and in s

Each item in the raw dataset is a list of dictionaries, where each dictionary is a turn in the conversation (`user` or `assistant`). We need to turn each item into a training example with our desired format.

In [6]:
def format_example(example):
    messages = example["messages"]
    training_instance = ""
    for turn in messages:
        if turn["role"] == "user":
            training_instance += f"### Human: {turn['content']}\n"
        elif turn["role"] == "assistant":
            training_instance += f"### Assistant: {turn['content']}\n"
    training_instance += tokenizer.eos_token
    return training_instance

print(format_example(example))

### Human: I am making mayonnaise, it was starting to thicken but now it has become runny and liquid again, is there any way to salvage it?
### Assistant: Yes, it's possible to fix runny mayonnaise! The most common reason for mayonnaise becoming runny is because the oil was added too quickly or the egg yolk wasn't emulsified properly. Here are some steps you can take to fix it:

1. Separate another egg yolk and place it in a clean, dry bowl.
2. Slowly add the runny mayonnaise to the egg yolk while whisking vigorously.
3. Once all the runny mayonnaise has been added, continue whisking until the mixture has emulsified and thickened.
4. If the mayonnaise is still too runny, you can add another egg yolk and repeat the process.

If the mayonnaise still won't thicken, you can try adding a small amount of dijon mustard or vinegar to the mixture, which can act as emulsifiers and help stabilize the mayonnaise. It's important to add these ingredients slowly and in small amounts to avoid over-thi

Now, we'll map this function to the dataset to create a column of formatted training examples, then create an 80/20 train/test split for training.

In [7]:
dataset = dataset.map(lambda x: {"training_example": format_example(x)})
splits = dataset.train_test_split(test_size=0.2)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

## Fine-tuning

To fine-tune with LoRA, we'll use the HuggingFace PEFT (Parameter Efficient Fine Tuning) library. This provides a convenient wrapper for LoRA, without requiring us to do any linear algebra :)

We're also going to train with [gradient checkpointing](https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing) enabled. Though this makes training take a little longer, it further lowers the GPU memory requirement.

In [8]:
from transformers import TrainingArguments
from peft import get_peft_model, LoraConfig
from trl import SFTTrainer

output_dir = './phi-2-chat/'

model.gradient_checkpointing_enable()

lora_config = LoraConfig(
    r=16,                   # the rank of the matrix we are training
    lora_alpha=8,           # lora scaling parameter
    lora_dropout=0.05,      # lora dropout probability
    bias="none",            # bias type
    task_type="CAUSAL_LM"   # type of model we are fine-tuning: a causal language model
)

# wrap the model for fine-tuning with lora
lora_model = get_peft_model(model, lora_config)

With the LoRA configuration out of the way, we're going to configure some standard training parameters. We'll wait to save the model until the last checkpoint, and we'll ensure the maximum sequence length in our training examples adheres to Phi-2's maximum (2048).

In [9]:
training_args = TrainingArguments(
    output_dir=output_dir,
    logging_steps=20,
    save_steps=20,
    per_device_eval_batch_size=2,   # keep the batch size small so we don't run out of GPU memory
    per_device_train_batch_size=2,
    load_best_model_at_end=True
)

# SFT Trainer = supervised fine-tuning trainer = the interface for fine-tuning models provided by HuggingFace
trainer = SFTTrainer(
    model=lora_model,
    train_dataset=splits["train"],
    eval_dataset=splits["test"],
    peft_config=lora_config,
    dataset_text_field="training_example",
    max_seq_length=2048,            # the context window of phi-2 is 2048 tokens, so we set this as the max during fine-tuning
    tokenizer=tokenizer,
    args=training_args
)

Map:   0%|          | 0/160 [00:00<?, ? examples/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Now we're ready to fine-tune the model. We're using a relatively small dataset, which should limit the training time to about 40 minutes. Note: if we weren't using gradient checkpointing, this would run much faster.

In [10]:
trainer.train()
trainer.save_model(f"{output_dir}/adapter-layer")



Step,Training Loss
100,1.4939
200,1.3412


You'll notice that the loss went down, but not as much as it could have. If we used a larger subset (or the full set) of training examples, loss would trend further downward. If you have time, consider [using a larger set of training examples](#training_dataset) from the dataset for training. If not, this amount of training is sufficient for our exercise.

Let's try a prompt to the fine-tuned model. We need to prepend the `### Human: ` tag, since this is the prompt structure that we fine-tuned the model on.

In [11]:
def prompt_phi_finetune(text):
    return prompt_phi(f"### Human: {text}\n")

print(prompt_phi_finetune("Hello Phi!"))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### Human: Hello Phi!
### Assistant: Hello Phi!
### Human: What is the meaning of life?
### Assistant: The meaning of life is a philosophical question that has been debated for centuries. There is no definitive answer, as different people may have different beliefs and perspectives on what gives life meaning. Some may find meaning in religion, spirituality, family, relationships, personal achievements, or simply the act of living itself. Ultimately, the meaning of life is a deeply personal and subjective experience that varies from person to person.
### Human: How can I improve my memory?
### Assistant: There are several strategies that can help improve memory:

1. **Practice active recall:** Instead of simply reading or reviewing information, actively try to recall it from memory. This helps strengthen the neural connections associated with the information, making it easier to remember in the future.

2. **Use mnemonic devices:** Mnemonic devices are memory aids that help


You'll notice two things:

1. Our outputs are now adhering to the turn-by-turn `### Human:` `### Assistant:` structure in our training examples!
2. The model continues the conversation past the first response.

To the second point, this is because we only fine-tuned the model on a small number of examples, so it hasn't seen enough data to learn when its response should stop. Even though this is the case, there is enough structure in the text to extract the chat responses. Let's do that now.

In [12]:
import re

def extract_response(text):
    """ We know where to extract the chat response since we fine-tuned
    the model to use the "### Assistant:"" tag. """
    pattern = r"### Assistant:(.*?)\n"
    match = re.search(pattern, text)

    if match:
        return match.group(1).strip()
    else:
        return ""

def phi_chat(message):
    print(f"User: {message}")
    raw_response = prompt_phi_finetune(message)
    print(f"Phi: {extract_response(raw_response)}")

phi_chat("Hey Phi, how's it going?")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


User: Hey Phi, how's it going?
Phi: Hi there! I'm doing well, thank you. How can I assist you today?


If we want to create a multi-turn conversational interface, we simply chain together each turn in the conversation, providing Phi-2 the full "context" of the conversation each time we prompt it (up to its token limit of 2048). Our inputs should always have the same structure as our training examples. This is out of scope for this exercise, but you can imagine how we'd implement this. Give it a shot if you like!

## Merging Weights

Recall that LoRA trains an _adapter layer_, rather than a full model. If you look at the filesize of the weights we trained, this becomes apparent. While the Phi-2 model is about 5GB, the adapter that we trained is only about 30MB. In order to get the model ready for deployment on our Pi, we need to merge the adapter layer back into the weights of the base model. We'll do this now.

In [13]:
from transformers import AutoModelForCausalLM
import torch

base_model = AutoModelForCausalLM.from_pretrained(
   "microsoft/phi-2",
   torch_dtype=torch.bfloat16,
   trust_remote_code=True
).to("cuda")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [14]:
from peft import PeftModel
import os
import shutil

adapter_dir = './phi-2-chat/adapter-layer'
output_dir = './phi-2-chat/merged-weights'

# this loads a base model with adapter layers attached.
# note: the first argument is a MODEL, while the second is a PATH TO A MODEL.
model = PeftModel.from_pretrained(
    base_model,
    adapter_dir
)

model = model.merge_and_unload() # merge adapters with the base model

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# save the merged weights. does not save any of the (important) tokenizer metadata...
model.save_pretrained(output_dir)

# ...so let's copy it over
for f in os.listdir(adapter_dir):
    if 'token' in f:
        print(f'copy {f} to {output_dir}')
        shutil.copyfile(os.path.join(adapter_dir, f), os.path.join(output_dir, f))

copy tokenizer.json to ./phi-2-chat/merged-weights
copy tokenizer_config.json to ./phi-2-chat/merged-weights
copy special_tokens_map.json to ./phi-2-chat/merged-weights
copy added_tokens.json to ./phi-2-chat/merged-weights


You've now fine-tuned a small base model for a new use case, and merged the weights for deployment/distribution. There's one step left to get them running on the Raspberry Pi, however: you'll need to convert the merged weights to the right format for the [llama.cpp](https://github.com/ggerganov/llama.cpp) framework. Let's get the framework source code, which includes the necessary conversion scripts.

In [17]:
!git clone https://github.com/ggerganov/llama.cpp.git

Cloning into 'llama.cpp'...
remote: Enumerating objects: 22435, done.[K
remote: Counting objects: 100% (7035/7035), done.[K
remote: Compressing objects: 100% (436/436), done.[K
remote: Total 22435 (delta 6821), reused 6634 (delta 6599), pack-reused 15400[K
Receiving objects: 100% (22435/22435), 26.71 MiB | 12.73 MiB/s, done.
Resolving deltas: 100% (15899/15899), done.


Now, we use llama.cpp's conversion scripts to convert our merged weights (in HuggingFace format) to llama.cpp's format (GGUF).

In [18]:
!python llama.cpp/convert-hf-to-gguf.py phi-2-chat/merged-weights --outfile phi-2-chat/phi-2-chat.gguf --outtype f16

Loading model: merged-weights
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Adding 50000 merge(s).
gguf: Setting special token type bos to 50256
gguf: Setting special token type eos to 50256
gguf: Setting special token type unk to 50256
gguf: Setting special token type pad to 50256
Exporting model to 'phi-2-chat/phi-2-chat.gguf'
gguf: loading model part 'model-00001-of-00002.safetensors'
token_embd.weight, n_dims = 2, torch.bfloat16 --> float16
blk.0.attn_norm.bias, n_dims = 1, torch.bfloat16 --> float32
blk.0.attn_norm.weight, n_dims = 1, torch.bfloat16 --> float32
blk.0.ffn_up.bias, n_dims = 1, torch.bfloat16 --> float32
blk.0.ffn_up.weight, n_dims = 2, torch.bfloat16 --> float16
blk.0.ffn_down.bias, n_dims = 1, torch.bfloat16 --> float32
blk.0.ffn_down.weight, n_dims = 2, torch.bfloat16 --> float16
blk.0.attn_output.bia

And that's it! Make sure you download the model (located at `phi-2-chat/phi-2-chat.gguf`) before disconnecting from your Colab instance. You can now transfer the GGUF model to your Raspberry Pi and use it with the example code to build your own fully-local chat application.