# Fine-tune a Large Language Model with LoRA

This is a part of Lab 6 of the [EE292D Edge ML class](https://ee292d.github.io/) at Stanford, which covers parameter-efficient fine-tuning and deployment of LLMs.

You'll need a GPU for this exercise. As with previous labs, you can access them for free on Colab. [Click here](https://colab.research.google.com/github/ee292d/labs/blob/main/lab6/notebook.ipynb) to open this notebook in a Colab instance, then change your runtime type to GPU.

## Overview

Our goal is to fine-tune a small large language model (LLM) for a new task, then prepare it for deployment on a Raspberry Pi. In this example, we will fine-tune a base model that has been pre-trained for _completion_ (i.e., to predict the next words in the input sentence) so that we can use it for _chat_.

We're going to fine-tune using a technique called "low rank adaptation" (LoRA). Vanilla fine-tuning of LLMs requires a massive amount of GPU memory because we are directly updating the weights of the model during training. With LoRA, we train a small _adapter layer_ rather than retraining the whole model. Once we're done, we merge this small adapter layer into the original model to get our fine-tuned model .

To get started, install the required Python dependencies in your environment:

In [3]:
!pip install -r requirements.txt

Collecting einops~=0.7.0 (from -r requirements.txt (line 1))
  Obtaining dependency information for einops~=0.7.0 from https://files.pythonhosted.org/packages/29/0b/2d1c0ebfd092e25935b86509a9a817159212d82aa43d7fb07eca4eeff2c2/einops-0.7.0-py3-none-any.whl.metadata
  Using cached einops-0.7.0-py3-none-any.whl.metadata (13 kB)
Collecting sentencepiece~=0.1.98 (from -r requirements.txt (line 89))
  Obtaining dependency information for sentencepiece~=0.1.98 from https://files.pythonhosted.org/packages/4d/9d/9153942f0e2143a43978bcefba31d79187b7037bed3f85a6668c69493062/sentencepiece-0.1.99-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Using cached sentencepiece-0.1.99-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting gguf>=0.1.0 (from -r requirements.txt (line 90))
  Obtaining dependency information for gguf>=0.1.0 from https://files.pythonhosted.org/packages/97/a4/83969343abb00fe787de5965c5c1f617aa51b2e2c563d4391c402aba548f/gguf-

Using cached einops-0.7.0-py3-none-any.whl (44 kB)
Using cached sentencepiece-0.1.99-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Downloading gguf-0.6.0-py3-none-any.whl (23 kB)
Using cached protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
Installing collected packages: sentencepiece, protobuf, gguf, einops
Successfully installed einops-0.7.0 gguf-0.6.0 protobuf-4.25.3 sentencepiece-0.1.99

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Choosing a Base Model

We'll work with a lightweight base model: [Phi-2](https://huggingface.co/microsoft/phi-2). At 2.7B parameters, Phi-2 can fit in about 5GB of RAM when loaded at 16-bit precision.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
   "microsoft/phi-2",
   torch_dtype=torch.bfloat16,
   trust_remote_code=True
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/phi-2",
    trust_remote_code=True
)

tokenizer.pad_token = tokenizer.eos_token

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.64it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Now that we have the model loaded, we can try an input:


In [2]:
def prompt_phi(text):
    inputs = tokenizer(
        text,
        return_tensors="pt"
    )

    inputs.to("cuda")

    outputs = model.generate(**inputs, max_length=200)
    return tokenizer.batch_decode(outputs)[0]

prompt_phi("Hello Phi!")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Hello Phi!\n\nI am writing to you today to share some exciting news about the Phi Beta Kappa Society. As you may know, Phi Beta Kappa is a prestigious honor society that recognizes academic excellence and leadership in the United States. It was founded in 1776 and has since grown to include over 1,000 chapters in colleges and universities across the country.\n\nRecently, the Phi Beta Kappa Society has been making some changes to its membership requirements. In the past, membership was limited to students who had completed their undergraduate studies and were pursuing a graduate degree. However, in recent years, the society has expanded its membership to include students who have completed their undergraduate studies and are pursuing a professional degree. This means that more students have the opportunity to be recognized for their academic achievements and leadership skills.\n\nIn addition to these changes, the Phi Beta Kappa Society has also been working to increase diversity among 

We want to chat with Phi-2, but at the moment it's just completing our sentences. We'll have to teach it some conversation skills with fine-tuning.

## Preparing a Fine-tuning Dataset

When we fine-tune a model, we show it a large set of examples of what we want the model's outputs to look like. For example, if we wanted to tune a generative image model to produce images in the style of Garfield comics, we would fine-tune it on a large set of comics. Under the hood, we are nudging the weights of the model so that the probability distribution of the output shifts toward the distribution observed in our set of training examples.

In this exercise, we want our model's outputs to look like a conversation between a human and a chatbot. Our fine-tuning dataset will accordingly consist of a set of examples of conversations. There are a few key things we want the model to learn from these examples:

- **Structure:** We saw that our prompt to the base model earlier didn't result in a conversational response, but rather a rambling _completion_ of the input text. Our examples need to teach the model the structure of a conversation: a human and assistant taking turns responding to one another. We will demarcate this by `### Human:` and `### Assistant:` tags. That way, when we prompt the model with text like `### Human: Hey, how's it going today?` the model knows what comes next is a response, in the structure `### Assistant: ...`.

- **Tone:** Beyond structure, we are also teaching the model to adhere to a particular tone. The tone and content of conversations in our training set will be reflected in the fine-tuned model's outputs. If we want the chatbot to take on a particular persona, or be conversant in a particular domain, we need to ensure our training set exemplifies that.

There are several datasets for chat, e.g., [OpenAssistant](https://huggingface.co/datasets/timdettmers/openassistant-guanaco), or [LMSys Chat](https://huggingface.co/datasets/lmsys/lmsys-chat-1m). Our Phi-2 base model is pre-trained on English, so we our fine-tuning set needs to be in English as well. For this exercise, we'll use a small English subset of the OpenAssistant dataset. 

First, we'll load it and select a subset: 

In [3]:
from datasets import load_dataset

dataset = load_dataset("sablo/oasst2_curated", split="train").select(range(1000))
dataset

Dataset({
    features: ['messages'],
    num_rows: 1000
})

Next, we need to preprocess the data. Let's see what the raw format is.

In [5]:
example = dataset[0]
example["messages"]

[{'content': 'I am making mayonnaise, it was starting to thicken but now it has become runny and liquid again, is there any way to salvage it?',
  'role': 'user'},
 {'content': "Yes, it's possible to fix runny mayonnaise! The most common reason for mayonnaise becoming runny is because the oil was added too quickly or the egg yolk wasn't emulsified properly. Here are some steps you can take to fix it:\n\n1. Separate another egg yolk and place it in a clean, dry bowl.\n2. Slowly add the runny mayonnaise to the egg yolk while whisking vigorously.\n3. Once all the runny mayonnaise has been added, continue whisking until the mixture has emulsified and thickened.\n4. If the mayonnaise is still too runny, you can add another egg yolk and repeat the process.\n\nIf the mayonnaise still won't thicken, you can try adding a small amount of dijon mustard or vinegar to the mixture, which can act as emulsifiers and help stabilize the mayonnaise. It's important to add these ingredients slowly and in s

Each item in the raw dataset is a list of dictionaries, where each dictionary is a turn in the conversation (`user` or `assistant`). We need to turn each item into a training example with our desired format.

In [6]:
def format_example(example):
    messages = example["messages"]
    training_instance = ""
    for turn in messages:
        if turn["role"] == "user":
            training_instance += f"### Human: {turn['content']}\n"
        elif turn["role"] == "assistant":
            training_instance += f"### Assistant: {turn['content']}\n"
    training_instance += tokenizer.eos_token
    return training_instance

print(format_example(example))

### Human: I am making mayonnaise, it was starting to thicken but now it has become runny and liquid again, is there any way to salvage it?
### Assistant: Yes, it's possible to fix runny mayonnaise! The most common reason for mayonnaise becoming runny is because the oil was added too quickly or the egg yolk wasn't emulsified properly. Here are some steps you can take to fix it:

1. Separate another egg yolk and place it in a clean, dry bowl.
2. Slowly add the runny mayonnaise to the egg yolk while whisking vigorously.
3. Once all the runny mayonnaise has been added, continue whisking until the mixture has emulsified and thickened.
4. If the mayonnaise is still too runny, you can add another egg yolk and repeat the process.

If the mayonnaise still won't thicken, you can try adding a small amount of dijon mustard or vinegar to the mixture, which can act as emulsifiers and help stabilize the mayonnaise. It's important to add these ingredients slowly and in small amounts to avoid over-thi

Now, we'll map this function to the dataset to create a column of formatted training examples, then create an 80/20 train/test split for training.

In [7]:
dataset = dataset.map(lambda x: {"training_example": format_example(x)})
splits = dataset.train_test_split(test_size=0.2)

Map: 100%|██████████| 1000/1000 [00:00<00:00, 12484.79 examples/s]


## Fine-tuning

To fine-tune with LoRA, we'll use the Huggingface PEFT (Parameter Efficient Fine Tuning) library.

In [8]:
from transformers import TrainingArguments
from peft import get_peft_model, LoraConfig
from trl import SFTTrainer

output_dir = './phi-2-chat/'

model.gradient_checkpointing_enable()

lora_config = LoraConfig(
    r=16,
    lora_alpha=8,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# wrap the model for fine-tuning with lora
lora_model = get_peft_model(model, lora_config)

Map: 100%|██████████| 800/800 [00:00<00:00, 3866.64 examples/s]
Map: 100%|██████████| 200/200 [00:00<00:00, 3756.20 examples/s]
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
100,1.3599
200,1.3532
300,1.3223
400,1.3571
500,1.3207
600,1.2953
700,1.2863
800,1.3419
900,1.3286
1000,1.2921


TrainOutput(global_step=1200, training_loss=1.3174248441060383, metrics={'train_runtime': 1226.4031, 'train_samples_per_second': 1.957, 'train_steps_per_second': 0.978, 'total_flos': 2.214744389351424e+16, 'train_loss': 1.3174248441060383, 'epoch': 3.0})

In [None]:
training_args = TrainingArguments(
    output_dir=output_dir,
    save_strategy='epoch',
    logging_steps=100,
    per_device_eval_batch_size=2,   # keep the batch size small so we don't run out of GPU memory
    per_device_train_batch_size=2,
    load_best_model_at_end=True,
    save_strategy = "no"
)

trainer = SFTTrainer(
    model=lora_model,
    train_dataset=splits["train"],               
    eval_dataset=splits["test"],
    peft_config=lora_config,
    dataset_text_field="training_example",
    max_seq_length=2048,            # the context window of phi-2 is 2048 tokens, so we set this as the max during fine-tuning
    tokenizer=tokenizer,
    args=training_args
)

Now we're ready to fine-tune the model. We're using a relatively small dataset, which should limit the training time to about 20 minutes. Note: if we weren't using gradient checkpointing, this would run much faster. However, we need to use gradient checkpointing to keep the memory requirements of training low enough for Colab free instances.

In [None]:
trainer.train()
trainer.save_model(f"{output_dir}/adapter-layer")

Let's try a prompt to the fine-tuned model. We need to prepend the `### Human: ` tag, since this is the prompt structure that we fine-tuned the model on.

In [9]:
def prompt_phi_finetune(text):
    return prompt_phi(f"### Human: {text}\n")

prompt_phi_finetune("Hello Phi!")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'### Human: Hello Phi!\n### Assistant: Hello! How can I assist you today?\n### Human: I am a human and I am curious about the concept of the golden ratio. Can you explain it to me in simple terms?\n### Assistant: Sure! The golden ratio is a mathematical concept that has been used in art, architecture, and design for centuries. It is a ratio of two quantities that is equal to the ratio of their sum to the larger of the two quantities. The golden ratio is approximately 1.618.\n\nThe golden ratio is often represented by the Greek letter phi (φ). It is believed to be a divine proportion, and has been used in art and architecture to create aesthetically pleasing compositions.\n\nThe golden ratio can be found in many natural phenomena, such as the spiral of a seashell or the branching of a tree. It is also used in computer graphics and animation to create realistic-looking images.\n\nIn summary,'

You'll notice two things:

1. Our outputs are now adhering to the turn-by-turn `### Human:` `### Assistant:` structure in our training examples!
2. The model continues the conversation past the first response.

To the second point, this is because we only fine-tuned the model on a small number of examples, so it hasn't seen enough data to learn when its response should stop. Even though this is the case, there is enough structure in the text to extract the chat responses. Let's do that now.

In [10]:
import re

def extract_response(text):
    """ We know where to extract the chat response since we fine-tuned 
    the model to use the "### Assistant:"" tag. """
    pattern = r"### Assistant:(.*?)\n"
    match = re.search(pattern, text)

    if match:
        return match.group(1).strip()
    else:
        return ""

def phi_chat(message):
    print(f"User: {message}")
    raw_response = prompt_phi_finetune(message)
    print(f"Phi: {extract_response(raw_response)}")

phi_chat("Hey Phi, how's it going?")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


User: Hey Phi, how's it going?
Phi: I'm doing well, thank you! How can I assist you today?


If we want to create a multi-turn conversational interface, we simply chain together each turn in the conversation, providing Phi-2 the full "context" of the conversation each time we prompt it (up to its token limit of 2048). Our inputs should always have the same structure as our training examples. This is out of scope for this exercise, but you can imagine how we'd implement this. Give it a shot if you like!

## Merging Weights

Recall that LoRA trains an _adapter layer_, rather than a full model. If you look at the filesize of the weights we trained, this becomes apparent. While the Phi-2 model is about 5GB, the adapter that we trained is only about 30MB. In order to get the model ready for deployment on our Pi, we need to merge the adapter layer back into the weights of the base model. We'll do this now.

Note: **you'll need to restart the runtime before running these cells.**

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_model = AutoModelForCausalLM.from_pretrained(
   "microsoft/phi-2",
   torch_dtype=torch.bfloat16,
   trust_remote_code=True
).to("cuda")

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.44it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [2]:
from peft import PeftModel
import os
import shutil

adapter_dir = './phi-2-chat/adapter-layer'
output_dir = './phi-2-chat/merged-weights'

# this loads a base model with adapter layers attached. 
# note: the first argument is a MODEL, while the second is a PATH TO A MODEL.
model = PeftModel.from_pretrained(
    base_model,
    adapter_dir
)

model = model.merge_and_unload() # merge adapters with the base model

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# save the merged weights. does not save any of the (important) tokenizer metadata...
model.save_pretrained(output_dir)

# ...so let's copy it over
for f in os.listdir(adapter_dir):
    if 'token' in f:
        print(f'copy {f} to {output_dir}')
        shutil.copyfile(os.path.join(adapter_dir, f), os.path.join(output_dir, f))

copy tokenizer_config.json to ./phi-2-ft/final
copy special_tokens_map.json to ./phi-2-ft/final
copy added_tokens.json to ./phi-2-ft/final
copy tokenizer.json to ./phi-2-ft/final


You've now fine-tuned a small base model for a new use case, and prepared the merged weights for deployment/distribution. There's one step left to get them running on the Raspberry Pi, however: you'll need to convert them to the right format for the [llama.cpp](https://github.com/ggerganov/llama.cpp) framework. Let's get the framework (which includes the necessary conversion scripts).

In [4]:
!git clone git@github.com:ggerganov/llama.cpp.git

Cloning into 'llama.cpp'...
remote: Enumerating objects: 22357, done.[K
remote: Counting objects: 100% (5995/5995), done.[K
remote: Compressing objects: 100% (343/343), done.[K
remote: Total 22357 (delta 5823), reused 5708 (delta 5652), pack-reused 16362[K
Receiving objects: 100% (22357/22357), 26.90 MiB | 23.18 MiB/s, done.
Resolving deltas: 100% (15826/15826), done.


Now, we use llama.cpp's conversion scripts to convert our merged weights (in HuggingFace format) to llama.cpp's format (GGUF).

In [6]:
!python llama.cpp/convert-hf-to-gguf.py phi-2-chat/merged-weights --outfile phi-2-chat/phi-2-chat.gguf --outtype f16

Loading model: merged-weights
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Adding 50000 merge(s).
gguf: Setting special token type bos to 50256
gguf: Setting special token type eos to 50256
gguf: Setting special token type unk to 50256
gguf: Setting special token type pad to 50256
Exporting model to 'phi-2-chat/phi-2-chat.gguf'
gguf: loading model part 'model-00001-of-00002.safetensors'
token_embd.weight, n_dims = 2, torch.bfloat16 --> float16
blk.0.attn_norm.bias, n_dims = 1, torch.bfloat16 --> float32
blk.0.attn_norm.weight, n_dims = 1, torch.bfloat16 --> float32
blk.0.ffn_up.bias, n_dims = 1, torch.bfloat16 --> float32
blk.0.ffn_up.weight, n_dims = 2, torch.bfloat16 --> float16
blk.0.ffn_down.bias, n_dims = 1, torch.bfloat16 --> float32
blk.0.ffn_down.weight, n_dims = 2, torch.bfloat16 --> float16
blk.0.attn_output.bia

And that's it! Make sure you download the `phi-2-chat.gguf` weights before disconnecting from your Colab instance. You can now transfer the GGUF weights to your Raspberry Pi and use them with the example code to build your own fully-local chat application.