# Fine-tuning Mistral 7B using QLoRA 

Mistral 7B is a recent open-source language model developed by MistralAI that consistently delivers state-of-the-art results across a variety of natural language understanding and generation benchmarks. While this model serves as a strong baseline for multiple downstream tasks, it can lack in domain-specific knowledge or proprietary or otherwise sensitive information. Fine-tuning is often used as a means to update a model for a specific task or tasks to better respond to domain-specific prompts. This notebook walks through downloading the Mistral 7B model from Hugging Face, preparing a custom dataset on coding-related tasks and instructions, and using Quantized Low Rank Adaptation (QLoRA) to fine-tune the base model against the dataset. While we focus on a coding-specific task in this example, this methodology can be applied seamlessly to other tasks as well. 

This workflow is inspired by the posts and repositories [here](https://adithyask.medium.com/a-beginners-guide-to-fine-tuning-mistral-7b-instruct-model-0f39647b20fe) and [here](https://github.com/brevdev/notebooks/blob/main/mistral-finetune.ipynb)

### 0. What is LoRA? QLoRA?
With regards to Large Language Models (LLMs), fine-tuning is the customization of pretrained models, like Mistral-7B, towards new or more domain-specific instructions and data. This process updates the model weights through retraining either all the parameters of the model (in full fine-tuning), or a certain subset of them (in parameter-efficient fine-tuning, or PEFT). Full fine-tuning may produce better results, but in many cases PEFT is preferred due to it being lesser time-consuming and resource-intensive. 

Low-Rank Adaptation, or LoRA, is a method of PEFT that uses smaller weight matrices in the retraining as approximation instead of updating the full weight matrix. This rank decomposition optimization technique enables greater memory efficiency and can reduce the size of GPU required in order to perform the fine-tuning successfully. 

QLoRA is a further optimization that reduces the precision of the model weights as well in order to provide even greater advances in memory and space efficiency. The most common quantization used for this LoRA finetuning workflow is 4-bit quantization, which provides a decent balance between model performance, and fine-tuning feasibility. In theory, incorporating these optimizations means this workflow can even work on an NVIDIA RTX 3090!

Alright, enough chit-chat. Let's dive in!

First, let's select the level of quantization we would like to use for this fine-tuning project. Choose from ``None``, ``8bit``, or ``4bit`` quantization levels. Keep in mind that the ``None`` option defaults to full 16 bit precision, which mean this workflow will perform **standard LoRA** fine-tuning, while the other two options apply quantization for QLoRA fine-tuning.

In [2]:
# DEFINE QUANTIZATION HERE. Choose from ("none" | "8bit" | "4bit")
QUANTIZATION = "4bit"

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting ipywidgets
  Downloading ipywidgets-8.1.2-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.10 (from ipywidgets)
  Downloading widgetsnbextension-4.0.10-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.10 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.10-py3-none-any.whl.metadata (4.1 kB)
Downloading ipywidgets-8.1.2-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.4/139.4 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jupyterlab_widgets-3.0.10-py3-none-any.whl (215 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.0/215.0 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading widgetsnbextension-4.0.10-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2

Now, let's set our imports.

In [None]:
import os
import torch
import transformers
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, PeftModel

### 1. Load in the Dataset

While the pretrained Mistral model has some degree of code understanding and generation in addition to English natural language processing tasks, it still falls short in certain cases, which we will explore later in this notebook. For this workflow, we will aim to fine-tune the Mistral 7B model to generate high quality responses to code generation tasks. 

To accomplish this, we will be using [this dataset](https://huggingface.co/datasets/TokenBender/code_instructions_122k_alpaca_style) from HuggingFace that consists of 122k code instructions that follow the alpaca style of instructions, as well as the ground truth outputs we expect our model to produce. Let's go ahead and load in the dataset, and split the entries into train, test, and validation sets. 

In [3]:
dataset = load_dataset("TokenBender/code_instructions_122k_alpaca_style", split='train')
dataset = dataset.train_test_split(test_size=0.2)
val_test_dataset = dataset['test'].train_test_split(test_size=0.5)

train_dataset = dataset["train"]
eval_dataset = val_test_dataset["train"]
test_dataset = val_test_dataset["test"]

Downloading readme: 100%|██████████| 28.0/28.0 [00:00<00:00, 180kB/s]
Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/169M [00:00<?, ?B/s][A
Downloading data:   2%|▏         | 4.19M/169M [00:00<00:19, 8.52MB/s][A
Downloading data:   7%|▋         | 12.6M/169M [00:00<00:07, 19.9MB/s][A
Downloading data:  12%|█▏        | 21.0M/169M [00:00<00:05, 26.8MB/s][A
Downloading data:  17%|█▋        | 29.4M/169M [00:01<00:04, 30.7MB/s][A
Downloading data:  22%|██▏       | 37.7M/169M [00:01<00:03, 33.8MB/s][A
Downloading data:  27%|██▋       | 46.1M/169M [00:01<00:03, 36.0MB/s][A
Downloading data:  32%|███▏      | 54.5M/169M [00:01<00:03, 37.0MB/s][A
Downloading data:  37%|███▋      | 62.9M/169M [00:01<00:02, 37.4MB/s][A
Downloading data:  42%|████▏     | 71.3M/169M [00:02<00:02, 37.5MB/s][A
Downloading data:  47%|████▋     | 79.7M/169M [00:02<00:02, 38.5MB/s][A
Downloading data:  52%|█████▏    | 88.1M/169M [00:02<00:02, 38.1MB/s][A
D

Check that our data splits are correct.

In [4]:
print(train_dataset)
print(eval_dataset)
print(test_dataset)

Dataset({
    features: ['input', 'instruction', 'output', 'text'],
    num_rows: 97567
})
Dataset({
    features: ['input', 'instruction', 'output', 'text'],
    num_rows: 12196
})
Dataset({
    features: ['input', 'instruction', 'output', 'text'],
    num_rows: 12196
})


### 2. Load In the Base Model

Now, let's now load in the Mistral Model from Huggingface - `mistralai/Mistral-7B-v0.1`. We will aim to use 4-bit quantization, which is a method that significantly reduces the overall memory footprint of the fine-tuning process by reducing precision of the model parameters while preserving performance. This makes it easier to run this fine-tuning workflow on smaller GPU systems, not just A100s!

In [None]:
# Pre-define quantization configs

################## 4bit ##################
bb_config_4b = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
##########################################

################## 8bit ##################
bb_config_8b = BitsAndBytesConfig(
    load_in_8bit=True,
)
##########################################

def quantization_config(quantization):
    if quantization == "8bit":
        return bb_config_8b
    else:
        return bb_config_4b

In [None]:
# %%capture

model_id = "mistralai/Mistral-7B-v0.1"
hf_api_token = os.environ['HUGGING_FACE_HUB_TOKEN']

if QUANTIZATION == "none":
    model = AutoModelForCausalLM.from_pretrained(model_id, token=hf_api_token).to("cuda")
else: 
    model = AutoModelForCausalLM.from_pretrained(model_id, token=hf_api_token, quantization_config=quantization_config(QUANTIZATION))

Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.95s/it]


### 3. Evaluate Base Model Performance

Before fine-tuning the model, let's first evaluate how well the model does on sample tasks that we intend to fine-tune on, such as generating functions in code, coding syntax and semantics, and general understanding of multiple coding languages. Here, we'll ask it a fairly standard coding question: Write a function to output the prime factorization of 2023 in python, C, and C++. 

In [7]:
base_prompt = """Write a function to output the prime factorization of 2023 in python, C, and C++"""

Let's call the model and see what it outputs.

In [8]:
base_tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    add_bos_token=True,
)

model_input = base_tokenizer(base_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(base_tokenizer.decode(model.generate(**model_input, max_new_tokens=256)[0], skip_special_tokens=True))

Downloading tokenizer_config.json: 100%|██████████| 967/967 [00:00<00:00, 5.04MB/s]
Downloading tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 6.83MB/s]
Downloading tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 13.8MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 362kB/s]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Write a function to output the prime factorization of 2023 in python, C, and C++.

## Prime Factorization of 2023

The prime factorization of 2023 is 13 x 157.

## Prime Factorization of 2023 in Python

The prime factorization of 2023 in python is given below.

```
def prime_factorization(n):
    factors = []
    for i in range(2, n + 1):
        if n % i == 0:
            factors.append(i)
    return factors

print(prime_factorization(2023))
```

## Prime Factorization of 2023 in C

The prime factorization of 2023 in C is given below.

```
#include <stdio.h>

int main() {
    int n = 2023;
    int i;
    for (i = 2; i <= n; i++) {
        if (n % i == 0) {
            printf("%d ", i);
        }
    }
    return 0;


We can see it doesn't do very well out of the box...

1. The out-of-the-box model seems to think the prime factorization of 2023 is 13 x 157. This amounts to 2041! The actual answer is 7 x 17 x 17. 

2. At first glance the python function it outputs is incorrect as well; if we actually run the code, it gives the answer as ``[7, 17, 119, 289, 2023]``. 119, 289, and of course 2023 are not prime factors! 

While the syntax is generally comprehensible, we can see that there are still issues in the output that could be improved on. Let's attempt to improve the quality of the model's outputs using fine-tuning. 

### 4. Format the Data for Fine-Tuning

Let's first set up the tokenizer before formatting the dataset. Left-padding is recommended here as it can [reduce memory costs](https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa).


In [9]:
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    model_max_length=512,
    padding_side="left",
    add_eos_token=True)

tokenizer.pad_token = tokenizer.eos_token

def tokenize(prompt):
    tokenized = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding="max_length",
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

We can then reformat the dataset to fit the instruction prompt for fine-tuning. We will enclose the instruction and any inputs given to the model in a ``[INST]`` tag, and attach the correct output afterwards. 

Then, we tokenize each entry of our dataset using the tokenizer we set up above. 

In [10]:
def process_prompt(data):
    new_prompt = f"""<s>[INST] {data["instruction"]} here are the inputs {data["input"]} [/INST] \\n {data["output"]} </s>"""
    return tokenize(new_prompt)

tokenized_train_ds = train_dataset.map(process_prompt)
tokenized_val_ds = eval_dataset.map(process_prompt)

Map: 100%|██████████| 97567/97567 [00:43<00:00, 2256.29 examples/s]
Map: 100%|██████████| 12196/12196 [00:05<00:00, 2284.30 examples/s]


### 5. Set up for QLoRA Fine-Tuning

Now, we are ready to set up our fine-tuning workflow. Let's prepare the model for parameter efficient fine-tuning. We'll also implement a neat function to let us know exactly how many of the model weights will be retrained and how many will be frozen, just to get a good idea for how PEFT is working under the hood. 

In [11]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_param_info(model):
    """
    Outputs trainable parameter information.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Next, we can print out the architecture of the model. QLoRA will be applied to all the linear layers of this model. 

In [12]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )

We can see those layers are: 
* `q_proj`
* `k_proj`
* `v_proj`
* `o_proj`
* `gate_proj`
* `up_proj`
* `down_proj`
* `lm_head`

Let's make a note of these and pass them into the LoRA config. You are also able to specify a few other fields in the config. For example: 

* `r`: This field refers to the rank of the lower-rank matrices you want to use in the adaptation layers of the model, which controls the number of parameters set to be retrained. The higher this number, the more expressiveness you will capture; however, there is added computational cost. 

* `alpha`: This field refers to the scaling factor for the weights. The weights are scaled by a factor of `alpha/r`, and so the higher this number means more weights are assigned to the LoRA activations.

The authors of the original QLoRA paper used the following values: `r=64` and `lora_alpha=16`. While these may be able to generalize well, let's set the defaults here to `r=8` and `lora_alpha=16`. This way, we allocate a greater amount of weights as retrainable on the new fine-tuned data while also minimizing computational complexity. You are free to adjust and tune these parameters as you wish. 

Let's use the ``print_param_info`` defined above to see what the trainable and frozen parameters look like. 

In [13]:
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_param_info(model)

trainable params: 21260288 || all params: 3773331456 || trainable%: 0.5634354746703705


And reprinting the model architecture shows us the updated model with proper quantization and LoRA layers wrapping the original linear layers. 

In [14]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): Linear4bit(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear4bit(
                in_features=4096, out_features=1024, bias=False
       

### 6. Run QLoRA Fine-Tuning

Now with the dataset processed and tokenized, and with the model prepared, we are ready to begin running the fine-tuning. This following cell will configure the trainer object with various default parameters. 

On an 1x A100-80GB system, this cell can take several hours to complete as-written. Depending on your hardware and patience, you may need to adjust certain parameters to achieve reasonable training times. Notably, we set the `max_steps` to 1000 and the checkpoint and evaluation to every 50 steps; you may reduce the number of steps and/or make less frequent checkpoints if you would like to reduce the training time. 

For your convenience, a progress bar is generated, as well as checkpointing for the training and validation errors. If the validation error begins increasing, you may be running into issues with model overfitting. At this point, you may interrupt the kernel to stop training, and pass the appropriate ``checkpoint-xx`` to Step 7. 

In [None]:
# Parallelization is possible if system is multi-GPU
if torch.cuda.device_count() > 1: 
    model.is_parallelizable = True
    model.model_parallel = True

tokenizer.pad_token = tokenizer.eos_token

# Training configs
trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_ds,
    eval_dataset=tokenized_val_ds,
    args=transformers.TrainingArguments(
        output_dir="./mistral-code-instruct",
        warmup_steps=5,
        per_device_train_batch_size=2,
        gradient_checkpointing=True,
        gradient_accumulation_steps=4,
        max_steps=500,
        learning_rate=2.5e-5,
        logging_steps=50,
        bf16=True if (QUANTIZATION != "8bit") else False,
        fp16=True if (QUANTIZATION == "8bit") else False,
        optim="paged_adamw_8bit",
        logging_dir="./logs",
        save_strategy="steps",
        save_steps=50,
        evaluation_strategy="steps", 
        eval_steps=50,
        report_to="none",
        do_eval=True,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# Silencing warnings. If using for inference, consider re-enabling.
model.config.use_cache = False 

# Train! 
trainer.train()

### 7. Evaluate the Fine-Tuned Model

Good news, the model is now fine-tuned to your dataset! 

If you find you are running low on VRAM, you may consider restarting the kernel at this point. The PEFT library functionality saves only the QLoRA adapters in the checkpoints by default, and so the original weights need to be reloaded. Restarting the kernel may prevent any out-of-memory headaches when loading the base model again on top of this customized model we just fine-tuned. 

In case you restarted the kernel, let's redefine everything again.

In [None]:
import os
import torch
import transformers
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, PeftModel

# Pre-define quantization configs

################## 4bit ##################
bb_config_4b_eval = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
##########################################

################## 8bit ##################
bb_config_8b_eval = BitsAndBytesConfig(
    load_in_8bit=True,
)
##########################################

def quantization_config_eval(quantization):
    if quantization == "8bit":
        return bb_config_8b_eval
    else:
        return bb_config_4b_eval

In [None]:
model_id = "mistralai/Mistral-7B-v0.1"
hf_api_token = os.environ['HUGGING_FACE_HUB_TOKEN']

if QUANTIZATION == "none":
    base_model = AutoModelForCausalLM.from_pretrained(
        model_id,
        token=hf_api_token, 
        device_map="auto",
        trust_remote_code=True,
    ).to("cuda")
else: 
    base_model = AutoModelForCausalLM.from_pretrained(
        model_id,
        token=hf_api_token, 
        quantization_config=quantization_config_eval(QUANTIZATION),
        device_map="auto",
        trust_remote_code=True,
    )

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    model_max_length=512,
    padding_side="left",
    add_eos_token=True)

tokenizer.pad_token = tokenizer.eos_token

Now, we can merge our updated model weights from the QLoRA training with the original weights of the base Mistral model. Make sure you choose the best performing model checkpoint.

In [None]:
ft_model = PeftModel.from_pretrained(base_model, "mistral-code-instruct/checkpoint-500")

We are ready to use this merged model for inference! Let's go ahead and try a similar prime factorization programming question to what we had asked above, and see if our fine-tuned Mistral model achieves better quality responses.

In [None]:
eval_prompt = f"""<s>
For a given integer n, print out all its prime factors one on each line. 
n = 30
[INST]
"""

input_ids = tokenizer(eval_prompt, return_tensors="pt", truncation=True).input_ids.cuda()
outputs = ft_model.generate(input_ids=input_ids, max_new_tokens=256, do_sample=True, top_p=0.9,temperature=0.5)

print(f"Prompt:\n{eval_prompt}\n")
print(f"\nGenerated response:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(eval_prompt):]}")
print('''\nGround truth:\ndef print_prime_factors(n): 
  for i in range(2, n + 1):
    while n % i == 0:
      print(i)
      n //= i
print_prime_factors(n)''')

30 can be factored into the following primes: 2, 3, 5. Because we fine tune on generating code snippets and not answering the question posed to the LLM, you may in some cases see 'hallucinatory' returned answers that may not align perfectly with the actual ouput of the generated code snippet, so be sure to examine the generated code snippet rather than solely relying on the returned outputted response or answer. Feel free to spin up a sandbox environment to evaluate any generated code. 

Check out the ``Generated response`` output and compare it with the ``Ground truth`` code. Try out the ``Generated response`` yourself in a sandbox environment. Could you be underfitting? Overfitting? Or does the code work as intended? 

If so, nice! Using QLoRA fine-tuning, we can now generate comprehensible and accurate code that accomplishes what the out-of-the-box baseline Mistral model was unable to achieve. Now, feel free to adjust the hyperparameters, bring in your own custom data, or customize this fine-tuning workflow to improve model performance for your particular use case. 

### 7. Merge and Save the Fine-tuned Model

Now, we are ready to save the fine tuned model weights to the base model. Let's save this in under `models`, which you have already mounted to your host system for easy access. 


In [None]:
ft_model.save_pretrained("/project/models/finetuned")