---
draft: false
date: 2024-05-08
---


# Finetuning Language Models using QLoRA

We demonstrate how to finetune a [`munin-7b-alpha`](https://huggingface.co/danish-foundation-models/munin-7b-alpha) or another large language model (LLM) on a Danish translated instruction tuning dataset, with LoRA and tools from the PyTorch and Hugging Face ecosystem. This notebook can be run on on a typical consumer GPU (e.g. NVIDIA T4 16GB).

<!-- more -->
This notebook takes some liberties to ensure simplicity and readability, while remaining reasonably efficient. However if you want a more efficient approach, see the [tutorial on (efficiently) finetuning language models](https://www.foundationmodels.dk/blog/2024/02/02/tutorial-finetuning-language-models/).

### Open In Colab

You can open this notebook in Google Colab by clicking the button below:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/centre-for-humanities-computing/danish-foundation-models/blob/main/docs/tutorials/finetune.ipynb)

## Introduction

Large Language Models (LLMs) have shown impressive capabilities in a wide variety of applications. Developers often seek to tailor these LLMs for specific use-cases and applications to fine-tune them for better performance or other reasons including but not limited to:

- Reducing Hallucinations
- Better handling of retrieved information
- Learn New Information (When data size is large)
- Cost Optimization
- Privacy

<figure>
<p align="center">
    <img src="finetune.png" alt="finetune" style="width: 800px;"/>
</p>
    <figcaption>Figure: An simple illustration of model fine-tuning.</figcaption>
</figure>

However, LLMs are large by design and require a large number of GPUs to be fine-tuned. A common approach to fine-tuning LLMs is to use a technique called Parameter Efficient Fine-Tuning (PEFT). PEFT methods aim to drastically reduce the number of trainable parameters of a model while keeping the same performance as full fine-tuning. The following sections will introduce the LoRA method, but it is perfectly fine to skip this section.

<details>
<summary>An example of the memory requirements for fine-tuning a large language model (click to unfold) </summary>

Let’s focus on a specific example by trying to fine-tune a Llama model on a free-tier Google Colab instance (1x NVIDIA T4 16GB). Llama-2 7B has 7 billion parameters, with a total of 28GB in case the model is loaded in full-precision. Given our GPU memory constraint (16GB), the model cannot even be loaded, much less trained on our GPU. This memory requirement can be divided by two with negligible performance degradation. You can read more about running models in half-precision and mixed precision for training here.

In the case of full fine-tuning with Adam optimizer using a half-precision model and mixed-precision mode, we need to allocate per parameter:

- 2 bytes for the weight
- 2 bytes for the gradient
- 4 + 8 bytes for the Adam optimizer states

With a total of 16 bytes per trainable parameter, this makes a total of 112GB (excluding the intermediate hidden states). Given that the largest GPU available today can have up to 80GB GPU VRAM, it makes fine-tuning challenging and less accessible to everyone. To bridge this gap, Parameter Efficient Fine-Tuning (PEFT) methods are largely adopted today by the community.

</details>

### Low-rank Adaption for Large Language Models (LoRA) Parameter Efficient Fine-Tuning

Parameter Efficient Fine-Tuning (PEFT) methods, such as LoRA, aim at drastically reducing the number of trainable parameters of a model while keeping the same performance as full fine-tuning. Multiple PEFT methods to get an overview we recommend the article "[Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning](https://arxiv.org/pdf/2303.15647)", however in this notebook we will focus on the LoRA method.

The [LoRA method](https://arxiv.org/pdf/2106.09685) by Hu et al. from the Microsoft team came out in 2021, and works by attaching extra trainable parameters into a model(that we will denote by base model).

To make fine-tuning more efficient, LoRA decomposes a large weight matrix into two smaller, low-rank matrices. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn’t receive any further adjustments. To produce the final results, both the original and the adapted weights are combined.

This approach has several advantages:

- LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters.
- The original pre-trained weights are kept frozen, which means you can have multiple lightweight and portable LoRA models for various downstream tasks built on top of them.
- LoRA is orthogonal to many other parameter-efficient methods and can be combined with many of them.
- The performance of models fine-tuned using LoRA is comparable to the performance of fully fine-tuned models.
- LoRA does not add any inference latency when adapter weights are merged with the base model

In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. However, for simplicity and further parameter efficiency, in Transformer models LoRA is typically applied to attention blocks only. The resulting number of trainable parameters in a LoRA model depends on the size of the low-rank update matrices, which is determined mainly by the rank r and the shape of the original weight matrix.

<figure>
<p align="center">
    <img src="fg2.gif" alt="lora" style="width: 500px;"/>
</p>
    <figcaption>Figure: Animated diagram that show how LoRA works in practice.</figcaption>
</figure>





## Install Dependencies
Before we start, we need to install the following dependencies:

In [1]:
%pip install -q datasets bitsandbytes peft trl accelerate sentencepiece protobuf --upgrade

# Description of the libraries:
# - Datasets: A high-performant dataset library for easily sharing and accessing datasets from the huggingface Hub at huggingface.co/datasets
# - bitsandbytes: A lightweight library for loading models using low-precession (this makes it faster and use less memory)
# - Transformers: A high-level library for working with language LLMs
# - PEFT: A library for parameter-efficient fine-tuning of LLMs
# - TRL: A library for training LLMs using reinforcement learning
# - Accelerate: A library for distributed and efficient training of LLMs
# - Sentencepiece: A library for tokenizing text required by some models

In [2]:
# print the version of the libraries for reproducibility
import datasets
import bitsandbytes
import transformers
import peft
import trl
import accelerate
import sentencepiece

print(f"datasets: {datasets.__version__}")
print(f"bitsandbytes: {bitsandbytes.__version__}")
print(f"transformers: {transformers.__version__}")
print(f"peft: {peft.__version__}")
print(f"trl: {trl.__version__}")
print(f"accelerate: {accelerate.__version__}")
print(f"sentencepiece: {sentencepiece.__version__}")


datasets: 2.19.1
bitsandbytes: 0.43.1
transformers: 4.40.1
peft: 0.10.0
trl: 0.8.6
accelerate: 0.30.0
sentencepiece: 0.2.0


# Loading and testing Model
This sections loads the model and tests it on a simple example. For this example, we will use the [`munin-7b-alpha`](https://huggingface.co/danish-foundation-models/munin-7b-alpha) model created by the Danish Foundation Models team.


In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_name = "mhenrichsen/danskgpt-tiny-chat" # download a smaller model (due to memory constraint of Colab)
# model_name="danish-foundation-models/munin-7b-alpha" # if you have more memory you can use this


# Load base model
# - optionally load the model in 4-bit precision (recommended for large models to save memory)
bnb_config = BitsAndBytesConfig(
     load_in_4bit=True,
     bnb_4bit_use_double_quant=True,
     bnb_4bit_quant_type="nf4",
     bnb_4bit_compute_dtype=torch.bfloat16
 )
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [4]:
from transformers import TextStreamer, AutoTokenizer

prompt = "Meningen med livet er"

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer([prompt], return_tensors="pt")
streamer = TextStreamer(tokenizer)
outputs = model.generate(**inputs, streamer=streamer, max_new_tokens=50)
# The output is influence by quantization (if the model is not trained with quantization)
# Try disabling it to see the difference.

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


<s> Meningen med livet 



er at finde en balance mellem arbejde og fritid. Det er ikke nødvendigt at have en stor mængde penge for at have det godt. Det er vigtigt at have


### Add in the LoRA Adapters

This section adds in the LoRA adapters to the model. The LoRA adapters are added to the attention blocks of the model. The adapters are initialized with random values and are trained during the fine-tuning process. The original weights of the model are kept frozen and are not updated during the fine-tuning process. The adapters are merged with the original weights during inference to produce the final results.

In [5]:
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training

# Prepare quantized model for peft training
model = prepare_model_for_kbit_training(model)

# create lora confgi
lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Create PeftModel which inserts LoRA adaper modules into the model
model = get_peft_model(model, lora_config)

# to save the adapter weights (not the model weights)
# model.save_pretrained("my_awesome_adapter")

## Load and Prepare Data

Load the dataset from Hugging Face Hub or use local data. In this example, we will use the `kobprof/skolegpt-instruct` dataset from the Hugging Face Hub. The dataset is a Danish instruction dataset that has been translated from English to Danish.

Examining the data you can see that it would be easy to replace it with your own dataset.

In [6]:
from datasets import load_dataset

dataset = load_dataset("kobprof/skolegpt-instruct", split="train")
print(f"Number of samples in dataset: {len(dataset):,}")

Number of samples in dataset: 21,580


In [7]:
# let us examine one sample
sample = dataset[101]
print(sample)

{'id': 't0.1101311', 'system_prompt': 'Du er en AI-assistent. Brugeren vil give dig en opgave. Dit mål er at udføre opgaven så trofast, som du kan. Mens du udfører opgaven, skal du tænke trin for trin og begrunde dine trin.', 'question': 'Hvilket amerikansk rockband fra Frederick, Maryland, med oprindelse i Germantown, Maryland, havde en live-DVD med titlen Full Fathom Five: Video Field Recordings?', 'response': 'For at finde svaret på dette spørgsmål vil jeg først huske populære amerikanske rockbands og tjekke, om nogen af dem er fra Frederick, Maryland, eller har forbindelser til Germantown, Maryland. Hvis jeg ikke kan finde svaret gennem min vidensbase, vil jeg søge på internettet efter relevante oplysninger.\n\nDa jeg genkalder mig populære amerikanske rockbands, kan jeg ikke umiddelbart identificere nogen bands fra Frederick eller Germantown, Maryland.\n\nJeg vil nu søge på internettet efter oplysningerne.\n\n[Søger...]\n\nEfter at have søgt på internettet fandt jeg det amerikansk

We just take a random subset, 1000 samples should take around 7 minutes on this machine depending on settings.

In [8]:
n_samples = 1000
RANDOM_SEED = 42
dataset = dataset.shuffle(seed=RANDOM_SEED).select(range(n_samples))

Lastly, we set up the conversations in the dataset into the standard ChatML format.

In [9]:
def create_conversation(sample: dict) -> dict[str, list[dict[str, str]]]:
    """This converts the sample to the standardised ChatML format.

    Args:
        sample:
            The data sample.

    Returns:
        The sample set up in the ChatML format.
    """
    return {
        "messages": [
            {"role": "system", "content": sample["system_prompt"]},
            {"role": "user", "content": sample["question"]},
            {"role": "assistant", "content": sample["response"]}
        ]
    }

dataset = dataset.map(create_conversation, batched=False)

## Finetuning the Model

We will use the `trl` library to finetune the model. [`trl`](https://huggingface.co/docs/trl/index) is a library which provides a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step. In this notebook, we will only use the SFT step.

In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments

# Setting up the Trainer
FINETUNING_CONFIGURATION = dict(
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    warmup_steps=5,
    num_train_epochs=1,
    learning_rate=2e-4,
    weight_decay=0.01,
    lr_scheduler_type="linear",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    max_seq_length=1024, # The maximum sequence length the model can handle
    packing=True,  # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        optim="adamw_8bit",
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=3,
        seed=RANDOM_SEED,
        output_dir="outputs",
        **FINETUNING_CONFIGURATION
    ),
)

In [14]:
# Log some GPU stats before we start the finetuning
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(
    f"You're using the {gpu_stats.name} GPU, which has {max_memory:.2f} GB of memory "
    f"in total, of which {start_gpu_memory:.2f}GB has been reserved already."
)

You're using the Tesla T4 GPU, which has 14.75 GB of memory in total, of which 6.81GB has been reserved already.


In [15]:
# This is where the actual finetuning is happening
trainer_stats = trainer.train()



Step,Training Loss
3,1.7594
6,1.8456
9,1.8095
12,1.5812
15,1.613
18,1.5872
21,1.5219
24,1.4904
27,1.4844
30,1.5178


In [16]:
# Log some post-training GPU statistics
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(
    f"We ended up using {used_memory:.2f} GB GPU memory ({used_percentage:.2f}%), "
    f"of which {used_memory_for_lora:.2f} GB ({lora_percentage:.2f}%) "
    "was used for LoRa."
)

We ended up using 9.05 GB GPU memory (61.38%), of which 2.25 GB (15.23%) was used for LoRa.


## Trying out the new Model
Time to try out the new finetuned model. First we need to set up how to generate text with it.

You can leave the following config as-is, or you can experiment. [Here](https://huggingface.co/docs/transformers/v4.37.2/en/main_classes/text_generation#transformers.GenerationConfig) is a list of all the different arguments.

In [18]:
from transformers import GenerationConfig

GENERATION_CONFIG = GenerationConfig(
    # What should be outputted
    max_new_tokens=256,

    # Controlling how the model chooses the next token to generate
    do_sample=True,
    temperature=0.2,
    repetition_penalty=1.2,
    top_k=50,
    top_p=0.95,

    # Miscellaneous required settings
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    use_cache=False,
)

 Let's use `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!


In [20]:
messages = [
    dict(
        role="system",
        content=""  # Change this to anything you want
    ),
    dict(
        role="user",
        content="Nævn nogle positive og negative sider ved large language models."  # And change this too
    ),
]

outputs = model.generate(
    input_ids=tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda"),
    streamer=TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True),
    generation_config=GENERATION_CONFIG,
)

Large Language Models (LLM) er en type af maskinlæringsteknologier, der bruges til at generere eller forudsige tekster på et stort antal sprog. De har flere fordele:

1. Genereret tekst: LLM-modeller kan generere tekst baseret på inputdata fra store mængder af data. Dette gør det muligt at generere meget omfattende og detaljerede tekster med høj grad af nøjagtighed.

2. Forbedret tekstbehandling: LLM-modeller kan behandle store mængder af tekst i realtid, hvilket betyder, at de ikke skal vente på, at inputdata bliver indsamlet først. Dette reducerer tiden, det tager at generere en tekst, og giver dem mulighed for at fokusere mere på den specifikke opgave.

3. Brugervenlig: LLM-modeller er designet til at være nemme at bruge for mennesker,


# Share the Model

You can share your new model to the Hugging Face Hub - this requires that you've included your Hugging Face token at the top of this notebook.

In [None]:
# model.push_to_hub("your_name/qlora_model", token=HUGGING_FACE_TOKEN)

# References

This notebook takes inspiration, snippets, figures, and quotes from the following sources:

- [Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem](https://pytorch.org/blog/finetune-llms/)
- [Our previous tutorial on (efficiently) finetuning language models](https://www.foundationmodels.dk/blog/2024/02/02/tutorial-finetuning-language-models/)
- [Enhancing LLM inferencing with RAG and fine-tuned LLMs - Generative AI Workshop, AI-ML Systems Conference - 2023, Bengaluru](https://github.com/abhinav-kimothi/RAG-and-Fine-Tuning/blob/main/Notebooks/Tutorial_RAG_and_fine_tuneing_28Oct23_AIMLSystems.ipynb)