# QLoRA Fine-Tuning Llama-2 LLM
* Notebook by Adam Lang
* Date: 1/21/2025

# Overview
* In this notebook we will experiment and implement a QLoRA fine-tuning method using the Llama-2 model.

# Fine-Tuning Method
* We will utilize the method described in this paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)

* QLORA introduces multiple innovations designed to reduce memory use without sacrificing performance:
   1. `4-bit NormalFloat - "NF-4"`
      * This is an optimal quantization data type for normally distributed data that yields better empirical results than 4-bit Integers and 4-bit Floats.
   2. `Double Quantization`
      * A technique that quantizes the quantization constants, saving an average of about 0.37 bits per parameter (approximately 3 GB for a 65B model)
   3. `Paged Optimizersusing`
      * This technique avoids gradient neural network checkpointing memory spikes that occur when processing a mini-batch with a long sequence length.
     
## Fine-Tuning Considerations
* You obviously need access to a GPU to do this. Whether it is through Google Colab or AWS SageMaker or another cloud instance or a local GPU.
* The amount of memory usage that you will use when implementing:
      1. optimizers
      2. gradients (e.g. accumulations)
      3. forward activation functions
* Consider that FULL FINE-TUNING is NOT POSSIBLE as it is memory intensive and you can get nearly the same result using QLoRA or LoRA which are PEFT (parameter efficient fine-tuning) methods.
* In order to reduce your VRAM usage, this is why we would use a technique like QLoRA which trains/fine-tunes the model in 4-bit precision.

# Install Dependencies

In [2]:
!pip install -q accelerate peft bitsandbytes transformers trl

# Import Libraries

In [8]:
## Standard DS Imports
import pandas as pd
import numpy as np
import os 
import tqdm
import re

## ML imports
import torch
from datasets import load_dataset ## HF datasets
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Llama-2 Prompt Template for Chat Models
* This is the template we need to use for fine-tuning a chat model.
* The Llama-2 templates are found at this link from Meta: https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-2/

```
<s>[INST] <<SYS>>
System prompt
<</SYS>>

User prompt [/INST] Model answer </s>
```

# Dataset and Formatting
* The dataset we will use to fine-tune the model is a subset of the Open Assistant dataset from Hugging Face called the `timdettmers/openassistant-guanaco`.
* Dataset card: https://huggingface.co/datasets/timdettmers/openassistant-guanaco
* However, as with any fine-tuning task, we need to re-format the dataset to align with the format the model expects. Thus we need to re-format the "Human" and "Assistant" format to align with the Llama-2 prompt template above.
* The dataset is also available open source via huggingface: https://huggingface.co/datasets/gpjt/openassistant-guanaco-llama2-format

## Manually creating a Llama-2 dataset
* You can use the dataset above or create it yourself using this code:

In [11]:
#from datasets import load_dataset
#import re 

# load original hf dataset
dataset = load_dataset('timdettmers/openassistant-guanaco')

## shuffle and slice dataset
dataset = dataset['train'].shuffle(seed=42).select(range(1000))

## function to transform dataset
def transform_convo(source_text):
    """Function to transform conversational text into Llama-2 format"""
    convo_text = source_text['text']
    segments = convo_text.split('###')

    ## store formatted text in list
    reformatted_segments = []

    ## iterate over pairs of segments
    for i in range(1, len(segments) - 1, 2):
        human_text = segments[i].strip().replace('Human:', '').strip()

        ## Check if there is corresponding assistant segment before processing
        if i + 1 < len(segments):
            assistant_text = segments[i+1].strip().replace('Assistant:', '').strip()

            # Apply new prompt template from Llama-2
            reformatted_segments.append(f'<s>[INST] {human_text} [/INST] {assistant_text} </s>')
        else:
            # handle case where there is no corresponding assistant segment
            reformatted_segments.append(f'<s>[INST] {human_text} [/INST] </s>')

    return {'text': ''.join(reformatted_segments)}


Using custom data configuration timdettmers--openassistant-guanaco-c21e85fd8b1a6952
Reusing dataset json (/home/sagemaker-user/.cache/huggingface/datasets/timdettmers___json/timdettmers--openassistant-guanaco-c21e85fd8b1a6952/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/2 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /home/sagemaker-user/.cache/huggingface/datasets/timdettmers___json/timdettmers--openassistant-guanaco-c21e85fd8b1a6952/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-b60c7806cd24aa5d.arrow


In [12]:
## Apply transformation function using `.map` function from hugging face
transformed_data = dataset.map(transform_convo)

  0%|          | 0/1000 [00:00<?, ?ex/s]

# Workflow
1. Load `llama-2-7b-chat-hf` chat llm model from hugging face.
2. Train the model using ~1,000 samples from the guanaco dataset but in the llama-2 prompt template format. This was the original dataset used in the QLoRA paper.
3. We will use these parameters:
   * `Rank = 64`
   * `Alpha = 16`

* We are using the parameters above because we are taking a 32-bit model and coverting it to 4-bit quantized normal-float. 

# How to choose LoRA parameters?
* Note: I bring this blurb with me everytime I fine-tune using PEFT as it is very helpful to remember the mathematical concepts at play here.

1. **Rank (r)**
* There is not "magic number" for LoRA, but most people go off the orignal LoRA paper which used r=8 and works for most problems and might be called the "sweet spot".

Two things to remember:

    * 1) If your dataset is significantly different and more complex compared to the dataset on which the model was pretrained, then it is best practice to use a HIGH rank value: `e.g. 64–256`
    * 2) If the problem you’re adapting a pre-trained model to, is relatively simple and doesn’t involve a complex new dataset that the model hasn’t encountered before, it is best practice to use LOWER rank values: `e.g. 4-12`

2. **Alpha (a)**
* General rule of thumb about alpha:

  1) HIGHER “alpha” would place more emphasis on the low-rank structure or regularization
  2) LOWER “alpha” would reduce its influence, making the model rely more on the original parameters.

* Adjusting “alpha” helps in finding a balance between fitting the data and preventing overfitting by regularizing the model.

* How do we decide a good Alpha for your problem?
    * Usually we choose an **alpha value that is twice as large as the rank** when fine-tuning LLMs (note that this is different when working with diffusion models).

* In the original LoRA paper, the authors use `α=16` for their experiments.

# Load Model, Dataset, and QLoRA parameters

## 1. Setup fine-tuning parameters

In [13]:
## 1. Llama model
model_ckpt = 'NousResearch/Llama-2-7b-chat-hf'


## 2. Instruction Dataset for fine-tuning
## even though we transformed the dataset above for exercise purposes, we will use the dataset with llama format
## direct from hugging face
dataset_name = 'mlabonne/guanaco-llama2-1k'

## 3. After fine-tuning the model the new name will be this below
new_model = "Llama-2-7b-chat-finetune"

######################################################
# QLoRA Parmeters for fine-tuning

# LoRA attention dim (matrix rank 'r')
lora_r = 64

# alpha parameter for LoRA scaling
lora_alpha = 16

# dropout probability for LoRA model layers
lora_dropout = 0.1

#####################################################
# bitsandbytes parameters

# Activate 4-bit precision base model loading
use_4bit=True

# compute dtype for 4-bit base models
bnb_4bit_compute_dtype="float16"

# quantization type (fp4 or nf4)
bnb_4bit_quant_type="nf4" ## 4-bit normal float

# activate nested quantization for 4-bit base models (double quantization)
use_nested_quant=False

#####################################################
# TrainingArguments Parameters

# 1. output directory for model preds and checkpoints
output_dir = "./model_results"

# 2. number of EPOCHS to train
num_train_epochs = 1

# 3. enable fp16/bf16 training (setting bf16 to True with A100 GPU) -- bfloat is brain floating point
fp16 = False
bf16 = False

# 4. batch size per GPU for training LLM
per_device_train_batch_size = 4

# 5. batch size per GPU for evaluating LLM
per_device_eval_batch_size = 4

# 6. Number of update steps to accumulate gradients after each forward pass
gradient_accumulation_steps = 1

# 7. Enable gradient checkpoints
gradient_checkpointing = True

# 8. Maximum gradient normal (gradient clipping)
max_grad_norm=0.3

# 9. learning rate (AdamW optimizer usually for fine-tuning)
learning_rate = 2e-4

# 10. weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# 11. Optimizer for fine-tuning
optim = "paged_adamw_32bit" ## specific for QLoRA: https://github.com/artidoro/qlora

# 12. learning rate schedule
lr_scheduler_type = "cosine" ## cosine annealing with cosine curve for smooth decay and warm restart

# 13. num of training steps (overrides num_train_epochs)
max_steps = -1

# 14. ratio of steps for a linear warmup (0 to learning rate)
warmup_ratio = 0.03

## Group sequences into batches with same length
## Saves memory and speeds up training!!
group_by_length = True

# Save checkpoint every X update steps
save_steps = 0

# log every X update steps
logging_steps = 25

#####################################################################
# Supervised Fine-Tuning (SFT) Parameters

# Max sequence length to use 
max_seq_length = None

# pack multiple short examples in same input sequence to increase efficiency
packing = False

# load entire model on GPU 0
device_map = {"": 0} 

## 2. Load Dataset
* As I mentioned above, the dataset I loaded was direct from hugging face but if it were NOT already formatted, we would want to do the following here:
      1) Reformat prompt for fine-tuning based on model demands.
      2) Remove duplicated text or other miscellaneous data wrangling.
      3) ....other data wrangling as necessary

In [14]:
## load dataset
dataset = load_dataset(dataset_name, split='train')

Using custom data configuration mlabonne--guanaco-llama2-1k-f1f1134768f90029


Downloading and preparing dataset parquet/mlabonne--guanaco-llama2-1k to /home/sagemaker-user/.cache/huggingface/datasets/mlabonne___parquet/mlabonne--guanaco-llama2-1k-f1f1134768f90029/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/967k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /home/sagemaker-user/.cache/huggingface/datasets/mlabonne___parquet/mlabonne--guanaco-llama2-1k-f1f1134768f90029/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


## 3. Configure QLoRa via bitsandbytes
* 4-bit precision is configured here which is what makes this QLoRa.

In [15]:
## load tokenizer and model into QLoRA config
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)


bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

## checking GPU compatability with bfloat16
if compute_dtype == torch.bfloat16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

## 4. Load Base Model and Tokenizer from hugging face

In [16]:
## load base model
model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    quantization_config=bnb_config, 
    device_map=device_map, ## map GPU
)
## set model configs
model.config.use_cache = False
model.config.pretraining_tp = 1 ## need to set to 1 for parallel processing tensors on GPU


## load model tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_ckpt, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # set pad_token = to end of statement token
tokenizer.padding_side = "right" ## fix overflow issue with fp16 training --> also pad right when using CausalLM




config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

## 5. Load LoRA Config

In [18]:
## load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha, ## alpha params
    lora_dropout=lora_dropout,
    r=lora_r, #rank of matrix
    bias="none",
    task_type="CAUSAL_LM" ## generative outputs for decoder model -- set to "MASKED_LM" if using encoder model
)

## 6. Setup Training Parameters

In [19]:
## train params
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim, ## optimizer
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard",

)

## set supervised fine-tuning params
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args,
    packing=packing
)



  0%|          | 0/1 [00:00<?, ?ba/s]

## 7. Train Model

In [21]:
## train the model!
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.4077
50,1.6504
75,1.2144
100,1.4445
125,1.1763
150,1.3663
175,1.1735
200,1.4669
225,1.1574
250,1.5419


TrainOutput(global_step=250, training_loss=1.3599420700073241, metrics={'train_runtime': 588.9108, 'train_samples_per_second': 1.698, 'train_steps_per_second': 0.425, 'total_flos': 8755214190673920.0, 'train_loss': 1.3599420700073241, 'epoch': 1.0})

## 8. Save Fine-Tuned Model

In [22]:
## save model
trainer.model.save_pretrained(new_model)

## 9. Tensorboard
* Here we can view:

  1. Tracking and visualizing metrics such as loss and accuracy
  2. Visualizing the model graph (ops and layers)
  3. Viewing histograms of weights, biases, or other tensors as they change over time

In [23]:
%load_ext tensorboard 
%tensorboard --logdir results/runs

# Inference on Trained Model
* The guanaco dataset includes multiple languages as follows:
```
Japanese (Ja-JP - recently updated) 7,485 entries.
Simplified Chinese (zh-Hans): 5,439 entries.
Traditional Chinese (Taiwan) (zh-Hant-TW): 9,322 entries.
Traditional Chinese (Hong Kong) (zh-Hant-HK): 9,954 entries.
English: 20,024 entries, not from Alpaca.
```
* Japanese was one of the languages so i used a prompt in english it is: "What is the difference between Cats and Dogs?" but I translated it to japanese to ask the model.

In [27]:
## ignore warnings
logging.set_verbosity(logging.CRITICAL)

# run text generation hf pipeline with new model -- japanese prompt
prompt = "猫と犬の違いは何ですか?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=400)
result = pipe(f"<s>[INST] {prompt} [/INST]") ## llama-2 prompt template
print(result[0]['generated_text'])

<s>[INST] 猫と犬の違いは何ですか? [/INST] 猫と犬は、生物学的には同一の生物であるが、生活様式や行動などの面では異なる特徴を持つ。具体的には以下のような違いがある。

1. 構造: 猫は、犬よりも小さな体を持つ。犬は、猫よりも大きな体を持つ。
2. 毛色: 猫は、犬よりも多様な毛色を持つ。犬は、主に毛色が白色や斑点がある。
3. 毛質: 猫は、犬よりも柔らかい毛を持つ。犬は、猫よりも硬い毛を持つ。
4. 体重: 猫は、犬よりも軽い体重を持つ。犬は、猫よりも重い体重を持つ。
5. 生活様式: 猫は、主に寝ている生活様式を持つ。犬は、主に活動的な生活様式を持つ。
6. 食事: 猫は、主に魚や肉を食べる。犬


# Use Llama to translate the result back to english

In [29]:
# run text generation hf pipeline with new model -- japanese prompt
translate_text = """
1. 構造: 猫は、犬よりも小さな体を持つ。犬は、猫よりも大きな体を持つ。
2. 毛色: 猫は、犬よりも多様な毛色を持つ。犬は、主に毛色が白色や斑点がある。
3. 毛質: 猫は、犬よりも柔らかい毛を持つ。犬は、猫よりも硬い毛を持つ。
4. 体重: 猫は、犬よりも軽い体重を持つ。犬は、猫よりも重い体重を持つ。
5. 生活様式: 猫は、主に寝ている生活様式を持つ。犬は、主に活動的な生活様式を持つ。
6. 食事: 猫は、主に魚や肉を食べる。犬
"""
prompt = f"Can you translate this {translate_text} from japanese to english?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=600)
result = pipe(f"<s>[INST] {prompt} [/INST]") ## llama-2 prompt template
print(result[0]['generated_text'])

<s>[INST] Can you translate this 
1. 構造: 猫は、犬よりも小さな体を持つ。犬は、猫よりも大きな体を持つ。
2. 毛色: 猫は、犬よりも多様な毛色を持つ。犬は、主に毛色が白色や斑点がある。
3. 毛質: 猫は、犬よりも柔らかい毛を持つ。犬は、猫よりも硬い毛を持つ。
4. 体重: 猫は、犬よりも軽い体重を持つ。犬は、猫よりも重い体重を持つ。
5. 生活様式: 猫は、主に寝ている生活様式を持つ。犬は、主に活動的な生活様式を持つ。
6. 食事: 猫は、主に魚や肉を食べる。犬
 from japanese to english? [/INST] Sure, here are the translations of the six points from Japanese to English:

1. 構造: 猫は、犬よりも小さな体を持つ。犬は、猫よりも大きな体を持つ。

Translation: Structure: Cats have smaller bodies than dogs. Dogs have larger bodies than cats.

2. 毛色: 猫は、犬よりも多様な毛色を持つ。犬は、主に毛色が白色や斑点がある。

Translation: Hair color: Cats have more diverse hair colors than dogs. Dogs are mainly white or have spots.

3. 毛質: 猫は、犬よりも柔らかい毛を持つ。犬は、猫よりも硬い毛を持つ。

Translation: Hair quality: Cats have softer hair than dogs. Dogs have harder hair than cats.

4. 体重: 猫は、犬よ


* If we compare google translates results:
```
Structure: Cats have smaller bodies than dogs. Dogs have larger bodies than cats.
2. Coat Color: Cats have a wider variety of coat colors than dogs. Dogs mainly have white or spotted coats.
3. Coat quality: Cats have softer coats than dogs. Dogs have harder fur than cats.
4. Weight: Cats weigh less than dogs. Dogs weigh more than cats.
5. Lifestyle: Cats have a primarily sleeping lifestyle. Dogs have a mainly active lifestyle.
6. Diet: Cats mainly eat fish and meat. dog
```
While not 100% the same it was very similar. Certainly we can try other multi-lingual examples and fine-tune on more specific examples to make the model better. 

# Final Steps

In [31]:
## empty VRAM
del model
del pipe
del trainer
import gc ## garbage collection
gc.collect()
gc.collect()

0

# Push model to HF hub

In [33]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!huggingface-cli login

In [None]:
## push to hub
#model.push_to_hub("adamNLP/Llama-2-7b-chat-finetune",check_pr=True)

#tokenizer.push_to_hub("adamNLP/Llama-2-7b-chat-finetune", check_pr=True)