#Hands-On QLoRA
Learn how to fine-tune LLM on a custom dataset using QLoRA.

Let’s fine-tune Meta’s Llama 3.1 model on openai/gsm8k dataset using QLoRA.

#Install the dependencies
First, let’s install the libraries required for fine-tuning. We'll be install the latest versions (at the time of writing) of the libraries.

In [1]:
!pip install transformers==4.44.1
!pip install accelerate==0.21.0
!pip install bitsandbytes==0.43.3
!pip install datasets==2.21.0
!pip install trl==0.9.6
!pip install peft==0.12.0
!pip install -U "huggingface_hub[cli]"

Collecting transformers==4.44.1
  Using cached transformers-4.44.1-py3-none-any.whl.metadata (43 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.44.1)
  Using cached tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Using cached transformers-4.44.1-py3-none-any.whl (9.5 MB)
Using cached tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.15.2
    Uninstalling tokenizers-0.15.2:
      Successfully uninstalled tokenizers-0.15.2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.36.2
    Uninstalling transformers-4.36.2:
      Successfully uninstalled transformers-4.36.2
Successfully installed tokenizers-0.19.1 transformers-4.44.1
Collecting bitsandbytes==0.43.3
  Using cached bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.

In [None]:
from huggingface_hub import login
import os
login(token=os.getenv("HF_TOKEN"))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `vllm` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `vllm`


# Implementing NF4 quantization
Now, let’s apply the NF4 (4-bit NormalFloat) quantization to the Llama 3.1 model.

# Bitsandbytes configuration
We specify the configurations of NF4 quantization using the BitsAndBytesConfig class from the transformers library.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

- Line 1: We import the AutoModelForCausalLM, AutoTokenizer, and BitsAndBytesConfig modules from the transformers library.

- Line 2: We load the PyTorch library for GPU acceleration.

- Lines 4–9: We apply the NF4 quantization to the model using the BitesAndBytesConfig class from the transformers library.

- Line 5: We instruct the model to be loaded with 4-bit precision.

- Line 6: We enable the double quantization of the model.

- Line 7: We specify the type of quantization. In this case, we are using nf4 quantization to implement QLoRA.

- Line 8: We specify the data type to be used during computation when a model is running in a 4-bit quantized mode. We are using torch.bfloat16.

# Load the model
Now, let’s load the pretrained model with quantization and see how it responds to a math word problem.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!cp /content/drive/MyDrive/meta-llamaLlama-3.1-8B.zip /content/

In [6]:
!unzip /content/meta-llamaLlama-3.1-8B.zip -d /content/

Archive:  /content/meta-llamaLlama-3.1-8B.zip
   creating: /content/meta-llamaLlama-3.1-8B/
  inflating: /content/meta-llamaLlama-3.1-8B/config.json  
  inflating: /content/meta-llamaLlama-3.1-8B/generation_config.json  
  inflating: /content/meta-llamaLlama-3.1-8B/gitattributes  
  inflating: /content/meta-llamaLlama-3.1-8B/LICENSE  
  inflating: /content/meta-llamaLlama-3.1-8B/model-00001-of-00004.safetensors  
  inflating: /content/meta-llamaLlama-3.1-8B/model-00002-of-00004.safetensors  
  inflating: /content/meta-llamaLlama-3.1-8B/model-00003-of-00004.safetensors  
  inflating: /content/meta-llamaLlama-3.1-8B/model-00004-of-00004.safetensors  
  inflating: /content/meta-llamaLlama-3.1-8B/model.safetensors.index.json  
  inflating: /content/meta-llamaLlama-3.1-8B/README.md  
  inflating: /content/meta-llamaLlama-3.1-8B/special_tokens_map.json  
  inflating: /content/meta-llamaLlama-3.1-8B/tokenizer.json  
  inflating: /content/meta-llamaLlama-3.1-8B/tokenizer_config.json  
  inflat

In [4]:
model_path = "/content/meta-llamaLlama-3.1-8B"


quantized_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map="auto"
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

# Memory footprints#
Now, let’s check the memory footprints of the quantized_model.

In [6]:
print(quantized_model.get_memory_footprint())

5591548160


The 4-bit quantization reduced the memory required to store the model from 8.45 GB (with 8-bit quantization) to 5591548160 bytes, which is around 5.2 GB only.

# Data type of model’s parameters
Let’s check the data types of the parameters of quantized_model.

In [8]:
params_dtypes = [param.dtype for param in quantized_model.parameters()]
print("Parameter's dtypes:", params_dtypes)

Parameter's dtypes: [torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float

# Inference
Now, let's try asking a prompt for inference of the quantized model before fine-tuning and learn how it responds.

In [9]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path) # del moedllo base

input_txt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"

input = tokenizer(input_txt, return_tensors="pt").to("cuda")
output = quantized_model.generate(**input, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? A. 48 B. 24 C. 72 D. 96
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? A. 48 B. 24 C. 72 D. 96


In [24]:
# Salva il modello quantizzato in una cartella specifica
save_path = "/content/quantized_model"
quantized_model.save_pretrained(save_path)

We can see that the model provided a solution to the math problem in natural language.

# Training the model
Now, we have applied the NF4 quantization to the model. Let’s use LoRA to fine-tune the model on openai/gsm8k dataset so it generates responses in mathematical expressions.

In [1]:
!pip install transformers==4.44.1
!pip install accelerate==0.30.0
!pip install trl==0.9.6
!pip install peft==0.12.0
!pip install bitsandbytes==0.43.3
!pip install datasets==2.21.0
!pip install -U "huggingface_hub[cli]"

Collecting accelerate==0.30.0
  Using cached accelerate-0.30.0-py3-none-any.whl.metadata (19 kB)
Using cached accelerate-0.30.0-py3-none-any.whl (302 kB)
Installing collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.21.0
    Uninstalling accelerate-0.21.0:
      Successfully uninstalled accelerate-0.21.0
Successfully installed accelerate-0.30.0
Collecting peft==0.12.0
  Using cached peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Using cached peft-0.12.0-py3-none-any.whl (296 kB)
Installing collected packages: peft
  Attempting uninstall: peft
    Found existing installation: peft 0.9.0
    Uninstalling peft-0.9.0:
      Successfully uninstalled peft-0.9.0
Successfully installed peft-0.12.0
Collecting bitsandbytes==0.43.3
  Using cached bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Using cached bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl (137.5 MB)
Installing collected packages: bitsandbyte

In [None]:
from huggingface_hub import login
import os
login(token=os.getenv("HF_TOKEN"))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `vllm` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `vllm`


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Percorso della cartella in cui hai salvato il modello
load_path = "/content/quantized_model"

# Configurazione BitsAndBytes per caricare il modello quantizzato
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Se hai usato 4-bit, altrimenti load_in_8bit=True
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Carica il modello dalla cartella salvata
quantized_model = AutoModelForCausalLM.from_pretrained(
    load_path,
    quantization_config=bnb_config,
    device_map="auto"
)

# Carica il tokenizer
tokenizer = AutoTokenizer.from_pretrained("/content/meta-llamaLlama-3.1-8B")

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
from datasets import load_dataset
import peft
from peft import LoraConfig
import transformers
from transformers import TrainingArguments
import os
import trl
from trl import SFTTrainer

# preprocess data

dataset = "openai/gsm8k"
data = load_dataset(dataset, "main")

tokenizer.pad_token = tokenizer.eos_token
data = data.map(lambda samples: tokenizer(samples['question'], samples['answer'], truncation=True, padding="max_length", max_length=100), batched=True)
train_sample = data['train'].select(range(400))

# LoRa Configurations

lora_config = LoraConfig(
    r = 16,
    lora_alpha=16,
    target_modules=['q_proj', 'v_proj'],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)


# setting the training arguments

working_dir = "/content"

output_directory = os.path.join(working_dir, "qlora")

training_args = TrainingArguments(
    output_dir=output_directory,
    auto_find_batch_size=True,
    learning_rate=3e-4,
    num_train_epochs=5
)

# setting the trainer

trainer = SFTTrainer(
    model=quantized_model,
    args=training_args,
    train_dataset=train_sample,
    peft_config=lora_config,
    tokenizer=tokenizer,
    data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

# Train the model

trainer.train()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mfelipaosfdk[0m ([33mfelipaosfdk-university-of-udine[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss


TrainOutput(global_step=250, training_loss=1.31228466796875, metrics={'train_runtime': 425.7826, 'train_samples_per_second': 4.697, 'train_steps_per_second': 0.587, 'total_flos': 9014088499200000.0, 'train_loss': 1.31228466796875, 'epoch': 5.0})

After training, we can save the fine-tuned model on our local machines for later use.

In [5]:
# Save the model.
model_path = os.path.join(output_directory, f"qlora_model")
trainer.model.save_pretrained(model_path)

# Load the fine-tuned model#
Let’s load the already trained model saved on our machine to see the inference.

In [7]:
model_path = "/content/qlora/qlora_model"

from peft import AutoPeftModelForCausalLM
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True
)

loaded_model = AutoPeftModelForCausalLM.from_pretrained(
    model_path,
    quantization_config= bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("/content/meta-llamaLlama-3.1-8B") # del modello base

input_txt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"

input = tokenizer(input_txt, return_tensors="pt").to("cuda")
output = loaded_model.generate(**input, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?Question: How many clips did Natalia sell altogether in April and May?The number of clips Natalia sold in May is 48/2 = <<48/2=24>>24 clips.
So, the total number of clips Natalia sold in April and May is 48 + 24 = <<48+24=72>>72 clips.
#### 72 clips
#### 72 clips
#### 72 clips
#### 72 clips
#### 72 clips
#### 


We can see that the model has adapted to our dataset and provided a solution using mathematical expressions.

In [8]:
!zip -r /content/qlora.zip /content/qlora

  adding: content/qlora/ (stored 0%)
  adding: content/qlora/checkpoint-250/ (stored 0%)
  adding: content/qlora/checkpoint-250/scheduler.pt (deflated 56%)
  adding: content/qlora/checkpoint-250/tokenizer_config.json (deflated 96%)
  adding: content/qlora/checkpoint-250/special_tokens_map.json (deflated 64%)
  adding: content/qlora/checkpoint-250/tokenizer.json (deflated 74%)
  adding: content/qlora/checkpoint-250/adapter_model.safetensors (deflated 8%)
  adding: content/qlora/checkpoint-250/trainer_state.json (deflated 56%)
  adding: content/qlora/checkpoint-250/training_args.bin (deflated 51%)
  adding: content/qlora/checkpoint-250/README.md (deflated 66%)
  adding: content/qlora/checkpoint-250/adapter_config.json (deflated 52%)
  adding: content/qlora/checkpoint-250/optimizer.pt (deflated 8%)
  adding: content/qlora/checkpoint-250/rng_state.pth (deflated 25%)
  adding: content/qlora/qlora_model/ (stored 0%)
  adding: content/qlora/qlora_model/adapter_model.safetensors (deflated 8%)


In [9]:
from google.colab import files
files.download("/content/qlora.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>