# SmolLM QLorA Fine-Tuning

In this notebook, we explore another fine tuning technique: QLorA.
You can find a lot of explanation online and in the QLorA papers.

Long story short, Quantized Low-rank adaption is a fine tuning technique that allows the model:
- to infer on a smaller type of data (lower compute storage)
- to adapt adding one or several adapters on the top of the original weights

It's interesting because compared to SFT, you can have multiple adapters and allow for serving the same model with these adapters. So you actually train a "small" number of parameters.

However, the training is usually longer than SFT.

In [None]:
!pip install transformers==4.54.1 datasets==4.0.0 trl==0.20.0 peft==0.16.0 torch torchsummary -q 

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch
import os
from datasets import load_dataset
from transformers import pipeline
import json
from peft import LoraConfig, AutoPeftModelForCausalLM

2025-08-01 07:08:07.736077: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754032087.752870    5335 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754032087.757917    5335 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-08-01 07:08:07.774390: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Make sure the device is cuda

device = (
"cuda"
if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available() else "cpu"
)
print(f'Device is:{device}')
print(f'Type of card: {torch.cuda.get_device_capability()[0]}. (8 and above does not support flash attention)')

Device is:cuda
Type of card: 8. (8 and above does not support flash attention)


In [3]:
model_name = "HuggingFaceTB/SmolLM2-360M"
dataset_name = "HuggingFaceTB/smoltalk"
model_cache_dir=model_name.split('/')[-1]
config_name = "smol-summarize"
dataset_cache_dir=f"{dataset_name.replace('/', '')}_{config_name}"
output_dir = "./sft_text_summary_360_qlora"

In [4]:
#Load the model and tokenizer

model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=model_name,
cache_dir=model_cache_dir,
    device_map='cuda',
)
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=model_name,
cache_dir=model_cache_dir
)
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

In [5]:
print(model.config)

LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 960,
  "initializer_range": 0.02,
  "intermediate_size": 2560,
  "is_llama_config": true,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 15,
  "num_hidden_layers": 32,
  "num_key_value_heads": 5,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_interleaved": false,
  "rope_scaling": null,
  "rope_theta": 100000,
  "tie_word_embeddings": true,
  "torch_dtype": "float32",
  "transformers_version": "4.54.1",
  "use_cache": true,
  "vocab_size": 49152
}



In [6]:
# load dataset

ds = load_dataset(dataset_name, config_name, cache_dir=dataset_cache_dir)
ds

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 96356
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 5072
    })
})

In [7]:
ds['train'][300]

{'messages': [{'content': 'Provide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns.',
   'role': 'system'},
  {'content': "By . Jill Reilly . PUBLISHED: . 04:02 EST, 29 April 2013 . | . UPDATED: . 09:59 EST, 29 April 2013 . A powerful explosion has damaged a building in the centre of the Czech capital, Prague, injuring up to 40 people. Authorities say they believe some people are buried in the rubble. Police spokesman Tomas Hulan says it is not certain what caused the blast in Divadelni Street, but it was likely a natural gas explosion . The street was covered with rubble and has been sealed off by police who have also evacuated people from nearby buildings and closed a wide area around the explosion site. Injured: A powerful explosion has damaged a building in the centre of the Czech capital Prague with people feared buried in the rubble . Cause: Police said it is not immediat

## QLorA

Let's try first to with a first configuration and 10k sample dataset and see how long it takes and how results behave

In [8]:
# r: rank dimension for LoRA update matrices (smaller = more compression)

rank_dimension = 6

# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)

lora_alpha = 10

# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)

lora_dropout = 0.05

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

In [9]:
# Training configuration

# Hyperparameters based on QLoRA paper recommendations

sft_config = SFTConfig(
    # Output settings
    output_dir=output_dir,  # Directory to save model checkpoints
    # Training duration
    num_train_epochs=1,  # Number of training epochs
    # Batch size settings
    per_device_train_batch_size=2,  # Batch size per GPU
    gradient_accumulation_steps=2,  # Accumulate gradients for larger effective batch
    # Memory optimization
    gradient_checkpointing=True,  # Trade compute for memory savings
    # Optimizer settings
    optim="adamw_torch_fused",  # Use fused AdamW for efficiency
    learning_rate=2e-4,  # Learning rate (QLoRA paper)
    max_grad_norm=0.3,  # Gradient clipping threshold
    # Learning rate schedule
    warmup_ratio=0.03,  # Portion of steps for warmup
    lr_scheduler_type="constant",  # Keep learning rate constant after warmup
    # Logging and saving
    logging_steps=50,  # Log metrics every N steps
    save_strategy="epoch",  # Save checkpoint every epoch
    # Precision settings
    bf16=True,  # Use bfloat16 precision
    # Integration settings
    #push_to_hub=True,  # Don't push to HuggingFace Hub
    report_to="none",  # Disable external logging
)

# Create SFTTrainer with LoRA configuration

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=ds["train"].select(range(10000)),
    peft_config=peft_config,  # LoRA configuration
    processing_class=tokenizer,
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [10]:
# very long to run but very little usage of GPU (2.4GB)

trainer.train()

# save model

trainer.save_model(output_dir)

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
50,2.4541
100,1.9297
150,1.7703
200,1.771
250,1.7052
300,1.7917
350,1.7359
400,1.7101
450,1.7631
500,1.718


In [11]:
from torchvision import models
from torchsummary import summary

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
vgg = models.vgg16().to(device)

summary(vgg, (3, 224, 224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 224, 224]           1,792
              ReLU-2         [-1, 64, 224, 224]               0
            Conv2d-3         [-1, 64, 224, 224]          36,928
              ReLU-4         [-1, 64, 224, 224]               0
         MaxPool2d-5         [-1, 64, 112, 112]               0
            Conv2d-6        [-1, 128, 112, 112]          73,856
              ReLU-7        [-1, 128, 112, 112]               0
            Conv2d-8        [-1, 128, 112, 112]         147,584
              ReLU-9        [-1, 128, 112, 112]               0
        MaxPool2d-10          [-1, 128, 56, 56]               0
           Conv2d-11          [-1, 256, 56, 56]         295,168
             ReLU-12          [-1, 256, 56, 56]               0
           Conv2d-13          [-1, 256, 56, 56]         590,080
             ReLU-14          [-1, 256,

In [13]:
print(model.config)

LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 960,
  "initializer_range": 0.02,
  "intermediate_size": 2560,
  "is_llama_config": true,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 15,
  "num_hidden_layers": 32,
  "num_key_value_heads": 5,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_interleaved": false,
  "rope_scaling": null,
  "rope_theta": 100000,
  "tie_word_embeddings": true,
  "torch_dtype": "float32",
  "transformers_version": "4.54.1",
  "use_cache": true,
  "vocab_size": 49152
}



In [16]:
# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=sft_config.output_dir,
    #torch_dtype=torch.float16,
    #low_cpu_mem_usage=True,
)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    sft_config.output_dir, safe_serialization=True, 
    max_shard_size="2GB"
)

In [17]:

def generate_summary(dataset, n, system_prompt, sample_type='test'):

    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    messages = [{"role": "system", "content": system_prompt_summarize}, {"role": "user", "content": dataset[sample_type][n]['messages'][1].get('content')}]
    return json.dumps(pipe(messages), indent=4)

In [18]:
# Load Model with PEFT adapter
tokenizer = AutoTokenizer.from_pretrained(sft_config.output_dir)
model = AutoPeftModelForCausalLM.from_pretrained(
    sft_config.output_dir, 
    #device_map='auto', 
    #torch_dtype=torch.float16
)

In [19]:
n = 3000

system_prompt_summarize = "Provide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns."
print(generate_summary(ds, n, system_prompt_summarize))
print('\n')
system_prompt_summarize = 'Extract and present the main key point of the input text in one very short sentence, including essential details like dates or locations if necessary.'
print(generate_summary(ds, n, system_prompt_summarize))

Device set to use cuda:0
Device set to use cuda:0


[
    {
        "generated_text": [
            {
                "role": "system",
                "content": "Provide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns."
            },
            {
                "role": "user",
                "content": "While suspended star striker Wayne Rooney was recovering from a hair transplant, England's hopes of automatic qualification for Euro 2012 suffered a blow on Saturday as Switzerland claimed at 2-2 draw at London's Wembley Stadium. The balding Manchester United player, who was booked in March's win over Wales to trigger the ban, revealed on Twitter before kickoff that he had used his time off to visit a hair specialist. \"Just to confirm to all my followers I have had a hair transplant. I was going bald at 25, why not. I'm delighted with the result,\" he wrote on the social networking website. \"It's still a bit bruised and s

In [23]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(49152, 960)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=960, out_features=960, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=960, out_features=6, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=6, out_features=960, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear(
      

Great let's see if we can infer on CPU as the goal of QLOrA was to reduce the memory consumption 

In [24]:
# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=sft_config.output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    sft_config.output_dir, safe_serialization=True, 
    max_shard_size="2GB"
)

In [28]:
# Load Model with PEFT adapter
tokenizer = AutoTokenizer.from_pretrained(sft_config.output_dir)
model = AutoPeftModelForCausalLM.from_pretrained(
    sft_config.output_dir, device_map='cpu', torch_dtype=torch.float16
)

In [30]:

def generate_summary(dataset, n, system_prompt, sample_type='test'):

    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    messages = [{"role": "system", "content": system_prompt_summarize}, {"role": "user", "content": dataset[sample_type][n]['messages'][1].get('content')}]
    return json.dumps(pipe(messages), indent=4)

In [31]:
n = 3000

system_prompt_summarize = "Provide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns."
print(generate_summary(ds, n, system_prompt_summarize))
print('\n')
system_prompt_summarize = 'Extract and present the main key point of the input text in one very short sentence, including essential details like dates or locations if necessary.'
print(generate_summary(ds, n, system_prompt_summarize))

Device set to use cpu
Device set to use cpu


[
    {
        "generated_text": [
            {
                "role": "system",
                "content": "Provide a concise, objective summary of the input text in up to three sentences, focusing on key actions and intentions without using second or third person pronouns."
            },
            {
                "role": "user",
                "content": "While suspended star striker Wayne Rooney was recovering from a hair transplant, England's hopes of automatic qualification for Euro 2012 suffered a blow on Saturday as Switzerland claimed at 2-2 draw at London's Wembley Stadium. The balding Manchester United player, who was booked in March's win over Wales to trigger the ban, revealed on Twitter before kickoff that he had used his time off to visit a hair specialist. \"Just to confirm to all my followers I have had a hair transplant. I was going bald at 25, why not. I'm delighted with the result,\" he wrote on the social networking website. \"It's still a bit bruised and s

Ok it took longer obviously but this is it, we managed to get similar quality model using QLorA while keeping the ability to serve on cpu !