# Efficient Fine-Tuning with LoRA, QLoRA, PEFT, and Quantization

In this notebook, we explore advanced concepts in language model fine-tuning and optimization, including LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), PEFT (Parameter-Efficient Fine-Tuning), and quantization techniques. We'll guide you through step-by-step code to implement these methods effectively.


## 1. Configuration and Model Loading

We start by loading a pre-trained model using 4-bit quantization to reduce memory usage.


In [19]:
# %pip install unsloth
# %pip install --upgrade pip && pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"

In [20]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

In [21]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-bnb-4bit",
    # model_name="unsloth/mistral-7b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.11.9: Fast Llama patching. Transformers = 4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 15.56 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Key Points:
1. 4-bit Quantization: Compresses model weights to reduce memory usage and inference time while maintaining performance.
1. FastLanguageModel: A library optimized for fast model loading and execution.

## 2. Adding LoRA for Parameter-Efficient Fine-Tuning

Next, we configure the model for Low-Rank Adaptation (LoRA), which updates only a small subset of parameters for fine-tuning.

### Key Points:
- **LoRA**: Fine-tunes a pre-trained model by adding small low-rank weight updates, significantly reducing training cost.
- **Gradient Checkpointing**: Optimizes memory usage for long sequence processing.


In [22]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128, Rank parameter for LoRA. The smaller this value, the fewer parameters will be modified.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none", # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # 4x longer contexts auto supported!
    random_state = 3407,     # Seed value for random number generation
    use_rslora = False, # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

## 3. Text Generation with Prompts

The model is now configured for inference, and we use it to generate SQL-based responses to a given instruction.

### Key Points:
- **Prompt Engineering**: Structuring input prompts carefully improves response quality.
- **Fast Inference**: `FastLanguageModel.for_inference` doubles inference speed.


In [23]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


In [24]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Which countries have experienced an increase in returns in the last quarter compared to the same quarter of the previous year, and what is the percentage change for each country?",
        # "List all the unique equipment types and their corresponding total maintenance frequency from the equipment_maintenance table.", # instruction
        "You are an expert in converting English questions to SQL code", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")


### Inference SQL with the base model

In [25]:
outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
response = tokenizer.batch_decode(outputs)
print(response[0])

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Which countries have experienced an increase in returns in the last quarter compared to the same quarter of the previous year, and what is the percentage change for each country?

### Input:
You are an expert in converting English questions to SQL code

### Response:
SELECT country, quarter, return_change
FROM (
    SELECT country, quarter, return_change
    FROM (
        SELECT country, quarter, return_change
        FROM (
            SELECT country, quarter, return_change
            FROM (
                SELECT country, quarter, return_change
                FROM (
                    SELECT country, quarter, return_change
                    FROM (
                        SELECT country, quarter, return_change
                        FROM (
                            SELECT country, quart

In [26]:
from datasets import Dataset, load_dataset

# Define the prompt template with variables matching the loop content
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
{response}
"""

# Set the EOS token (assuming the tokenizer is already defined)
EOS_TOKEN = tokenizer.eos_token


## 4. Preparing and Formatting the Dataset

We prepare a dataset for fine-tuning the model with SQL-specific prompts.

### Key Points:
- **Dataset Preparation**: Converts raw data into formatted prompts ready for training.
- **Synthetic SQL Dataset**: Ideal for fine-tuning on text-to-SQL tasks.


In [27]:
# Formatting function to apply the prompt template to the dataset
def formatting_prompts_func(examples):
    company_databases = examples["sql_context"]
    prompts = examples["sql_prompt"]
    sqls = examples["sql"]
    explanations = examples["sql_explanation"]
    texts = []
    for company_database, prompt, sql, explanation in zip(company_databases, prompts, sqls, explanations):
        # Substitute the correct placeholders
        text = alpaca_prompt.format(
            instruction = prompt,
            input = company_database,
            response = sql + "\n" + explanation
            ) + EOS_TOKEN
        texts.append(text)
    return {"text": texts} # Ensure the formatted text is returned as a "text" field


In [28]:
# Load dataset and map formatting function to add prompts
ds = load_dataset("gretelai/synthetic_text_to_sql")
formatted_ds = ds.map(formatting_prompts_func, batched=True) # Apply formatting

# Select the 'train' split from the formatted dataset
train_dataset = formatted_ds['train']

## 5. Fine-Tuning the Model with PEFT

We use PEFT (Parameter-Efficient Fine-Tuning) to train the model on the prepared dataset.

### Key Points:
- **PEFT**: Focuses on updating a small subset of model parameters, saving compute resources.
- **SFTTrainer**: Streamlined trainer for supervised fine-tuning with prompts.


In [29]:
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from trl import SFTTrainer

# Trainer setup
trainer = SFTTrainer(
    model=model, # Ensure model is defined
    tokenizer=tokenizer, # Ensure tokenizer is defined
    train_dataset=train_dataset, # Use the 'train' split from formatted_ds
    dataset_text_field="text", # This is the field we created with formatted prompts
    max_seq_length=max_seq_length, # Ensure max_seq_length is defined
    dataset_num_proc=2,
    packing=False, # Can make training 5x faster for short sequences.
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none", # Disable WANDB logging
    )
)

max_steps is given, it will override any value given in num_train_epochs


In [30]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 15.56 GB.
6.207 GB of memory reserved.


In [31]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,1.7144
2,1.7784
3,1.7414
4,1.6591
5,1.4781
6,1.455
7,1.3345
8,1.0956
9,1.0946
10,1.0472


In [32]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

139.0311 seconds used for training.
2.32 minutes used for training.
Peak reserved memory = 6.207 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 39.891 %.
Peak reserved memory for training % of max memory = 0.0 %.


## 6. Inference with the Fine-Tuned Model

After fine-tuning our model, we can now use it to perform inference and generate responses based on new prompts. Below, we provide an example of how to format the prompt and execute an inference to get an answer from the fine-tuned model.

### Prompt Construction and Inference

We construct the prompt in a similar format to what was used during training to ensure the model is familiar with the structure and can respond effectively.

In [33]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


In [34]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        # "Which space agencies have launched more than 5 satellites and their respective total mission durations?",
        "Which countries have experienced an increase in returns in the last quarter compared to the same quarter of the previous year, and what is the percentage change for each country?",
        # "List all the unique equipment types and their corresponding total maintenance frequency from the equipment_maintenance table.", # instruction
        "You are an expert in converting English questions to SQL code", # [Additional context or information needed to complete the task. This can be empty if the instruction is self-contained]

        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")


In [35]:
outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True)
response = tokenizer.batch_decode(outputs)
print(response[0])

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Which countries have experienced an increase in returns in the last quarter compared to the same quarter of the previous year, and what is the percentage change for each country?

### Input:
You are an expert in converting English questions to SQL code

### Response:
SELECT country, SUM(returned) AS total_returns, SUM(returned) / SUM(ordered) AS return_percentage
FROM orders
WHERE order_date >= DATE_SUB(NOW(), INTERVAL 1 QUARTER)
GROUP BY country
HAVING SUM(returned) > 0
ORDER BY return_percentage DESC;
<|end_of_text|>
