# Fine-Tuning with QLoRA (Quantized LoRA)

## 1. What is Quantization?
Quantization is the process of reducing the precision of the model's weights. Large Language Models (LLMs) are typically stored in **FP32** (32-bit floating point) or **BF16/FP16** (16-bit).

By quantizing to **4-bit**, we reduce the memory footprint by almost 8x, allowing us to run and fine-tune large models (like Llama-3 8B or Mistral 7B) on consumer-grade GPUs (like an RTX 3090 or even free Google Colab T4).

## 2. What is QLoRA?
**QLoRA** (Quantized Low-Rank Adaptation) is an efficient fine-tuning technique that:
1.  **Quantizes** the base model to 4-bit (using the NormalFloat4 or NF4 data type).
2.  **Freezes** the base model weights.
3.  **Adds** small, trainable 16-bit adapter weights (LoRA layers) that are optimized during training.

This combination gives us the performance of full fine-tuning with a fraction of the memory.

## 3. Setup
We need `bitsandbytes` (for quantization), `peft` (for LoRA), and `trl` (for easy training).

In [None]:
%pip install -q -U bitsandbytes transformers peft accelerate datasets trl

## 4. Loading the Model in 4-bit
We use `BitsAndBytesConfig` to define the 4-bit quantization parameters.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # Low memory model for demo

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,       # Quantize the quantization constants
    bnb_4bit_quant_type="nf4",            # NormalFloat4 is better than pure 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16 # Computation still happens in 16-bit
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## 5. Preparing for Training
We need to specifically prepare the quantized model for k-bit training using `prepare_model_for_kbit_training`.

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

## 6. Configuring LoRA Adapters
This is where we define the 'bottleneck' layers that will actually be trained.

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16, # Rank: the size of the temporary matrices
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Modules to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

# Check how many parameters we are actually training
print(f"Trainable parameters: {model.get_nb_trainable_parameters()}")

## 7. Dataset and Training
We use the `SFTTrainer` (Supervised Fine-Tuning Trainer) which handles the complexity of LoRA under the hood.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Dummy dataset for example
dataset = load_dataset("json", data_files={"train": [{"text": "### Instruction: Hello\n### Response: Hi there!"}]})

training_args = TrainingArguments(
    output_dir="./outputs",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    max_steps=10,
    logging_steps=1,
    fp16=True if torch.cuda.is_available() else False,
    optim="paged_adamw_8bit" # Special optimizer for quantization
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    args=training_args,
)

# trainer.train()

## Summary of QLoRA Workflow
1.  **Quantize model** to 4-bit (NF4) to save memory.
2.  **Freeze** original weights.
3.  **Apply LoRA** to specific layers.
4.  **Train** ONLY the LoRA adapters.
5.  **Merge** adapters back into the base model (optional, for deployment).