# Day 27: QLoRA Implementation - Part 1

In this notebook, we'll implement QLoRA (Quantized Low-Rank Adaptation) to fine-tune a large language model on consumer hardware. We'll focus on the setup, quantization, and basic QLoRA configuration.

## Overview

1. Setup and dependencies
2. Loading and quantizing a pre-trained model
3. Configuring QLoRA adapters
4. Preparing a dataset for fine-tuning
5. Setting up the training pipeline

## 1. Setup and Dependencies

First, let's install the necessary libraries. We'll need `bitsandbytes` for quantization, `peft` for LoRA, and `transformers` for the model.

In [None]:
!pip install -q transformers datasets peft evaluate accelerate bitsandbytes trl

In [None]:
import os
import torch
import transformers
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging
)
from peft import (
    prepare_model_for_kbit_training,
    LoraConfig,
    get_peft_model,
    PeftModel,
    TaskType
)
import bitsandbytes as bnb
from trl import SFTTrainer

# Set logging level
logging.set_verbosity_info()

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 2. Loading and Quantizing a Pre-trained Model

We'll use a smaller model for demonstration purposes, but the same principles apply to larger models like Llama 2 or Falcon.

In [None]:
# Define the model name
model_name = "facebook/opt-1.3b"  # Using OPT-1.3B as an example

# Configure quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,              # Load model in 4-bit precision
    bnb_4bit_compute_dtype=torch.float16,  # Compute in fp16
    bnb_4bit_use_double_quant=True,  # Use double quantization
    bnb_4bit_quant_type="nf4"        # Use NF4 quantization
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token

# Load the model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",  # Automatically determine device mapping
)

# Print model size information
def print_model_size(model):
    """Print model size information"""
    param_size = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()
    
    size_mb = (param_size + buffer_size) / 1024**2
    print(f"Model size: {size_mb:.2f} MB")
    
    # Count parameters by dtype
    dtypes = {}
    for param in model.parameters():
        dtype = param.dtype
        if dtype not in dtypes:
            dtypes[dtype] = 0
        dtypes[dtype] += param.nelement()
    
    for dtype, count in dtypes.items():
        print(f"{dtype}: {count:,} parameters")

print_model_size(model)

## 3. Configuring QLoRA Adapters

Now, let's prepare the model for QLoRA fine-tuning by adding LoRA adapters to the quantized model.

In [None]:
# Prepare the model for k-bit training
model = prepare_model_for_kbit_training(model)

# Define LoRA configuration
lora_config = LoraConfig(
    r=16,                    # Rank of the update matrices
    lora_alpha=32,           # Alpha parameter for scaling
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Target attention modules
    lora_dropout=0.05,       # Dropout probability for LoRA layers
    bias="none",             # Don't train bias parameters
    task_type=TaskType.CAUSAL_LM  # Task type (causal language modeling)
)

# Create the PEFT model
peft_model = get_peft_model(model, lora_config)

# Print trainable parameters
def print_trainable_parameters(model):
    """Prints the number of trainable parameters in the model."""
    trainable_params = 0
    all_params = 0
    for _, param in model.named_parameters():
        all_params += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params:,} || all params: {all_params:,} || "
        f"trainable%: {100 * trainable_params / all_params:.2f}%"
    )

print_trainable_parameters(peft_model)

## 4. Preparing a Dataset for Fine-tuning

We'll use a subset of the Alpaca dataset for instruction fine-tuning.

In [None]:
# Load the Alpaca dataset
dataset = load_dataset("tatsu-lab/alpaca")
print(dataset)

# Look at a few examples
for i in range(3):
    print(f"Example {i+1}:")
    print(f"Instruction: {dataset['train'][i]['instruction']}")
    print(f"Input: {dataset['train'][i]['input']}")
    print(f"Output: {dataset['train'][i]['output']}")
    print()

In [None]:
# Format the dataset for instruction fine-tuning
def format_instruction(example):
    """Format the instruction, input, and output into a single text."""
    instruction = example["instruction"]
    input_text = example["input"]
    output = example["output"]
    
    if input_text:
        formatted_text = f"### Instruction: {instruction}\n\n### Input: {input_text}\n\n### Response: {output}"
    else:
        formatted_text = f"### Instruction: {instruction}\n\n### Response: {output}"
    
    return {"text": formatted_text}

# Apply the formatting function
formatted_dataset = dataset.map(format_instruction, remove_columns=["instruction", "input", "output"])

# Create a smaller dataset for demonstration
train_dataset = formatted_dataset["train"].select(range(1000))  # Use 1000 examples for training

# Show a formatted example
print(train_dataset[0]["text"])

## 5. Setting Up the Training Pipeline

We'll use the `SFTTrainer` from the TRL library to simplify the training process.

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results/opt-qlora-alpaca",
    learning_rate=2e-4,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    max_steps=500,
    warmup_steps=50,
    fp16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    optim="paged_adamw_8bit",  # Use 8-bit optimizer to save memory
    gradient_checkpointing=True,  # Enable gradient checkpointing
    report_to="none",  # Disable wandb, tensorboard, etc.
)

# Create the trainer
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=train_dataset,
    args=training_args,
    tokenizer=tokenizer,
    max_seq_length=512,
    packing=True,  # Pack multiple examples into one sequence
)

## 6. Memory Optimization Techniques

Let's explore the memory optimization techniques used in QLoRA.

In [None]:
# Check if gradient checkpointing is enabled
print(f"Gradient checkpointing enabled: {peft_model.gradient_checkpointing}")

# Check optimizer type
print(f"Optimizer: {training_args.optim}")

# Check quantization settings
print(f"4-bit quantization: {quantization_config.load_in_4bit}")
print(f"Double quantization: {quantization_config.bnb_4bit_use_double_quant}")
print(f"Quantization type: {quantization_config.bnb_4bit_quant_type}")

# Check memory usage before training
if torch.cuda.is_available():
    print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

## 7. Training the Model

Now, let's start the training process. Note that this will take some time, even with the optimizations.

In [None]:
# Train the model
trainer.train()

## 8. Saving the QLoRA Adapter

After training, we'll save the LoRA adapter weights.

In [None]:
# Save the LoRA adapter weights
peft_model_path = "./qlora-opt-alpaca"
peft_model.save_pretrained(peft_model_path)

print(f"QLoRA adapter saved to {peft_model_path}")

# Check the size of the saved adapter
!du -sh {peft_model_path}

## Conclusion

In this notebook, we've implemented QLoRA to fine-tune a language model with 4-bit quantization. We've seen how to:

1. Load a model with 4-bit quantization using BitsAndBytes
2. Configure LoRA adapters for the quantized model
3. Prepare a dataset for instruction fine-tuning
4. Set up memory-efficient training with gradient checkpointing and 8-bit optimizers
5. Train and save the QLoRA adapter

This approach allows us to fine-tune large language models on consumer hardware that would otherwise be impossible to train. In Part 2, we'll explore how to use the fine-tuned model for inference and evaluate its performance.