# Day 27: QLoRA Implementation - Part 2

In this notebook, we'll focus on using our QLoRA-trained model for inference and evaluating its performance. We'll also explore advanced techniques for memory optimization and model deployment.

## Overview

1. Loading the QLoRA adapter and base model
2. Inference with the fine-tuned model
3. Evaluating model performance
4. Advanced memory optimization techniques
5. Merging adapters for deployment

## 1. Setup and Dependencies

In [None]:
import os
import torch
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    pipeline
)
from peft import PeftModel, PeftConfig
import bitsandbytes as bnb

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 2. Loading the QLoRA Adapter and Base Model

Let's load our fine-tuned QLoRA adapter and the base model.

In [None]:
# Path to the saved adapter
peft_model_path = "./qlora-opt-alpaca"

# Load the PEFT configuration
config = PeftConfig.from_pretrained(peft_model_path)
print(f"Base model: {config.base_model_name_or_path}")

# Configure quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load the base model with quantization
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    quantization_config=quantization_config,
    device_map="auto"
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

# Load the LoRA adapter
model = PeftModel.from_pretrained(model, peft_model_path)

print("Model and adapter loaded successfully!")

## 3. Inference with the Fine-tuned Model

Now, let's use our fine-tuned model to generate responses to instructions.

In [None]:
# Set up a text generation pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2
)

# Function to generate responses
def generate_response(instruction, input_text=None):
    # Format the prompt
    if input_text:
        prompt = f"### Instruction: {instruction}\n\n### Input: {input_text}\n\n### Response:"
    else:
        prompt = f"### Instruction: {instruction}\n\n### Response:"
    
    # Generate the response
    result = generator(prompt, max_new_tokens=256)[0]["generated_text"]
    
    # Extract just the response part
    response = result.split("### Response:")[-1].strip()
    
    return response

In [None]:
# Test with some instructions
test_instructions = [
    "Write a short poem about artificial intelligence.",
    "Explain the concept of quantum computing to a 10-year-old.",
    "List five ways to reduce carbon emissions in daily life."
]

for instruction in test_instructions:
    print(f"Instruction: {instruction}")
    response = generate_response(instruction)
    print(f"Response:\n{response}")
    print("-" * 50)

## 4. Evaluating Model Performance

Let's evaluate our model on a few examples from the test set.

In [None]:
from datasets import load_dataset

# Load a small evaluation dataset
eval_dataset = load_dataset("tatsu-lab/alpaca", split="train[1000:1020]")  # Just 20 examples for demonstration

# Evaluate on a few examples
for i in range(5):  # Evaluate on 5 examples
    example = eval_dataset[i]
    instruction = example["instruction"]
    input_text = example["input"]
    reference_output = example["output"]
    
    print(f"Example {i+1}:")
    print(f"Instruction: {instruction}")
    if input_text:
        print(f"Input: {input_text}")
    print(f"Reference: {reference_output}")
    
    # Generate response
    model_output = generate_response(instruction, input_text)
    print(f"Model Output: {model_output}")
    print("-" * 50)

## 5. Advanced Memory Optimization Techniques

Let's explore some advanced memory optimization techniques for working with large models.

In [None]:
# Function to check GPU memory usage
def check_gpu_memory():
    if torch.cuda.is_available():
        print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
        print(f"GPU memory reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
        print(f"Max GPU memory allocated: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

# Check current memory usage
check_gpu_memory()

### 5.1 CPU Offloading

For extremely large models, we can offload some layers to CPU.

In [None]:
# Note: This is just a demonstration of the concept
# In practice, you would use device_map to specify which layers go where

# Example of device map for CPU offloading
device_map_example = {
    "model.embed_tokens": "cpu",
    "model.layers.0": "cuda:0",
    "model.layers.1": "cuda:0",
    # ... more layers
    "model.layers.23": "cpu",
    "model.norm": "cuda:0",
    "lm_head": "cuda:0"
}

print("Example device map for CPU offloading:")
for key, value in list(device_map_example.items())[:5]:
    print(f"{key}: {value}")
print("... (more layers)")

### 5.2 Flash Attention

Flash Attention is a memory-efficient attention implementation that can significantly reduce memory usage.

In [None]:
# Note: This is just a demonstration of the concept
# In practice, you would use models that support Flash Attention

print("Flash Attention benefits:")
print("1. O(n) memory complexity instead of O(n²)")
print("2. Faster computation on GPUs")
print("3. Enables processing of longer sequences")
print("\nTo use Flash Attention:")
print("model = AutoModelForCausalLM.from_pretrained('model_name', use_flash_attention=True)")

## 6. Merging Adapters for Deployment

For deployment, we can merge the LoRA adapter with the base model to eliminate the adapter overhead.

In [None]:
# Merge the adapter with the base model
merged_model = model.merge_and_unload()

# Save the merged model (optional - this would be a large file)
# merged_model.save_pretrained("./merged-qlora-opt-alpaca")

# Test the merged model
instruction = "Explain the difference between machine learning and deep learning."
print(f"Instruction: {instruction}")

# Create a new pipeline with the merged model
merged_generator = pipeline(
    "text-generation",
    model=merged_model,
    tokenizer=tokenizer,
    max_length=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)

# Generate response
prompt = f"### Instruction: {instruction}\n\n### Response:"
result = merged_generator(prompt, max_new_tokens=256)[0]["generated_text"]
response = result.split("### Response:")[-1].strip()

print(f"Response:\n{response}")

## 7. Converting to Different Quantization Formats for Inference

For deployment, we might want to convert the model to a different quantization format optimized for inference.

In [None]:
# Note: This is just a demonstration of the concept
# In practice, you would use libraries like ONNX or TensorRT for optimized inference

print("Inference Optimization Options:")
print("1. ONNX Runtime: Convert model to ONNX format for optimized inference")
print("2. TensorRT: NVIDIA's deep learning inference optimizer")
print("3. INT8 Quantization: Further quantize to 8-bit integers for inference")
print("4. Model Pruning: Remove less important weights")
print("5. Knowledge Distillation: Train a smaller model to mimic the larger one")

## Conclusion

In this notebook, we've explored how to use a QLoRA-trained model for inference and evaluated its performance. We've also discussed advanced memory optimization techniques and deployment strategies.

Key takeaways:

1. QLoRA enables fine-tuning of large language models on consumer hardware
2. The fine-tuned model can generate high-quality responses to instructions
3. Advanced memory optimization techniques like CPU offloading and Flash Attention can further reduce memory requirements
4. For deployment, adapters can be merged with the base model to eliminate overhead
5. Additional optimizations like ONNX conversion or further quantization can improve inference performance

QLoRA represents a significant advancement in democratizing access to large language model fine-tuning, making it possible for researchers and developers with limited resources to adapt state-of-the-art models to their specific needs.