# Day 30: Adapter Merging and Model Packaging - Part 1

In this notebook, we'll explore different approaches to adapter management and model packaging for deployment. We'll focus on merging LoRA adapters with base models and preparing them for distribution.

## Overview

1. Setup and dependencies
2. Loading base models and LoRA adapters
3. Merging adapters with base models
4. Comparing merged models vs. on-the-fly adapters
5. Saving and loading merged models

## 1. Setup and Dependencies

In [None]:
!pip install -q transformers peft datasets torch accelerate

In [None]:
import os
import torch
import time
import numpy as np
from transformers import (
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline
)
from peft import (
    PeftModel,
    PeftConfig,
    LoraConfig,
    get_peft_model,
    TaskType
)

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 2. Creating a Base Model and LoRA Adapter

First, let's create a simple base model and LoRA adapter for demonstration purposes.

In [None]:
# Define the base model
base_model_name = "gpt2"  # Using a smaller model for demonstration

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 doesn't have a pad token by default

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(base_model_name).to(device)

# Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["c_attn"],  # Target attention modules in GPT-2
    bias="none"
)

# Create the PEFT model
peft_model = get_peft_model(base_model, lora_config)

# Print model information
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Base model parameters: {count_parameters(base_model):,}")
print(f"LoRA trainable parameters: {count_parameters(peft_model):,}")
print(f"Parameter efficiency: {count_parameters(peft_model) / count_parameters(base_model) * 100:.2f}%")

## 3. Simulating a Fine-tuned LoRA Adapter

For demonstration purposes, we'll simulate a fine-tuned LoRA adapter by modifying some weights directly.

In [None]:
# Simulate fine-tuning by directly modifying some LoRA weights
# In a real scenario, you would train the model on a dataset

# Find LoRA modules
lora_modules = []
for name, module in peft_model.named_modules():
    if "lora" in name and hasattr(module, "weight"):
        lora_modules.append((name, module))

# Modify some weights to simulate training
with torch.no_grad():
    for name, module in lora_modules:
        if hasattr(module, "weight"):
            # Add small random values to simulate training updates
            module.weight.data += torch.randn_like(module.weight.data) * 0.01

print(f"Simulated fine-tuning on {len(lora_modules)} LoRA modules")

## 4. Saving the LoRA Adapter

Now, let's save the LoRA adapter weights separately from the base model.

In [None]:
# Create a directory for the adapter
adapter_path = "./lora-adapter"
os.makedirs(adapter_path, exist_ok=True)

# Save the adapter weights
peft_model.save_pretrained(adapter_path)

print(f"LoRA adapter saved to {adapter_path}")

# Check the size of the saved adapter
!du -sh {adapter_path}

## 5. Loading the Base Model and Adapter Separately

Let's load the base model and adapter separately, which is the on-the-fly approach.

In [None]:
# Load the base model again
fresh_base_model = AutoModelForCausalLM.from_pretrained(base_model_name).to(device)

# Load the adapter configuration
adapter_config = PeftConfig.from_pretrained(adapter_path)
print(f"Adapter config: {adapter_config}")

# Load the adapter with the base model
adapter_model = PeftModel.from_pretrained(fresh_base_model, adapter_path)

print(f"Successfully loaded adapter model")

## 6. Merging the Adapter with the Base Model

Now, let's merge the adapter weights with the base model to create a single model.

In [None]:
# Merge the adapter with the base model
merged_model = adapter_model.merge_and_unload()

print(f"Successfully merged adapter with base model")

# Verify that the merged model has the same number of parameters as the base model
print(f"Base model parameters: {count_parameters(fresh_base_model):,}")
print(f"Merged model parameters: {count_parameters(merged_model):,}")

## 7. Comparing Inference Speed: On-the-fly vs. Merged

Let's compare the inference speed of the on-the-fly adapter approach versus the merged model approach.

In [None]:
# Function to measure inference time
def measure_inference_time(model, tokenizer, prompt, num_runs=10):
    # Tokenize the input
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    # Warm-up run
    with torch.no_grad():
        _ = model.generate(**inputs, max_length=50, num_return_sequences=1)
    
    # Measure inference time
    start_time = time.time()
    with torch.no_grad():
        for _ in range(num_runs):
            _ = model.generate(**inputs, max_length=50, num_return_sequences=1)
    end_time = time.time()
    
    avg_time = (end_time - start_time) / num_runs
    return avg_time

In [None]:
# Compare inference speed
prompt = "The future of artificial intelligence is"

# Measure on-the-fly adapter inference time
adapter_time = measure_inference_time(adapter_model, tokenizer, prompt)
print(f"On-the-fly adapter inference time: {adapter_time:.4f} seconds per run")

# Measure merged model inference time
merged_time = measure_inference_time(merged_model, tokenizer, prompt)
print(f"Merged model inference time: {merged_time:.4f} seconds per run")

# Calculate speedup
speedup = adapter_time / merged_time
print(f"Speedup from merging: {speedup:.2f}x")

## 8. Comparing Memory Usage: On-the-fly vs. Merged

Let's also compare the memory usage of both approaches.

In [None]:
# Function to measure peak memory usage
def measure_peak_memory(model, tokenizer, prompt):
    if torch.cuda.is_available():
        # Reset peak memory stats
        torch.cuda.reset_peak_memory_stats()
        torch.cuda.empty_cache()
        
        # Tokenize the input
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        
        # Run inference
        with torch.no_grad():
            _ = model.generate(**inputs, max_length=50, num_return_sequences=1)
        
        # Get peak memory usage
        peak_memory = torch.cuda.max_memory_allocated() / (1024 * 1024)  # MB
        return peak_memory
    else:
        return "N/A (CUDA not available)"

In [None]:
# Compare memory usage
if torch.cuda.is_available():
    # Measure on-the-fly adapter memory usage
    adapter_memory = measure_peak_memory(adapter_model, tokenizer, prompt)
    print(f"On-the-fly adapter peak memory: {adapter_memory:.2f} MB")
    
    # Measure merged model memory usage
    merged_memory = measure_peak_memory(merged_model, tokenizer, prompt)
    print(f"Merged model peak memory: {merged_memory:.2f} MB")
    
    # Calculate memory difference
    memory_diff = adapter_memory - merged_memory
    print(f"Memory difference: {memory_diff:.2f} MB ({memory_diff / adapter_memory * 100:.2f}%)")
else:
    print("CUDA not available for memory measurement")

## 9. Saving the Merged Model

Let's save the merged model for deployment.

In [None]:
# Create a directory for the merged model
merged_model_path = "./merged-model"
os.makedirs(merged_model_path, exist_ok=True)

# Save the merged model
merged_model.save_pretrained(merged_model_path)

# Save the tokenizer
tokenizer.save_pretrained(merged_model_path)

print(f"Merged model saved to {merged_model_path}")

# Check the size of the saved model
!du -sh {merged_model_path}

## 10. Loading and Using the Merged Model

Now, let's load the merged model and use it for inference.

In [None]:
# Load the merged model
loaded_merged_model = AutoModelForCausalLM.from_pretrained(merged_model_path).to(device)
loaded_tokenizer = AutoTokenizer.from_pretrained(merged_model_path)

print(f"Successfully loaded merged model")

In [None]:
# Create a text generation pipeline
generator = pipeline(
    "text-generation",
    model=loaded_merged_model,
    tokenizer=loaded_tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Generate text
prompts = [
    "The future of artificial intelligence is",
    "Climate change will impact our planet by",
    "The most important skill for the 21st century is"
]

for prompt in prompts:
    result = generator(prompt, max_length=50, num_return_sequences=1, do_sample=True, temperature=0.7)
    print(f"Prompt: {prompt}")
    print(f"Generated: {result[0]['generated_text']}")
    print("-" * 50)

## 11. Comparing File Sizes: Adapter vs. Merged Model

Let's compare the file sizes of the adapter and the merged model.

In [None]:
# Get file sizes
!echo "Adapter size:"
!du -sh {adapter_path}

!echo "\nMerged model size:"
!du -sh {merged_model_path}

!echo "\nBase model size:"
!du -sh $(python -c "import transformers; print(transformers.AutoModelForCausalLM.from_pretrained('{base_model_name}', local_files_only=False).config.name_or_path)")

## Conclusion

In this notebook, we've explored two approaches to adapter management:

1. **On-the-fly Adapter Composition**: Keeping the base model and adapters separate and applying the adapters during inference.
   - Advantages: Smaller artifacts, flexibility to switch adapters
   - Disadvantages: Slightly higher computational overhead, requires custom code

2. **Adapter Merging**: Combining the weights of the base model and adapters into a single model.
   - Advantages: Faster inference, standard pipelines work without modification
   - Disadvantages: Larger artifact size, cannot switch adapters dynamically

We've seen that merging adapters can provide a speed boost for inference while maintaining the same model quality. However, it comes at the cost of larger artifact sizes and loss of modularity.

The choice between these approaches depends on your specific deployment needs:
- For simple deployments where performance is critical, merged models are preferable.
- For scenarios requiring flexibility to switch between tasks or where storage is limited, on-the-fly adapters may be better.

In Part 2, we'll explore model documentation and creating comprehensive model cards for responsible sharing.