# Lab 5: IA³ - Model Merging and Deployment

## 🎯 Experiment Objectives

This Notebook demonstrates one of the core advantages of IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations): **scaling vectors can be fully merged into the base model, enabling zero-overhead inference deployment**.

### Key Learning Points
- Understand the mathematical merging principles of IA³ scaling vectors
- Implement complete model merging and saving workflow
- Verify functional consistency before and after merging
- Analyze performance improvements from merging
- Master production deployment best practices

---

## 1. Environment Setup and Dependency Check

In [None]:
# Import necessary libraries
import torch
import torch.nn as nn
import time
import gc
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    pipeline
)
from peft import (
    PeftModel, 
    IA3Config,
    get_peft_model,
    TaskType
)
import os
import json

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set random seed
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

## 2. Load Trained IA³ Model

First, load the IA³ adapter model trained in `02-Train.ipynb`.

In [None]:
# Model and adapter paths
base_model_name = "gpt2"
adapter_path = "./ia3-gpt2-imdb"  # Assume this is the trained adapter path

# Load base model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto" if torch.cuda.is_available() else None
)

print(f"Base model parameters: {base_model.num_parameters():,}")
print(f"Base model memory usage: {base_model.get_memory_footprint() / 1024**2:.2f} MB")

In [None]:
# Load IA³ adapter (if exists)
if os.path.exists(adapter_path):
    peft_model = PeftModel.from_pretrained(base_model, adapter_path)
    print(f"Successfully loaded IA³ adapter: {adapter_path}")
else:
    # If no trained adapter exists, create a demo IA³ configuration
    print("No trained adapter found, creating demo IA³ configuration...")
    
    ia3_config = IA3Config(
        task_type=TaskType.CAUSAL_LM,
        target_modules=["c_attn", "mlp.c_proj"],
        feedforward_modules=["mlp.c_proj"],
    )
    
    peft_model = get_peft_model(base_model, ia3_config)
    print("Created demo IA³ model")

# Display trainable parameter statistics
peft_model.print_trainable_parameters()

## 3. IA³ Scaling Vector Merging Principles

### 3.1 Theoretical Foundation

IA³ core consists of three scaling vectors:
- `l_k`: Key vector scaling
- `l_v`: Value vector scaling  
- `l_ff`: Feed-forward network scaling

**Mathematical Merging Principle**:
```
Original operation: output = scaling_vector ⊙ (W @ input)
After merging:      output = (scaling_vector ⊙ W) @ input
```

Where `⊙` denotes element-wise multiplication. This merging is mathematically equivalent and doesn't change model behavior.

In [None]:
# Analyze IA³ adapter structure
def analyze_ia3_adapters(model):
    """Analyze IA³ adapter structure and parameters"""
    ia3_params = {}
    
    for name, param in model.named_parameters():
        if 'ia3' in name.lower():
            ia3_params[name] = {
                'shape': param.shape,
                'dtype': param.dtype,
                'requires_grad': param.requires_grad,
                'device': param.device
            }
            print(f"IA³ parameter: {name}")
            print(f"  Shape: {param.shape}")
            print(f"  Data type: {param.dtype}")
            print(f"  Parameter count: {param.numel():,}")
            print()
    
    return ia3_params

ia3_structure = analyze_ia3_adapters(peft_model)
print(f"Total IA³ adapters: {len(ia3_structure)}")

### 3.2 Pre-merge Performance Baseline

In [None]:
def benchmark_inference(model, tokenizer, test_prompts, num_runs=5):
    """Inference performance benchmark"""
    model.eval()
    total_time = 0
    results = []
    
    with torch.no_grad():
        for run in range(num_runs):
            start_time = time.time()
            
            for prompt in test_prompts:
                inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
                if torch.cuda.is_available():
                    inputs = {k: v.cuda() for k, v in inputs.items()}
                
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=50,
                    do_sample=True,
                    temperature=0.7,
                    pad_token_id=tokenizer.eos_token_id
                )
                
                generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
                if run == 0:  # Only save results on first run
                    results.append({
                        'prompt': prompt,
                        'generated': generated_text[len(prompt):].strip()
                    })
            
            end_time = time.time()
            total_time += (end_time - start_time)
    
    avg_time = total_time / num_runs
    return avg_time, results

# Test prompts
test_prompts = [
    "This movie is absolutely",
    "I think this film was",
    "The acting in this movie",
    "Overall, my impression is"
]

print("=== Pre-merge Performance Test ===")
pre_merge_time, pre_merge_results = benchmark_inference(peft_model, tokenizer, test_prompts)
print(f"Average inference time: {pre_merge_time:.4f} seconds")
print(f"Memory usage: {peft_model.get_memory_footprint() / 1024**2:.2f} MB")

print("\n=== Pre-merge Generation Examples ===")
for i, result in enumerate(pre_merge_results[:2]):  # Only show first two examples
    print(f"Prompt {i+1}: {result['prompt']}")
    print(f"Generated: {result['generated']}")
    print()

## 4. Execute Model Merging

### 4.1 Using PEFT's merge_and_unload Method

In [None]:
print("=== Starting Model Merging Process ===")
print("Merging IA³ scaling vectors into base model weights...")

# Execute merge operation
merged_model = peft_model.merge_and_unload()

print("✅ Merge completed!")
print(f"Merged model parameters: {merged_model.num_parameters():,}")
print(f"Merged model memory usage: {merged_model.get_memory_footprint() / 1024**2:.2f} MB")

# Clean up original PEFT model to free memory
del peft_model
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("Memory cleanup completed")

### 4.2 Post-merge Performance Test

In [None]:
print("=== Post-merge Performance Test ===")
post_merge_time, post_merge_results = benchmark_inference(merged_model, tokenizer, test_prompts)
print(f"Average inference time: {post_merge_time:.4f} seconds")
print(f"Memory usage: {merged_model.get_memory_footprint() / 1024**2:.2f} MB")

print("\n=== Post-merge Generation Examples ===")
for i, result in enumerate(post_merge_results[:2]):
    print(f"Prompt {i+1}: {result['prompt']}")
    print(f"Generated: {result['generated']}")
    print()

### 4.3 Performance Comparison Analysis

In [None]:
# Calculate performance improvement
time_improvement = (pre_merge_time - post_merge_time) / pre_merge_time * 100

print("=== Performance Comparison Results ===")
print(f"Pre-merge inference time: {pre_merge_time:.4f} seconds")
print(f"Post-merge inference time: {post_merge_time:.4f} seconds")
print(f"Inference speed improvement: {time_improvement:.2f}%")

print("\n=== Deployment Advantages Summary ===")
print("✅ Inference latency: Zero overhead (scaling vectors merged)")
print("✅ Memory efficiency: No need to load additional adapters")
print("✅ Deployment simplification: Single model file only")
print("✅ Hardware compatibility: Same requirements as original model")

## 5. Functional Consistency Verification

Verify the consistency of model outputs before and after merging (under same random seed).

In [None]:
def verify_consistency(pre_results, post_results, tolerance=1e-6):
    """Verify consistency of outputs before and after merging"""
    print("=== Functional Consistency Verification ===")
    
    consistent_count = 0
    total_count = len(pre_results)
    
    for i, (pre, post) in enumerate(zip(pre_results, post_results)):
        # Compare generated text (may not be identical due to randomness)
        pre_text = pre['generated'].strip()
        post_text = post['generated'].strip()
        
        print(f"\nSample {i+1}:")
        print(f"  Prompt: {pre['prompt']}")
        print(f"  Pre-merge: {pre_text[:100]}{'...' if len(pre_text) > 100 else ''}")
        print(f"  Post-merge: {post_text[:100]}{'...' if len(post_text) > 100 else ''}")
        
        # Check if identical (exact match)
        if pre_text == post_text:
            consistent_count += 1
            print(f"  ✅ Perfectly consistent")
        else:
            print(f"  ⚠️ Differences (normal due to generation randomness)")
    
    print(f"\nPerfectly consistent samples: {consistent_count}/{total_count}")
    print("Note: Due to randomness in text generation, outputs may differ, which is normal.")
    print("The important thing is that model structure and parameters are correctly merged.")

verify_consistency(pre_merge_results, post_merge_results)

## 6. Model Saving and Deployment

### 6.1 Save Merged Model

In [None]:
# Set save path
merged_model_path = "./ia3-gpt2-merged"

print(f"=== Saving merged model to: {merged_model_path} ===")

# Save model and tokenizer
merged_model.save_pretrained(merged_model_path)
tokenizer.save_pretrained(merged_model_path)

print("✅ Model saved successfully!")

# Check saved files
saved_files = os.listdir(merged_model_path)
print(f"\nSaved files: {saved_files}")

# Calculate model size
total_size = 0
for file in saved_files:
    file_path = os.path.join(merged_model_path, file)
    if os.path.isfile(file_path):
        size = os.path.getsize(file_path)
        total_size += size
        print(f"  {file}: {size / 1024**2:.2f} MB")

print(f"\nTotal model size: {total_size / 1024**2:.2f} MB")

### 6.2 Verify Saved Model

In [None]:
# Reload model from saved path
print("=== Verifying Saved Model ===")

# Reload
reloaded_tokenizer = AutoTokenizer.from_pretrained(merged_model_path)
reloaded_model = AutoModelForCausalLM.from_pretrained(
    merged_model_path,
    torch_dtype=torch.float16,
    device_map="auto" if torch.cuda.is_available() else None
)

print(f"✅ Successfully loaded saved model")
print(f"Reloaded model parameters: {reloaded_model.num_parameters():,}")

# Quick test
test_prompt = "This movie is absolutely"
inputs = reloaded_tokenizer(test_prompt, return_tensors="pt")
if torch.cuda.is_available():
    inputs = {k: v.cuda() for k, v in inputs.items()}

with torch.no_grad():
    outputs = reloaded_model.generate(
        **inputs,
        max_new_tokens=30,
        do_sample=True,
        temperature=0.7,
        pad_token_id=reloaded_tokenizer.eos_token_id
    )

generated_text = reloaded_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"\nTest generation:")
print(f"Prompt: {test_prompt}")
print(f"Generated: {generated_text[len(test_prompt):].strip()}")
print("\n✅ Saved model functions normally!")

### 6.3 Production Deployment Guide

In [None]:
# Create deployment configuration file
deployment_config = {
    "model_info": {
        "base_model": base_model_name,
        "peft_method": "IA3",
        "merged": True,
        "total_parameters": merged_model.num_parameters(),
        "model_size_mb": total_size / 1024**2
    },
    "performance_metrics": {
        "inference_time_seconds": post_merge_time,
        "memory_usage_mb": merged_model.get_memory_footprint() / 1024**2,
        "speed_improvement_percent": time_improvement
    },
    "deployment_requirements": {
        "python_packages": [
            "torch>=1.9.0",
            "transformers>=4.20.0",
            "tokenizers>=0.12.0"
        ],
        "minimum_gpu_memory_gb": 4,
        "recommended_gpu_memory_gb": 8
    },
    "usage_example": {
        "load_command": f"AutoModelForCausalLM.from_pretrained('{merged_model_path}')",
        "inference_note": "No additional adapter loading required - model is fully merged"
    }
}

# Save configuration file
config_path = os.path.join(merged_model_path, "deployment_config.json")
with open(config_path, 'w', encoding='utf-8') as f:
    json.dump(deployment_config, f, indent=2, ensure_ascii=False)

print("=== Production Deployment Guide ===")
print(json.dumps(deployment_config, indent=2, ensure_ascii=False))
print(f"\nConfiguration file saved to: {config_path}")

## 7. IA³ vs Other PEFT Methods Merging Comparison

### 7.1 Merging Feasibility Analysis

In [None]:
# PEFT method merging capability comparison table
peft_comparison = {
    "LoRA": {
        "can_merge": True,
        "merge_principle": "Matrix decomposition W + A·B",
        "inference_overhead": "Zero (after merge)",
        "deployment_complexity": "Simple"
    },
    "IA3": {
        "can_merge": True,
        "merge_principle": "Scaling vector fusion",
        "inference_overhead": "Zero (after merge)",
        "deployment_complexity": "Simple"
    },
    "AdapterLayers": {
        "can_merge": False,
        "merge_principle": "Non-mergeable (new modules)",
        "inference_overhead": "Present (bottleneck layers)",
        "deployment_complexity": "Medium"
    },
    "Prompt_Tuning": {
        "can_merge": False,
        "merge_principle": "Non-mergeable (input modification)",
        "inference_overhead": "Present (soft prompts)",
        "deployment_complexity": "Medium"
    },
    "Prefix_Tuning": {
        "can_merge": False,
        "merge_principle": "Non-mergeable (attention modification)",
        "inference_overhead": "Present (prefix injection)",
        "deployment_complexity": "Complex"
    },
    "BitFit": {
        "can_merge": False,
        "merge_principle": "Already merged (bias parameters)",
        "inference_overhead": "Zero (native by design)",
        "deployment_complexity": "Simple"
    }
}

print("=== PEFT Method Merging Capability Comparison ===")
print(f"{'Method':<15} {'Mergeable':<10} {'Merge Principle':<25} {'Inference OH':<15} {'Deploy Complexity':<15}")
print("-" * 85)

for method, info in peft_comparison.items():
    can_merge = "✅" if info["can_merge"] else "❌"
    print(f"{method:<15} {can_merge:<10} {info['merge_principle']:<25} {info['inference_overhead']:<15} {info['deployment_complexity']:<15}")

print("\n=== IA³ Unique Advantages ===")
print("✅ Ultimate parameter efficiency: Only ~0.01% parameters")
print("✅ Fully mergeable: Scaling vectors can be mathematically fused")
print("✅ Zero inference overhead: Same as original model after merge")
print("✅ Deployment friendly: Single model file, no extra dependencies")

## 8. Best Practices and Recommendations

### 8.1 When to Choose IA³

In [None]:
print("=== IA³ Usage Scenario Recommendations ===")
print("")
print("🎯 Most Suitable Scenarios:")
print("  • Extremely resource-constrained environments (edge computing, mobile devices)")
print("  • Need for rapid prototyping and experimentation")
print("  • Inference speed-sensitive production environments")
print("  • Need to deploy multiple task models simultaneously")
print("  • Strict model size limitations")
print("")
print("⚠️  Scenarios Requiring Careful Consideration:")
print("  • Complex tasks requiring extensive parameter adjustment")
print("  • Domains very different from pre-training tasks")
print("  • Need to learn entirely new feature representations")
print("")
print("🔄 Combination Strategies with Other Methods:")
print("  • IA³ + LoRA: Balance efficiency and performance")
print("  • Start with IA³ for quick validation, then use LoRA for fine-tuning")
print("  • Use different PEFT methods for different layers")

### 8.2 Deployment Checklist

In [None]:
print("=== IA³ Deployment Checklist ===")
print("")
print("📋 Pre-merge Checks:")
print("  □ Confirm IA³ adapter training completed and saved")
print("  □ Validate adapter performance on validation set")
print("  □ Record pre-merge baseline performance")
print("")
print("🔄 Merge Process:")
print("  □ Use merge_and_unload() method")
print("  □ Check merged model parameter count unchanged")
print("  □ Verify merged model functions normally")
print("")
print("✅ Post-deployment Verification:")
print("  □ Inference speed testing")
print("  □ Memory usage monitoring")
print("  □ Output quality sampling checks")
print("  □ Long-term stability testing")
print("")
print("📚 Documentation:")
print("  □ Save deployment configuration file")
print("  □ Record performance benchmark data")
print("  □ Prepare rollback plan")
print("  □ Update API documentation")

## 9. Summary

Through this experiment, we explored one of IA³'s core advantages: **model merging and zero-overhead deployment**.

### 9.1 Key Learning Outcomes

1. **Theoretical Understanding**: Mastered the mathematical merging principles of IA³ scaling vectors
2. **Practical Experience**: Completed the full pipeline of merging, verification, and deployment
3. **Performance Analysis**: Quantified inference efficiency improvements from merging
4. **Best Practices**: Learned when to choose IA³ and how to deploy correctly

### 9.2 IA³'s Unique Value

- **Ultimate Efficiency**: Minimal parameter modifications for maximum adaptation capability
- **Perfect Merging**: Scaling vectors can be losslessly fused into base model
- **Deployment Friendly**: Single model file, zero inference overhead
- **Theoretical Elegance**: Simple yet powerful feature re-weighting strategy

IA³ perfectly embodies the core philosophy of PEFT: **"Moving mountains with minimal effort"**, proving that in appropriate scenarios, the simplest methods are often the most effective.

In [None]:
# Clean up resources
print("=== Cleaning Up Experiment Resources ===")
del merged_model, reloaded_model
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
print("✅ Resource cleanup completed")

print("\n🎉 Lab 5 - IA³ Model Merging and Deployment Experiment Completed!")
print(f"Merged model saved to: {merged_model_path}")
print("You can now deploy this merged model directly and enjoy zero inference overhead!")