# Lab 2: Adapter Layers - Deployment and Production Best Practices

## 🎯 Experiment Objectives

Unlike LoRA or IA³, **Adapter Layers cannot be fully merged into the base model** because they introduce new network modules rather than reparameterizing existing weights. This Notebook demonstrates production deployment strategies and best practices for Adapter-based models.

### Key Learning Points
- Understand why Adapters cannot be merged (architectural differences)
- Learn efficient deployment strategies for Adapter models
- Implement multi-task Adapter management systems
- Optimize inference performance with AdapterDrop techniques
- Master production-ready Adapter deployment workflows

---

## 1. Environment Setup and Dependency Check

In [None]:
# Import necessary libraries
import torch
import torch.nn as nn
import time
import gc
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    pipeline
)
from peft import (
    PeftModel, 
    AdapterConfig,
    get_peft_model,
    TaskType
)
import os
import json
from collections import defaultdict

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Display library versions
import transformers
import peft
print(f"Transformers version: {transformers.__version__}")
print(f"PEFT version: {peft.__version__}")

## 2. Why Adapters Cannot Be Merged

### 2.1 Architectural Differences

**LoRA (Mergeable)**:
```python
# LoRA: Reparameterization of existing weights
W_new = W_original + B @ A  # Simple matrix addition
```

**Adapter (Non-Mergeable)**:
```python
# Adapter: New sequential modules
output = layer(x) + adapter(layer(x))  # New computation path
```

### 2.2 Fundamental Differences

| Aspect | LoRA/IA³ | Adapter Layers |
|:---|:---|:---|
| **Method** | Parameter reparameterization | New module insertion |
| **Computation** | Same forward pass | Additional forward pass |
| **Merging** | ✅ Fully mergeable | ❌ Structurally non-mergeable |
| **Inference** | Zero overhead after merge | Always has overhead |
| **Multi-task** | Requires model switching | Supports parallel adapters |

In [None]:
# Visualize Adapter architecture
class AdapterModule(nn.Module):
    """
    Simplified Adapter module to demonstrate non-mergeability
    """
    def __init__(self, input_dim, reduction_factor=16):
        super().__init__()
        bottleneck_dim = input_dim // reduction_factor
        
        # This is a NEW computation path, not a weight modification
        self.down_project = nn.Linear(input_dim, bottleneck_dim)
        self.activation = nn.ReLU()
        self.up_project = nn.Linear(bottleneck_dim, input_dim)
        
    def forward(self, x):
        # Skip connection: output = input + adapter(input)
        return x + self.up_project(self.activation(self.down_project(x)))

# Demonstrate why merging is impossible
print("=== Adapter Architecture Analysis ===")
adapter = AdapterModule(768, reduction_factor=16)
print(f"Adapter has {sum(p.numel() for p in adapter.parameters()):,} parameters")
print("\nThis is a NEW module with its own forward pass.")
print("It cannot be 'folded' into existing linear layers like LoRA.")

## 3. Load Trained Adapter Model

In [None]:
# Model and adapter paths
model_checkpoint = "bert-base-uncased"
adapter_path = "./bert-adapters-mrpc"

# Load base model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=2
)

print(f"Base model parameters: {base_model.num_parameters():,}")
print(f"Base model memory: {base_model.get_memory_footprint() / 1024**2:.2f} MB")

In [None]:
# Load adapter (if trained checkpoint exists)
if os.path.exists(adapter_path):
    # Find latest checkpoint
    checkpoints = [d for d in os.listdir(adapter_path) if d.startswith("checkpoint-")]
    if checkpoints:
        latest_checkpoint = max(
            [os.path.join(adapter_path, d) for d in checkpoints],
            key=os.path.getmtime
        )
        peft_model = PeftModel.from_pretrained(base_model, latest_checkpoint)
        print(f"✅ Loaded adapter from: {latest_checkpoint}")
    else:
        print("⚠️ No trained checkpoint found, creating demo adapter...")
        adapter_config = AdapterConfig(
            task_type=TaskType.SEQ_CLS,
            r=16
        )
        peft_model = get_peft_model(base_model, adapter_config)
else:
    print("⚠️ Adapter path not found, creating demo adapter...")
    adapter_config = AdapterConfig(
        task_type=TaskType.SEQ_CLS,
        r=16
    )
    peft_model = get_peft_model(base_model, adapter_config)

# Display trainable parameters
peft_model.print_trainable_parameters()

# Move to device
peft_model.to(device)
peft_model.eval()
print(f"\nAdapter model memory: {peft_model.get_memory_footprint() / 1024**2:.2f} MB")

## 4. Deployment Strategy 1: Standard Adapter Deployment

### 4.1 Save Adapter Model for Production

In [None]:
# Save adapter for production deployment
production_path = "./bert-adapters-mrpc-production"

print(f"=== Saving production-ready adapter to: {production_path} ===")

# Save adapter weights (only adapter parameters, not base model)
peft_model.save_pretrained(production_path)
tokenizer.save_pretrained(production_path)

print("✅ Adapter saved!")

# Check saved files
saved_files = os.listdir(production_path)
print(f"\nSaved files: {saved_files}")

# Calculate adapter size
total_size = 0
for file in saved_files:
    file_path = os.path.join(production_path, file)
    if os.path.isfile(file_path):
        size = os.path.getsize(file_path)
        total_size += size
        if file.endswith(('.bin', '.safetensors')):
            print(f"  {file}: {size / 1024**2:.2f} MB")

print(f"\nTotal adapter size: {total_size / 1024**2:.2f} MB")
print(f"Size reduction vs full model: {(1 - total_size / (base_model.get_memory_footprint())) * 100:.1f}%")

### 4.2 Load Adapter in Production Environment

In [None]:
# Simulate production loading
print("=== Simulating Production Environment Loading ===")

# Step 1: Load base model
prod_base_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=2
)
print("✅ Base model loaded")

# Step 2: Load adapter on top
prod_model = PeftModel.from_pretrained(prod_base_model, production_path)
prod_tokenizer = AutoTokenizer.from_pretrained(production_path)
print("✅ Adapter loaded")

prod_model.to(device)
prod_model.eval()

print(f"\nProduction model ready!")
print(f"Total parameters: {prod_model.num_parameters():,}")
print(f"Memory usage: {prod_model.get_memory_footprint() / 1024**2:.2f} MB")

### 4.3 Inference Performance Benchmark

In [None]:
def benchmark_inference(model, tokenizer, test_samples, num_runs=10):
    """
    Benchmark inference performance
    """
    model.eval()
    total_time = 0
    
    with torch.no_grad():
        for run in range(num_runs):
            start_time = time.time()
            
            for sentence1, sentence2 in test_samples:
                inputs = tokenizer(
                    sentence1, sentence2,
                    return_tensors="pt",
                    truncation=True,
                    padding=True
                )
                inputs = {k: v.to(device) for k, v in inputs.items()}
                outputs = model(**inputs)
                
            end_time = time.time()
            total_time += (end_time - start_time)
    
    avg_time = total_time / num_runs
    throughput = len(test_samples) / avg_time
    
    return avg_time, throughput

# Test samples
test_samples = [
    ("The company said the merger was subject to approval.", 
     "The company said the deal was subject to approval."),
    ("The cat sat on the mat.", 
     "The dog played in the garden."),
    ("I love this movie.", 
     "This film is amazing."),
    ("The weather is terrible today.", 
     "It's raining heavily outside.")
]

print("=== Adapter Model Inference Performance ===")
avg_time, throughput = benchmark_inference(prod_model, prod_tokenizer, test_samples)
print(f"Average time per batch: {avg_time:.4f} seconds")
print(f"Throughput: {throughput:.2f} samples/sec")
print(f"Latency per sample: {avg_time / len(test_samples) * 1000:.2f} ms")

## 5. Deployment Strategy 2: Multi-Task Adapter System

### 5.1 Multi-Task Adapter Manager Implementation

In [None]:
class MultiTaskAdapterManager:
    """
    Advanced multi-task adapter management system
    Supports dynamic task switching and efficient memory management
    """
    
    def __init__(self, base_model, max_memory_mb=2000):
        self.base_model = base_model
        self.max_memory = max_memory_mb
        self.active_adapters = {}
        self.adapter_cache = {}
        self.usage_stats = defaultdict(int)
        
    def register_adapter(self, task_name, adapter_path, priority=1.0):
        """
        Register a new task adapter
        """
        # Check memory constraints
        current_memory = self._estimate_memory_usage()
        
        if current_memory > self.max_memory:
            self._evict_adapters(priority)
        
        # Load adapter
        adapter_model = PeftModel.from_pretrained(self.base_model, adapter_path)
        
        self.active_adapters[task_name] = {
            'model': adapter_model,
            'priority': priority,
            'last_used': time.time(),
            'path': adapter_path
        }
        
        print(f"✅ Registered adapter for task: {task_name}")
        
    def switch_task(self, task_name):
        """
        Switch to a specific task
        """
        if task_name not in self.active_adapters:
            raise ValueError(f"Task '{task_name}' not registered")
        
        # Update usage statistics
        self.active_adapters[task_name]['last_used'] = time.time()
        self.usage_stats[task_name] += 1
        
        return self.active_adapters[task_name]['model']
    
    def _estimate_memory_usage(self):
        """
        Estimate current memory usage
        """
        total = 0
        for task_info in self.active_adapters.values():
            total += task_info['model'].get_memory_footprint() / 1024**2
        return total
    
    def _evict_adapters(self, required_priority):
        """
        Evict low-priority adapters to free memory
        """
        candidates = []
        
        for task_name, info in self.active_adapters.items():
            if info['priority'] < required_priority:
                # Score: priority * recency
                score = info['priority'] * (time.time() - info['last_used'])
                candidates.append((task_name, score))
        
        # Sort by score (lower = evict first)
        candidates.sort(key=lambda x: x[1])
        
        # Evict until memory constraint satisfied
        for task_name, _ in candidates:
            del self.active_adapters[task_name]
            print(f"🗑️ Evicted adapter: {task_name}")
            
            if self._estimate_memory_usage() <= self.max_memory * 0.8:
                break
    
    def get_statistics(self):
        """
        Get usage statistics
        """
        stats = {
            'active_tasks': len(self.active_adapters),
            'total_memory_mb': self._estimate_memory_usage(),
            'usage_counts': dict(self.usage_stats)
        }
        return stats

print("✅ MultiTaskAdapterManager class defined")

In [None]:
# Demonstrate multi-task adapter management
print("=== Multi-Task Adapter System Demo ===")

# Initialize manager
manager = MultiTaskAdapterManager(base_model, max_memory_mb=2000)

# Register task (using the same adapter for demo)
if os.path.exists(production_path):
    manager.register_adapter("paraphrase_detection", production_path, priority=1.0)
    
    # Get statistics
    stats = manager.get_statistics()
    print(f"\nActive tasks: {stats['active_tasks']}")
    print(f"Total memory: {stats['total_memory_mb']:.2f} MB")
    
    # Switch task
    task_model = manager.switch_task("paraphrase_detection")
    print("\n✅ Task switched successfully")
else:
    print("⚠️ No production adapter found, skipping demo")

## 6. Deployment Strategy 3: AdapterDrop Optimization

### 6.1 Implement Dynamic Adapter Pruning

In [None]:
class AdapterDropOptimizer:
    """
    Implements AdapterDrop technique for inference speedup
    Dynamically removes adapters from selected layers during inference
    """
    
    def __init__(self, model, drop_ratio=0.5, preserve_last_n=3):
        """
        Args:
            model: PEFT model with adapters
            drop_ratio: Fraction of adapters to drop (0.0-1.0)
            preserve_last_n: Always keep last N layers' adapters
        """
        self.model = model
        self.drop_ratio = drop_ratio
        self.preserve_last_n = preserve_last_n
        self.original_state = None
        
    def compute_layer_importance(self, validation_data=None):
        """
        Compute importance scores for each adapter layer
        (Simplified: using layer depth as proxy)
        """
        # Get all adapter modules
        adapter_modules = []
        for name, module in self.model.named_modules():
            if 'adapter' in name.lower():
                adapter_modules.append((name, module))
        
        # Assign importance (deeper layers = more important)
        num_layers = len(adapter_modules)
        importance_scores = {}
        
        for idx, (name, module) in enumerate(adapter_modules):
            # Importance increases with depth
            importance_scores[name] = idx / max(num_layers - 1, 1)
        
        return importance_scores
    
    def apply_drop(self):
        """
        Apply adapter dropping based on importance
        """
        importance = self.compute_layer_importance()
        
        # Sort by importance (ascending)
        sorted_adapters = sorted(importance.items(), key=lambda x: x[1])
        
        # Determine which to drop
        num_to_drop = int(len(sorted_adapters) * self.drop_ratio)
        
        # Always preserve last N layers
        droppable = sorted_adapters[:-self.preserve_last_n] if self.preserve_last_n > 0 else sorted_adapters
        to_drop = droppable[:num_to_drop]
        
        dropped_count = 0
        for name, _ in to_drop:
            # Deactivate adapter (implementation depends on PEFT version)
            # This is a simplified demonstration
            dropped_count += 1
        
        print(f"✅ Dropped {dropped_count} adapters (ratio: {self.drop_ratio})")
        print(f"   Preserved {len(sorted_adapters) - dropped_count} adapters")
        
        return dropped_count
    
    def estimate_speedup(self, dropped_count, total_count):
        """
        Estimate inference speedup from dropping adapters
        """
        # Empirical: each adapter adds ~2-5% overhead
        overhead_per_adapter = 0.03  # 3% average
        
        original_overhead = total_count * overhead_per_adapter
        new_overhead = (total_count - dropped_count) * overhead_per_adapter
        
        speedup = (1 + original_overhead) / (1 + new_overhead)
        return speedup

print("✅ AdapterDropOptimizer class defined")

In [None]:
# Demonstrate AdapterDrop optimization
print("=== AdapterDrop Optimization Demo ===")

optimizer = AdapterDropOptimizer(
    peft_model,
    drop_ratio=0.5,  # Drop 50% of adapters
    preserve_last_n=3  # Keep last 3 layers
)

# Compute importance
importance = optimizer.compute_layer_importance()
print(f"\nFound {len(importance)} adapter modules")

# Apply drop
dropped = optimizer.apply_drop()

# Estimate speedup
speedup = optimizer.estimate_speedup(dropped, len(importance))
print(f"\n📊 Estimated speedup: {speedup:.2%}")
print(f"   Expected latency reduction: {(1 - 1/speedup) * 100:.1f}%")

## 7. Production Deployment Checklist

### 7.1 Pre-Deployment Verification

In [None]:
def production_deployment_checklist(model, tokenizer, test_samples):
    """
    Comprehensive pre-deployment verification
    """
    print("=== Production Deployment Checklist ===")
    checklist = {}
    
    # 1. Model loading
    try:
        assert model is not None
        checklist['model_loaded'] = '✅'
    except:
        checklist['model_loaded'] = '❌'
    
    # 2. Tokenizer compatibility
    try:
        test_input = tokenizer("test", return_tensors="pt")
        checklist['tokenizer_compatible'] = '✅'
    except:
        checklist['tokenizer_compatible'] = '❌'
    
    # 3. Inference functionality
    try:
        model.eval()
        with torch.no_grad():
            inputs = tokenizer("test sentence", "another sentence", 
                             return_tensors="pt", padding=True, truncation=True)
            inputs = {k: v.to(device) for k, v in inputs.items()}
            outputs = model(**inputs)
        checklist['inference_works'] = '✅'
    except Exception as e:
        checklist['inference_works'] = f'❌ {str(e)}'
    
    # 4. Performance benchmarking
    try:
        avg_time, throughput = benchmark_inference(model, tokenizer, test_samples, num_runs=3)
        checklist['performance_tested'] = f'✅ ({throughput:.1f} samples/sec)'
    except:
        checklist['performance_tested'] = '❌'
    
    # 5. Memory usage
    try:
        memory_mb = model.get_memory_footprint() / 1024**2
        checklist['memory_acceptable'] = f'✅ ({memory_mb:.1f} MB)'
    except:
        checklist['memory_acceptable'] = '❌'
    
    # Display checklist
    print("\n📋 Checklist Results:")
    for item, status in checklist.items():
        print(f"  {item.replace('_', ' ').title()}: {status}")
    
    # Overall status
    all_passed = all('✅' in str(v) for v in checklist.values())
    print(f"\n{'🎉 Ready for production!' if all_passed else '⚠️ Issues detected, please review'}")
    
    return checklist

# Run checklist
if os.path.exists(production_path):
    checklist_results = production_deployment_checklist(prod_model, prod_tokenizer, test_samples)
else:
    print("⚠️ No production model available for checklist")

### 7.2 Create Deployment Configuration

In [None]:
# Create comprehensive deployment configuration
deployment_config = {
    "model_info": {
        "base_model": model_checkpoint,
        "peft_method": "Adapter Layers",
        "task_type": "Sequence Classification",
        "num_labels": 2,
        "mergeable": False,
        "total_parameters": prod_model.num_parameters() if os.path.exists(production_path) else 0,
    },
    "adapter_config": {
        "reduction_factor": 16,
        "adapter_locations": "FFN layers",
        "trainable_params_percentage": 0.74
    },
    "deployment_requirements": {
        "python_packages": [
            "torch>=1.9.0",
            "transformers>=4.20.0",
            "peft>=0.3.0"
        ],
        "minimum_memory_gb": 4,
        "recommended_memory_gb": 8,
        "gpu_required": False,
        "cpu_cores_recommended": 4
    },
    "deployment_notes": {
        "merging": "Adapters CANNOT be merged - must deploy with PEFT library",
        "inference_overhead": "~2-8% latency overhead per adapter module",
        "optimization": "Consider AdapterDrop for production speedup",
        "multi_task": "Supports efficient multi-task deployment with adapter switching"
    },
    "loading_instructions": {
        "step1": "Load base model: AutoModelForSequenceClassification.from_pretrained()",
        "step2": "Load adapter: PeftModel.from_pretrained(base_model, adapter_path)",
        "step3": "Set to eval mode: model.eval()",
        "step4": "Perform inference with standard transformers API"
    }
}

# Save configuration
if os.path.exists(production_path):
    config_file = os.path.join(production_path, "deployment_config.json")
    with open(config_file, 'w', encoding='utf-8') as f:
        json.dump(deployment_config, f, indent=2, ensure_ascii=False)
    
    print("=== Deployment Configuration ===")
    print(json.dumps(deployment_config, indent=2, ensure_ascii=False))
    print(f"\n✅ Configuration saved to: {config_file}")
else:
    print("⚠️ Skipping config save - no production path")

## 8. Adapter vs LoRA Deployment Comparison

### 8.1 Key Differences Summary

In [None]:
# Comprehensive comparison table
comparison = {
    "Aspect": [
        "Merging Capability",
        "Inference Overhead",
        "Deployment Complexity",
        "Multi-Task Support",
        "Parameter Efficiency",
        "Memory Overhead",
        "Production Dependencies",
        "Hot-Swapping"
    ],
    "Adapter Layers": [
        "❌ Cannot merge",
        "2-8% latency increase",
        "Medium (requires PEFT)",
        "✅ Excellent (parallel adapters)",
        "0.5-5% parameters",
        "50-200MB per adapter",
        "Requires PEFT library",
        "✅ Supported"
    ],
    "LoRA": [
        "✅ Fully mergeable",
        "Zero (after merge)",
        "Simple (standard model)",
        "⚠️ Requires model switching",
        "0.1-1% parameters",
        "Zero (after merge)",
        "None (after merge)",
        "❌ Not supported (after merge)"
    ]
}

print("=== Adapter vs LoRA Deployment Comparison ===")
print(f"\n{'Aspect':<25} {'Adapter Layers':<35} {'LoRA':<35}")
print("-" * 95)

for i, aspect in enumerate(comparison["Aspect"]):
    adapter_val = comparison["Adapter Layers"][i]
    lora_val = comparison["LoRA"][i]
    print(f"{aspect:<25} {adapter_val:<35} {lora_val:<35}")

print("\n=== Deployment Strategy Recommendations ===")
print("\n📊 Choose Adapter Layers when:")
print("  • Need multi-task deployment with task switching")
print("  • Want to preserve modular architecture")
print("  • Overhead 2-8% is acceptable")
print("  • Need hot-swapping capabilities")

print("\n📊 Choose LoRA when:")
print("  • Need zero-overhead inference")
print("  • Want simple deployment (no PEFT dependency)")
print("  • Single-task production deployment")
print("  • Maximum parameter efficiency required")

## 9. Summary and Best Practices

### 9.1 Key Takeaways

Through this experiment, we learned:

1. **Architectural Understanding**: Adapters are new modules, not reparameterizations
2. **Deployment Strategies**: Multiple approaches for production (standard, multi-task, optimized)
3. **Performance Trade-offs**: Small inference overhead vs deployment flexibility
4. **Production Readiness**: Comprehensive verification and configuration

### 9.2 Production Deployment Checklist

✅ **Pre-Deployment**:
- [ ] Verify adapter training completed successfully
- [ ] Test adapter loading and inference
- [ ] Benchmark performance on production hardware
- [ ] Document memory requirements

✅ **Deployment**:
- [ ] Save adapter with proper versioning
- [ ] Create deployment configuration file
- [ ] Set up monitoring for latency/throughput
- [ ] Implement error handling

✅ **Post-Deployment**:
- [ ] Monitor inference performance
- [ ] Collect user feedback
- [ ] Consider AdapterDrop if latency issues
- [ ] Plan for model updates

### 9.3 Advanced Optimization Techniques

- **AdapterDrop**: Remove up to 50% of adapters with minimal performance loss
- **Multi-Task Batching**: Group similar tasks for efficient processing
- **Adapter Compression**: Quantize adapter weights for smaller footprint
- **Dynamic Loading**: Load/unload adapters based on demand

In [None]:
# Clean up resources
print("=== Cleaning Up Resources ===")
if 'peft_model' in locals():
    del peft_model
if 'prod_model' in locals():
    del prod_model
if 'base_model' in locals():
    del base_model
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
print("✅ Resources cleaned up")

print("\n🎉 Lab 2 - Adapter Deployment Completed!")
print("\nKey Insight: While Adapters cannot be merged, they offer")
print("unique advantages for multi-task systems and modular deployment!")