# Lab 6: BitFit - Model Merging and Deployment

## 🎯 Experiment Objectives

BitFit (Bias-term Fine-tuning) trains only bias parameters, which are **already part of the base model architecture**. This means BitFit models are **inherently merged** - no separate merging step is needed!

### Key Learning Points
- Understand why BitFit doesn't require merging (bias parameters are native)
- Save and deploy BitFit models efficiently
- Compare BitFit deployment with other PEFT methods
- Implement production deployment best practices

---

## 1. Understanding BitFit Deployment

### 1.1 Why BitFit is "Pre-Merged"

| PEFT Method | Adds New Params? | Requires Merging? | Reason |
|-------------|------------------|-------------------|--------|
| **BitFit** | ❌ No | ❌ No | Only trains existing bias parameters |
| **LoRA** | ✅ Yes | ✅ Yes | Adds A, B matrices (can merge) |
| **Adapter** | ✅ Yes | ❌ No | Adds new layers (cannot merge) |
| **IA³** | ✅ Yes | ✅ Yes | Adds scaling vectors (can merge) |

**Key Insight**: BitFit simply **unfreezes existing bias parameters** during training. After training, the model is a standard transformer with updated bias values - no adapter artifacts to remove!

### 1.2 Deployment Advantages

✅ **Zero inference overhead** (no adapter layers)
✅ **No PEFT library required** at inference
✅ **Smallest possible model size** (only bias changes)
✅ **Standard transformers format** (easy deployment)

---

## 2. Load Trained BitFit Model

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch
import os
import time

# --- Load trained BitFit model ---
model_checkpoint = "bert-base-uncased"
output_dir = "./bitfit-bert-mrpc"

# Find latest checkpoint
latest_checkpoint = max(
    [os.path.join(output_dir, d) for d in os.listdir(output_dir) if d.startswith("checkpoint-")],
    key=os.path.getmtime
)

# Load base model
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=2
)

# Load PEFT adapter
peft_model = PeftModel.from_pretrained(base_model, latest_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
peft_model.to(device)
peft_model.eval()

print("✅ BitFit model loaded successfully!")
print(f"Device: {device}")

## 3. Analyze BitFit Parameter Efficiency

In [None]:
def analyze_bitfit_params(model):
    """
    Analyze which parameters were trained in BitFit
    """
    print("=== BitFit Parameter Analysis ===")
    
    trainable_params = 0
    frozen_params = 0
    bias_params = 0
    
    bias_param_names = []
    
    for name, param in model.named_parameters():
        if param.requires_grad:
            trainable_params += param.numel()
            if 'bias' in name:
                bias_params += param.numel()
                bias_param_names.append(name)
        else:
            frozen_params += param.numel()
    
    total_params = trainable_params + frozen_params
    
    print(f"\nParameter Statistics:")
    print(f"  Total parameters: {total_params:,}")
    print(f"  Trainable parameters: {trainable_params:,}")
    print(f"  Frozen parameters: {frozen_params:,}")
    print(f"  Bias parameters (trainable): {bias_params:,}")
    print(f"  Training ratio: {trainable_params/total_params*100:.4f}%")
    
    print(f"\nBias Parameters Trained (first 10):")
    for name in bias_param_names[:10]:
        print(f"  - {name}")
    if len(bias_param_names) > 10:
        print(f"  ... and {len(bias_param_names)-10} more")
    
    return trainable_params, total_params

trainable, total = analyze_bitfit_params(peft_model)

## 4. "Merge" BitFit Model (Extract Updated Model)

Since bias parameters are already part of the model, "merging" simply means saving the model with updated bias values.

In [None]:
print("=== BitFit 'Merging' Process ===")
print("")
print("ℹ️  BitFit doesn't require traditional merging:")
print("  1. Bias parameters are native to the base model")
print("  2. Training updates bias values in-place")
print("  3. No adapter artifacts to remove")
print("  4. Model is already in standard transformers format")
print("")

# For BitFit, merge_and_unload() simply returns the base model with updated biases
merged_model = peft_model.merge_and_unload()

print("✅ Model 'merged' (bias parameters integrated)")
print(f"Model type: {type(merged_model)}")
print(f"Total parameters: {sum(p.numel() for p in merged_model.parameters()):,}")

## 5. Performance Benchmarking

In [None]:
def benchmark_model(model, tokenizer, test_cases, num_runs=10):
    model.eval()
    device = next(model.parameters()).device
    
    total_time = 0
    predictions = []
    
    with torch.no_grad():
        # Warm-up
        for s1, s2 in test_cases:
            inputs = tokenizer(s1, s2, return_tensors="pt", truncation=True, padding=True)
            inputs = {k: v.to(device) for k, v in inputs.items()}
            _ = model(**inputs)
        
        # Benchmark
        for run in range(num_runs):
            start_time = time.time()
            
            for s1, s2 in test_cases:
                inputs = tokenizer(s1, s2, return_tensors="pt", truncation=True, padding=True)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                outputs = model(**inputs)
                
                if run == 0:
                    pred = torch.argmax(outputs.logits, dim=-1).cpu().item()
                    predictions.append(pred)
            
            total_time += (time.time() - start_time)
    
    return total_time / num_runs, predictions

test_cases = [
    ("The company said the merger was approved.", "The company announced the deal was approved."),
    ("The cat sat on the mat.", "The dog ran in the park."),
    ("Python is a programming language.", "Python is used for coding."),
]

print("=== Performance Benchmarking ===")
avg_time, preds = benchmark_model(merged_model, tokenizer, test_cases)

print(f"\nAverage inference time: {avg_time*1000:.2f} ms")
print(f"Per sample: {avg_time/len(test_cases)*1000:.2f} ms")
print(f"Throughput: {len(test_cases)/avg_time:.2f} samples/sec")
print(f"\n✅ Zero overhead - standard transformers model!")

## 6. Save Deployment Model

In [None]:
# Save directory
deployment_dir = "./bitfit-bert-mrpc-deployed"

print(f"=== Saving BitFit Model for Deployment ===")
print(f"Target: {deployment_dir}")
print("")

# Save as standard transformers model
merged_model.save_pretrained(deployment_dir)
tokenizer.save_pretrained(deployment_dir)

print("✅ Model saved!")
print("")
print("📦 Saved files:")
for f in os.listdir(deployment_dir):
    fpath = os.path.join(deployment_dir, f)
    if os.path.isfile(fpath):
        size_mb = os.path.getsize(fpath) / (1024**2)
        print(f"  {f}: {size_mb:.2f} MB")

## 7. Verify Deployment Model

In [None]:
print("=== Verifying Deployment Model ===")

# Load WITHOUT PEFT library - pure transformers!
from transformers import AutoModelForSequenceClassification, AutoTokenizer

deployed_model = AutoModelForSequenceClassification.from_pretrained(deployment_dir)
deployed_tokenizer = AutoTokenizer.from_pretrained(deployment_dir)
deployed_model.to(device)
deployed_model.eval()

print("✅ Loaded as standard transformers model (no PEFT!)")

# Test inference
test_s1 = "The company said the merger was approved."
test_s2 = "The company announced the deal was approved."

inputs = deployed_tokenizer(test_s1, test_s2, return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = deployed_model(**inputs)
    pred = torch.argmax(outputs.logits, dim=-1).cpu().item()

labels = {0: "Not Paraphrase", 1: "Paraphrase"}
print(f"\nTest: {labels[pred]}")
print("✅ Deployed model works perfectly!")

## 8. Deployment Configuration

In [None]:
import json

config = {
    "model_info": {
        "base_model": model_checkpoint,
        "method": "BitFit",
        "trainable_params": int(trainable),
        "total_params": int(total),
        "efficiency_percent": round(trainable/total*100, 4)
    },
    "deployment": {
        "format": "Standard Transformers (no PEFT required)",
        "inference_overhead": "0% (native bias parameters)",
        "requires_peft_library": False,
        "loading_example": [
            "from transformers import AutoModelForSequenceClassification",
            f"model = AutoModelForSequenceClassification.from_pretrained('{deployment_dir}')"
        ]
    },
    "advantages": [
        "Smallest parameter footprint (0.08-0.1%)",
        "Zero inference overhead",
        "No special libraries required",
        "Standard transformers format",
        "Easy to deploy and maintain"
    ],
    "use_cases": [
        "Extreme resource constraints",
        "Many tasks on single base model",
        "Fast prototyping and experimentation",
        "Edge device deployment"
    ]
}

config_path = os.path.join(deployment_dir, "deployment_config.json")
with open(config_path, 'w') as f:
    json.dump(config, f, indent=2)

print("=== Deployment Configuration ===")
print(json.dumps(config, indent=2))
print(f"\n✅ Saved to: {config_path}")

## 9. PEFT Methods Comparison

In [None]:
print("=== PEFT Methods Deployment Comparison ===")
print("")
print(f"{'Method':<15} {'Params%':<10} {'Mergeable':<12} {'Overhead':<12} {'PEFT Lib?':<12}")
print("-" * 65)
print(f"{'BitFit':<15} {'~0.08%':<10} {'Native ✅':<12} {'0%':<12} {'No ✅':<12}")
print(f"{'LoRA':<15} {'0.1-1%':<10} {'Yes ✅':<12} {'0%*':<12} {'No*':<12}")
print(f"{'IA³':<15} {'~0.01%':<10} {'Yes ✅':<12} {'0%*':<12} {'No*':<12}")
print(f"{'Adapter':<15} {'0.5-5%':<10} {'No ❌':<12} {'2-5%':<12} {'Yes ❌':<12}")
print(f"{'Prompt':<15} {'0.01-1%':<10} {'No ❌':<12} {'1-3%':<12} {'Yes ❌':<12}")
print(f"{'Prefix':<15} {'0.1-3%':<10} {'No ❌':<12} {'3-7%':<12} {'Yes ❌':<12}")
print("")
print("* After merging")
print("")
print("🏆 BitFit Unique Advantages:")
print("  • Smallest parameter footprint")
print("  • No merging step needed (native parameters)")
print("  • Zero deployment complexity")
print("  • No special library dependencies")

## 10. Summary

### Key Takeaways

1. **BitFit is "Pre-Merged"**: Bias parameters are native to the model - no adapter artifacts to remove
2. **Simplest Deployment**: Save as standard transformers model, no PEFT library needed
3. **Zero Overhead**: Same inference speed as base model
4. **Extreme Efficiency**: Smallest parameter footprint (~0.08%)

### When to Choose BitFit

✅ **Best for:**
- Extreme resource constraints
- Many tasks on single base model (tiny storage per task)
- Fast prototyping
- Edge device deployment
- Scenarios where deployment simplicity matters

⚠️ **Limitations:**
- May underperform vs LoRA on complex tasks
- Limited expressiveness (only bias parameters)
- Works best when base model is well-pretrained

---

## 🎉 Lab 6 Complete!

BitFit demonstrates that sometimes the simplest approach is the most practical. By only training bias parameters, we achieve extreme parameter efficiency while maintaining deployment simplicity.