# Tutorial 15: Model Compression Techniques

## Module 6: Deployment and Serving

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Apply knowledge distillation** - Train smaller student models from larger teachers
2. **Implement model pruning** - Remove unnecessary weights and neurons
3. **Use quantization techniques** - Reduce model precision for faster inference
4. **Compare compression methods** - Analyze accuracy-size-speed trade-offs
5. **Design efficient architectures** - Build models optimized for deployment

---

## Table of Contents

1. [Introduction to Model Compression](#1-introduction)
2. [Knowledge Distillation](#2-knowledge-distillation)
3. [Model Pruning](#3-pruning)
4. [Quantization](#4-quantization)
5. [Architecture Optimization](#5-architecture)
6. [Compression Comparison](#6-comparison)
7. [Hands-on Exercises](#7-exercises)
8. [Summary](#8-summary)

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pickle
import time
from typing import Dict, List, Any
from dataclasses import dataclass
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [12, 6]

print("Libraries imported successfully!")

---

## 1. Introduction to Model Compression <a name="1-introduction"></a>

Model compression reduces the size and computational requirements of ML models while maintaining acceptable accuracy.

### Why Compress Models?

- **Mobile deployment**: Limited memory and compute
- **Edge devices**: IoT, embedded systems
- **Cost reduction**: Lower inference costs
- **Latency**: Faster predictions

### Compression Techniques

| Technique | Size Reduction | Speed Improvement | Accuracy Impact |
|-----------|----------------|-------------------|----------------|
| Knowledge Distillation | 2-10x | 2-10x | Low-Medium |
| Pruning | 2-10x | 1.5-3x | Low |
| Quantization | 2-4x | 2-4x | Very Low |
| Architecture | Varies | Varies | Varies |

In [None]:
# Load sample datasets
digits = load_digits()
X, y = digits.data, digits.target

X_large, y_large = make_classification(
    n_samples=5000, n_features=50, n_informative=30,
    n_classes=5, n_clusters_per_class=2, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_large, X_test_large, y_train_large, y_test_large = train_test_split(
    X_large, y_large, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

scaler_large = StandardScaler()
X_train_large_scaled = scaler_large.fit_transform(X_train_large)
X_test_large_scaled = scaler_large.transform(X_test_large)

print(f"Digits: {X_train.shape[0]} train, {X_test.shape[0]} test")
print(f"Large: {X_train_large.shape[0]} train, {X_test_large.shape[0]} test")

In [None]:
def get_model_size(model):
    """Estimate model size in bytes."""
    return len(pickle.dumps(model))

def measure_inference_time(model, X, n_runs=100):
    """Measure average inference time."""
    times = []
    for _ in range(n_runs):
        start = time.time()
        model.predict(X)
        times.append(time.time() - start)
    return np.mean(times) * 1000

def evaluate_model(model, X_train, y_train, X_test, y_test, name="Model"):
    """Comprehensive model evaluation."""
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc = accuracy_score(y_test, model.predict(X_test))
    size = get_model_size(model)
    inf_time = measure_inference_time(model, X_test[:100])
    
    return {
        'name': name, 'train_acc': train_acc, 'test_acc': test_acc,
        'size_kb': size / 1024, 'inference_ms': inf_time, 'train_time_s': train_time
    }

print("Utility functions defined!")

---

## 2. Knowledge Distillation <a name="2-knowledge-distillation"></a>

Knowledge distillation transfers knowledge from a large "teacher" model to a smaller "student" model.

### Key Concepts

- **Teacher Model**: Large, accurate model
- **Student Model**: Smaller, faster model
- **Soft Labels**: Teacher's probability outputs
- **Temperature**: Softens probability distribution

In [None]:
class KnowledgeDistillation:
    """Knowledge distillation from teacher to student."""
    
    def __init__(self, teacher, student, temperature=3.0, alpha=0.5):
        self.teacher = teacher
        self.student = student
        self.temperature = temperature
        self.alpha = alpha
    
    def softmax_temp(self, logits, temp):
        """Softmax with temperature."""
        logits = np.array(logits)
        scaled = logits / temp
        exp_scaled = np.exp(scaled - np.max(scaled, axis=-1, keepdims=True))
        return exp_scaled / np.sum(exp_scaled, axis=-1, keepdims=True)
    
    def get_soft_labels(self, X):
        """Get soft labels from teacher."""
        if hasattr(self.teacher, 'predict_proba'):
            proba = self.teacher.predict_proba(X)
            return self.softmax_temp(np.log(proba + 1e-10), self.temperature)
        preds = self.teacher.predict(X)
        n_classes = len(np.unique(preds))
        one_hot = np.zeros((len(preds), n_classes))
        one_hot[np.arange(len(preds)), preds] = 1
        return one_hot
    
    def distill(self, X_train, y_train):
        """Perform knowledge distillation."""
        print("Training teacher...")
        self.teacher.fit(X_train, y_train)
        teacher_acc = accuracy_score(y_train, self.teacher.predict(X_train))
        print(f"  Teacher accuracy: {teacher_acc:.4f}")
        
        print("Training student with soft labels...")
        teacher_preds = self.teacher.predict(X_train)
        mask = np.random.random(len(y_train)) > self.alpha
        mixed_labels = np.where(mask, teacher_preds, y_train)
        
        self.student.fit(X_train, mixed_labels)
        student_acc = accuracy_score(y_train, self.student.predict(X_train))
        print(f"  Student accuracy: {student_acc:.4f}")
        
        return self.student
    
    def compare(self, X_test, y_test):
        """Compare teacher and student."""
        t_acc = accuracy_score(y_test, self.teacher.predict(X_test))
        s_acc = accuracy_score(y_test, self.student.predict(X_test))
        t_size = get_model_size(self.teacher)
        s_size = get_model_size(self.student)
        t_time = measure_inference_time(self.teacher, X_test)
        s_time = measure_inference_time(self.student, X_test)
        
        return {
            'teacher': {'accuracy': t_acc, 'size_kb': t_size/1024, 'inference_ms': t_time},
            'student': {'accuracy': s_acc, 'size_kb': s_size/1024, 'inference_ms': s_time},
            'compression': {
                'acc_retention': s_acc / t_acc,
                'size_reduction': t_size / s_size,
                'speedup': t_time / s_time
            }
        }


# Demo distillation
teacher = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42)
student = DecisionTreeClassifier(max_depth=10, random_state=42)

kd = KnowledgeDistillation(teacher, student, temperature=3.0, alpha=0.3)
kd.distill(X_train_scaled, y_train)

comp = kd.compare(X_test_scaled, y_test)

print("\n" + "="*50)
print("Knowledge Distillation Results")
print("="*50)
print(f"\nTeacher: Acc={comp['teacher']['accuracy']:.4f}, Size={comp['teacher']['size_kb']:.1f}KB")
print(f"Student: Acc={comp['student']['accuracy']:.4f}, Size={comp['student']['size_kb']:.1f}KB")
print(f"\nCompression: {comp['compression']['size_reduction']:.1f}x size, {comp['compression']['speedup']:.1f}x speed")
print(f"Accuracy retention: {comp['compression']['acc_retention']:.2%}")

In [None]:
# Temperature experiment
temperatures = [1.0, 2.0, 3.0, 5.0, 10.0]
temp_results = []

print("Temperature Effect on Distillation:")
for temp in temperatures:
    t = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42)
    s = DecisionTreeClassifier(max_depth=10, random_state=42)
    kd_t = KnowledgeDistillation(t, s, temperature=temp, alpha=0.3)
    kd_t.distill(X_train_scaled, y_train)
    c = kd_t.compare(X_test_scaled, y_test)
    temp_results.append({'temp': temp, 'student_acc': c['student']['accuracy'], 'teacher_acc': c['teacher']['accuracy']})
    print(f"T={temp}: Student={c['student']['accuracy']:.4f}")

# Visualize
df_temp = pd.DataFrame(temp_results)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(df_temp['temp'], df_temp['student_acc'], marker='o', label='Student', color='coral')
ax.axhline(df_temp['teacher_acc'].iloc[0], linestyle='--', color='steelblue', label='Teacher')
ax.set_xlabel('Temperature')
ax.set_ylabel('Accuracy')
ax.set_title('Temperature Effect on Distillation')
ax.legend()
plt.show()

---

## 3. Model Pruning <a name="3-pruning"></a>

Pruning removes unimportant weights or neurons to reduce model size.

### Types of Pruning

| Type | Description | Hardware Friendly |
|------|-------------|------------------|
| Weight Pruning | Remove individual weights | No (sparse) |
| Structured | Remove neurons/filters | Yes |
| Iterative | Gradual pruning | Varies |

In [None]:
class ModelPruning:
    """Model pruning techniques."""
    
    @staticmethod
    def prune_random_forest(model, keep_ratio=0.5):
        """Prune RF by keeping top trees."""
        n_keep = max(1, int(len(model.estimators_) * keep_ratio))
        pruned = RandomForestClassifier(n_estimators=n_keep, random_state=42)
        pruned.estimators_ = model.estimators_[:n_keep]
        pruned.n_classes_ = model.n_classes_
        pruned.classes_ = model.classes_
        pruned.n_features_in_ = model.n_features_in_
        return pruned
    
    @staticmethod
    def magnitude_pruning(weights, sparsity=0.5):
        """Magnitude-based weight pruning."""
        threshold = np.percentile(np.abs(weights), sparsity * 100)
        pruned = np.where(np.abs(weights) < threshold, 0, weights)
        actual_sparsity = np.sum(pruned == 0) / pruned.size
        return pruned, actual_sparsity


# RF Pruning Demo
print("="*50)
print("Random Forest Pruning")
print("="*50)

rf_full = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42)
rf_full.fit(X_train_scaled, y_train)

pruning_results = []
for keep in [1.0, 0.75, 0.5, 0.25, 0.1]:
    if keep == 1.0:
        pruned = rf_full
    else:
        pruned = ModelPruning.prune_random_forest(rf_full, keep)
    
    acc = accuracy_score(y_test, pruned.predict(X_test_scaled))
    size = get_model_size(pruned)
    inf_time = measure_inference_time(pruned, X_test_scaled)
    
    pruning_results.append({'keep': keep, 'trees': len(pruned.estimators_), 
                           'acc': acc, 'size_kb': size/1024, 'time_ms': inf_time})
    print(f"Keep {keep:.0%}: {len(pruned.estimators_)} trees, Acc={acc:.4f}, Size={size/1024:.1f}KB")

# Visualize
df_prune = pd.DataFrame(pruning_results)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].plot(df_prune['keep'], df_prune['acc'], marker='o', color='steelblue')
axes[0].set_xlabel('Keep Ratio')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy vs Pruning')

axes[1].plot(df_prune['keep'], df_prune['size_kb'], marker='s', color='coral')
axes[1].set_xlabel('Keep Ratio')
axes[1].set_ylabel('Size (KB)')
axes[1].set_title('Size vs Pruning')

axes[2].plot(df_prune['keep'], df_prune['time_ms'], marker='^', color='forestgreen')
axes[2].set_xlabel('Keep Ratio')
axes[2].set_ylabel('Time (ms)')
axes[2].set_title('Inference Time vs Pruning')

plt.tight_layout()
plt.show()

In [None]:
# Weight Magnitude Pruning Simulation
print("="*50)
print("Weight Magnitude Pruning")
print("="*50)

np.random.seed(42)
weights = np.random.randn(10000)

sparsity_results = []
for sparsity in [0.0, 0.3, 0.5, 0.7, 0.9, 0.95]:
    pruned, actual = ModelPruning.magnitude_pruning(weights, sparsity)
    non_zero = np.count_nonzero(pruned)
    compression = len(weights) / non_zero if non_zero > 0 else float('inf')
    sparsity_results.append({'target': sparsity, 'actual': actual, 'compression': compression})
    print(f"Target: {sparsity:.0%}, Actual: {actual:.2%}, Compression: {compression:.1f}x")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(weights, bins=50, alpha=0.7, label='Original', color='steelblue')
pruned_50, _ = ModelPruning.magnitude_pruning(weights, 0.5)
axes[0].hist(pruned_50[pruned_50 != 0], bins=50, alpha=0.5, label='Pruned 50%', color='coral')
axes[0].set_xlabel('Weight Value')
axes[0].set_title('Weight Distribution')
axes[0].legend()

df_sp = pd.DataFrame(sparsity_results)
axes[1].plot(df_sp['target'], df_sp['compression'], marker='o', color='forestgreen')
axes[1].set_xlabel('Sparsity')
axes[1].set_ylabel('Compression Ratio')
axes[1].set_title('Compression vs Sparsity')

plt.tight_layout()
plt.show()

---

## 4. Quantization <a name="4-quantization"></a>

Quantization reduces weight precision from floating-point to lower-bit representations.

### Quantization Types

| Type | Bits | Size Reduction |
|------|------|---------------|
| FP32 | 32 | 1x (baseline) |
| FP16 | 16 | 2x |
| INT8 | 8 | 4x |
| INT4 | 4 | 8x |

In [None]:
class Quantization:
    """Quantization techniques."""
    
    @staticmethod
    def quantize_int8(values):
        """Quantize to INT8."""
        values = np.array(values, dtype=np.float32)
        max_abs = np.max(np.abs(values))
        scale = max_abs / 127.0 if max_abs > 0 else 1.0
        quantized = np.round(values / scale).astype(np.int8)
        quantized = np.clip(quantized, -128, 127)
        return quantized, scale
    
    @staticmethod
    def dequantize_int8(quantized, scale):
        """Dequantize from INT8."""
        return quantized.astype(np.float32) * scale
    
    @staticmethod
    def quantize_int4(values):
        """Quantize to INT4."""
        values = np.array(values, dtype=np.float32)
        max_abs = np.max(np.abs(values))
        scale = max_abs / 7.0 if max_abs > 0 else 1.0
        quantized = np.round(values / scale).astype(np.int8)
        quantized = np.clip(quantized, -8, 7)
        return quantized, scale
    
    @staticmethod
    def calc_error(original, dequantized):
        """Calculate quantization error."""
        mse = np.mean((original - dequantized) ** 2)
        mae = np.mean(np.abs(original - dequantized))
        rel_error = mae / (np.mean(np.abs(original)) + 1e-10)
        return {'mse': mse, 'mae': mae, 'rel_error': rel_error}


# Quantization Demo
print("="*50)
print("Quantization Demonstration")
print("="*50)

np.random.seed(42)
sample_weights = np.random.randn(10000).astype(np.float32)

# INT8
int8_w, scale_8 = Quantization.quantize_int8(sample_weights)
dequant_8 = Quantization.dequantize_int8(int8_w, scale_8)
error_8 = Quantization.calc_error(sample_weights, dequant_8)

# INT4
int4_w, scale_4 = Quantization.quantize_int4(sample_weights)
dequant_4 = int4_w.astype(np.float32) * scale_4
error_4 = Quantization.calc_error(sample_weights, dequant_4)

print(f"\nINT8: Scale={scale_8:.4f}, Rel Error={error_8['rel_error']:.4%}")
print(f"INT4: Scale={scale_4:.4f}, Rel Error={error_4['rel_error']:.4%}")

# Memory savings
n_params = 10_000_000
print(f"\nMemory for 10M params:")
print(f"  FP32: {n_params * 4 / 1e6:.1f} MB")
print(f"  INT8: {n_params * 1 / 1e6:.1f} MB (4x reduction)")
print(f"  INT4: {n_params * 0.5 / 1e6:.1f} MB (8x reduction)")

In [None]:
# Visualize quantization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Distribution comparison
axes[0].hist(sample_weights, bins=50, alpha=0.7, label='FP32', color='steelblue')
axes[0].hist(dequant_8, bins=50, alpha=0.5, label='INT8', color='coral')
axes[0].set_xlabel('Value')
axes[0].set_title('FP32 vs INT8')
axes[0].legend()

# Error distribution
error_dist = sample_weights - dequant_8
axes[1].hist(error_dist, bins=50, color='coral', alpha=0.7)
axes[1].set_xlabel('Quantization Error')
axes[1].set_title('INT8 Quantization Error')

# Size vs Error trade-off
precisions = ['FP32', 'INT8', 'INT4']
sizes = [40, 10, 5]  # MB for 10M params
errors = [0, error_8['rel_error'], error_4['rel_error']]

ax2 = axes[2]
ax2_twin = ax2.twinx()
ax2.bar(precisions, sizes, color='steelblue', alpha=0.7)
ax2_twin.plot(precisions, errors, marker='o', color='coral', linewidth=2)
ax2.set_ylabel('Size (MB)', color='steelblue')
ax2_twin.set_ylabel('Relative Error', color='coral')
ax2.set_title('Size vs Error Trade-off')

plt.tight_layout()
plt.show()

---

## 5. Architecture Optimization <a name="5-architecture"></a>

Efficient architectures are designed for low latency and small footprint.

In [None]:
# Compare different architectures
print("="*50)
print("Architecture Comparison")
print("="*50)

architectures = {
    'Large MLP': MLPClassifier(hidden_layer_sizes=(500, 200, 100), max_iter=500, random_state=42),
    'Medium MLP': MLPClassifier(hidden_layer_sizes=(200, 100), max_iter=500, random_state=42),
    'Small MLP': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42),
    'Tiny MLP': MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=42),
    'Logistic': LogisticRegression(max_iter=500, random_state=42),
    'Large RF': RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42),
    'Small RF': RandomForestClassifier(n_estimators=20, max_depth=10, random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42)
}

arch_results = []
for name, model in architectures.items():
    model.fit(X_train_scaled, y_train)
    acc = accuracy_score(y_test, model.predict(X_test_scaled))
    size = get_model_size(model)
    inf_time = measure_inference_time(model, X_test_scaled)
    arch_results.append({'name': name, 'acc': acc, 'size_kb': size/1024, 'time_ms': inf_time})
    print(f"{name:<15}: Acc={acc:.4f}, Size={size/1024:.1f}KB, Time={inf_time:.2f}ms")

df_arch = pd.DataFrame(arch_results).sort_values('acc', ascending=False)

# Efficiency score
df_arch['efficiency'] = df_arch['acc'] / (df_arch['size_kb'] * df_arch['time_ms'] + 1)

In [None]:
# Visualize architectures
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Accuracy vs Size
axes[0].scatter(df_arch['size_kb'], df_arch['acc'], s=100, c=df_arch['time_ms'], cmap='viridis')
for _, row in df_arch.iterrows():
    axes[0].annotate(row['name'], (row['size_kb'], row['acc']), fontsize=8)
axes[0].set_xlabel('Size (KB)')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy vs Size')

# Size comparison
df_sorted = df_arch.sort_values('size_kb')
axes[1].barh(df_sorted['name'], df_sorted['size_kb'], color='steelblue')
axes[1].set_xlabel('Size (KB)')
axes[1].set_title('Model Size')

# Efficiency
df_eff = df_arch.sort_values('efficiency')
axes[2].barh(df_eff['name'], df_eff['efficiency'], color='forestgreen')
axes[2].set_xlabel('Efficiency Score')
axes[2].set_title('Efficiency (Acc / Size*Time)')

plt.tight_layout()
plt.show()

---

## 6. Compression Comparison <a name="6-comparison"></a>

Compare all compression techniques on the same baseline model.

In [None]:
# Comprehensive comparison
print("="*60)
print("Comprehensive Compression Comparison")
print("="*60)

# Baseline
baseline = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42)
baseline.fit(X_train_large_scaled, y_train_large)

comparison = [{
    'method': 'Baseline',
    'accuracy': accuracy_score(y_test_large, baseline.predict(X_test_large_scaled)),
    'size_kb': get_model_size(baseline) / 1024,
    'inference_ms': measure_inference_time(baseline, X_test_large_scaled)
}]

# 1. Knowledge Distillation
print("\n1. Knowledge Distillation...")
student_kd = DecisionTreeClassifier(max_depth=15, random_state=42)
teacher_preds = baseline.predict(X_train_large_scaled)
mask = np.random.random(len(y_train_large)) > 0.3
mixed = np.where(mask, teacher_preds, y_train_large)
student_kd.fit(X_train_large_scaled, mixed)

comparison.append({
    'method': 'Knowledge Distillation',
    'accuracy': accuracy_score(y_test_large, student_kd.predict(X_test_large_scaled)),
    'size_kb': get_model_size(student_kd) / 1024,
    'inference_ms': measure_inference_time(student_kd, X_test_large_scaled)
})

# 2. Pruning (50%)
print("2. Pruning (50%)...")
pruned = ModelPruning.prune_random_forest(baseline, 0.5)

comparison.append({
    'method': 'Pruning (50%)',
    'accuracy': accuracy_score(y_test_large, pruned.predict(X_test_large_scaled)),
    'size_kb': get_model_size(pruned) / 1024,
    'inference_ms': measure_inference_time(pruned, X_test_large_scaled)
})

# 3. Smaller Architecture
print("3. Smaller Architecture...")
small_rf = RandomForestClassifier(n_estimators=20, max_depth=10, random_state=42)
small_rf.fit(X_train_large_scaled, y_train_large)

comparison.append({
    'method': 'Smaller Architecture',
    'accuracy': accuracy_score(y_test_large, small_rf.predict(X_test_large_scaled)),
    'size_kb': get_model_size(small_rf) / 1024,
    'inference_ms': measure_inference_time(small_rf, X_test_large_scaled)
})

# Create comparison table
df_comp = pd.DataFrame(comparison)
baseline_acc = df_comp.loc[0, 'accuracy']
baseline_size = df_comp.loc[0, 'size_kb']
baseline_time = df_comp.loc[0, 'inference_ms']

df_comp['acc_retention'] = df_comp['accuracy'] / baseline_acc
df_comp['size_reduction'] = baseline_size / df_comp['size_kb']
df_comp['speedup'] = baseline_time / df_comp['inference_ms']

print("\nResults:")
print(df_comp.to_string(index=False))

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

methods = df_comp['method']
x = np.arange(len(methods))

# Accuracy
colors = ['steelblue' if m == 'Baseline' else 'coral' for m in methods]
axes[0].bar(x, df_comp['accuracy'], color=colors)
axes[0].set_xticks(x)
axes[0].set_xticklabels(methods, rotation=45, ha='right')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy Comparison')
axes[0].axhline(baseline_acc, color='gray', linestyle='--', alpha=0.5)

# Size
axes[1].bar(x, df_comp['size_kb'], color=colors)
axes[1].set_xticks(x)
axes[1].set_xticklabels(methods, rotation=45, ha='right')
axes[1].set_ylabel('Size (KB)')
axes[1].set_title('Model Size Comparison')

# Speed
axes[2].bar(x, df_comp['inference_ms'], color=colors)
axes[2].set_xticks(x)
axes[2].set_xticklabels(methods, rotation=45, ha='right')
axes[2].set_ylabel('Inference Time (ms)')
axes[2].set_title('Speed Comparison')

plt.tight_layout()
plt.show()

# Summary
print("\nCompression Summary (vs Baseline):")
for _, row in df_comp.iterrows():
    if row['method'] != 'Baseline':
        print(f"  {row['method']}:")
        print(f"    Accuracy: {row['acc_retention']:.1%} retained")
        print(f"    Size: {row['size_reduction']:.1f}x smaller")
        print(f"    Speed: {row['speedup']:.1f}x faster")

---

## 7. Hands-on Exercises <a name="7-exercises"></a>

### Exercise 1: Find Optimal Compression
Find the best compression method that maintains >95% accuracy retention.

In [None]:
# Exercise 1 Solution
print("Exercise 1: Finding Optimal Compression")
print("="*50)

# Test various compression levels
results = []

# Pruning at different levels
for keep in [0.9, 0.7, 0.5, 0.3]:
    pruned = ModelPruning.prune_random_forest(baseline, keep)
    acc = accuracy_score(y_test_large, pruned.predict(X_test_large_scaled))
    retention = acc / baseline_acc
    size_red = baseline_size / (get_model_size(pruned)/1024)
    
    results.append({
        'method': f'Prune {1-keep:.0%}',
        'acc_retention': retention,
        'size_reduction': size_red,
        'meets_target': retention >= 0.95
    })

df_ex = pd.DataFrame(results)
print(df_ex.to_string(index=False))

best = df_ex[df_ex['meets_target']].sort_values('size_reduction', ascending=False).iloc[0]
print(f"\nBest option: {best['method']} ({best['size_reduction']:.1f}x size reduction)")

### Exercise 2: Combine Compression Techniques
Combine distillation and architecture optimization.

In [None]:
# Exercise 2 Solution
print("Exercise 2: Combined Compression")
print("="*50)

# Distill to efficient architecture
efficient_student = LogisticRegression(max_iter=1000, random_state=42)
teacher_preds = baseline.predict(X_train_large_scaled)
efficient_student.fit(X_train_large_scaled, teacher_preds)

combined_acc = accuracy_score(y_test_large, efficient_student.predict(X_test_large_scaled))
combined_size = get_model_size(efficient_student)
combined_time = measure_inference_time(efficient_student, X_test_large_scaled)

print(f"Combined approach (Distillation + LogReg):")
print(f"  Accuracy: {combined_acc:.4f} ({combined_acc/baseline_acc:.1%} retention)")
print(f"  Size: {combined_size/1024:.1f}KB ({baseline_size/(combined_size/1024):.1f}x reduction)")
print(f"  Speed: {combined_time:.2f}ms ({baseline_time/combined_time:.1f}x faster)")

---

## 8. Summary <a name="8-summary"></a>

### Key Takeaways

1. **Knowledge Distillation**: Transfer knowledge from large to small models
   - Temperature controls softness of labels
   - Alpha balances hard vs soft labels

2. **Pruning**: Remove unnecessary weights
   - Magnitude-based removes small weights
   - Structured pruning is more hardware-efficient

3. **Quantization**: Reduce precision
   - INT8 gives 4x compression with minimal loss
   - INT4 gives 8x but more accuracy impact

4. **Architecture**: Design efficient models
   - Consider accuracy/size/speed trade-offs
   - Smaller models can still be effective

### Best Practices

- Start with architecture optimization
- Apply quantization for easy wins
- Use distillation for maximum compression
- Combine techniques for best results

### Next Steps

- Tutorial 16: Serving and Prediction Pipelines