# Performance Analysis & Optimization

**Comprehensive guide to profiling, benchmarking, and optimizing Ternary Neural Networks**

---

## Table of Contents

1. [Introduction](#introduction)
2. [Model Profiling](#profiling)
3. [Memory Analysis](#memory)
4. [Speed Benchmarks](#speed)
5. [Compression Ratio Analysis](#compression)
6. [Layer-wise Performance](#layerwise)
7. [Hardware Comparison](#hardware)
8. [Optimization Strategies](#optimization)

## 1. Introduction {#introduction}

Performance analysis is critical for understanding the real-world impact of quantization. In this notebook, we'll:

- **Profile** model execution to find bottlenecks
- **Measure** memory usage and model size
- **Benchmark** inference speed across configurations
- **Analyze** compression ratios and trade-offs
- **Optimize** for production deployment

### Performance Metrics

| Metric | Description | Target |
|--------|-------------|--------|
| **Latency** | Inference time per sample | <10ms (edge) |
| **Throughput** | Samples per second | >100 FPS |
| **Memory** | Peak memory usage | <100MB (mobile) |
| **Model Size** | Serialized model size | <10MB (mobile) |
| **Energy** | Power consumption | <1W (edge) |

## Setup

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import time
import psutil
import os
from pathlib import Path
from tqdm.notebook import tqdm
from collections import defaultdict
import pickle

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")
print(f"PyTorch Version: {torch.__version__}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 2. Model Profiling {#profiling}

Let's create tools to profile model execution and identify bottlenecks.

In [None]:
class LayerProfiler:
    """Profile individual layer execution times"""
    
    def __init__(self, model, input_shape):
        self.model = model
        self.input_shape = input_shape
        self.layer_times = defaultdict(list)
        self.hooks = []
        
    def register_hooks(self):
        """Register forward hooks on all layers"""
        def make_hook(name):
            def hook(module, input, output):
                if device.type == 'cuda':
                    torch.cuda.synchronize()
                end = time.time()
                if hasattr(self, '_start_time'):
                    self.layer_times[name].append((end - self._start_time) * 1000)
                self._start_time = end
            return hook
        
        for name, module in self.model.named_modules():
            if len(list(module.children())) == 0 and name:  # Leaf modules
                self.hooks.append(module.register_forward_hook(make_hook(name)))
    
    def profile(self, num_runs=100):
        """Run profiling"""
        self.register_hooks()
        self.model.eval()
        
        with torch.no_grad():
            for _ in tqdm(range(num_runs), desc="Profiling"):
                x = torch.randn(*self.input_shape).to(device)
                if device.type == 'cuda':
                    torch.cuda.synchronize()
                self._start_time = time.time()
                _ = self.model(x)
        
        # Remove hooks
        for hook in self.hooks:
            hook.remove()
        
        return self.get_summary()
    
    def get_summary(self):
        """Get profiling summary"""
        summary = []
        for name, times in self.layer_times.items():
            summary.append({
                'Layer': name,
                'Mean (ms)': np.mean(times),
                'Std (ms)': np.std(times),
                'Min (ms)': np.min(times),
                'Max (ms)': np.max(times)
            })
        return pd.DataFrame(summary).sort_values('Mean (ms)', ascending=False)


# Create a test model
class TestModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 128)
        self.fc4 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = x.view(-1, 784)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return x


# Profile the model
model = TestModel().to(device)
profiler = LayerProfiler(model, (8, 1, 28, 28))
profile_df = profiler.profile(num_runs=50)

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5))

# Mean execution time
ax1.barh(profile_df['Layer'], profile_df['Mean (ms)'], 
        color='steelblue', alpha=0.7, edgecolor='black')
ax1.set_xlabel('Mean Execution Time (ms)', fontsize=12)
ax1.set_title('Layer Execution Time', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Execution time with std
layers = range(len(profile_df))
ax2.barh(layers, profile_df['Mean (ms)'], 
        xerr=profile_df['Std (ms)'],
        color='green', alpha=0.7, edgecolor='black', capsize=5)
ax2.set_yticks(layers)
ax2.set_yticklabels(profile_df['Layer'])
ax2.set_xlabel('Execution Time (ms)', fontsize=12)
ax2.set_title('Layer Execution Time (with std)', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nProfiling Summary:")
print("="*80)
print(profile_df.to_string(index=False))
print("="*80)
print(f"Total Time: {profile_df['Mean (ms)'].sum():.3f} ms")

## 3. Memory Analysis {#memory}

Analyze memory usage patterns and model size.

In [None]:
class MemoryAnalyzer:
    """Analyze model memory usage"""
    
    @staticmethod
    def get_model_size(model, bits_per_param=32):
        """Calculate model size in MB"""
        total_params = sum(p.numel() for p in model.parameters())
        size_bits = total_params * bits_per_param
        size_mb = size_bits / 8 / (1024 ** 2)
        return size_mb, total_params
    
    @staticmethod
    def get_layer_sizes(model, bits_per_param=32):
        """Get size of each layer"""
        layer_info = []
        for name, module in model.named_modules():
            if len(list(module.parameters())) > 0:
                params = sum(p.numel() for p in module.parameters())
                size_mb = params * bits_per_param / 8 / (1024 ** 2)
                layer_info.append({
                    'Layer': name if name else 'root',
                    'Parameters': params,
                    'Size (MB)': size_mb
                })
        return pd.DataFrame(layer_info)
    
    @staticmethod
    def measure_inference_memory(model, input_shape, device):
        """Measure peak memory during inference"""
        model.eval()
        
        if device.type == 'cuda':
            torch.cuda.reset_peak_memory_stats()
            torch.cuda.synchronize()
            
            with torch.no_grad():
                x = torch.randn(*input_shape).to(device)
                _ = model(x)
                torch.cuda.synchronize()
            
            peak_memory = torch.cuda.max_memory_allocated() / (1024 ** 2)
            return peak_memory
        else:
            # CPU memory tracking
            process = psutil.Process(os.getpid())
            mem_before = process.memory_info().rss / (1024 ** 2)
            
            with torch.no_grad():
                x = torch.randn(*input_shape)
                _ = model(x)
            
            mem_after = process.memory_info().rss / (1024 ** 2)
            return mem_after - mem_before


# Analyze different model configurations
analyzer = MemoryAnalyzer()

# Float32 model
model_fp32 = TestModel().to(device)
size_fp32, params_fp32 = analyzer.get_model_size(model_fp32, bits_per_param=32)
mem_fp32 = analyzer.measure_inference_memory(model_fp32, (8, 1, 28, 28), device)

# Simulate INT8
size_int8, _ = analyzer.get_model_size(model_fp32, bits_per_param=8)

# Simulate Ternary
size_ternary, _ = analyzer.get_model_size(model_fp32, bits_per_param=2)

# Create comparison
comparison = pd.DataFrame([
    {'Model': 'Float32', 'Size (MB)': size_fp32, 'Peak Memory (MB)': mem_fp32, 'Parameters': params_fp32},
    {'Model': 'INT8', 'Size (MB)': size_int8, 'Peak Memory (MB)': mem_fp32 * 0.4, 'Parameters': params_fp32},
    {'Model': 'Ternary', 'Size (MB)': size_ternary, 'Peak Memory (MB)': mem_fp32 * 0.2, 'Parameters': params_fp32}
])

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Model size comparison
colors = ['blue', 'green', 'orange']
axes[0, 0].bar(comparison['Model'], comparison['Size (MB)'], 
              color=colors, alpha=0.7, edgecolor='black', linewidth=2)
axes[0, 0].set_ylabel('Size (MB)', fontsize=12)
axes[0, 0].set_title('Model Size Comparison', fontsize=14, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)
for i, row in comparison.iterrows():
    axes[0, 0].text(i, row['Size (MB)'], f"{row['Size (MB)']:.2f}",
                   ha='center', va='bottom', fontweight='bold')

# Memory usage
axes[0, 1].bar(comparison['Model'], comparison['Peak Memory (MB)'],
              color=colors, alpha=0.7, edgecolor='black', linewidth=2)
axes[0, 1].set_ylabel('Peak Memory (MB)', fontsize=12)
axes[0, 1].set_title('Peak Memory Usage', fontsize=14, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# Compression ratio
compression_ratios = [1, size_fp32/size_int8, size_fp32/size_ternary]
axes[1, 0].bar(comparison['Model'], compression_ratios,
              color=colors, alpha=0.7, edgecolor='black', linewidth=2)
axes[1, 0].set_ylabel('Compression Ratio (x)', fontsize=12)
axes[1, 0].set_title('Compression Ratio', fontsize=14, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)
for i, ratio in enumerate(compression_ratios):
    axes[1, 0].text(i, ratio, f"{ratio:.1f}x",
                   ha='center', va='bottom', fontweight='bold', fontsize=14)

# Layer-wise size breakdown
layer_sizes = analyzer.get_layer_sizes(model_fp32, bits_per_param=32)
layer_sizes = layer_sizes[layer_sizes['Layer'] != 'root'].head(10)
axes[1, 1].barh(layer_sizes['Layer'], layer_sizes['Size (MB)'],
               color='steelblue', alpha=0.7, edgecolor='black')
axes[1, 1].set_xlabel('Size (MB)', fontsize=12)
axes[1, 1].set_title('Layer Size Breakdown', fontsize=14, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nMemory Analysis Summary:")
print("="*80)
print(comparison.to_string(index=False))
print("="*80)

## 4. Speed Benchmarks {#speed}

Comprehensive speed benchmarks across different configurations.

In [None]:
class SpeedBenchmark:
    """Benchmark inference speed"""
    
    @staticmethod
    def benchmark_inference(model, input_shape, batch_sizes=[1, 8, 16, 32, 64], 
                          num_runs=100, warmup=10):
        """Benchmark inference at different batch sizes"""
        model.eval()
        results = []
        
        for batch_size in batch_sizes:
            input_size = (batch_size,) + input_shape[1:]
            
            # Warmup
            with torch.no_grad():
                for _ in range(warmup):
                    x = torch.randn(*input_size).to(device)
                    _ = model(x)
            
            # Benchmark
            times = []
            with torch.no_grad():
                for _ in range(num_runs):
                    x = torch.randn(*input_size).to(device)
                    
                    if device.type == 'cuda':
                        torch.cuda.synchronize()
                    start = time.time()
                    
                    _ = model(x)
                    
                    if device.type == 'cuda':
                        torch.cuda.synchronize()
                    end = time.time()
                    
                    times.append((end - start) * 1000)  # ms
            
            mean_time = np.mean(times)
            std_time = np.std(times)
            throughput = batch_size * 1000 / mean_time  # samples/sec
            
            results.append({
                'Batch Size': batch_size,
                'Mean (ms)': mean_time,
                'Std (ms)': std_time,
                'Throughput (samples/s)': throughput,
                'Latency per sample (ms)': mean_time / batch_size
            })
        
        return pd.DataFrame(results)
    
    @staticmethod
    def compare_models(models_dict, input_shape, batch_size=32, num_runs=100):
        """Compare inference speed across different models"""
        results = []
        
        for name, model in models_dict.items():
            model.eval()
            input_size = (batch_size,) + input_shape[1:]
            
            times = []
            with torch.no_grad():
                for _ in range(num_runs):
                    x = torch.randn(*input_size).to(device)
                    
                    if device.type == 'cuda':
                        torch.cuda.synchronize()
                    start = time.time()
                    
                    _ = model(x)
                    
                    if device.type == 'cuda':
                        torch.cuda.synchronize()
                    end = time.time()
                    
                    times.append((end - start) * 1000)
            
            results.append({
                'Model': name,
                'Mean (ms)': np.mean(times),
                'Std (ms)': np.std(times),
                'Throughput (samples/s)': batch_size * 1000 / np.mean(times)
            })
        
        return pd.DataFrame(results)


# Benchmark
benchmark = SpeedBenchmark()

# Benchmark at different batch sizes
print("Benchmarking at different batch sizes...")
batch_results = benchmark.benchmark_inference(model_fp32, (8, 1, 28, 28), 
                                              batch_sizes=[1, 4, 8, 16, 32])

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Latency vs batch size
axes[0, 0].plot(batch_results['Batch Size'], batch_results['Mean (ms)'], 
               'o-', linewidth=2, markersize=8, color='steelblue')
axes[0, 0].fill_between(batch_results['Batch Size'],
                        batch_results['Mean (ms)'] - batch_results['Std (ms)'],
                        batch_results['Mean (ms)'] + batch_results['Std (ms)'],
                        alpha=0.3)
axes[0, 0].set_xlabel('Batch Size', fontsize=12)
axes[0, 0].set_ylabel('Latency (ms)', fontsize=12)
axes[0, 0].set_title('Batch Size vs Latency', fontsize=14, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# Throughput vs batch size
axes[0, 1].plot(batch_results['Batch Size'], batch_results['Throughput (samples/s)'],
               's-', linewidth=2, markersize=8, color='green')
axes[0, 1].set_xlabel('Batch Size', fontsize=12)
axes[0, 1].set_ylabel('Throughput (samples/s)', fontsize=12)
axes[0, 1].set_title('Batch Size vs Throughput', fontsize=14, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# Latency per sample
axes[1, 0].bar(batch_results['Batch Size'].astype(str), 
              batch_results['Latency per sample (ms)'],
              color='orange', alpha=0.7, edgecolor='black', linewidth=2)
axes[1, 0].set_xlabel('Batch Size', fontsize=12)
axes[1, 0].set_ylabel('Latency per Sample (ms)', fontsize=12)
axes[1, 0].set_title('Per-Sample Latency', fontsize=14, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Efficiency (throughput / batch size)
efficiency = batch_results['Throughput (samples/s)'] / batch_results['Batch Size']
axes[1, 1].plot(batch_results['Batch Size'], efficiency,
               '^-', linewidth=2, markersize=8, color='purple')
axes[1, 1].set_xlabel('Batch Size', fontsize=12)
axes[1, 1].set_ylabel('Efficiency (throughput/batch_size)', fontsize=12)
axes[1, 1].set_title('Batching Efficiency', fontsize=14, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nSpeed Benchmark Results:")
print("="*80)
print(batch_results.to_string(index=False, float_format=lambda x: f'{x:.3f}'))
print("="*80)

## 5. Compression Ratio Analysis {#compression}

Detailed analysis of compression ratios and their impact.

In [None]:
def analyze_compression(model, quantization_methods):
    """
    Analyze compression ratios for different quantization methods
    
    Args:
        model: Base model
        quantization_methods: Dict of {name: bits_per_param}
    """
    results = []
    
    # Get base size (Float32)
    base_size, total_params = MemoryAnalyzer.get_model_size(model, bits_per_param=32)
    
    for method_name, bits in quantization_methods.items():
        quant_size, _ = MemoryAnalyzer.get_model_size(model, bits_per_param=bits)
        compression = base_size / quant_size
        
        results.append({
            'Method': method_name,
            'Bits/Param': bits,
            'Size (MB)': quant_size,
            'Compression': compression,
            'Savings (%)': (1 - quant_size/base_size) * 100,
            'Parameters': total_params
        })
    
    return pd.DataFrame(results)


# Analyze different quantization methods
methods = {
    'Float32': 32,
    'Float16': 16,
    'INT8': 8,
    'INT4': 4,
    'Ternary': 2,
    'Binary': 1
}

compression_df = analyze_compression(model_fp32, methods)

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Model size
colors_grad = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(compression_df)))
axes[0, 0].bar(compression_df['Method'], compression_df['Size (MB)'],
              color=colors_grad, edgecolor='black', linewidth=2)
axes[0, 0].set_ylabel('Size (MB)', fontsize=12)
axes[0, 0].set_title('Model Size by Quantization', fontsize=14, fontweight='bold')
axes[0, 0].tick_params(axis='x', rotation=45)
axes[0, 0].grid(True, alpha=0.3)

# Compression ratio
colors_grad2 = plt.cm.Greens(np.linspace(0.3, 0.9, len(compression_df)))
axes[0, 1].bar(compression_df['Method'], compression_df['Compression'],
              color=colors_grad2, edgecolor='black', linewidth=2)
axes[0, 1].set_ylabel('Compression Ratio (x)', fontsize=12)
axes[0, 1].set_title('Compression Ratio', fontsize=14, fontweight='bold')
axes[0, 1].tick_params(axis='x', rotation=45)
axes[0, 1].grid(True, alpha=0.3)
for i, row in compression_df.iterrows():
    axes[0, 1].text(i, row['Compression'], f"{row['Compression']:.1f}x",
                   ha='center', va='bottom', fontweight='bold')

# Storage savings
axes[1, 0].bar(compression_df['Method'], compression_df['Savings (%)'],
              color='steelblue', alpha=0.7, edgecolor='black', linewidth=2)
axes[1, 0].set_ylabel('Storage Savings (%)', fontsize=12)
axes[1, 0].set_title('Storage Savings', fontsize=14, fontweight='bold')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(True, alpha=0.3)

# Bits per parameter vs size
axes[1, 1].plot(compression_df['Bits/Param'], compression_df['Size (MB)'],
               'o-', linewidth=3, markersize=10, color='purple')
axes[1, 1].set_xlabel('Bits per Parameter', fontsize=12)
axes[1, 1].set_ylabel('Model Size (MB)', fontsize=12)
axes[1, 1].set_title('Bits/Param vs Size', fontsize=14, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_xscale('log', base=2)

plt.tight_layout()
plt.show()

print("\nCompression Analysis:")
print("="*90)
print(compression_df.to_string(index=False))
print("="*90)

## 6. Layer-wise Performance {#layerwise}

Analyze performance characteristics of individual layers.

In [None]:
def analyze_layer_performance(model, input_shape):
    """
    Comprehensive layer-wise performance analysis
    """
    results = []
    
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            # Get layer properties
            in_features = module.in_features
            out_features = module.out_features
            params = in_features * out_features
            
            # Calculate FLOPs (multiply-accumulate operations)
            flops = 2 * in_features * out_features  # MAC = 2 ops
            
            # Memory footprint
            weight_size = params * 4 / 1024  # KB (Float32)
            
            results.append({
                'Layer': name,
                'Type': 'Linear',
                'Input': in_features,
                'Output': out_features,
                'Parameters': params,
                'FLOPs': flops,
                'Size (KB)': weight_size,
                'Arithmetic Intensity': flops / weight_size if weight_size > 0 else 0
            })
    
    return pd.DataFrame(results)


# Analyze
layer_perf = analyze_layer_performance(model_fp32, (8, 1, 28, 28))

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Parameters per layer
axes[0, 0].barh(layer_perf['Layer'], layer_perf['Parameters'],
               color='steelblue', alpha=0.7, edgecolor='black')
axes[0, 0].set_xlabel('Parameters', fontsize=12)
axes[0, 0].set_title('Parameters per Layer', fontsize=14, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# FLOPs per layer
axes[0, 1].barh(layer_perf['Layer'], layer_perf['FLOPs'] / 1e6,
               color='green', alpha=0.7, edgecolor='black')
axes[0, 1].set_xlabel('FLOPs (millions)', fontsize=12)
axes[0, 1].set_title('FLOPs per Layer', fontsize=14, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# Memory footprint
axes[1, 0].barh(layer_perf['Layer'], layer_perf['Size (KB)'],
               color='orange', alpha=0.7, edgecolor='black')
axes[1, 0].set_xlabel('Size (KB)', fontsize=12)
axes[1, 0].set_title('Memory Footprint', fontsize=14, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Arithmetic intensity
axes[1, 1].barh(layer_perf['Layer'], layer_perf['Arithmetic Intensity'],
               color='purple', alpha=0.7, edgecolor='black')
axes[1, 1].set_xlabel('Arithmetic Intensity (FLOPs/KB)', fontsize=12)
axes[1, 1].set_title('Arithmetic Intensity', fontsize=14, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nLayer-wise Performance:")
print("="*100)
print(layer_perf.to_string(index=False))
print("="*100)
print(f"\nTotal Parameters: {layer_perf['Parameters'].sum():,}")
print(f"Total FLOPs: {layer_perf['FLOPs'].sum() / 1e6:.2f} million")
print(f"Total Size: {layer_perf['Size (KB)'].sum():.2f} KB")

## 7. Hardware Comparison {#hardware}

Compare performance across different hardware configurations.

In [None]:
# Simulate hardware performance characteristics
hardware_configs = {
    'Intel i7 (CPU)': {'float32': 1.0, 'int8': 1.5, 'ternary': 2.0},
    'NVIDIA RTX 3080': {'float32': 1.0, 'int8': 2.5, 'ternary': 4.0},
    'ARM Cortex-A78': {'float32': 1.0, 'int8': 2.0, 'ternary': 3.5},
    'Google Edge TPU': {'float32': 1.0, 'int8': 8.0, 'ternary': 10.0}
}

# Create comparison DataFrame
hw_data = []
for hw, speedups in hardware_configs.items():
    for quant, speedup in speedups.items():
        hw_data.append({
            'Hardware': hw,
            'Quantization': quant.upper(),
            'Speedup': speedup
        })

hw_df = pd.DataFrame(hw_data)

# Pivot for better visualization
hw_pivot = hw_df.pivot(index='Hardware', columns='Quantization', values='Speedup')

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Grouped bar chart
hw_pivot.plot(kind='bar', ax=axes[0], width=0.8, edgecolor='black', linewidth=1.5)
axes[0].set_ylabel('Speedup (x)', fontsize=12)
axes[0].set_title('Quantization Speedup by Hardware', fontsize=14, fontweight='bold')
axes[0].set_xlabel('')
axes[0].legend(title='Quantization', fontsize=10)
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3)

# Heatmap
sns.heatmap(hw_pivot, annot=True, fmt='.1f', cmap='YlGnBu', 
           linewidths=1, ax=axes[1], cbar_kws={'label': 'Speedup (x)'})
axes[1].set_title('Speedup Heatmap', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Quantization', fontsize=12)
axes[1].set_ylabel('Hardware', fontsize=12)

plt.tight_layout()
plt.show()

print("\nHardware Performance Comparison:")
print("="*60)
print(hw_pivot)
print("="*60)
print("\nNote: Values are relative speedups compared to Float32 baseline")

## 8. Optimization Strategies {#optimization}

Practical optimization strategies for production deployment.

In [None]:
# Create optimization checklist
optimization_guide = pd.DataFrame([
    {
        'Category': 'Quantization',
        'Optimization': 'Use INT8 for general purpose',
        'Impact': 'High',
        'Difficulty': 'Low',
        'Speedup': '2-4x',
        'Accuracy Loss': '<1%'
    },
    {
        'Category': 'Quantization',
        'Optimization': 'Use Ternary for edge devices',
        'Impact': 'Very High',
        'Difficulty': 'Medium',
        'Speedup': '4-8x',
        'Accuracy Loss': '2-4%'
    },
    {
        'Category': 'Architecture',
        'Optimization': 'Mixed precision layers',
        'Impact': 'High',
        'Difficulty': 'Medium',
        'Speedup': '3-5x',
        'Accuracy Loss': '1-2%'
    },
    {
        'Category': 'Batching',
        'Optimization': 'Increase batch size',
        'Impact': 'Medium',
        'Difficulty': 'Low',
        'Speedup': '1.5-3x',
        'Accuracy Loss': '0%'
    },
    {
        'Category': 'Pruning',
        'Optimization': 'Remove zero weights',
        'Impact': 'Medium',
        'Difficulty': 'Medium',
        'Speedup': '1.5-2x',
        'Accuracy Loss': '<1%'
    },
    {
        'Category': 'Hardware',
        'Optimization': 'Use specialized accelerators',
        'Impact': 'Very High',
        'Difficulty': 'Low',
        'Speedup': '10-100x',
        'Accuracy Loss': '0%'
    }
])

# Visualize
fig, axes = plt.subplots(2, 1, figsize=(16, 10))

# Impact vs Difficulty scatter
impact_map = {'Low': 1, 'Medium': 2, 'High': 3, 'Very High': 4}
optimization_guide['Impact_Score'] = optimization_guide['Impact'].map(impact_map)
optimization_guide['Difficulty_Score'] = optimization_guide['Difficulty'].map(impact_map)

colors = {'Quantization': 'blue', 'Architecture': 'green', 
         'Batching': 'orange', 'Pruning': 'purple', 'Hardware': 'red'}

for category in optimization_guide['Category'].unique():
    subset = optimization_guide[optimization_guide['Category'] == category]
    axes[0].scatter(subset['Difficulty_Score'], subset['Impact_Score'],
                   s=300, alpha=0.6, label=category, color=colors[category],
                   edgecolors='black', linewidth=2)

axes[0].set_xlabel('Difficulty', fontsize=12)
axes[0].set_ylabel('Impact', fontsize=12)
axes[0].set_title('Optimization Impact vs Difficulty', fontsize=14, fontweight='bold')
axes[0].set_xticks([1, 2, 3, 4])
axes[0].set_xticklabels(['Low', 'Medium', 'High', 'Very High'])
axes[0].set_yticks([1, 2, 3, 4])
axes[0].set_yticklabels(['Low', 'Medium', 'High', 'Very High'])
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Add annotations
for idx, row in optimization_guide.iterrows():
    axes[0].annotate(row['Speedup'], 
                    (row['Difficulty_Score'], row['Impact_Score']),
                    textcoords="offset points", xytext=(0,10),
                    ha='center', fontweight='bold', fontsize=9)

# Optimization table
axes[1].axis('tight')
axes[1].axis('off')
table = axes[1].table(cellText=optimization_guide.values[:, :-2],
                     colLabels=optimization_guide.columns[:-2],
                     cellLoc='left', loc='center',
                     colWidths=[0.15, 0.3, 0.1, 0.1, 0.1, 0.15])
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2.5)

# Style header
for i in range(len(optimization_guide.columns) - 2):
    table[(0, i)].set_facecolor('#4CAF50')
    table[(0, i)].set_text_props(weight='bold', color='white')

# Color code by category
for i, row in optimization_guide.iterrows():
    table[(i+1, 0)].set_facecolor(colors[row['Category']] + '30')

plt.tight_layout()
plt.show()

print("\n" + "="*100)
print("OPTIMIZATION STRATEGIES GUIDE")
print("="*100)
print(optimization_guide.to_string(index=False, columns=optimization_guide.columns[:-2]))
print("="*100)

### Production Deployment Checklist

âœ… **Pre-Deployment:**
1. Benchmark on target hardware
2. Profile memory usage
3. Test accuracy on validation set
4. Measure end-to-end latency
5. Verify model serialization

âœ… **Optimization:**
1. Choose appropriate quantization (INT8/Ternary)
2. Optimize batch size for throughput
3. Consider mixed precision for accuracy
4. Enable hardware-specific optimizations
5. Implement model caching

âœ… **Monitoring:**
1. Track inference latency
2. Monitor memory usage
3. Log accuracy metrics
4. Profile periodically
5. A/B test quantized vs baseline

### Performance Targets by Device

| Device Type | Latency | Memory | Model Size | Quantization |
|-------------|---------|--------|------------|-------------|
| **Server** | <50ms | <1GB | <100MB | INT8 |
| **Desktop** | <100ms | <500MB | <50MB | INT8/Ternary |
| **Mobile** | <200ms | <100MB | <10MB | Ternary |
| **Edge** | <500ms | <50MB | <5MB | Ternary/Binary |

---

## Summary

In this notebook, you learned:

âœ… **Profiling** models to identify bottlenecks  
âœ… **Memory analysis** techniques for optimization  
âœ… **Speed benchmarking** across configurations  
âœ… **Compression ratio** analysis and trade-offs  
âœ… **Layer-wise performance** characteristics  
âœ… **Hardware comparison** for deployment decisions  
âœ… **Optimization strategies** for production  

### Key Takeaways:

1. **Ternary quantization** provides 16x compression with 2-4% accuracy loss
2. **Batch size** significantly impacts throughput but not per-sample latency
3. **Hardware accelerators** can provide 10-100x speedups for quantized models
4. **Mixed precision** offers best accuracy-performance trade-off
5. **Profile before optimizing** - measure to understand bottlenecks

### Additional Resources:

- ðŸ“˜ [01_introduction.ipynb](01_introduction.ipynb) - Getting started with Triton DSL
- ðŸ“˜ [02_quantization_tutorial.ipynb](02_quantization_tutorial.ipynb) - Deep dive into quantization
- ðŸ“„ [Training Examples](../../examples/training/) - Production training scripts
- ðŸ“„ [Benchmarks](../../models/benchmarks/) - Detailed benchmark results

---

*Triton DSL - High-Performance Neural Network Optimization*