# Notebook 30: Advanced Latency Analysis

## Inference Engineering Course

---

## Overview

In production systems, **average latency is a lie**. What matters is the **tail latency** -- the worst-case experience that a significant fraction of your users will encounter. A service with 100ms average latency but 2-second P99 latency will feel broken for 1% of requests.

### Why Percentiles Matter

```
Average:  "Our API responds in 50ms"   ← Looks great!
P50:      "Half of requests take >45ms" ← OK
P90:      "10% of requests take >200ms" ← That's 1 in 10!
P99:      "1% of requests take >1500ms" ← Terrible experience
P99.9:    "0.1% take >5000ms"           ← Timeouts!
```

### What You'll Learn

| Topic | Description |
|-------|-------------|
| Concurrent Request Testing | Send many parallel requests to an API |
| Latency Distributions | Analyze and characterize latency patterns |
| Percentile Calculation | Compute P50, P90, P95, P99 |
| Tail Latency Amplification | How microservices multiply tail latency |
| SLO Monitoring | Implement Service Level Objective monitoring |
| Load Testing | Compare latency under different load levels |

### Prerequisites
- Basic statistics (mean, median, percentiles)
- Understanding of HTTP APIs
- No GPU required (CPU is sufficient)

In [None]:
# ============================================================
# Install dependencies
# ============================================================
!pip install matplotlib numpy pandas scipy aiohttp -q

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from scipy import stats
import time
import asyncio
import json
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
print("Dependencies loaded!")

---

## Section 1: Understanding Latency Distributions

Real-world latency distributions are almost never normal (Gaussian). They are typically:

- **Right-skewed**: Most requests are fast, but a long tail of slow ones
- **Multimodal**: Sometimes have multiple peaks (cache hits vs misses)
- **Heavy-tailed**: Extreme outliers are more common than you'd expect

Common distribution models:
- **Log-normal**: Most common for API latencies
- **Gamma**: Good for queuing-based systems
- **Bimodal**: Cache hit/miss patterns

In [None]:
# ============================================================
# Generate realistic latency distributions
# ============================================================

np.random.seed(42)
n_requests = 10000

def generate_realistic_latencies(n, base_latency_ms=50, scenario='normal'):
    """
    Generate realistic latency samples for different scenarios.
    
    Returns latencies in milliseconds.
    """
    if scenario == 'normal':
        # Typical API: log-normal distribution
        latencies = np.random.lognormal(mean=np.log(base_latency_ms), sigma=0.5, size=n)
        
    elif scenario == 'bimodal':
        # Cache hit/miss pattern
        cache_hit = np.random.random(n) < 0.8  # 80% cache hit rate
        latencies = np.where(
            cache_hit,
            np.random.lognormal(np.log(10), 0.3, n),    # Cache hit: ~10ms
            np.random.lognormal(np.log(200), 0.4, n)    # Cache miss: ~200ms
        )
        
    elif scenario == 'heavy_tail':
        # System under stress: many outliers
        base = np.random.lognormal(np.log(base_latency_ms), 0.3, n)
        # Add occasional spikes (GC pauses, network issues)
        spike_mask = np.random.random(n) < 0.02  # 2% spike rate
        spikes = np.random.uniform(500, 5000, n)  # 500ms-5s spikes
        latencies = np.where(spike_mask, spikes, base)
        
    elif scenario == 'degraded':
        # Gradually degrading system
        time_factor = np.linspace(1, 3, n)  # Gets slower over time
        latencies = np.random.lognormal(np.log(base_latency_ms), 0.4, n) * time_factor
    
    else:
        latencies = np.random.lognormal(np.log(base_latency_ms), 0.5, n)
    
    return latencies

# Generate different scenarios
scenarios = {
    'Normal Load': generate_realistic_latencies(n_requests, 50, 'normal'),
    'Bimodal (Cache)': generate_realistic_latencies(n_requests, 50, 'bimodal'),
    'Heavy Tail': generate_realistic_latencies(n_requests, 50, 'heavy_tail'),
    'Degrading': generate_realistic_latencies(n_requests, 50, 'degraded'),
}

# Quick stats
print("Latency Statistics (milliseconds):")
print("=" * 80)
print(f"{'Scenario':20s} {'Mean':>8s} {'Median':>8s} {'P90':>8s} {'P95':>8s} {'P99':>8s} {'P99.9':>8s} {'Max':>8s}")
print("-" * 80)
for name, latencies in scenarios.items():
    print(f"{name:20s} {np.mean(latencies):>8.1f} {np.median(latencies):>8.1f} "
          f"{np.percentile(latencies, 90):>8.1f} {np.percentile(latencies, 95):>8.1f} "
          f"{np.percentile(latencies, 99):>8.1f} {np.percentile(latencies, 99.9):>8.1f} "
          f"{np.max(latencies):>8.1f}")

In [None]:
# ============================================================
# Visualize latency distributions
# ============================================================

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
colors = ['#2196F3', '#4CAF50', '#FF9800', '#F44336']

for idx, (name, latencies) in enumerate(scenarios.items()):
    ax = axes[idx // 2][idx % 2]
    
    # Histogram
    # Clip for display (but compute stats on full data)
    display_max = np.percentile(latencies, 99.5)
    display_data = latencies[latencies <= display_max]
    
    ax.hist(display_data, bins=100, color=colors[idx], alpha=0.6, 
            density=True, edgecolor='white', linewidth=0.5)
    
    # Add percentile lines
    percentiles = {
        'P50': np.percentile(latencies, 50),
        'P90': np.percentile(latencies, 90),
        'P95': np.percentile(latencies, 95),
        'P99': np.percentile(latencies, 99),
    }
    
    line_colors = {'P50': 'green', 'P90': 'orange', 'P95': 'red', 'P99': 'darkred'}
    for p_name, p_val in percentiles.items():
        if p_val <= display_max:
            ax.axvline(x=p_val, color=line_colors[p_name], linestyle='--', 
                      linewidth=1.5, alpha=0.8, label=f'{p_name}: {p_val:.0f}ms')
    
    ax.set_xlabel('Latency (ms)', fontsize=11)
    ax.set_ylabel('Density', fontsize=11)
    ax.set_title(f'{name}\n(Mean: {np.mean(latencies):.0f}ms, P99: {percentiles["P99"]:.0f}ms)', 
                fontsize=12, fontweight='bold')
    ax.legend(fontsize=9, loc='upper right')
    ax.grid(True, alpha=0.3)

plt.suptitle('Latency Distribution Shapes (10,000 requests each)', 
            fontsize=15, fontweight='bold')
plt.tight_layout()
plt.savefig('latency_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

---

## Section 2: Simulating Concurrent API Requests

In production, your API handles many concurrent requests. The load level significantly affects latency distribution.

In [None]:
# ============================================================
# Simulate concurrent request patterns
# ============================================================

import asyncio
import time

class SimulatedAPIServer:
    """
    Simulates an API server with realistic behavior:
    - Base processing time
    - Queueing delays under load
    - Occasional slow requests (GC, cache miss, etc.)
    """
    
    def __init__(self, base_latency_ms=30, max_concurrency=50):
        self.base_latency_ms = base_latency_ms
        self.max_concurrency = max_concurrency
        self.current_load = 0
        self.total_requests = 0
    
    async def handle_request(self):
        """Simulate handling a single request."""
        self.current_load += 1
        self.total_requests += 1
        
        # Base processing time (log-normal)
        processing_time = np.random.lognormal(
            np.log(self.base_latency_ms / 1000), 0.3
        )
        
        # Queueing delay increases with load
        load_factor = self.current_load / self.max_concurrency
        if load_factor > 0.8:
            # Exponential queueing delay when near capacity
            queue_delay = np.random.exponential(load_factor * 0.1)
            processing_time += queue_delay
        
        # Occasional slow request (2% chance)
        if np.random.random() < 0.02:
            processing_time += np.random.uniform(0.2, 1.0)  # 200ms-1s spike
        
        # Simulate the delay
        await asyncio.sleep(processing_time)
        
        self.current_load -= 1
        return processing_time * 1000  # Return in milliseconds


async def run_load_test(server, num_requests, concurrency):
    """
    Send concurrent requests and collect latencies.
    
    Uses a semaphore to limit concurrency.
    """
    semaphore = asyncio.Semaphore(concurrency)
    latencies = []
    
    async def single_request():
        async with semaphore:
            start = time.time()
            await server.handle_request()
            latency = (time.time() - start) * 1000
            latencies.append(latency)
    
    tasks = [single_request() for _ in range(num_requests)]
    await asyncio.gather(*tasks)
    
    return np.array(latencies)


# Run load tests at different concurrency levels
concurrency_levels = [1, 5, 10, 25, 50, 100]
num_requests_per_test = 500
all_load_results = {}

print("Running load tests...")
print("=" * 70)

for concurrency in concurrency_levels:
    server = SimulatedAPIServer(base_latency_ms=30, max_concurrency=50)
    latencies = await run_load_test(server, num_requests_per_test, concurrency)
    all_load_results[concurrency] = latencies
    
    print(f"Concurrency {concurrency:>3d}: "
          f"P50={np.percentile(latencies, 50):>7.1f}ms | "
          f"P90={np.percentile(latencies, 90):>7.1f}ms | "
          f"P99={np.percentile(latencies, 99):>7.1f}ms | "
          f"Throughput={num_requests_per_test / (np.max(latencies) / 1000):>6.1f} req/s")

In [None]:
# ============================================================
# Visualize: Latency under different load levels
# ============================================================

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Plot 1: Percentiles vs Concurrency
ax = axes[0, 0]
percentile_names = ['P50', 'P90', 'P95', 'P99']
percentile_values = [50, 90, 95, 99]
percentile_colors = ['#4CAF50', '#FF9800', '#F44336', '#9C27B0']

for p_name, p_val, p_color in zip(percentile_names, percentile_values, percentile_colors):
    values = [np.percentile(all_load_results[c], p_val) for c in concurrency_levels]
    ax.plot(concurrency_levels, values, '-o', linewidth=2, markersize=7, 
            label=p_name, color=p_color)

ax.set_xlabel('Concurrency Level', fontsize=12)
ax.set_ylabel('Latency (ms)', fontsize=12)
ax.set_title('Latency Percentiles vs Concurrency', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_yscale('log')

# Plot 2: CDF comparison
ax = axes[0, 1]
selected_levels = [1, 10, 50, 100]
cdf_colors = ['#2196F3', '#4CAF50', '#FF9800', '#F44336']

for concurrency, color in zip(selected_levels, cdf_colors):
    if concurrency in all_load_results:
        latencies = all_load_results[concurrency]
        sorted_latencies = np.sort(latencies)
        cdf = np.arange(1, len(sorted_latencies) + 1) / len(sorted_latencies)
        ax.plot(sorted_latencies, cdf, linewidth=2, label=f'Concurrency={concurrency}',
                color=color)

# Mark key percentiles
for p in [0.5, 0.9, 0.99]:
    ax.axhline(y=p, color='gray', linestyle=':', alpha=0.3)
    ax.text(ax.get_xlim()[0], p + 0.01, f'P{int(p*100)}', fontsize=9, color='gray')

ax.set_xlabel('Latency (ms)', fontsize=12)
ax.set_ylabel('Cumulative Probability', fontsize=12)
ax.set_title('CDF: Latency Distribution by Load', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

# Plot 3: Latency heatmap over time (for degrading scenario)
ax = axes[1, 0]
degrading_latencies = scenarios['Degrading']
window_size = 100
n_windows = len(degrading_latencies) // window_size

# Calculate rolling percentiles
rolling_p50 = [np.percentile(degrading_latencies[i*window_size:(i+1)*window_size], 50) 
               for i in range(n_windows)]
rolling_p90 = [np.percentile(degrading_latencies[i*window_size:(i+1)*window_size], 90) 
               for i in range(n_windows)]
rolling_p99 = [np.percentile(degrading_latencies[i*window_size:(i+1)*window_size], 99) 
               for i in range(n_windows)]

x = range(n_windows)
ax.fill_between(x, rolling_p50, rolling_p99, alpha=0.15, color='red', label='P50-P99 range')
ax.fill_between(x, rolling_p50, rolling_p90, alpha=0.2, color='orange', label='P50-P90 range')
ax.plot(x, rolling_p50, linewidth=2, color='green', label='P50')
ax.plot(x, rolling_p90, linewidth=2, color='orange', label='P90')
ax.plot(x, rolling_p99, linewidth=2, color='red', label='P99')

ax.set_xlabel('Time Window', fontsize=12)
ax.set_ylabel('Latency (ms)', fontsize=12)
ax.set_title('Rolling Percentiles (Degrading System)', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

# Plot 4: P99/P50 ratio ("tail latency amplification factor")
ax = axes[1, 1]
ratios = []
for c in concurrency_levels:
    p99 = np.percentile(all_load_results[c], 99)
    p50 = np.percentile(all_load_results[c], 50)
    ratios.append(p99 / p50)

bars = ax.bar([str(c) for c in concurrency_levels], ratios, 
              color=plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(concurrency_levels))), alpha=0.8)
for bar, r in zip(bars, ratios):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.1,
            f'{r:.1f}x', ha='center', fontweight='bold', fontsize=10)

ax.set_xlabel('Concurrency Level', fontsize=12)
ax.set_ylabel('P99/P50 Ratio', fontsize=12)
ax.set_title('Tail Latency Amplification Factor\n(P99 / P50)', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')
ax.axhline(y=3, color='red', linestyle='--', alpha=0.5, label='Danger threshold (3x)')
ax.legend()

plt.tight_layout()
plt.savefig('load_test_results.png', dpi=150, bbox_inches='tight')
plt.show()

---

## Section 3: Tail Latency Amplification in Distributed Systems

In microservice architectures, a single user request often fans out to multiple backend services. This **amplifies** tail latency dramatically.

### The Fan-Out Problem

If a request touches $n$ services, and each service has P99 latency $L_{99}$:

$$P(\text{all fast}) = (0.99)^n$$

For $n = 100$ services: $P(\text{all fast}) = 0.99^{100} = 0.366$

That means **63.4% of requests hit at least one slow service!**

In [None]:
# ============================================================
# Tail latency amplification simulation
# ============================================================

np.random.seed(42)

def simulate_fanout(n_services, n_requests=10000, base_latency_ms=20, p99_latency_ms=200):
    """
    Simulate fan-out requests across n_services.
    The overall latency is the MAX of all service latencies.
    """
    # Generate latencies for each service
    # Use log-normal distribution
    sigma = np.log(p99_latency_ms / base_latency_ms) / 2.326  # 2.326 = z-score for 99th percentile
    
    all_latencies = np.random.lognormal(
        np.log(base_latency_ms), sigma, size=(n_requests, n_services)
    )
    
    # Fan-out latency = max across all services
    fanout_latencies = np.max(all_latencies, axis=1)
    # Single service latency (for comparison)
    single_latencies = all_latencies[:, 0]
    
    return fanout_latencies, single_latencies


# Test with different fan-out widths
service_counts = [1, 5, 10, 25, 50, 100]
fanout_results = {}

print("Tail Latency Amplification:")
print("=" * 80)
print(f"{'Services':>10s} {'P50':>10s} {'P90':>10s} {'P99':>10s} {'P99.9':>10s} {'P99/P50':>10s}")
print("-" * 80)

for n_services in service_counts:
    fanout_lat, single_lat = simulate_fanout(n_services)
    fanout_results[n_services] = fanout_lat
    
    p50 = np.percentile(fanout_lat, 50)
    p90 = np.percentile(fanout_lat, 90)
    p99 = np.percentile(fanout_lat, 99)
    p999 = np.percentile(fanout_lat, 99.9)
    ratio = p99 / p50
    
    print(f"{n_services:>10d} {p50:>10.1f} {p90:>10.1f} {p99:>10.1f} {p999:>10.1f} {ratio:>10.1f}x")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Plot 1: CDFs for different fan-out widths
ax = axes[0]
selected = [1, 10, 50, 100]
colors = plt.cm.RdYlBu_r(np.linspace(0.1, 0.9, len(selected)))

for n_services, color in zip(selected, colors):
    lat = fanout_results[n_services]
    sorted_lat = np.sort(lat)
    cdf = np.arange(1, len(sorted_lat) + 1) / len(sorted_lat)
    ax.plot(sorted_lat, cdf * 100, linewidth=2, label=f'{n_services} services', color=color)

ax.axhline(y=99, color='gray', linestyle=':', alpha=0.5)
ax.text(25, 99.3, 'P99', fontsize=9, color='gray')
ax.set_xlabel('Latency (ms)', fontsize=12)
ax.set_ylabel('Percentile', fontsize=12)
ax.set_title('Latency CDF by Fan-Out Width', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 500)

# Plot 2: P99 vs number of services
ax = axes[1]
p99_values = [np.percentile(fanout_results[n], 99) for n in service_counts]
p50_values = [np.percentile(fanout_results[n], 50) for n in service_counts]

ax.plot(service_counts, p99_values, 'r-o', linewidth=2, markersize=8, label='P99')
ax.plot(service_counts, p50_values, 'g-s', linewidth=2, markersize=8, label='P50')
ax.fill_between(service_counts, p50_values, p99_values, alpha=0.1, color='orange')

ax.set_xlabel('Number of Services (fan-out width)', fontsize=12)
ax.set_ylabel('Latency (ms)', fontsize=12)
ax.set_title('Tail Latency Amplification\n(More services = worse tails)', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Plot 3: Probability of hitting at least one slow service
ax = axes[2]
n_range = np.arange(1, 101)
for p in [0.01, 0.05, 0.10]:
    prob_all_fast = (1 - p) ** n_range
    prob_one_slow = 1 - prob_all_fast
    ax.plot(n_range, prob_one_slow * 100, linewidth=2, label=f'P(slow)={p*100:.0f}%')

ax.axhline(y=50, color='gray', linestyle='--', alpha=0.3)
ax.set_xlabel('Number of Services', fontsize=12)
ax.set_ylabel('P(at least one slow) %', fontsize=12)
ax.set_title('Probability of Hitting a Slow Service', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 100)

plt.tight_layout()
plt.savefig('tail_amplification.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nKey Takeaway: With 100 microservices, even if each is 99% fast,")
print("63% of user requests will experience at least one slow backend call.")

---

## Section 4: SLO (Service Level Objective) Monitoring

An SLO defines the target performance level:

- **SLI** (Service Level Indicator): The metric you measure (e.g., P99 latency)
- **SLO** (Service Level Objective): The target value (e.g., P99 < 200ms for 99.5% of the time)
- **Error Budget**: How much violation you can tolerate

### Error Budget Calculation

$$\text{Error Budget} = 1 - \text{SLO}$$
$$\text{Budget Consumed} = \frac{\text{SLO Violations}}{\text{Total Windows}}$$

In [None]:
# ============================================================
# SLO Monitoring System
# ============================================================

class SLOMonitor:
    """
    Monitors Service Level Objectives for an API.
    
    Example SLO: "P99 latency < 200ms for 99.5% of 5-minute windows"
    """
    
    def __init__(self, slo_config: dict):
        """
        Args:
            slo_config: {
                'name': 'API Latency SLO',
                'metric': 'p99_latency_ms',
                'threshold': 200,         # P99 < 200ms
                'target': 0.995,          # 99.5% compliance
                'window_minutes': 5,       # 5-minute windows
            }
        """
        self.config = slo_config
        self.windows = []  # List of (timestamp, is_compliant, metric_value)
        self.alerts = []
    
    def record_window(self, timestamp, latencies):
        """Record metrics for a time window."""
        p99 = np.percentile(latencies, 99)
        is_compliant = p99 < self.config['threshold']
        
        self.windows.append({
            'timestamp': timestamp,
            'p99': p99,
            'p50': np.percentile(latencies, 50),
            'p90': np.percentile(latencies, 90),
            'mean': np.mean(latencies),
            'is_compliant': is_compliant,
            'n_requests': len(latencies),
        })
        
        # Check for alerting
        self._check_alerts()
    
    def _check_alerts(self):
        """Check if we should alert."""
        if len(self.windows) < 3:
            return
        
        # Alert if last 3 windows all violated
        recent = self.windows[-3:]
        if all(not w['is_compliant'] for w in recent):
            self.alerts.append({
                'type': 'consecutive_violations',
                'timestamp': recent[-1]['timestamp'],
                'message': f"3 consecutive SLO violations! P99: {recent[-1]['p99']:.0f}ms",
                'severity': 'high'
            })
        
        # Alert if error budget is nearly exhausted
        budget = self.get_error_budget()
        if budget['remaining_percent'] < 10:
            self.alerts.append({
                'type': 'budget_low',
                'timestamp': self.windows[-1]['timestamp'],
                'message': f"Error budget at {budget['remaining_percent']:.1f}%! "
                          f"({budget['violations']}/{budget['total_windows']} violations)",
                'severity': 'critical'
            })
    
    def get_error_budget(self):
        """Calculate current error budget status."""
        if not self.windows:
            return {'remaining_percent': 100, 'violations': 0, 'total_windows': 0}
        
        total = len(self.windows)
        violations = sum(1 for w in self.windows if not w['is_compliant'])
        allowed_violations = total * (1 - self.config['target'])
        
        if allowed_violations > 0:
            budget_consumed = violations / max(allowed_violations, 1) * 100
        else:
            budget_consumed = 100 if violations > 0 else 0
        
        return {
            'total_windows': total,
            'violations': violations,
            'allowed_violations': allowed_violations,
            'compliance_rate': (total - violations) / total,
            'remaining_percent': max(0, 100 - budget_consumed),
            'is_healthy': budget_consumed < 100,
        }
    
    def report(self):
        """Generate SLO report."""
        budget = self.get_error_budget()
        
        print(f"\n{'='*60}")
        print(f"  SLO Report: {self.config['name']}")
        print(f"{'='*60}")
        print(f"  Target: P99 < {self.config['threshold']}ms for {self.config['target']*100:.1f}% of windows")
        print(f"  Window size: {self.config['window_minutes']} minutes")
        print(f"")
        print(f"  Total windows:    {budget['total_windows']}")
        print(f"  SLO violations:   {budget['violations']}")
        print(f"  Compliance rate:  {budget['compliance_rate']*100:.1f}%")
        
        status = "HEALTHY" if budget['is_healthy'] else "BUDGET EXHAUSTED"
        print(f"  Error budget:     {budget['remaining_percent']:.1f}% remaining")
        print(f"  Status:           {status}")
        
        if self.alerts:
            print(f"\n  Alerts ({len(self.alerts)}):")
            for alert in self.alerts[-5:]:
                print(f"    [{alert['severity'].upper()}] {alert['message']}")
        
        print(f"{'='*60}")


# ---- Run SLO monitoring simulation ----
slo_config = {
    'name': 'API Response Time',
    'metric': 'p99_latency_ms',
    'threshold': 200,
    'target': 0.995,
    'window_minutes': 5,
}

monitor = SLOMonitor(slo_config)

# Simulate 24 hours (288 5-minute windows)
np.random.seed(42)
n_windows = 288

for i in range(n_windows):
    hour = i * 5 / 60  # Hours
    
    # Simulate varying load throughout the day
    if 9 <= hour % 24 <= 17:  # Business hours: higher load
        base_lat = 60
        n_req = 1000
    elif 2 <= hour % 24 <= 5:  # Night: lower load
        base_lat = 25
        n_req = 100
    else:
        base_lat = 40
        n_req = 500
    
    # Simulate occasional degradation
    if 14 <= hour % 24 <= 15:  # Peak hours: some degradation
        base_lat *= 2
    
    latencies = generate_realistic_latencies(n_req, base_lat, 'normal')
    monitor.record_window(timestamp=i * 5, latencies=latencies)

# Print report
monitor.report()

In [None]:
# ============================================================
# Visualize SLO monitoring dashboard
# ============================================================

fig, axes = plt.subplots(3, 1, figsize=(18, 14))

windows_df = pd.DataFrame(monitor.windows)
hours = windows_df['timestamp'] / 60  # Convert to hours

# Plot 1: P99 latency over time with SLO threshold
ax = axes[0]
ax.plot(hours, windows_df['p50'], linewidth=1, alpha=0.7, color='green', label='P50')
ax.plot(hours, windows_df['p90'], linewidth=1, alpha=0.7, color='orange', label='P90')
ax.plot(hours, windows_df['p99'], linewidth=2, color='red', label='P99')
ax.axhline(y=slo_config['threshold'], color='darkred', linestyle='--', 
           linewidth=2, label=f'SLO Threshold ({slo_config["threshold"]}ms)')

# Shade violations
violation_mask = ~windows_df['is_compliant']
for i in range(len(violation_mask)):
    if violation_mask.iloc[i]:
        ax.axvspan(hours.iloc[i] - 5/120, hours.iloc[i] + 5/120, 
                  alpha=0.3, color='red')

ax.set_xlabel('Time (hours)', fontsize=12)
ax.set_ylabel('Latency (ms)', fontsize=12)
ax.set_title('24-Hour Latency Monitoring Dashboard', fontsize=14, fontweight='bold')
ax.legend(fontsize=10, loc='upper right')
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 24)

# Add time-of-day annotations
ax.axvspan(9, 17, alpha=0.05, color='blue')
ax.text(13, ax.get_ylim()[1] * 0.95, 'Business Hours', ha='center', fontsize=10, 
        color='blue', alpha=0.7)

# Plot 2: Error budget burn-down
ax = axes[1]
cumulative_violations = np.cumsum(~windows_df['is_compliant'])
allowed = np.arange(1, len(windows_df) + 1) * (1 - slo_config['target'])

ax.plot(hours, cumulative_violations, linewidth=2, color='red', label='Actual violations')
ax.plot(hours, allowed, linewidth=2, color='green', linestyle='--', label='Budget limit')
ax.fill_between(hours, cumulative_violations, allowed, 
                where=cumulative_violations > allowed, 
                alpha=0.3, color='red', label='Over budget')
ax.fill_between(hours, cumulative_violations, allowed, 
                where=cumulative_violations <= allowed, 
                alpha=0.3, color='green', label='Within budget')

ax.set_xlabel('Time (hours)', fontsize=12)
ax.set_ylabel('Cumulative Violations', fontsize=12)
ax.set_title('Error Budget Burn-Down', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 24)

# Plot 3: Request volume and compliance rate
ax = axes[2]
ax2 = ax.twinx()

# Rolling compliance rate (last 12 windows = 1 hour)
rolling_compliance = windows_df['is_compliant'].rolling(12, min_periods=1).mean() * 100

ax.bar(hours, windows_df['n_requests'], width=5/60, alpha=0.3, color='steelblue', label='Requests')
ax2.plot(hours, rolling_compliance, linewidth=2, color='green', label='Rolling compliance %')
ax2.axhline(y=slo_config['target'] * 100, color='red', linestyle='--', 
           linewidth=1.5, label=f'SLO target ({slo_config["target"]*100}%)')

ax.set_xlabel('Time (hours)', fontsize=12)
ax.set_ylabel('Requests per Window', fontsize=12, color='steelblue')
ax2.set_ylabel('Compliance Rate (%)', fontsize=12, color='green')
ax.set_title('Request Volume and SLO Compliance', fontsize=14, fontweight='bold')
ax.set_xlim(0, 24)
ax2.set_ylim(80, 101)

# Combine legends
lines1, labels1 = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax.legend(lines1 + lines2, labels1 + labels2, fontsize=10, loc='lower right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('slo_dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

---

## Section 5: Strategies for Reducing Tail Latency

| Strategy | How It Helps | Trade-off |
|----------|-------------|----------|
| **Hedged Requests** | Send to multiple replicas, use first response | Higher resource usage |
| **Caching** | Eliminate slow paths for repeat queries | Memory usage, staleness |
| **Circuit Breaker** | Fast-fail when backend is degraded | Reduced availability |
| **Timeout + Retry** | Bound worst case, retry on fresh replica | Increased load |
| **Load Shedding** | Reject excess requests to protect quality | Reduced throughput |

In [None]:
# ============================================================
# Simulate hedged requests strategy
# ============================================================

np.random.seed(42)

def simulate_hedged_requests(base_latencies, n_replicas=2, hedge_delay_ms=50):
    """
    Simulate hedged requests: send to primary, after hedge_delay
    send to backup. Use whichever responds first.
    """
    n = len(base_latencies)
    hedged_latencies = []
    
    for i in range(n):
        primary = base_latencies[i]
        
        # If primary is fast, no need for hedge
        if primary < hedge_delay_ms:
            hedged_latencies.append(primary)
            continue
        
        # Send hedged request after delay
        backup_latencies = []
        for _ in range(n_replicas - 1):
            # Backup starts after hedge_delay, independent latency
            backup = hedge_delay_ms + np.random.lognormal(np.log(50), 0.5)
            backup_latencies.append(backup)
        
        # Use the fastest response
        all_responses = [primary] + backup_latencies
        hedged_latencies.append(min(all_responses))
    
    return np.array(hedged_latencies)


# Generate base latencies with heavy tail
base_latencies = generate_realistic_latencies(10000, 50, 'heavy_tail')

# Apply different strategies
strategies = {
    'No Mitigation': base_latencies,
    'Hedged (2 replicas)': simulate_hedged_requests(base_latencies, 2, 75),
    'Hedged (3 replicas)': simulate_hedged_requests(base_latencies, 3, 75),
    'Timeout at P95': np.minimum(base_latencies, np.percentile(base_latencies, 95)),
}

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: CDF comparison
ax = axes[0]
colors = ['#F44336', '#2196F3', '#4CAF50', '#FF9800']
for (name, latencies), color in zip(strategies.items(), colors):
    sorted_lat = np.sort(latencies)
    cdf = np.arange(1, len(sorted_lat) + 1) / len(sorted_lat)
    ax.plot(sorted_lat, cdf * 100, linewidth=2, label=name, color=color)

ax.axhline(y=99, color='gray', linestyle=':', alpha=0.5)
ax.set_xlabel('Latency (ms)', fontsize=12)
ax.set_ylabel('Percentile', fontsize=12)
ax.set_title('Effect of Tail Latency Mitigation', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.set_xlim(0, 1000)
ax.set_ylim(50, 100)
ax.grid(True, alpha=0.3)

# Plot 2: Percentile comparison bar chart
ax = axes[1]
percentiles_to_show = [50, 90, 95, 99]
x = np.arange(len(percentiles_to_show))
width = 0.2

for i, (name, latencies) in enumerate(strategies.items()):
    values = [np.percentile(latencies, p) for p in percentiles_to_show]
    ax.bar(x + i * width, values, width, label=name, color=colors[i], alpha=0.8)

ax.set_xlabel('Percentile', fontsize=12)
ax.set_ylabel('Latency (ms)', fontsize=12)
ax.set_title('Percentile Comparison by Strategy', fontsize=13, fontweight='bold')
ax.set_xticks(x + 1.5 * width)
ax.set_xticklabels([f'P{p}' for p in percentiles_to_show])
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('mitigation_strategies.png', dpi=150, bbox_inches='tight')
plt.show()

# Print improvement summary
print("\nTail Latency Reduction Summary:")
print("=" * 60)
base_p99 = np.percentile(base_latencies, 99)
for name, latencies in strategies.items():
    p99 = np.percentile(latencies, 99)
    reduction = (1 - p99 / base_p99) * 100
    print(f"  {name:25s}: P99 = {p99:>8.1f}ms ({reduction:>+5.1f}%)")

---

## Summary & Key Takeaways

| Concept | Key Insight |
|---------|-------------|
| **Percentiles > Averages** | P99 matters more than mean for user experience |
| **Latency Distributions** | Real-world latency is log-normal, not Gaussian |
| **Tail Amplification** | Fan-out to N services: P(all fast) = (1-p)^N |
| **SLO Monitoring** | Track compliance and error budget over time |
| **Hedged Requests** | Reduce P99 by 40-60% with 2x resource cost |
| **Load Correlation** | Tail latency grows exponentially near capacity |

### Rules of Thumb

1. **Always measure P99** (not just average or P50)
2. **Set SLOs on percentiles** (e.g., P99 < 200ms for 99.5%)
3. **Monitor error budgets** to make data-driven decisions
4. **Use hedged requests** for critical, latency-sensitive paths
5. **Test under load** -- behavior at 10% load tells you nothing about 80% load

---

## Exercises

### Exercise 1: Real API Load Test
Using `aiohttp`, send concurrent requests to a real API (e.g., httpbin.org) and analyze the latency distribution.

### Exercise 2: Circuit Breaker Implementation
Implement a circuit breaker that opens after 3 consecutive failures and closes after a cool-down period. Measure how it affects tail latency.

### Exercise 3: Custom SLO Dashboard
Extend the SLOMonitor to track multiple SLOs simultaneously (e.g., latency + error rate + throughput).

### Exercise 4: Adaptive Load Shedding
Implement a system that starts rejecting requests when P99 exceeds a threshold, gradually increasing rejection rate until P99 recovers.

In [None]:
# ============================================================
# Exercise 1 Starter: Real API Load Test
# ============================================================

# import aiohttp
# 
# async def real_load_test(url, num_requests=100, concurrency=10):
#     """Send concurrent requests to a real API."""
#     semaphore = asyncio.Semaphore(concurrency)
#     latencies = []
#     errors = 0
#     
#     async with aiohttp.ClientSession() as session:
#         async def single_request():
#             nonlocal errors
#             async with semaphore:
#                 start = time.time()
#                 try:
#                     async with session.get(url) as resp:
#                         await resp.text()
#                         latency = (time.time() - start) * 1000
#                         latencies.append(latency)
#                 except Exception:
#                     errors += 1
#         
#         tasks = [single_request() for _ in range(num_requests)]
#         await asyncio.gather(*tasks)
#     
#     return np.array(latencies), errors
# 
# # Run the test
# latencies, errors = await real_load_test(
#     "https://httpbin.org/get",
#     num_requests=100,
#     concurrency=10
# )
# 
# print(f"Results: {len(latencies)} successful, {errors} errors")
# print(f"P50: {np.percentile(latencies, 50):.0f}ms")
# print(f"P99: {np.percentile(latencies, 99):.0f}ms")

print("Uncomment the code above to run a real API load test!")
print("Works best in Google Colab with internet access.")

In [None]:
# ============================================================
# Exercise 2 Starter: Circuit Breaker
# ============================================================

class CircuitBreaker:
    """Simple circuit breaker implementation."""
    
    CLOSED = 'closed'      # Normal operation
    OPEN = 'open'          # Failing fast
    HALF_OPEN = 'half_open'  # Testing recovery
    
    def __init__(self, failure_threshold=3, reset_timeout_s=5):
        self.failure_threshold = failure_threshold
        self.reset_timeout_s = reset_timeout_s
        self.state = self.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0
    
    def can_execute(self):
        """Check if request should be allowed."""
        if self.state == self.CLOSED:
            return True
        elif self.state == self.OPEN:
            if time.time() - self.last_failure_time > self.reset_timeout_s:
                self.state = self.HALF_OPEN
                return True  # Allow one test request
            return False
        elif self.state == self.HALF_OPEN:
            return True
        return False
    
    def record_success(self):
        """Record a successful request."""
        if self.state == self.HALF_OPEN:
            self.state = self.CLOSED
        self.failure_count = 0
    
    def record_failure(self):
        """Record a failed request."""
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = self.OPEN

# Demo
cb = CircuitBreaker(failure_threshold=3, reset_timeout_s=2)
print(f"Initial state: {cb.state}")

# Simulate failures
for i in range(5):
    if cb.can_execute():
        print(f"  Request {i}: SENT (state={cb.state})")
        cb.record_failure()
    else:
        print(f"  Request {i}: REJECTED by circuit breaker (state={cb.state})")

print(f"\nWaiting for reset timeout...")
time.sleep(2.1)

if cb.can_execute():
    print(f"  Request 5: SENT (state={cb.state})")
    cb.record_success()
    print(f"  Recovery! State: {cb.state}")