[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/)

# 12. Latency Percentiles: P50, P90, P95, P99

---

## What You'll Learn

1. **What percentiles are** and why they matter more than averages
2. **How to measure** and visualize latency distributions
3. **P50, P90, P95, P99** - what each one tells you
4. **Why LLM inference latency is right-skewed** and what causes the tail
5. **The user experience impact** - why P99 matters for real products
6. **How load affects distribution** - what happens as you push the system harder

---

### Why Not Just Use the Average?

If your average latency is 200ms but 1% of users wait 5 seconds, you have a problem that the average hides. With millions of requests per day, that 1% represents tens of thousands of frustrated users.

In [None]:
!pip install matplotlib numpy scipy -q

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import time
from typing import List

np.random.seed(42)

plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 12
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

print("Setup complete!")

## Part 1: Simulating LLM Inference Latencies

Real LLM inference latency is affected by:
- **Input length** (prefill time varies)
- **Output length** (more tokens = more time)
- **Queue depth** (waiting behind other requests)
- **Batch effects** (larger batches = more per-step time)
- **System noise** (GC pauses, network jitter, cache misses)

This creates a **right-skewed distribution** (long tail on the right).

In [None]:
def simulate_llm_latencies(n_requests=5000, 
                            base_latency=0.15,
                            load_factor=1.0):
    """Simulate realistic LLM inference latencies.
    
    The distribution is a mixture:
    - Base component: lognormal (typical request processing)
    - Output length variation: gamma distributed
    - Queue delays: exponential (increases with load)
    - Occasional spikes: rare but large delays
    """
    # Base processing time (lognormal - most common distribution for latency)
    base = np.random.lognormal(
        mean=np.log(base_latency), 
        sigma=0.3, 
        size=n_requests
    )
    
    # Variable output length contribution
    output_time = np.random.gamma(shape=2, scale=0.05, size=n_requests)
    
    # Queue waiting time (increases with load)
    queue_delay = np.random.exponential(
        scale=0.02 * load_factor, 
        size=n_requests
    )
    
    # Occasional spikes (1-3% of requests hit GC, cache miss, etc.)
    spike_mask = np.random.random(n_requests) < 0.02 * load_factor
    spikes = spike_mask * np.random.exponential(scale=0.5 * load_factor, size=n_requests)
    
    total = base + output_time + queue_delay + spikes
    return total

# Generate baseline latencies
latencies = simulate_llm_latencies(n_requests=5000, load_factor=1.0)

print(f"Generated {len(latencies)} latency measurements")
print(f"Min:    {latencies.min()*1000:.1f} ms")
print(f"Max:    {latencies.max()*1000:.1f} ms")
print(f"Mean:   {latencies.mean()*1000:.1f} ms")
print(f"Median: {np.median(latencies)*1000:.1f} ms")
print(f"Std:    {latencies.std()*1000:.1f} ms")
print(f"\nNotice: Mean ({latencies.mean()*1000:.1f}ms) > Median ({np.median(latencies)*1000:.1f}ms)")
print("This indicates a right-skewed distribution!")

## Part 2: Visualizing the Distribution

Let's look at what this distribution actually looks like.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# Histogram
ax = axes[0]
latencies_ms = latencies * 1000
ax.hist(latencies_ms, bins=80, color='steelblue', edgecolor='black', 
        alpha=0.7, density=True)

# Mark mean and median
ax.axvline(np.mean(latencies_ms), color='red', linewidth=2, linestyle='-', 
           label=f'Mean = {np.mean(latencies_ms):.1f}ms')
ax.axvline(np.median(latencies_ms), color='green', linewidth=2, linestyle='--', 
           label=f'Median = {np.median(latencies_ms):.1f}ms')

ax.set_xlabel('Latency (ms)', fontsize=13)
ax.set_ylabel('Density', fontsize=13)
ax.set_title('LLM Inference Latency Distribution\n(Right-skewed: long tail on the right)', 
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11)

# CDF (Cumulative Distribution Function)
ax = axes[1]
sorted_latencies = np.sort(latencies_ms)
cdf = np.arange(1, len(sorted_latencies) + 1) / len(sorted_latencies)

ax.plot(sorted_latencies, cdf * 100, color='steelblue', linewidth=2)
ax.fill_between(sorted_latencies, cdf * 100, alpha=0.1, color='steelblue')

# Mark key percentiles
for p, color, style in [(50, 'green', '--'), (90, 'orange', '--'), 
                         (95, 'darkorange', ':'), (99, 'red', '-')]:
    val = np.percentile(latencies_ms, p)
    ax.axhline(y=p, color=color, linestyle=style, alpha=0.5)
    ax.axvline(x=val, color=color, linestyle=style, alpha=0.5)
    ax.plot(val, p, 'o', color=color, markersize=10, zorder=5)
    ax.annotate(f'P{p} = {val:.0f}ms', xy=(val, p), 
               xytext=(val + 50, p - 5), fontsize=10, fontweight='bold',
               color=color)

ax.set_xlabel('Latency (ms)', fontsize=13)
ax.set_ylabel('Percentile (%)', fontsize=13)
ax.set_title('Cumulative Distribution (CDF)\nRead: X% of requests finish within Y ms', 
             fontsize=14, fontweight='bold')
ax.set_ylim(0, 101)

plt.tight_layout()
plt.show()

## Part 3: Understanding Percentiles

**What does P99 = 500ms mean?**
- 99% of requests complete within 500ms
- 1% of requests take longer than 500ms

Let's calculate all key percentiles and understand what each one tells us.

In [None]:
def analyze_percentiles(latencies_ms, name=""):
    """Calculate and display all key percentiles."""
    percentiles = [50, 75, 90, 95, 99, 99.5, 99.9]
    
    print(f"{'='*60}")
    print(f"  Latency Percentile Analysis {name}")
    print(f"{'='*60}")
    print(f"  Total requests: {len(latencies_ms):,}")
    print(f"  Mean:   {np.mean(latencies_ms):>8.1f} ms")
    print(f"  Median: {np.median(latencies_ms):>8.1f} ms")
    print(f"  StdDev: {np.std(latencies_ms):>8.1f} ms")
    print(f"{'─'*60}")
    
    for p in percentiles:
        val = np.percentile(latencies_ms, p)
        n_above = np.sum(latencies_ms > val)
        print(f"  P{p:<5} = {val:>8.1f} ms  |  {100-p:>5.1f}% above  |  ~{n_above:>5,} requests slower")
    
    print(f"{'─'*60}")
    print(f"  Min:    {np.min(latencies_ms):>8.1f} ms")
    print(f"  Max:    {np.max(latencies_ms):>8.1f} ms")
    print(f"  Range:  {np.max(latencies_ms) - np.min(latencies_ms):>8.1f} ms")
    print(f"{'='*60}")

analyze_percentiles(latencies * 1000, name="(Normal Load)")

## Part 4: Mean vs Median - The Outlier Problem

The **mean is pulled by outliers**, the **median is robust**. Let's see this clearly.

In [None]:
# Start with clean data
clean_latencies = np.random.lognormal(mean=np.log(150), sigma=0.2, size=1000)

# Progressively add outliers
outlier_counts = [0, 1, 5, 10, 20, 50]
outlier_value = 5000  # 5 second outlier

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

means = []
medians = []

for idx, n_outliers in enumerate(outlier_counts):
    ax = axes[idx // 3][idx % 3]
    
    data = np.concatenate([
        clean_latencies,
        np.full(n_outliers, outlier_value)
    ])
    
    mean_val = np.mean(data)
    median_val = np.median(data)
    means.append(mean_val)
    medians.append(median_val)
    
    ax.hist(data, bins=50, color='steelblue', edgecolor='black', alpha=0.7)
    ax.axvline(mean_val, color='red', linewidth=2, label=f'Mean={mean_val:.0f}ms')
    ax.axvline(median_val, color='green', linewidth=2, linestyle='--', 
               label=f'Median={median_val:.0f}ms')
    
    pct_outliers = n_outliers / (1000 + n_outliers) * 100
    ax.set_title(f'{n_outliers} outliers ({pct_outliers:.1f}%)', fontweight='bold')
    ax.legend(fontsize=9)
    ax.set_xlabel('Latency (ms)')

plt.suptitle('How Outliers Affect Mean vs Median\n(Same base distribution, adding 5000ms outliers)', 
             fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nSummary:")
print(f"{'Outliers':>10} | {'Mean':>10} | {'Median':>10} | {'Mean Shift':>10}")
print("-" * 50)
for n, m, md in zip(outlier_counts, means, medians):
    shift = m - means[0]
    print(f"{n:>10} | {m:>8.0f}ms | {md:>8.0f}ms | {shift:>+8.0f}ms")

Notice how just 50 outliers out of 1050 requests (4.8%) can dramatically shift the mean, while the median barely moves! This is why **percentiles are the standard for latency reporting**.

## Part 5: Why P99 Matters for User Experience

Consider a web page that makes 20 API calls to an LLM backend. The page load time is determined by the **slowest** call. If each call has independent latency, the chance that **at least one** hits the tail increases dramatically.

In [None]:
# Simulate: page makes N independent API calls
# Page latency = max of all N calls

n_pages = 10000
api_calls_per_page = [1, 3, 5, 10, 20, 50]

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

p99_by_calls = []
p50_by_calls = []

for n_calls in api_calls_per_page:
    # For each page, take the max of n_calls independent latencies
    page_latencies = []
    for _ in range(n_pages):
        calls = simulate_llm_latencies(n_requests=n_calls, load_factor=1.0)
        page_latencies.append(np.max(calls))
    
    page_latencies = np.array(page_latencies) * 1000  # to ms
    p99_by_calls.append(np.percentile(page_latencies, 99))
    p50_by_calls.append(np.percentile(page_latencies, 50))

# Plot
axes[0].plot(api_calls_per_page, p50_by_calls, 'go-', linewidth=2, markersize=10, label='P50')
axes[0].plot(api_calls_per_page, p99_by_calls, 'ro-', linewidth=2, markersize=10, label='P99')
axes[0].set_xlabel('Number of API Calls per Page', fontsize=13)
axes[0].set_ylabel('Page Latency (ms)', fontsize=13)
axes[0].set_title('Page Latency vs Number of API Calls\n(Page latency = max of all calls)', 
                   fontweight='bold')
axes[0].legend(fontsize=12)

# Probability of hitting the tail
p99_single = np.percentile(simulate_llm_latencies(10000) * 1000, 99)
prob_no_tail = [(1 - 0.01)**n for n in api_calls_per_page]
prob_any_tail = [1 - p for p in prob_no_tail]

axes[1].bar(range(len(api_calls_per_page)), [p*100 for p in prob_any_tail], 
           tick_label=api_calls_per_page, color='salmon', edgecolor='black')
axes[1].set_xlabel('Number of API Calls per Page', fontsize=13)
axes[1].set_ylabel('Probability (%)', fontsize=13)
axes[1].set_title('Probability that at Least One Call\nHits the P99 Tail', fontweight='bold')

for i, p in enumerate(prob_any_tail):
    axes[1].text(i, p*100 + 1, f'{p*100:.0f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nKey insight: With 20 API calls per page, ~18% of page loads will")
print("experience at least one P99-level slow response!")

## Part 6: Latency Under Different Loads

As system load increases, the latency distribution changes dramatically. Let's see how.

In [None]:
load_factors = [0.5, 1.0, 2.0, 4.0, 8.0]
load_names = ['Low (50%)', 'Normal (100%)', 'High (200%)', 'Very High (400%)', 'Overloaded (800%)']

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

all_load_latencies = {}

for idx, (lf, name) in enumerate(zip(load_factors, load_names)):
    lats = simulate_llm_latencies(n_requests=5000, load_factor=lf) * 1000
    all_load_latencies[name] = lats
    
    ax = axes[idx // 3][idx % 3]
    ax.hist(lats, bins=60, color=plt.cm.RdYlGn_r(idx / len(load_factors)), 
            edgecolor='black', alpha=0.7, density=True)
    
    p50 = np.percentile(lats, 50)
    p99 = np.percentile(lats, 99)
    ax.axvline(p50, color='green', linewidth=2, linestyle='--', label=f'P50={p50:.0f}ms')
    ax.axvline(p99, color='red', linewidth=2, label=f'P99={p99:.0f}ms')
    
    ax.set_title(f'Load: {name}', fontweight='bold')
    ax.set_xlabel('Latency (ms)')
    ax.legend(fontsize=9)

# Summary in last subplot
ax = axes[1][2]
p50s = [np.percentile(all_load_latencies[n], 50) for n in load_names]
p90s = [np.percentile(all_load_latencies[n], 90) for n in load_names]
p99s = [np.percentile(all_load_latencies[n], 99) for n in load_names]

x = range(len(load_factors))
ax.plot(x, p50s, 'go-', label='P50', linewidth=2, markersize=8)
ax.plot(x, p90s, 'o-', color='orange', label='P90', linewidth=2, markersize=8)
ax.plot(x, p99s, 'ro-', label='P99', linewidth=2, markersize=8)
ax.set_xticks(x)
ax.set_xticklabels([f'{int(lf*100)}%' for lf in load_factors])
ax.set_xlabel('Load Factor')
ax.set_ylabel('Latency (ms)')
ax.set_title('Percentiles vs Load', fontweight='bold')
ax.legend()

plt.suptitle('How Load Affects Latency Distribution', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## Part 7: Box Plot Comparison

Box plots are excellent for comparing distributions at a glance.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Box plot
data_for_box = [all_load_latencies[n] for n in load_names]
bp = axes[0].boxplot(data_for_box, labels=[f'{int(lf*100)}%' for lf in load_factors],
                     patch_artist=True, showfliers=True,
                     flierprops={'marker': '.', 'markersize': 2, 'alpha': 0.3})

colors = plt.cm.RdYlGn_r(np.linspace(0.1, 0.9, len(load_factors)))
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

axes[0].set_xlabel('Load Factor', fontsize=13)
axes[0].set_ylabel('Latency (ms)', fontsize=13)
axes[0].set_title('Latency Box Plots by Load\n(Box=P25-P75, Whiskers=P5-P95, Dots=Outliers)', 
                   fontweight='bold')

# Violin plot (shows full distribution shape)
vp = axes[1].violinplot(data_for_box, showmedians=True, showextrema=True)

for idx, body in enumerate(vp['bodies']):
    body.set_facecolor(colors[idx])
    body.set_alpha(0.7)

axes[1].set_xticks(range(1, len(load_factors) + 1))
axes[1].set_xticklabels([f'{int(lf*100)}%' for lf in load_factors])
axes[1].set_xlabel('Load Factor', fontsize=13)
axes[1].set_ylabel('Latency (ms)', fontsize=13)
axes[1].set_title('Latency Violin Plots by Load\n(Width shows density of measurements)', 
                   fontweight='bold')

plt.tight_layout()
plt.show()

## Part 8: Measuring Real Latencies

Let's measure actual computation latencies to see these distributions in practice. We'll use a simple model operation to simulate inference.

In [None]:
def simulate_real_inference(n_requests=1000):
    """Measure real computation latencies using matrix operations.
    
    This simulates the variable-length nature of inference
    by varying matrix sizes (like varying sequence lengths).
    """
    latencies = []
    
    for i in range(n_requests):
        # Vary the "sequence length" (matrix size)
        size = np.random.randint(100, 500)
        
        start = time.perf_counter()
        
        # Simulate inference computation
        a = np.random.randn(size, size)
        b = np.random.randn(size, size)
        c = a @ b  # Matrix multiply
        _ = np.linalg.norm(c)  # Reduction
        
        end = time.perf_counter()
        latencies.append((end - start) * 1000)  # ms
    
    return np.array(latencies)

print("Measuring 1000 real computation latencies...")
real_latencies = simulate_real_inference(1000)
print("Done!\n")

analyze_percentiles(real_latencies, name="(Real Computation)")

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Histogram
axes[0].hist(real_latencies, bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].axvline(np.mean(real_latencies), color='red', linewidth=2, label=f'Mean={np.mean(real_latencies):.1f}ms')
axes[0].axvline(np.median(real_latencies), color='green', linewidth=2, linestyle='--', 
                label=f'Median={np.median(real_latencies):.1f}ms')
axes[0].set_title('Real Computation Latency Distribution', fontweight='bold')
axes[0].set_xlabel('Latency (ms)')
axes[0].legend()

# CDF
sorted_real = np.sort(real_latencies)
cdf = np.arange(1, len(sorted_real) + 1) / len(sorted_real) * 100
axes[1].plot(sorted_real, cdf, color='steelblue', linewidth=2)
for p in [50, 90, 95, 99]:
    val = np.percentile(real_latencies, p)
    axes[1].axhline(y=p, color='gray', linestyle=':', alpha=0.3)
    axes[1].plot(val, p, 'o', markersize=8)
    axes[1].annotate(f'P{p}={val:.1f}ms', xy=(val, p), xytext=(val+1, p-3), fontsize=9)
axes[1].set_title('CDF of Real Latencies', fontweight='bold')
axes[1].set_xlabel('Latency (ms)')
axes[1].set_ylabel('Percentile')

# Box plot
axes[2].boxplot(real_latencies, patch_artist=True,
                boxprops=dict(facecolor='steelblue', alpha=0.7))
axes[2].set_title('Box Plot of Real Latencies', fontweight='bold')
axes[2].set_ylabel('Latency (ms)')

plt.tight_layout()
plt.show()

## Part 9: The P99/P50 Ratio - A Key Health Metric

The **ratio of P99 to P50** tells you how predictable your system is:
- P99/P50 < 2x: Very consistent
- P99/P50 = 2-5x: Normal for most systems
- P99/P50 > 10x: Highly variable, needs investigation

In [None]:
# Compare different system scenarios
scenarios = {
    'Well-optimized': simulate_llm_latencies(5000, load_factor=0.3) * 1000,
    'Normal load': simulate_llm_latencies(5000, load_factor=1.0) * 1000,
    'High load': simulate_llm_latencies(5000, load_factor=3.0) * 1000,
    'Overloaded': simulate_llm_latencies(5000, load_factor=8.0) * 1000,
}

fig, ax = plt.subplots(figsize=(12, 6))

scenario_names = list(scenarios.keys())
p50_vals = [np.percentile(scenarios[s], 50) for s in scenario_names]
p90_vals = [np.percentile(scenarios[s], 90) for s in scenario_names]
p95_vals = [np.percentile(scenarios[s], 95) for s in scenario_names]
p99_vals = [np.percentile(scenarios[s], 99) for s in scenario_names]

x = np.arange(len(scenario_names))
width = 0.2

bars1 = ax.bar(x - 1.5*width, p50_vals, width, label='P50', color='#27ae60', edgecolor='black')
bars2 = ax.bar(x - 0.5*width, p90_vals, width, label='P90', color='#f39c12', edgecolor='black')
bars3 = ax.bar(x + 0.5*width, p95_vals, width, label='P95', color='#e67e22', edgecolor='black')
bars4 = ax.bar(x + 1.5*width, p99_vals, width, label='P99', color='#e74c3c', edgecolor='black')

ax.set_xticks(x)
ax.set_xticklabels(scenario_names)
ax.set_ylabel('Latency (ms)', fontsize=13)
ax.set_title('Latency Percentiles Across Different System States', fontsize=14, fontweight='bold')
ax.legend(fontsize=12)

# Add P99/P50 ratio
for i, (p50, p99) in enumerate(zip(p50_vals, p99_vals)):
    ratio = p99 / p50
    ax.text(i, p99 + 10, f'P99/P50={ratio:.1f}x', ha='center', fontweight='bold', fontsize=10)

plt.tight_layout()
plt.show()

## Part 10: Sliding Window Analysis

In production monitoring, you don't just look at overall percentiles. You track them **over time** using a sliding window to detect degradation.

In [None]:
# Simulate time-series of latencies with a degradation event
n_total = 5000
time_series_latencies = []

for i in range(n_total):
    # Normal conditions for first 60%, then degradation
    if i < n_total * 0.6:
        load = 1.0
    elif i < n_total * 0.8:
        load = 3.0  # Spike!
    else:
        load = 1.5  # Partial recovery
    
    lat = simulate_llm_latencies(1, load_factor=load)[0] * 1000
    time_series_latencies.append(lat)

time_series_latencies = np.array(time_series_latencies)

# Sliding window percentiles
window_size = 200
stride = 10

windows = []
p50_series = []
p90_series = []
p99_series = []
mean_series = []

for start in range(0, n_total - window_size, stride):
    window = time_series_latencies[start:start + window_size]
    windows.append(start + window_size // 2)
    p50_series.append(np.percentile(window, 50))
    p90_series.append(np.percentile(window, 90))
    p99_series.append(np.percentile(window, 99))
    mean_series.append(np.mean(window))

fig, axes = plt.subplots(2, 1, figsize=(16, 10))

# Raw latencies
axes[0].scatter(range(n_total), time_series_latencies, s=1, alpha=0.3, color='steelblue')
axes[0].axvspan(n_total*0.6, n_total*0.8, alpha=0.1, color='red', label='Degradation event')
axes[0].set_xlabel('Request Number')
axes[0].set_ylabel('Latency (ms)')
axes[0].set_title('Raw Latency Measurements Over Time', fontweight='bold')
axes[0].legend()

# Sliding window percentiles
axes[1].plot(windows, p50_series, color='green', linewidth=2, label='P50')
axes[1].plot(windows, p90_series, color='orange', linewidth=2, label='P90')
axes[1].plot(windows, p99_series, color='red', linewidth=2, label='P99')
axes[1].plot(windows, mean_series, color='blue', linewidth=1.5, linestyle=':', label='Mean')
axes[1].axvspan(n_total*0.6, n_total*0.8, alpha=0.1, color='red')
axes[1].set_xlabel('Request Number')
axes[1].set_ylabel('Latency (ms)')
axes[1].set_title('Sliding Window Percentiles (window=200 requests)\nP99 reacts fastest to degradation!', 
                   fontweight='bold')
axes[1].legend(fontsize=11)

plt.tight_layout()
plt.show()

print("Key observation: P99 spikes before P50 does, making it an early warning signal!")

## Part 11: SLA and Error Budgets

In production, you set **Service Level Objectives (SLOs)**:
- "P50 latency must be under 200ms"
- "P99 latency must be under 1000ms"

Let's see how to monitor compliance.

In [None]:
# Define SLOs
SLO_P50 = 200   # ms
SLO_P90 = 400   # ms
SLO_P99 = 1000  # ms

# Check compliance for each scenario
print(f"SLO Targets: P50 < {SLO_P50}ms, P90 < {SLO_P90}ms, P99 < {SLO_P99}ms")
print("=" * 70)

fig, axes = plt.subplots(1, len(scenarios), figsize=(20, 5))

for idx, (name, lats) in enumerate(scenarios.items()):
    p50 = np.percentile(lats, 50)
    p90 = np.percentile(lats, 90)
    p99 = np.percentile(lats, 99)
    
    p50_ok = p50 < SLO_P50
    p90_ok = p90 < SLO_P90
    p99_ok = p99 < SLO_P99
    
    status = "PASS" if (p50_ok and p90_ok and p99_ok) else "FAIL"
    
    print(f"\n{name}:")
    print(f"  P50: {p50:>6.0f}ms {'PASS' if p50_ok else 'FAIL'}")
    print(f"  P90: {p90:>6.0f}ms {'PASS' if p90_ok else 'FAIL'}")
    print(f"  P99: {p99:>6.0f}ms {'PASS' if p99_ok else 'FAIL'}")
    print(f"  Overall: {status}")
    
    # Visual: what fraction of requests meet the SLO
    ax = axes[idx]
    below_200 = np.sum(lats < SLO_P50) / len(lats) * 100
    below_400 = np.sum(lats < SLO_P90) / len(lats) * 100
    below_1000 = np.sum(lats < SLO_P99) / len(lats) * 100
    above_1000 = 100 - below_1000
    
    bars = ax.bar(['<200ms', '<400ms', '<1000ms', '>1000ms'],
                  [below_200, below_400 - below_200, below_1000 - below_400, above_1000],
                  color=['#27ae60', '#f39c12', '#e67e22', '#e74c3c'],
                  edgecolor='black')
    
    border_color = 'green' if status == 'PASS' else 'red'
    for spine in ax.spines.values():
        spine.set_color(border_color)
        spine.set_linewidth(3)
    
    ax.set_title(f'{name}\n({status})', fontweight='bold',
                 color='green' if status == 'PASS' else 'red')
    ax.set_ylabel('% of Requests')

plt.suptitle('SLO Compliance Dashboard', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## Part 12: Comparing Two Models

Different models have different latency profiles. Let's compare a "fast but variable" model vs a "slower but consistent" model.

In [None]:
# Model A: Fast average but high variance (e.g., model with variable output length)
model_a = np.random.lognormal(mean=np.log(100), sigma=0.6, size=5000)

# Model B: Slightly slower average but consistent (e.g., optimized model with capping)
model_b = np.random.lognormal(mean=np.log(130), sigma=0.15, size=5000)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Overlaid histograms
axes[0].hist(model_a, bins=60, alpha=0.5, color='blue', label=f'Model A (mean={np.mean(model_a):.0f}ms)', 
             density=True, edgecolor='blue')
axes[0].hist(model_b, bins=60, alpha=0.5, color='red', label=f'Model B (mean={np.mean(model_b):.0f}ms)', 
             density=True, edgecolor='red')
axes[0].set_title('Model A vs Model B: Distribution', fontweight='bold')
axes[0].set_xlabel('Latency (ms)')
axes[0].legend()

# CDFs
for data, color, name in [(model_a, 'blue', 'Model A'), (model_b, 'red', 'Model B')]:
    sorted_d = np.sort(data)
    cdf = np.arange(1, len(sorted_d)+1) / len(sorted_d) * 100
    axes[1].plot(sorted_d, cdf, color=color, linewidth=2, label=name)

axes[1].axhline(99, color='gray', linestyle=':', alpha=0.5)
axes[1].set_title('CDF Comparison', fontweight='bold')
axes[1].set_xlabel('Latency (ms)')
axes[1].set_ylabel('Percentile')
axes[1].legend()

# Percentile comparison
percentiles = [50, 75, 90, 95, 99, 99.9]
a_percs = [np.percentile(model_a, p) for p in percentiles]
b_percs = [np.percentile(model_b, p) for p in percentiles]

x = np.arange(len(percentiles))
axes[2].bar(x - 0.15, a_percs, 0.3, color='blue', alpha=0.7, label='Model A', edgecolor='black')
axes[2].bar(x + 0.15, b_percs, 0.3, color='red', alpha=0.7, label='Model B', edgecolor='black')
axes[2].set_xticks(x)
axes[2].set_xticklabels([f'P{p}' for p in percentiles])
axes[2].set_title('Percentile Comparison', fontweight='bold')
axes[2].set_ylabel('Latency (ms)')
axes[2].legend()

plt.tight_layout()
plt.show()

print("\nModel A: Lower P50 but MUCH higher P99 (high variance)")
print("Model B: Slightly higher P50 but much better P99 (consistent)")
print(f"\nModel A P99/P50 = {np.percentile(model_a, 99)/np.percentile(model_a, 50):.1f}x")
print(f"Model B P99/P50 = {np.percentile(model_b, 99)/np.percentile(model_b, 50):.1f}x")
print("\nFor production systems, Model B is often preferred despite the higher average!")

## Part 13: Heatmap - Latency by Time of Day

In [None]:
# Simulate 24 hours of latency data with varying load
hours = 24
requests_per_hour = 500

# Load varies by time of day (peak at business hours)
hourly_load = {
    0: 0.3, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.3, 5: 0.5,
    6: 0.8, 7: 1.2, 8: 2.0, 9: 3.0, 10: 3.5, 11: 3.0,
    12: 2.5, 13: 3.0, 14: 3.5, 15: 3.0, 16: 2.5, 17: 2.0,
    18: 1.5, 19: 1.2, 20: 1.0, 21: 0.8, 22: 0.5, 23: 0.3
}

# Collect percentiles per hour
hourly_p50 = []
hourly_p90 = []
hourly_p95 = []
hourly_p99 = []

for h in range(hours):
    lats = simulate_llm_latencies(requests_per_hour, load_factor=hourly_load[h]) * 1000
    hourly_p50.append(np.percentile(lats, 50))
    hourly_p90.append(np.percentile(lats, 90))
    hourly_p95.append(np.percentile(lats, 95))
    hourly_p99.append(np.percentile(lats, 99))

fig, ax = plt.subplots(figsize=(16, 6))

ax.fill_between(range(hours), hourly_p99, alpha=0.2, color='red', label='P99')
ax.fill_between(range(hours), hourly_p95, alpha=0.2, color='orange', label='P95')
ax.fill_between(range(hours), hourly_p90, alpha=0.2, color='yellow', label='P90')
ax.fill_between(range(hours), hourly_p50, alpha=0.3, color='green', label='P50')

ax.plot(range(hours), hourly_p99, 'r-', linewidth=2)
ax.plot(range(hours), hourly_p95, '-', color='orange', linewidth=2)
ax.plot(range(hours), hourly_p90, 'y-', linewidth=2)
ax.plot(range(hours), hourly_p50, 'g-', linewidth=2)

# SLO line
ax.axhline(y=1000, color='red', linestyle='--', linewidth=2, alpha=0.5, label='P99 SLO (1000ms)')

ax.set_xlabel('Hour of Day', fontsize=13)
ax.set_ylabel('Latency (ms)', fontsize=13)
ax.set_title('Latency Percentiles Throughout the Day\n(Load varies with business hours)', 
             fontsize=14, fontweight='bold')
ax.set_xticks(range(hours))
ax.set_xticklabels([f'{h:02d}:00' for h in range(hours)], rotation=45)
ax.legend(loc='upper left', fontsize=10)

# Secondary axis for load
ax2 = ax.twinx()
ax2.bar(range(hours), [hourly_load[h] for h in range(hours)], 
        alpha=0.1, color='blue', label='Load')
ax2.set_ylabel('Load Factor', color='blue', alpha=0.5)

plt.tight_layout()
plt.show()

## Part 14: Quick Reference - Percentile Cheat Sheet

In [None]:
# Create a visual cheat sheet
fig, ax = plt.subplots(figsize=(14, 7))

# Generate sample distribution
sample = simulate_llm_latencies(10000, load_factor=1.0) * 1000

# Plot histogram
counts, bins, _ = ax.hist(sample, bins=100, color='steelblue', edgecolor='black', 
                           alpha=0.5, density=True)

# Color regions
percentile_info = [
    (0, 50, '#27ae60', 'P0-P50: Half of requests\n(the "typical" experience)'),
    (50, 90, '#f39c12', 'P50-P90: Most of the rest\n(still acceptable)'),
    (90, 99, '#e67e22', 'P90-P99: The long tail\n(noticeable to users)'),
    (99, 100, '#e74c3c', 'P99+: The extreme tail\n(worst user experience)')
]

for p_low, p_high, color, label in percentile_info:
    low_val = np.percentile(sample, p_low)
    high_val = np.percentile(sample, p_high) if p_high < 100 else sample.max()
    
    mask = (bins[:-1] >= low_val) & (bins[:-1] < high_val)
    ax.bar(bins[:-1][mask], counts[mask], width=np.diff(bins)[0], 
           color=color, alpha=0.7, edgecolor='black', linewidth=0.3)

# Add annotations
y_max = ax.get_ylim()[1]
for p in [50, 90, 99]:
    val = np.percentile(sample, p)
    ax.axvline(val, color='black', linewidth=2, linestyle='-')
    ax.text(val, y_max * 0.95, f'P{p}\n{val:.0f}ms', ha='center', fontsize=11, fontweight='bold',
           bbox=dict(boxstyle='round,pad=0.3', facecolor='white', edgecolor='black'))

ax.set_xlabel('Latency (ms)', fontsize=14)
ax.set_ylabel('Density', fontsize=14)
ax.set_title('Latency Percentile Regions - Visual Guide', fontsize=16, fontweight='bold')

# Custom legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=c, label=l) for _, _, c, l in percentile_info]
ax.legend(handles=legend_elements, loc='upper right', fontsize=9, 
          bbox_to_anchor=(0.99, 0.85))

plt.tight_layout()
plt.show()

---

## Key Takeaways

### 1. Averages Lie, Percentiles Tell the Truth
- The **mean** is skewed by outliers and hides tail behavior
- **Percentiles** directly tell you what fraction of users experience what latency
- Always report at minimum: **P50, P90, P99**

### 2. LLM Inference is Right-Skewed
- Variable output lengths, queue depths, and system noise create long tails
- The **P99 can easily be 3-10x the P50**
- This is normal and expected, but needs to be managed

### 3. P99 Matters More Than You Think
- With multiple API calls per page, the probability of hitting the tail skyrockets
- **20 calls per page = ~18% chance** of at least one P99 event
- P99 is often the right SLO target for user-facing products

### 4. Monitor Percentiles Over Time
- **P99 spikes first** during degradation, making it an early warning signal
- Use sliding windows to track trends
- Set SLOs with error budgets (e.g., "P99 < 1s for 99.5% of hours")

### 5. The P99/P50 Ratio is a Health Metric
- < 2x: Very consistent system
- 2-5x: Normal for inference systems
- \> 10x: Something is wrong, investigate

### 6. For Production: Consistency Often Beats Speed
- A model with 130ms P50 and 200ms P99 is often preferred over
- A model with 100ms P50 and 800ms P99
- Predictability matters for user experience