# Week 9 ‚Äî Performance Profiling
### BenchRight LLM Evaluation Master Program (18 Weeks)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Understand performance profiling concepts and metrics
2. Use the `PerformanceProfiler` class to wrap an ONNX Runtime session
3. Use the `profile_model` function to measure per-prompt metrics
4. Analyze performance data using pandas DataFrames
5. Generate and interpret summary statistics (mean, std)

---

## üß† Why Performance Profiling Matters

### The Challenge

Deploying LLMs to production requires understanding their performance characteristics:

| Metric | What It Measures | Business Impact |
|--------|------------------|------------------|
| **Latency** | Time from input to output | User experience, SLAs |
| **Throughput** | Tokens processed per second | Cost efficiency |
| **Memory** | RAM/VRAM consumption | Hardware requirements |

### Why Profile?

- Predict infrastructure costs
- Set realistic SLAs
- Compare model configurations
- Identify optimization opportunities
- Plan capacity for production

---

## üõ†Ô∏è Step 1: Setup & Dependencies

In [None]:
# Standard library imports
import sys
import time
from typing import Dict, Any, List

# Add src to path if running in Colab
sys.path.insert(0, '.')

# Install dependencies if needed
# !pip install pandas numpy psutil

import pandas as pd
import numpy as np

print("‚úÖ Setup complete!")

---

## üì¶ Step 2: Import the Performance Profiler Module

In [None]:
# Import the performance profiling functions
from src.benchmark_engine.performance_profiler import (
    PerformanceProfiler,
    profile_model,
    create_mock_profiler,
)

print("‚úÖ Performance profiler module imported successfully!")
print("\nüìã Available components:")
print("   - PerformanceProfiler: Class for profiling ONNX models")
print("   - profile_model: Function that returns a DataFrame with metrics")
print("   - create_mock_profiler: Helper for testing without real models")

---

## üîß Step 3: Understanding Performance Metrics

The `profile_model` function measures:
- **latency_ms**: Wall-clock time in milliseconds
- **tokens_per_second**: Throughput (input + output tokens / time)
- **memory_usage_mb**: Memory consumption in megabytes (if available)
- **input_tokens**: Number of input tokens
- **output_tokens**: Number of output tokens

In [None]:
# Create a mock model for demonstration
# In production, you would use a real ONNX model

def mock_model(prompt: str) -> str:
    """A mock model that simulates varying inference times."""
    # Simulate longer inference for longer prompts
    base_delay = 0.01  # 10ms base
    length_factor = len(prompt.split()) * 0.002  # 2ms per word
    time.sleep(base_delay + length_factor)
    return f"Response to: {prompt[:30]}..."

# Create a mock profiler
profiler_fn = create_mock_profiler(mock_model)

print("‚úÖ Mock model created!")
print("   This simulates varying latency based on prompt length.")

---

## üß™ Step 4: Profile Individual Prompts

In [None]:
# Profile a single prompt
print("üìä Single Prompt Profiling")
print("=" * 60)

test_prompt = "What is the capital of France?"
result = profiler_fn(test_prompt)

print(f"\nüìù Prompt: {test_prompt}")
print(f"\nüìà Metrics:")
print(f"   Input tokens:  {result['input_tokens']}")
print(f"   Output tokens: {result['output_tokens']}")
print(f"   Latency:       {result['latency_ms']:.2f} ms")
print(f"   Tokens/sec:    {result['tokens_per_second']:.2f}")
if result['memory_usage_mb'] is not None:
    print(f"   Memory:        {result['memory_usage_mb']:.2f} MB")
else:
    print(f"   Memory:        N/A (psutil not available)")

---

## üß™ Step 5: Profile Multiple Prompts

In [None]:
# Define test prompts of varying complexity
test_prompts = [
    "Hello world",
    "What is the capital of France?",
    "Explain machine learning in simple terms for a beginner.",
    "Write a short poem about the ocean and its waves crashing on the shore.",
    "Summarize the key principles of software engineering, including topics like abstraction, modularity, and testing.",
]

print("üìä Multi-Prompt Profiling")
print("=" * 60)

# Profile all prompts
results = []
for i, prompt in enumerate(test_prompts, 1):
    result = profiler_fn(prompt)
    results.append(result)
    print(f"\n[{i}/{len(test_prompts)}] Profiled: '{prompt[:40]}...'")
    print(f"    Latency: {result['latency_ms']:.2f} ms | Tokens/s: {result['tokens_per_second']:.2f}")

print("\n‚úÖ All prompts profiled!")

---

## üìä Step 6: Create and Analyze DataFrame

In [None]:
# Create DataFrame from results
df = pd.DataFrame(results)

print("üìä Performance Results DataFrame")
print("=" * 80)

# Display the full DataFrame
display(df[['prompt', 'input_tokens', 'latency_ms', 'tokens_per_second']])

# Add a shortened prompt column for display
df['prompt_short'] = df['prompt'].apply(lambda x: x[:30] + '...' if len(x) > 30 else x)

---

## üìà Step 7: Generate Summary Statistics

In [None]:
print("üìä Summary Statistics")
print("=" * 60)

# Latency stats
print("\n‚è±Ô∏è Latency (ms):")
print(f"   Mean: {df['latency_ms'].mean():.2f}")
print(f"   Std:  {df['latency_ms'].std():.2f}")
print(f"   Min:  {df['latency_ms'].min():.2f}")
print(f"   Max:  {df['latency_ms'].max():.2f}")

# Tokens per second stats
print("\nüöÄ Tokens per second:")
print(f"   Mean: {df['tokens_per_second'].mean():.2f}")
print(f"   Std:  {df['tokens_per_second'].std():.2f}")

# Token counts
print("\nüìù Token counts:")
print(f"   Total input tokens:  {df['input_tokens'].sum()}")
print(f"   Total output tokens: {df['output_tokens'].sum()}")
print(f"   Total time: {df['inference_time_seconds'].sum():.4f} seconds")

# Memory stats (if available)
if df['memory_usage_mb'].notna().any():
    print("\nüíæ Memory usage (MB):")
    print(f"   Mean: {df['memory_usage_mb'].mean():.2f}")
    print(f"   Max:  {df['memory_usage_mb'].max():.2f}")

---

## üìã Step 8: Analyze Performance Patterns

In [None]:
print("üìä Performance Pattern Analysis")
print("=" * 60)

# Find slowest and fastest prompts
slowest_idx = df['latency_ms'].idxmax()
fastest_idx = df['latency_ms'].idxmin()

print("\nüê¢ Slowest prompt:")
print(f"   Prompt: {df.loc[slowest_idx, 'prompt'][:50]}...")
print(f"   Latency: {df.loc[slowest_idx, 'latency_ms']:.2f} ms")
print(f"   Input tokens: {df.loc[slowest_idx, 'input_tokens']}")

print("\nüöÄ Fastest prompt:")
print(f"   Prompt: {df.loc[fastest_idx, 'prompt'][:50]}...")
print(f"   Latency: {df.loc[fastest_idx, 'latency_ms']:.2f} ms")
print(f"   Input tokens: {df.loc[fastest_idx, 'input_tokens']}")

# Correlation analysis
correlation = df['input_tokens'].corr(df['latency_ms'])
print(f"\nüìà Correlation (input_tokens vs latency): {correlation:.3f}")

if correlation > 0.7:
    print("   Strong positive correlation: Longer prompts take more time.")
elif correlation > 0.3:
    print("   Moderate positive correlation: Some relationship between length and latency.")
else:
    print("   Weak or no correlation: Latency may depend on other factors.")

---

## üìä Step 9: Percentile Analysis

In [None]:
print("üìä Percentile Analysis")
print("=" * 60)

# Calculate percentiles for latency
percentiles = {
    'P50': df['latency_ms'].quantile(0.50),
    'P75': df['latency_ms'].quantile(0.75),
    'P90': df['latency_ms'].quantile(0.90),
    'P95': df['latency_ms'].quantile(0.95),
    'P99': df['latency_ms'].quantile(0.99),
}

print("\n‚è±Ô∏è Latency Percentiles:")
for name, value in percentiles.items():
    print(f"   {name}: {value:.2f} ms")

print("\nüìù Interpretation:")
print(f"   - 50% of requests complete in under {percentiles['P50']:.2f} ms (P50)")
print(f"   - 95% of requests complete in under {percentiles['P95']:.2f} ms (P95)")
print(f"   - The slowest 1% take over {percentiles['P99']:.2f} ms (P99)")

---

## üîß Step 10: Using profile_model with Real ONNX Models

When you have a real ONNX model, use `profile_model` directly:

In [None]:
# Example code for profiling a real ONNX model
# Uncomment and modify when you have an actual model

# from src.benchmark_engine.performance_profiler import profile_model
#
# prompts = [
#     "What is the capital of France?",
#     "Explain machine learning.",
#     "Write a poem about the ocean.",
# ]
#
# # Profile the model with summary statistics
# df = profile_model(
#     model_path="/path/to/model.onnx",
#     prompts=prompts,
#     tokenizer_name="gpt2",
#     warmup_runs=1,
#     num_runs=3,  # Average over 3 runs
#     print_summary=True
# )
#
# print(df)

print("üìù Note: Replace with your ONNX model path when available.")
print("   The profile_model function will:")
print("   1. Load the ONNX model and tokenizer")
print("   2. Profile each prompt")
print("   3. Return a DataFrame with all metrics")
print("   4. Optionally print summary statistics")

---

## üéì Mini-Project: Performance Audit

### Task

Create a comprehensive performance audit of a model.

### Template

In [None]:
# Your performance audit code here

# Step 1: Define your prompts
# audit_prompts = [
#     # Short prompts
#     "Hello world",
#     "What is AI?",
#     # Medium prompts
#     ...
#     # Long prompts
#     ...
# ]

# Step 2: Profile the model
# df = profile_model(
#     model_path="your_model.onnx",
#     prompts=audit_prompts,
#     print_summary=True
# )

# Step 3: Analyze results
# - Calculate percentiles
# - Find correlations
# - Identify outliers

# Step 4: Create your audit report
# Export to /examples/week09_performance_audit.md

print("üìù Complete the mini-project using the template above.")

---

## ü§î Paul-Elder Critical Thinking Questions

Reflect on these questions:

### Question 1: EVIDENCE
**If latency varies significantly between runs, what evidence would help explain this?**
*Consider: System load, garbage collection, JIT compilation, memory pressure.*

### Question 2: ASSUMPTIONS
**What assumptions are we making when we profile on a single machine?**
*Consider: Production hardware, network latency, concurrent users, cold starts.*

### Question 3: IMPLICATIONS
**If we set an SLA based on mean latency, what could go wrong?**
*Consider: Tail latency (P99), outliers, worst-case scenarios.*

---

## ‚ö†Ô∏è Limitations of Performance Profiling

### What These Metrics DON'T Cover

1. **Quality Trade-offs:** Fast inference doesn't mean good quality
2. **Concurrent Load:** Single-request latency differs from production
3. **Cold Start:** First-request latency after model load
4. **Network Latency:** Real deployments include API overhead
5. **Memory Leaks:** Need long-running tests to detect

### Future Improvements (TODO)

- GPU memory tracking
- Concurrent request simulation
- Cold start analysis
- Batch size optimization
- Power consumption metrics

---

## ‚úÖ Knowledge Mastery Checklist

Before moving to Week 10, ensure you can check all boxes:

- [ ] I understand why performance profiling is critical for LLM deployment
- [ ] I can use `PerformanceProfiler` to profile an ONNX model
- [ ] I can use `profile_model` to generate a metrics DataFrame
- [ ] I can calculate and interpret summary statistics (mean, std)
- [ ] I understand percentile metrics (P50, P95, P99)
- [ ] I know the limitations of performance profiling

---

**Week 9 Complete!** üéâ

**Next:** *Week 10 ‚Äî Regression Tests*