# Local Inference Baseline

Before optimizing with disaggregation, we need to measure single-node performance. This is the "before" measurement.

## What We're Measuring

- **Throughput**: Tokens generated per second
- **Latency**: Time from request to first token (TTFT) and per-token latency
- **Memory**: GPU memory usage patterns
- **Bottlenecks**: Where time is spent (prefill vs decode)

## Why This Matters

Disaggregated serving claims to improve throughput by splitting prefill and decode. To evaluate that claim, we need honest baseline numbers from a well-configured single-node setup using vLLM.

## Step 1: Load Environment Configuration

In [None]:
import json
from pathlib import Path

# Load configuration from previous notebook
config_file = Path("environment_config.json")
if config_file.exists():
    with open(config_file) as f:
        env_config = json.load(f)
    print(f"Loaded config from {config_file}")
    print(f"Hostname: {env_config['hostname']}")
    print(f"GPUs: {env_config['gpus']['count']}x {env_config['gpus']['model']}")
else:
    print("Config file not found. Run 00_Environment_Setup.ipynb first")
    env_config = None

## Step 2: Initialize vLLM Engine

vLLM is a high-performance inference engine with continuous batching and PagedAttention. This is our baseline—not naive sequential inference.

In [None]:
from vllm import LLM, SamplingParams
import torch
import time

MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading model: {MODEL_NAME}")
print("This may take 30-60 seconds on first run...\n")

start_time = time.time()

# Initialize vLLM with default settings
# tensor_parallel_size=1 means single GPU (baseline)
llm = LLM(
    model=MODEL_NAME,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory
    trust_remote_code=True
)

load_time = time.time() - start_time
print(f"✓ Model loaded in {load_time:.2f} seconds")

# Check GPU memory usage after model load
if torch.cuda.is_available():
    allocated_mb = torch.cuda.memory_allocated() / 1e6
    reserved_mb = torch.cuda.memory_reserved() / 1e6
    print(f"GPU Memory: {allocated_mb:.0f} MB allocated, {reserved_mb:.0f} MB reserved")

## Step 3: Single Request Latency Test

Measure time from request submission to response completion for a single prompt. This shows best-case latency with no batching.

In [None]:
# Test prompt
test_prompt = "Explain how HTTP load balancers work in 3 sentences."

# Sampling parameters - control output length
sampling_params = SamplingParams(
    temperature=0.0,  # Deterministic output
    max_tokens=100,   # Limit output length
    top_p=1.0
)

print(f"Prompt: '{test_prompt}'\n")
print("Running single request...")

start = time.time()
outputs = llm.generate([test_prompt], sampling_params)
end = time.time()

# Extract results
output_text = outputs[0].outputs[0].text
tokens_generated = len(outputs[0].outputs[0].token_ids)
latency_ms = (end - start) * 1000
tokens_per_sec = tokens_generated / (end - start)

print(f"\nResults:")
print(f"  Total latency: {latency_ms:.1f} ms")
print(f"  Tokens generated: {tokens_generated}")
print(f"  Throughput: {tokens_per_sec:.1f} tokens/sec")
print(f"  Per-token latency: {latency_ms/tokens_generated:.1f} ms/token")
print(f"\nOutput:\n{output_text}")

## Step 4: Batch Processing Test

Process multiple requests in a batch. vLLM uses continuous batching to improve throughput. This is more realistic for production workloads.

In [None]:
# Generate multiple test prompts
test_prompts = [
    "What is a REST API?",
    "Explain database indexing.",
    "How does DNS work?",
    "What are microservices?",
    "Describe container orchestration.",
    "What is continuous integration?",
    "Explain message queues.",
    "How does caching improve performance?"
]

batch_size = len(test_prompts)
print(f"Processing batch of {batch_size} requests...\n")

start = time.time()
outputs = llm.generate(test_prompts, sampling_params)
end = time.time()

# Calculate aggregate metrics
total_tokens = sum(len(output.outputs[0].token_ids) for output in outputs)
total_time = end - start
throughput = total_tokens / total_time
avg_latency_per_request = (total_time / batch_size) * 1000

print(f"Batch Results:")
print(f"  Total time: {total_time:.2f} seconds")
print(f"  Total tokens: {total_tokens}")
print(f"  Throughput: {throughput:.1f} tokens/sec")
print(f"  Avg latency per request: {avg_latency_per_request:.1f} ms")
print(f"  Speedup vs sequential: {(batch_size * latency_ms / 1000) / total_time:.2f}x")

## Step 5: Understand Prefill vs Decode Time

LLM inference has two phases:
- **Prefill**: Process input prompt, compute KV cache (compute-bound)
- **Decode**: Generate tokens one at a time (memory-bound)

Disaggregated serving splits these phases across nodes. Let's measure them separately.

In [None]:
import numpy as np

def measure_prefill_decode_split(prompt, num_output_tokens=50):
    """Measure prefill and decode time separately"""
    
    # Prefill: Process prompt with 1 output token
    prefill_params = SamplingParams(
        temperature=0.0,
        max_tokens=1,  # Only 1 token to measure prefill
        top_p=1.0
    )
    
    start = time.time()
    outputs = llm.generate([prompt], prefill_params)
    prefill_time = time.time() - start
    
    # Full generation to measure decode
    decode_params = SamplingParams(
        temperature=0.0,
        max_tokens=num_output_tokens,
        top_p=1.0
    )
    
    start = time.time()
    outputs = llm.generate([prompt], decode_params)
    total_time = time.time() - start
    
    # Approximate decode time
    # (total_time - prefill_time) / (num_tokens - 1)
    actual_tokens = len(outputs[0].outputs[0].token_ids)
    decode_time = total_time - prefill_time
    per_token_decode = decode_time / max(1, actual_tokens - 1)
    
    return {
        'prefill_ms': prefill_time * 1000,
        'decode_ms': decode_time * 1000,
        'per_token_ms': per_token_decode * 1000,
        'total_ms': total_time * 1000,
        'tokens': actual_tokens
    }

# Test with different prompt lengths
test_cases = [
    ("Short prompt", "What is TCP?"),
    ("Medium prompt", "Explain the OSI network model and describe each layer in detail."),
    ("Long prompt", "Describe the architecture of a modern distributed database system, including replication strategies, consistency models, and failure handling mechanisms. Explain how these systems achieve high availability."),
]

print("Measuring Prefill vs Decode Time:\n")
print(f"{'Prompt Length':<15} {'Prefill':<10} {'Decode':<10} {'Per-Token':<12} {'Total':<10}")
print("-" * 65)

for name, prompt in test_cases:
    metrics = measure_prefill_decode_split(prompt, num_output_tokens=30)
    print(f"{name:<15} {metrics['prefill_ms']:>8.1f}ms {metrics['decode_ms']:>8.1f}ms {metrics['per_token_ms']:>10.1f}ms {metrics['total_ms']:>8.1f}ms")

print("\nKey Insight:")
print("  Prefill time grows with input length (compute-bound)")
print("  Decode time is per-token and roughly constant (memory-bound)")
print("  Disaggregation splits these phases across specialized nodes")

## Step 6: Memory Usage Profiling

Track GPU memory usage during inference. The KV cache grows with sequence length and is what gets transferred in disaggregated serving.

In [None]:
if torch.cuda.is_available():
    # Reset memory stats
    torch.cuda.reset_peak_memory_stats()
    
    # Get baseline memory
    baseline_mb = torch.cuda.memory_allocated() / 1e6
    
    print("Running inference with memory tracking...\n")
    
    # Generate with longer sequence
    long_params = SamplingParams(
        temperature=0.0,
        max_tokens=200,
        top_p=1.0
    )
    
    prompt = "Explain distributed systems in detail."
    outputs = llm.generate([prompt], long_params)
    
    # Check peak memory
    peak_mb = torch.cuda.max_memory_allocated() / 1e6
    current_mb = torch.cuda.memory_allocated() / 1e6
    inference_mb = peak_mb - baseline_mb
    
    tokens_generated = len(outputs[0].outputs[0].token_ids)
    memory_per_token = inference_mb / tokens_generated
    
    print(f"Memory Usage:")
    print(f"  Baseline (model weights): {baseline_mb:.0f} MB")
    print(f"  Peak during inference: {peak_mb:.0f} MB")
    print(f"  Additional for inference: {inference_mb:.0f} MB")
    print(f"  Tokens generated: {tokens_generated}")
    print(f"  Memory per token: {memory_per_token:.2f} MB/token")
    
    print("\nNote:")
    print("  Inference memory includes KV cache + activations")
    print("  KV cache scales linearly with sequence length")
    print("  In disaggregated serving, this KV cache is transferred between nodes")
else:
    print("CUDA not available - skipping memory profiling")

## Step 7: Baseline Performance Summary

Collect all baseline metrics for comparison with disaggregated serving later.

In [None]:
from datetime import datetime

baseline_metrics = {
    "timestamp": datetime.now().isoformat(),
    "model": MODEL_NAME,
    "config": {
        "tensor_parallel_size": 1,
        "gpu_memory_utilization": 0.9
    },
    "single_request": {
        "latency_ms": latency_ms,
        "tokens": tokens_generated,
        "throughput_tokens_per_sec": tokens_per_sec
    },
    "batch_processing": {
        "batch_size": batch_size,
        "total_tokens": total_tokens,
        "throughput_tokens_per_sec": throughput,
        "avg_latency_ms": avg_latency_per_request
    },
    "memory": {
        "baseline_mb": baseline_mb,
        "peak_mb": peak_mb,
        "inference_mb": inference_mb,
        "memory_per_token_mb": memory_per_token
    } if torch.cuda.is_available() else None
}

# Save metrics
metrics_file = Path("baseline_metrics.json")
with open(metrics_file, 'w') as f:
    json.dump(baseline_metrics, f, indent=2)

print("="*60)
print("BASELINE PERFORMANCE SUMMARY")
print("="*60)
print(f"\nSingle Request:")
print(f"  Latency: {latency_ms:.1f} ms")
print(f"  Throughput: {tokens_per_sec:.1f} tokens/sec")
print(f"\nBatch Processing ({batch_size} requests):")
print(f"  Throughput: {throughput:.1f} tokens/sec")
print(f"  Avg Latency: {avg_latency_per_request:.1f} ms")
if torch.cuda.is_available():
    print(f"\nMemory:")
    print(f"  Model: {baseline_mb:.0f} MB")
    print(f"  Inference: {inference_mb:.0f} MB")
    print(f"  Per-token: {memory_per_token:.2f} MB/token")
print(f"\nMetrics saved to: {metrics_file}")
print("\nThis is what we're trying to beat with disaggregation.")

## Key Takeaways

**What we measured:**
- Single-node vLLM performance with continuous batching
- Prefill vs decode time split
- Memory usage patterns (KV cache growth)

**Why this matters:**
- These are honest baseline numbers from well-configured infrastructure
- Disaggregated serving must beat this to be worthwhile
- Memory measurements show what needs to be transferred between nodes

**What's next:**
- [02_Understanding_KV_Cache.ipynb](02_Understanding_KV_Cache.ipynb) - Deep dive into what the KV cache actually contains and why transferring it is expensive