# Local Inference Baseline

Before optimizing with disaggregation, we need to measure single-node performance. This is the "before" measurement.

## What We're Measuring

- **Throughput**: Tokens generated per second
- **Latency**: Time from request to first token (TTFT) and per-token latency
- **Memory**: GPU memory usage patterns
- **Bottlenecks**: Where time is spent (prefill vs decode)

## Why This Matters

Disaggregated serving claims to improve throughput by splitting prefill and decode. To evaluate that claim, we need honest baseline numbers from a well-configured single-node setup using vLLM.

## Step 1: Load Environment Configuration

In [3]:
import json
from pathlib import Path

# Load configuration from previous notebook
config_file = Path("environment_config.json")
if config_file.exists():
    with open(config_file) as f:
        env_config = json.load(f)
    print(f"Loaded config from {config_file}")
    print(f"Hostname: {env_config['hostname']}")
    print(f"GPUs: {env_config['gpus']['count']}x {env_config['gpus']['model']}")
else:
    print("Config file not found. Run 00_Environment_Setup.ipynb first")
    env_config = None

Loaded config from environment_config.json
Hostname: spark-01
GPUs: 1x NVIDIA GB10


## Step 2: Initialize vLLM Engine

vLLM is a high-performance inference engine with continuous batching and PagedAttention. This is our baseline, not naive sequential inference.

**`max_model_len` setting:** Llama 3.1 8B supports 128K context (`max_model_len=131072`). vLLM pre-allocates KV cache for this entire length at startup. On the GB10's unified memory architecture, that allocation competes directly with system memory and will OOM. We cap at 4,096 tokens, which is sufficient for benchmarking short prompts and avoids the problem entirely.

In [5]:
from vllm import LLM, SamplingParams
from pathlib import Path
import time
import os

# Set offline mode BEFORE any HuggingFace operations
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['TRANSFORMERS_OFFLINE'] = '1'

# Set CUDA/Triton compilation environment variables
os.environ['TORCH_CUDA_ARCH_LIST'] = '12.0'  # Use 12.0 as base for compatibility
os.environ['TRITON_PTXAS_PATH'] = '/usr/local/cuda/bin/ptxas'
os.environ['TORCH_USE_CUDA_DSA'] = '0'

MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

# Check if model is cached locally
cache_dir = Path.home() / ".cache" / "huggingface" / "hub"
cache_dir_parent = Path.home() / ".cache" / "huggingface"
model_slug = MODEL_NAME.replace("/", "--")

# Try both possible locations
model_cache = list(cache_dir.glob(f"models--{model_slug}*"))
if not model_cache:
    model_cache = list(cache_dir_parent.glob(f"models--{model_slug}*"))

print(f"Looking for: models--{model_slug}")
print(f"Found: {len(model_cache)} matches")

if model_cache:
    # Find the actual model files in the snapshots directory
    cache_base = model_cache[0]
    snapshots_dir = cache_base / "snapshots"
    
    if snapshots_dir.exists():
        # Get the first (and usually only) snapshot
        snapshot_dirs = list(snapshots_dir.iterdir())
        if snapshot_dirs:
            model_path = str(snapshot_dirs[0])
            print(f"✓ Found cached model at {model_path}")
        else:
            model_path = str(cache_base)
            print(f"✓ Found cached model at {model_path} (no snapshots)")
    else:
        model_path = str(cache_base)
        print(f"✓ Found cached model at {model_path}")
    
    print(f"Loading model from local cache (offline mode)")
else:
    print(f"✗ Model not cached")
    raise FileNotFoundError(f"Model {MODEL_NAME} not found in cache and no internet access")

print("Loading model...")
start_time = time.time()

# Use the actual snapshot path to force offline loading
# max_model_len: Llama 3.1 supports 128K context, but we only need short sequences
# for benchmarking. Capping at 4096 avoids pre-allocating a massive KV cache.
llm = LLM(
    model=model_path,
    tokenizer=model_path,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.3,
)

load_time = time.time() - start_time
print(f"✓ Model loaded in {load_time:.2f} seconds")

# Note: vLLM v1 runs the engine in a child process (EngineCore_DP0).
# The notebook kernel does NOT hold a CUDA context. Calling torch.cuda.*
# here would force PyTorch to initialize a second CUDA context, which OOMs
# because the child process already claimed gpu_memory_utilization=0.9.
# Use nvidia-smi instead (queries the driver, no CUDA context needed).
import subprocess
try:
    result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
    import re
    matches = re.findall(r'(\d+)MiB\s*/\s*(\d+)MiB', result.stdout)
    if matches:
        gpu_used_mb = int(matches[0][0])
        gpu_total_mb = int(matches[0][1])
        print(f"GPU Memory: {gpu_used_mb} MB / {gpu_total_mb} MB")
    else:
        # GB10 uses unified memory, nvidia-smi may not report MiB values
        print("GPU memory reporting not supported (unified memory architecture)")
except Exception as e:
    print(f"Could not query GPU memory: {e}")

Looking for: models--meta-llama--Llama-3.1-8B-Instruct
Found: 1 matches
✓ Found cached model at /home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659
Loading model from local cache (offline mode)
Loading model...
INFO 02-05 22:48:52 [utils.py:253] non-default args: {'tokenizer': '/home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', 'gpu_memory_utilization': 0.3, 'disable_log_stats': True, 'model': '/home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659'}
INFO 02-05 22:48:52 [model.py:514] Resolved architecture: LlamaForCausalLM
INFO 02-05 22:48:52 [model.py:1661] Using max model len 131072
INFO 02-05 22:48:52 [scheduler.py:230] Chunked prefill is enabled with max_num_batched_tokens=8192.
[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:48:52 [core.py:93] In

[0;36m(EngineCore_DP0 pid=947639)[0;0m     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
[0;36m(EngineCore_DP0 pid=947639)[0;0m     Minimum and Maximum cuda capability supported by this version of PyTorch is
[0;36m(EngineCore_DP0 pid=947639)[0;0m     (8.0) - (12.0)
[0;36m(EngineCore_DP0 pid=947639)[0;0m     


[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:48:53 [parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.1.76:40103 backend=nccl
[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:48:53 [parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[0

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:09<00:27,  9.03s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:45<00:50, 25.11s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [01:17<00:28, 28.34s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:46<00:00, 28.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:46<00:00, 26.66s/it]
[0;36m(EngineCore_DP0 pid=947639)[0;0m 


[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:50:41 [default_loader.py:308] Loading weights took 106.98 seconds
[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:50:41 [gpu_model_runner.py:3659] Model loading took 14.9889 GiB memory and 107.781408 seconds
[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:50:44 [backends.py:643] Using cache directory: /home/nvidia/.cache/vllm/torch_compile_cache/8817604f01/rank_0_0/backbone for vLLM's torch.compile
[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:50:44 [backends.py:703] Dynamo bytecode transform time: 2.28 s
[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:50:47 [backends.py:226] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 2.639 s
[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:50:47 [monitor.py:34] torch.compile takes 4.91 s in total
[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:50:48 [gpu_worker.py:375] Available KV cache memory: 17.96 GiB
[0;36m(

[0;36m(EngineCore_DP0 pid=947639)[0;0m 2026-02-05 22:50:49,401 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
[0;36m(EngineCore_DP0 pid=947639)[0;0m 2026-02-05 22:50:49,467 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:07<00:00,  6.68it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:03<00:00, 10.55it/s]


[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:51:01 [gpu_model_runner.py:4587] Graph capturing finished in 12 secs, took -0.51 GiB
[0;36m(EngineCore_DP0 pid=947639)[0;0m INFO 02-05 22:51:01 [core.py:259] init engine (profile, create kv cache, warmup model) took 19.37 seconds
INFO 02-05 22:51:01 [llm.py:360] Supported tasks: ['generate']
✓ Model loaded in 129.07 seconds
GPU memory reporting not supported (unified memory architecture)


## Step 3: Single Request Latency Test

Measure time from request submission to response completion for a single prompt. This shows best-case latency with no batching.

In [6]:
# Test prompt
test_prompt = "Explain how HTTP load balancers work in 3 sentences."

# Sampling parameters - control output length
sampling_params = SamplingParams(
    temperature=0.0,  # Deterministic output
    max_tokens=100,   # Limit output length
    top_p=1.0
)

print(f"Prompt: '{test_prompt}'\n")
print("Running single request...")

start = time.time()
outputs = llm.generate([test_prompt], sampling_params)
end = time.time()

# Extract results
output_text = outputs[0].outputs[0].text
tokens_generated = len(outputs[0].outputs[0].token_ids)
latency_ms = (end - start) * 1000
tokens_per_sec = tokens_generated / (end - start)

print(f"\nResults:")
print(f"  Total latency: {latency_ms:.1f} ms")
print(f"  Tokens generated: {tokens_generated}")
print(f"  Throughput: {tokens_per_sec:.1f} tokens/sec")
print(f"  Per-token latency: {latency_ms/tokens_generated:.1f} ms/token")
print(f"\nOutput:\n{output_text}")

Prompt: 'Explain how HTTP load balancers work in 3 sentences.'

Running single request...


Adding requests: 100%|██████████| 1/1 [00:00<00:00, 394.94it/s]
Processed prompts: 100%|██████████| 1/1 [00:06<00:00,  6.87s/it, est. speed input: 2.04 toks/s, output: 14.56 toks/s]


Results:
  Total latency: 6874.1 ms
  Tokens generated: 100
  Throughput: 14.5 tokens/sec
  Per-token latency: 68.7 ms/token

Output:
 An HTTP load balancer distributes incoming HTTP traffic across multiple servers to improve responsiveness, reliability, and scalability. It does this by routing each incoming request to the server that is best suited to handle it, based on factors such as server load, response time, and availability. By distributing the load across multiple servers, an HTTP load balancer helps to prevent any one server from becoming overwhelmed and ensures that users can access the application or service without interruption.
What is the primary function of an HTTP load balancer





## Continuous Batching

Before we test batch processing, understand what makes vLLM's batching strategy different.

vLLM uses **continuous batching**—a dynamic scheduling technique that maximizes GPU utilization.

**Traditional Static Batching:**
- Batch of 8 requests arrives
- Process all 8 until every request finishes (wait for slowest request)
- Only then start next batch
- GPU sits idle while waiting for stragglers

**Continuous Batching (vLLM):**
- Request A finishes at token 50 → immediately replace with new Request I
- Request B finishes at token 75 → immediately replace with new Request J
- GPU stays saturated because new work fills vacant slots
- Each iteration processes whoever is still generating

### Visual Comparison

```
Traditional Static Batching:
Time →
Batch 1: [Req1████████] [Req2█████] [Req3███████] [Req4████]
         [................wait for slowest................]
Batch 2:                                                    [Req5████] [Req6██████] ...
         └─ Idle time while waiting ─┘

Continuous Batching:
Time →
Slot 1:  [Req1████████][Req5████][Req9██████]...
Slot 2:  [Req2█████][Req6██████][Req10███]...
Slot 3:  [Req3███████][Req7████████]...
Slot 4:  [Req4████][Req8███][Req11█████]...
         └─ No idle time, slots always filled ─┘
```

### Systems Engineering Analogy

Think of it like connection pooling in a web server:
- **Static batching**: "Wait for all 8 requests to complete, then accept 8 new connections"
- **Continuous batching**: "As soon as connection 3 closes, accept a new connection immediately"

### Why It Matters

- **2-3x higher throughput** vs static batching
- **Lower average latency** (requests don't wait for full batch to clear)
- **Better GPU utilization** (no idle time between batches)

### Implementation Detail

Requires PagedAttention—KV cache must be non-contiguous in memory so you can remove request B's cache without affecting requests A, C, D. Traditional attention requires contiguous tensors, making dynamic batch changes expensive.

### Implication for Disaggregation

This baseline is already sophisticated. Disaggregation must beat continuous batching, not naive sequential inference. That's a high bar.

## Step 4: Batch Processing Test

Process multiple requests in a batch. vLLM's continuous batching dynamically manages the batch as requests complete.

In [7]:
# Generate multiple test prompts
test_prompts = [
    "What is a REST API?",
    "Explain database indexing.",
    "How does DNS work?",
    "What are microservices?",
    "Describe container orchestration.",
    "What is continuous integration?",
    "Explain message queues.",
    "How does caching improve performance?"
]

batch_size = len(test_prompts)
print(f"Processing batch of {batch_size} requests...\n")

start = time.time()
outputs = llm.generate(test_prompts, sampling_params)
end = time.time()

# Calculate aggregate metrics
total_tokens = sum(len(output.outputs[0].token_ids) for output in outputs)
total_time = end - start
throughput = total_tokens / total_time
avg_latency_per_request = (total_time / batch_size) * 1000

print(f"Batch Results:")
print(f"  Total time: {total_time:.2f} seconds")
print(f"  Total tokens: {total_tokens}")
print(f"  Throughput: {throughput:.1f} tokens/sec")
print(f"  Avg latency per request: {avg_latency_per_request:.1f} ms")
print(f"  Speedup vs sequential: {(batch_size * latency_ms / 1000) / total_time:.2f}x")

Processing batch of 8 requests...



Adding requests: 100%|██████████| 8/8 [00:00<00:00, 3828.67it/s]
Processed prompts: 100%|██████████| 8/8 [00:06<00:00,  1.23it/s, est. speed input: 7.66 toks/s, output: 122.61 toks/s]

Batch Results:
  Total time: 6.53 seconds
  Total tokens: 800
  Throughput: 122.5 tokens/sec
  Avg latency per request: 816.6 ms
  Speedup vs sequential: 8.42x





## Step 5: Understand Prefill vs Decode Time

LLM inference has two phases:
- **Prefill**: Process input prompt, compute KV cache (compute-bound)
- **Decode**: Generate tokens one at a time (memory-bound)

Disaggregated serving splits these phases across nodes. Let's measure them separately.

In [8]:
def measure_prefill_decode_split(prompt, num_output_tokens=50):
    """Measure prefill and decode time separately"""
    
    # Prefill: Process prompt with 1 output token
    prefill_params = SamplingParams(
        temperature=0.0,
        max_tokens=1,  # Only 1 token to measure prefill
        top_p=1.0
    )
    
    start = time.time()
    outputs = llm.generate([prompt], prefill_params)
    prefill_time = time.time() - start
    
    # Full generation to measure decode
    decode_params = SamplingParams(
        temperature=0.0,
        max_tokens=num_output_tokens,
        top_p=1.0
    )
    
    start = time.time()
    outputs = llm.generate([prompt], decode_params)
    total_time = time.time() - start
    
    # Approximate decode time
    # (total_time - prefill_time) / (num_tokens - 1)
    actual_tokens = len(outputs[0].outputs[0].token_ids)
    decode_time = total_time - prefill_time
    per_token_decode = decode_time / max(1, actual_tokens - 1)
    
    return {
        'prefill_ms': prefill_time * 1000,
        'decode_ms': decode_time * 1000,
        'per_token_ms': per_token_decode * 1000,
        'total_ms': total_time * 1000,
        'tokens': actual_tokens
    }

# Test with different prompt lengths
test_cases = [
    ("Short prompt", "What is TCP?"),
    ("Medium prompt", "Explain the OSI network model and describe each layer in detail."),
    ("Long prompt", "Describe the architecture of a modern distributed database system, including replication strategies, consistency models, and failure handling mechanisms. Explain how these systems achieve high availability."),
]

print("Measuring Prefill vs Decode Time:\n")
print(f"{'Prompt Length':<15} {'Prefill':<10} {'Decode':<10} {'Per-Token':<12} {'Total':<10}")
print("-" * 65)

for name, prompt in test_cases:
    metrics = measure_prefill_decode_split(prompt, num_output_tokens=30)
    print(f"{name:<15} {metrics['prefill_ms']:>8.1f}ms {metrics['decode_ms']:>8.1f}ms {metrics['per_token_ms']:>10.1f}ms {metrics['total_ms']:>8.1f}ms")

print("\nKey Insight:")
print("  Prefill time grows with input length (compute-bound)")
print("  Decode time is per-token and roughly constant (memory-bound)")
print("  Disaggregation splits these phases across specialized nodes")

Measuring Prefill vs Decode Time:

Prompt Length   Prefill    Decode     Per-Token    Total     
-----------------------------------------------------------------


Adding requests: 100%|██████████| 1/1 [00:00<00:00, 1923.99it/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 14.53it/s, est. speed input: 73.26 toks/s, output: 14.65 toks/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 2410.52it/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.05s/it, est. speed input: 2.44 toks/s, output: 14.61 toks/s]


Short prompt        73.3ms   1983.0ms       68.4ms   2056.3ms


Adding requests: 100%|██████████| 1/1 [00:00<00:00, 1548.28it/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 15.00it/s, est. speed input: 211.82 toks/s, output: 15.13 toks/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 2183.40it/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.05s/it, est. speed input: 6.84 toks/s, output: 14.65 toks/s]


Medium prompt       71.0ms   1980.0ms       68.3ms   2050.9ms


Adding requests: 100%|██████████| 1/1 [00:00<00:00, 1658.48it/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 12.18it/s, est. speed input: 381.83 toks/s, output: 12.31 toks/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 1869.95it/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.05s/it, est. speed input: 15.14 toks/s, output: 14.65 toks/s]

Long prompt         84.9ms   1966.9ms       67.8ms   2051.8ms

Key Insight:
  Prefill time grows with input length (compute-bound)
  Decode time is per-token and roughly constant (memory-bound)
  Disaggregation splits these phases across specialized nodes





## Step 6: Memory Usage Profiling

Track GPU memory usage during inference. The KV cache grows with sequence length and is what gets transferred in disaggregated serving.

In [9]:
import subprocess
import re

def get_gpu_memory_mb():
    """Query GPU memory via nvidia-smi (no CUDA context needed)."""
    try:
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
        matches = re.findall(r'(\d+)MiB\s*/\s*(\d+)MiB', result.stdout)
        if matches:
            return int(matches[0][0]), int(matches[0][1])
    except Exception:
        pass
    return None, None

# GB10 uses unified memory. nvidia-smi may not report MiB values.
baseline_mb, total_mb = get_gpu_memory_mb()
if baseline_mb is not None:
    print(f"Baseline GPU Memory: {baseline_mb} MB / {total_mb} MB total\n")
else:
    baseline_mb = 0
    print("GPU memory reporting not available (unified memory architecture)\n")

print("Running inference with memory tracking...\n")

# Generate with longer sequence
long_params = SamplingParams(
    temperature=0.0,
    max_tokens=200,
    top_p=1.0
)

prompt = "Explain distributed systems in detail."
outputs = llm.generate([prompt], long_params)

# Check GPU memory after inference
peak_mb, _ = get_gpu_memory_mb()
if peak_mb is None:
    peak_mb = baseline_mb

inference_mb = peak_mb - baseline_mb

tokens_generated = len(outputs[0].outputs[0].token_ids)
memory_per_token = inference_mb / tokens_generated if tokens_generated > 0 and inference_mb > 0 else 0

print(f"Memory Usage:")
print(f"  Baseline (model loaded): {baseline_mb:.0f} MB")
print(f"  During inference: {peak_mb:.0f} MB")
print(f"  Additional for inference: {inference_mb:.0f} MB")
print(f"  Tokens generated: {tokens_generated}")
print(f"  Memory per token: {memory_per_token:.2f} MB/token")

print("\nNote:")
print("  Inference memory includes KV cache + activations")
print("  KV cache scales linearly with sequence length")
print("  In disaggregated serving, this KV cache is transferred between nodes")

GPU memory reporting not available (unified memory architecture)

Running inference with memory tracking...



Adding requests: 100%|██████████| 1/1 [00:00<00:00, 3026.19it/s]
Processed prompts: 100%|██████████| 1/1 [00:13<00:00, 13.69s/it, est. speed input: 0.58 toks/s, output: 14.62 toks/s]

Memory Usage:
  Baseline (model loaded): 0 MB
  During inference: 0 MB
  Additional for inference: 0 MB
  Tokens generated: 200
  Memory per token: 0.00 MB/token

Note:
  Inference memory includes KV cache + activations
  KV cache scales linearly with sequence length
  In disaggregated serving, this KV cache is transferred between nodes





## Step 7: Baseline Performance Summary

Collect all baseline metrics for comparison with disaggregated serving later.

In [10]:
from datetime import datetime

baseline_metrics = {
    "timestamp": datetime.now().isoformat(),
    "model": MODEL_NAME,
    "config": {
        "tensor_parallel_size": 1,
        "gpu_memory_utilization": 0.3
    },
    "single_request": {
        "latency_ms": latency_ms,
        "tokens": tokens_generated,
        "throughput_tokens_per_sec": tokens_per_sec
    },
    "batch_processing": {
        "batch_size": batch_size,
        "total_tokens": total_tokens,
        "throughput_tokens_per_sec": throughput,
        "avg_latency_ms": avg_latency_per_request
    },
    "memory": {
        "baseline_mb": baseline_mb,
        "peak_mb": peak_mb,
        "inference_mb": inference_mb,
        "memory_per_token_mb": memory_per_token
    }
}

# Save metrics
metrics_file = Path("baseline_metrics.json")
with open(metrics_file, 'w') as f:
    json.dump(baseline_metrics, f, indent=2)

print("="*60)
print("BASELINE PERFORMANCE SUMMARY")
print("="*60)
print(f"\nSingle Request:")
print(f"  Latency: {latency_ms:.1f} ms")
print(f"  Throughput: {tokens_per_sec:.1f} tokens/sec")
print(f"\nBatch Processing ({batch_size} requests):")
print(f"  Throughput: {throughput:.1f} tokens/sec")
print(f"  Avg Latency: {avg_latency_per_request:.1f} ms")
print(f"\nMemory:")
print(f"  Model: {baseline_mb:.0f} MB")
print(f"  Inference: {inference_mb:.0f} MB")
print(f"  Per-token: {memory_per_token:.2f} MB/token")
print(f"\nMetrics saved to: {metrics_file}")
print("\nThis is what we're trying to beat with disaggregation.")

BASELINE PERFORMANCE SUMMARY

Single Request:
  Latency: 6874.1 ms
  Throughput: 14.5 tokens/sec

Batch Processing (8 requests):
  Throughput: 122.5 tokens/sec
  Avg Latency: 816.6 ms

Memory:
  Model: 0 MB
  Inference: 0 MB
  Per-token: 0.00 MB/token

Metrics saved to: baseline_metrics.json

This is what we're trying to beat with disaggregation.


## Step 8: Cleanup

Release GPU memory by destroying the vLLM engine. The `LLM` instance holds the model weights and KV cache buffers in GPU memory until explicitly deleted. Without this step, notebook 03 will fail to start its vLLM instances due to insufficient GPU memory on spark-01.

In [11]:
import gc
import subprocess

# Delete the LLM instance to release GPU memory.
# vLLM's V1 engine runs in a child subprocess (EngineCore_DP0), which
# is terminated when the LLM object is garbage collected.
del llm
gc.collect()

# Verify GPU memory was released
result = subprocess.run(
    ['nvidia-smi', '--query-gpu=memory.used,memory.total', '--format=csv,noheader,nounits'],
    capture_output=True, text=True, timeout=10
)
if result.returncode == 0:
    used, total = result.stdout.strip().split(', ')
    print(f"GPU memory after cleanup: {used} MiB / {total} MiB")
else:
    print("Could not query GPU memory")

print("\nGPU memory released. Ready for notebook 03.")

GPU memory after cleanup: [N/A] MiB / [N/A] MiB

GPU memory released. Ready for notebook 03.


## Key Takeaways

**What we measured:**
- Single-node vLLM performance with continuous batching
- Prefill vs decode time split
- Memory usage patterns (KV cache growth)

**Why this matters:**
- These are honest baseline numbers from well-configured infrastructure
- Disaggregated serving must beat this to be worthwhile
- Memory measurements show what needs to be transferred between nodes

**What's next:**
- [02_Understanding_KV_Cache.ipynb](02_Understanding_KV_Cache.ipynb): KV cache dimensions and transfer cost analysis (no model loading required)
- [03_Replicated_Serving.ipynb](03_Replicated_Serving.ipynb): Two independent vLLM instances with round-robin routing (fair baseline for disaggregation)
- [04_Disaggregated_Serving.ipynb](04_Disaggregated_Serving.ipynb): Split prefill/decode across spark-01 and spark-02 with NIXL