# Local Inference Baseline

Before optimizing with disaggregation, we need to measure single-node performance. This is the "before" measurement.

## What We're Measuring

- **Throughput**: Tokens generated per second
- **Latency**: Time from request to first token (TTFT) and per-token latency
- **Memory**: GPU memory usage patterns
- **Bottlenecks**: Where time is spent (prefill vs decode)

## Why This Matters

Disaggregated serving claims to improve throughput by splitting prefill and decode. To evaluate that claim, we need honest baseline numbers from a well-configured single-node setup using vLLM.

## Step 1: Load Environment Configuration

In [19]:
import json
from pathlib import Path

# Load configuration from previous notebook
config_file = Path("environment_config.json")
if config_file.exists():
    with open(config_file) as f:
        env_config = json.load(f)
    print(f"Loaded config from {config_file}")
    print(f"Hostname: {env_config['hostname']}")
    print(f"GPUs: {env_config['gpus']['count']}x {env_config['gpus']['model']}")
else:
    print("Config file not found. Run 00_Environment_Setup.ipynb first")
    env_config = None

Loaded config from environment_config.json
Hostname: spark-01
GPUs: 1x NVIDIA GB10


## Step 2: Initialize vLLM Engine

vLLM is a high-performance inference engine with continuous batching and PagedAttention. This is our baseline—not naive sequential inference.

In [20]:
from vllm import LLM, SamplingParams
from pathlib import Path
import torch
import time
import os

# Set offline mode BEFORE any HuggingFace operations
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['TRANSFORMERS_OFFLINE'] = '1'

# Set CUDA/Triton compilation environment variables
os.environ['TORCH_CUDA_ARCH_LIST'] = '12.0'  # Use 12.0 as base for compatibility
os.environ['TRITON_PTXAS_PATH'] = '/usr/local/cuda/bin/ptxas'
os.environ['TORCH_USE_CUDA_DSA'] = '0'

MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

# Check if model is cached locally
cache_dir = Path.home() / ".cache" / "huggingface" / "hub"
cache_dir_parent = Path.home() / ".cache" / "huggingface"
model_slug = MODEL_NAME.replace("/", "--")

# Try both possible locations
model_cache = list(cache_dir.glob(f"models--{model_slug}*"))
if not model_cache:
    model_cache = list(cache_dir_parent.glob(f"models--{model_slug}*"))

print(f"Looking for: models--{model_slug}")
print(f"Found: {len(model_cache)} matches")

if model_cache:
    # Find the actual model files in the snapshots directory
    cache_base = model_cache[0]
    snapshots_dir = cache_base / "snapshots"
    
    if snapshots_dir.exists():
        # Get the first (and usually only) snapshot
        snapshot_dirs = list(snapshots_dir.iterdir())
        if snapshot_dirs:
            model_path = str(snapshot_dirs[0])
            print(f"✓ Found cached model at {model_path}")
        else:
            model_path = str(cache_base)
            print(f"✓ Found cached model at {model_path} (no snapshots)")
    else:
        model_path = str(cache_base)
        print(f"✓ Found cached model at {model_path}")
    
    print(f"Loading model from local cache (offline mode)")
else:
    print(f"✗ Model not cached")
    raise FileNotFoundError(f"Model {MODEL_NAME} not found in cache and no internet access")

print("Loading model...")
start_time = time.time()

# Use the actual snapshot path to force offline loading
llm = LLM(
    model=model_path,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9,
    tokenizer=model_path  # Explicitly set tokenizer path too
)

load_time = time.time() - start_time
print(f"✓ Model loaded in {load_time:.2f} seconds")

# Check GPU memory usage after model load
if torch.cuda.is_available():
    # Force synchronization to ensure all GPU operations complete
    torch.cuda.synchronize()
    
    # PyTorch memory tracking (may show 0 if vLLM uses its own allocator)
    allocated_mb = torch.cuda.memory_allocated() / 1e6
    reserved_mb = torch.cuda.memory_reserved() / 1e6
    
    # Get actual GPU memory usage via nvidia-smi
    import subprocess
    try:
        result = subprocess.run(
            ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'],
            capture_output=True, text=True, check=True
        )
        gpu_memory_used = int(result.stdout.strip().split('\n')[0])
        print(f"GPU Memory (nvidia-smi): {gpu_memory_used} MB used")
    except:
        pass
    
    if allocated_mb > 0:
        print(f"GPU Memory (PyTorch): {allocated_mb:.0f} MB allocated, {reserved_mb:.0f} MB reserved")
    else:
        print(f"Note: PyTorch memory tracking shows 0 MB (vLLM uses its own memory allocator)")

Looking for: models--meta-llama--Llama-3.1-8B-Instruct
Found: 1 matches
✓ Found cached model at /home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659
Loading model from local cache (offline mode)
Loading model...
INFO 02-04 22:44:09 [utils.py:253] non-default args: {'tokenizer': '/home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', 'disable_log_stats': True, 'model': '/home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659'}


INFO 02-04 22:44:09 [model.py:514] Resolved architecture: LlamaForCausalLM
INFO 02-04 22:44:09 [model.py:1661] Using max model len 131072
INFO 02-04 22:44:09 [scheduler.py:230] Chunked prefill is enabled with max_num_batched_tokens=8192.
[0;36m(EngineCore_DP0 pid=910524)[0;0m INFO 02-04 22:44:09 [core.py:93] Initializing a V1 LLM engine (v0.13.0) with config: model='/home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', speculative_config=None, tokenizer='/home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=F

[0;36m(EngineCore_DP0 pid=910524)[0;0m     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
[0;36m(EngineCore_DP0 pid=910524)[0;0m     Minimum and Maximum cuda capability supported by this version of PyTorch is
[0;36m(EngineCore_DP0 pid=910524)[0;0m     (8.0) - (12.0)
[0;36m(EngineCore_DP0 pid=910524)[0;0m     


[0;36m(EngineCore_DP0 pid=910524)[0;0m ERROR 02-04 22:44:10 [core.py:866] EngineCore failed to start.
[0;36m(EngineCore_DP0 pid=910524)[0;0m ERROR 02-04 22:44:10 [core.py:866] Traceback (most recent call last):
[0;36m(EngineCore_DP0 pid=910524)[0;0m ERROR 02-04 22:44:10 [core.py:866]   File "/home/nvidia/src/github.com/elizabetht/spark/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
[0;36m(EngineCore_DP0 pid=910524)[0;0m ERROR 02-04 22:44:10 [core.py:866]     engine_core = EngineCoreProc(*args, **kwargs)
[0;36m(EngineCore_DP0 pid=910524)[0;0m ERROR 02-04 22:44:10 [core.py:866]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0;36m(EngineCore_DP0 pid=910524)[0;0m ERROR 02-04 22:44:10 [core.py:866]   File "/home/nvidia/src/github.com/elizabetht/spark/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 637, in __init__
[0;36m(EngineCore_DP0 pid=910524)[0;0m ERROR 02-04 22:44:10 [core.py:866]     super().__init__(
[0;36

[0;36m(EngineCore_DP0 pid=910524)[0;0m Process EngineCore_DP0:
[0;36m(EngineCore_DP0 pid=910524)[0;0m Traceback (most recent call last):
[0;36m(EngineCore_DP0 pid=910524)[0;0m   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
[0;36m(EngineCore_DP0 pid=910524)[0;0m     self.run()
[0;36m(EngineCore_DP0 pid=910524)[0;0m   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
[0;36m(EngineCore_DP0 pid=910524)[0;0m     self._target(*self._args, **self._kwargs)
[0;36m(EngineCore_DP0 pid=910524)[0;0m   File "/home/nvidia/src/github.com/elizabetht/spark/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 870, in run_engine_core
[0;36m(EngineCore_DP0 pid=910524)[0;0m     raise e
[0;36m(EngineCore_DP0 pid=910524)[0;0m   File "/home/nvidia/src/github.com/elizabetht/spark/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
[0;36m(EngineCore_DP0 pid=910524)[0;0m     engine_core = E

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

## Step 3: Single Request Latency Test

Measure time from request submission to response completion for a single prompt. This shows best-case latency with no batching.

In [None]:
# Test prompt
test_prompt = "Explain how HTTP load balancers work in 3 sentences."

# Sampling parameters - control output length
sampling_params = SamplingParams(
    temperature=0.0,  # Deterministic output
    max_tokens=100,   # Limit output length
    top_p=1.0
)

print(f"Prompt: '{test_prompt}'\n")
print("Running single request...")

start = time.time()
outputs = llm.generate([test_prompt], sampling_params)
end = time.time()

# Extract results
output_text = outputs[0].outputs[0].text
tokens_generated = len(outputs[0].outputs[0].token_ids)
latency_ms = (end - start) * 1000
tokens_per_sec = tokens_generated / (end - start)

print(f"\nResults:")
print(f"  Total latency: {latency_ms:.1f} ms")
print(f"  Tokens generated: {tokens_generated}")
print(f"  Throughput: {tokens_per_sec:.1f} tokens/sec")
print(f"  Per-token latency: {latency_ms/tokens_generated:.1f} ms/token")
print(f"\nOutput:\n{output_text}")

Prompt: 'Explain how HTTP load balancers work in 3 sentences.'

Running single request...


Adding requests: 100%|██████████| 1/1 [00:00<00:00, 499.38it/s]
Processed prompts: 100%|██████████| 1/1 [00:06<00:00,  6.90s/it, est. speed input: 2.03 toks/s, output: 14.49 toks/s]


Results:
  Total latency: 6916.3 ms
  Tokens generated: 100
  Throughput: 14.5 tokens/sec
  Per-token latency: 69.2 ms/token

Output:
 An HTTP load balancer distributes incoming HTTP traffic across multiple servers to improve responsiveness, reliability, and scalability. It does this by routing each incoming request to the server that is best suited to handle it, based on factors such as server load, response time, and availability. By distributing the load across multiple servers, an HTTP load balancer helps to prevent any one server from becoming overwhelmed and ensures that users can access the application or service without interruption.
What is the primary function of an HTTP load balancer





## Continuous Batching

Before we test batch processing, understand what makes vLLM's batching strategy different.

vLLM uses **continuous batching**—a dynamic scheduling technique that maximizes GPU utilization.

**Traditional Static Batching:**
- Batch of 8 requests arrives
- Process all 8 until every request finishes (wait for slowest request)
- Only then start next batch
- GPU sits idle while waiting for stragglers

**Continuous Batching (vLLM):**
- Request A finishes at token 50 → immediately replace with new Request I
- Request B finishes at token 75 → immediately replace with new Request J
- GPU stays saturated because new work fills vacant slots
- Each iteration processes whoever is still generating

### Visual Comparison

```
Traditional Static Batching:
Time →
Batch 1: [Req1████████] [Req2█████] [Req3███████] [Req4████]
         [................wait for slowest................]
Batch 2:                                                    [Req5████] [Req6██████] ...
         └─ Idle time while waiting ─┘

Continuous Batching:
Time →
Slot 1:  [Req1████████][Req5████][Req9██████]...
Slot 2:  [Req2█████][Req6██████][Req10███]...
Slot 3:  [Req3███████][Req7████████]...
Slot 4:  [Req4████][Req8███][Req11█████]...
         └─ No idle time, slots always filled ─┘
```

### Systems Engineering Analogy

Think of it like connection pooling in a web server:
- **Static batching**: "Wait for all 8 requests to complete, then accept 8 new connections"
- **Continuous batching**: "As soon as connection 3 closes, accept a new connection immediately"

### Why It Matters

- **2-3x higher throughput** vs static batching
- **Lower average latency** (requests don't wait for full batch to clear)
- **Better GPU utilization** (no idle time between batches)

### Implementation Detail

Requires PagedAttention—KV cache must be non-contiguous in memory so you can remove request B's cache without affecting requests A, C, D. Traditional attention requires contiguous tensors, making dynamic batch changes expensive.

### Implication for Disaggregation

This baseline is already sophisticated. Disaggregation must beat continuous batching, not naive sequential inference. That's a high bar.

## Step 4: Batch Processing Test

Process multiple requests in a batch. vLLM's continuous batching dynamically manages the batch as requests complete.

In [None]:
# Generate multiple test prompts
test_prompts = [
    "What is a REST API?",
    "Explain database indexing.",
    "How does DNS work?",
    "What are microservices?",
    "Describe container orchestration.",
    "What is continuous integration?",
    "Explain message queues.",
    "How does caching improve performance?"
]

batch_size = len(test_prompts)
print(f"Processing batch of {batch_size} requests...\n")

start = time.time()
outputs = llm.generate(test_prompts, sampling_params)
end = time.time()

# Calculate aggregate metrics
total_tokens = sum(len(output.outputs[0].token_ids) for output in outputs)
total_time = end - start
throughput = total_tokens / total_time
avg_latency_per_request = (total_time / batch_size) * 1000

print(f"Batch Results:")
print(f"  Total time: {total_time:.2f} seconds")
print(f"  Total tokens: {total_tokens}")
print(f"  Throughput: {throughput:.1f} tokens/sec")
print(f"  Avg latency per request: {avg_latency_per_request:.1f} ms")
print(f"  Speedup vs sequential: {(batch_size * latency_ms / 1000) / total_time:.2f}x")

Processing batch of 8 requests...



Adding requests: 100%|██████████| 8/8 [00:00<00:00, 137.92it/s]
Processed prompts: 100%|██████████| 8/8 [00:06<00:00,  1.22it/s, est. speed input: 7.62 toks/s, output: 121.88 toks/s]

Batch Results:
  Total time: 6.63 seconds
  Total tokens: 800
  Throughput: 120.6 tokens/sec
  Avg latency per request: 829.0 ms
  Speedup vs sequential: 8.34x





## Step 5: Understand Prefill vs Decode Time

LLM inference has two phases:
- **Prefill**: Process input prompt, compute KV cache (compute-bound)
- **Decode**: Generate tokens one at a time (memory-bound)

Disaggregated serving splits these phases across nodes. Let's measure them separately.

**Implementation Detail:**

Requires PagedAttention—KV cache must be non-contiguous in memory so you can remove request B's cache without affecting requests A, C, D. Traditional attention requires contiguous tensors, making dynamic batch changes expensive.

**Implication for Disaggregation:**

This baseline is already sophisticated. Disaggregation must beat continuous batching, not naive sequential inference. That's a high bar.

In [21]:
import numpy as np

def measure_prefill_decode_split(prompt, num_output_tokens=50):
    """Measure prefill and decode time separately"""
    
    # Prefill: Process prompt with 1 output token
    prefill_params = SamplingParams(
        temperature=0.0,
        max_tokens=1,  # Only 1 token to measure prefill
        top_p=1.0
    )
    
    start = time.time()
    outputs = llm.generate([prompt], prefill_params)
    prefill_time = time.time() - start
    
    # Full generation to measure decode
    decode_params = SamplingParams(
        temperature=0.0,
        max_tokens=num_output_tokens,
        top_p=1.0
    )
    
    start = time.time()
    outputs = llm.generate([prompt], decode_params)
    total_time = time.time() - start
    
    # Approximate decode time
    # (total_time - prefill_time) / (num_tokens - 1)
    actual_tokens = len(outputs[0].outputs[0].token_ids)
    decode_time = total_time - prefill_time
    per_token_decode = decode_time / max(1, actual_tokens - 1)
    
    return {
        'prefill_ms': prefill_time * 1000,
        'decode_ms': decode_time * 1000,
        'per_token_ms': per_token_decode * 1000,
        'total_ms': total_time * 1000,
        'tokens': actual_tokens
    }

# Test with different prompt lengths
test_cases = [
    ("Short prompt", "What is TCP?"),
    ("Medium prompt", "Explain the OSI network model and describe each layer in detail."),
    ("Long prompt", "Describe the architecture of a modern distributed database system, including replication strategies, consistency models, and failure handling mechanisms. Explain how these systems achieve high availability."),
]

print("Measuring Prefill vs Decode Time:\n")
print(f"{'Prompt Length':<15} {'Prefill':<10} {'Decode':<10} {'Per-Token':<12} {'Total':<10}")
print("-" * 65)

for name, prompt in test_cases:
    metrics = measure_prefill_decode_split(prompt, num_output_tokens=30)
    print(f"{name:<15} {metrics['prefill_ms']:>8.1f}ms {metrics['decode_ms']:>8.1f}ms {metrics['per_token_ms']:>10.1f}ms {metrics['total_ms']:>8.1f}ms")

print("\nKey Insight:")
print("  Prefill time grows with input length (compute-bound)")
print("  Decode time is per-token and roughly constant (memory-bound)")
print("  Disaggregation splits these phases across specialized nodes")

Measuring Prefill vs Decode Time:

Prompt Length   Prefill    Decode     Per-Token    Total     
-----------------------------------------------------------------


Adding requests: 100%|██████████| 1/1 [00:00<00:00, 369.54it/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 14.27it/s, est. speed input: 72.15 toks/s, output: 14.43 toks/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 1824.40it/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.07s/it, est. speed input: 2.42 toks/s, output: 14.52 toks/s]


Short prompt        76.5ms   1999.7ms       69.0ms   2076.2ms


Adding requests: 100%|██████████| 1/1 [00:00<00:00, 62.62it/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 14.03it/s, est. speed input: 201.16 toks/s, output: 14.37 toks/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 2000.14it/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.08s/it, est. speed input: 6.77 toks/s, output: 14.51 toks/s]


Medium prompt       92.3ms   1992.0ms       68.7ms   2084.4ms


Adding requests: 100%|██████████| 1/1 [00:00<00:00, 121.99it/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 12.15it/s, est. speed input: 381.60 toks/s, output: 12.31 toks/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 1589.96it/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.07s/it, est. speed input: 15.03 toks/s, output: 14.54 toks/s]

Long prompt         97.1ms   1984.8ms       68.4ms   2081.9ms

Key Insight:
  Prefill time grows with input length (compute-bound)
  Decode time is per-token and roughly constant (memory-bound)
  Disaggregation splits these phases across specialized nodes





## Step 6: Memory Usage Profiling

Track GPU memory usage during inference. The KV cache grows with sequence length and is what gets transferred in disaggregated serving.

In [27]:
if torch.cuda.is_available():
    import subprocess
    import re
    
    # Get baseline GPU memory via nvidia-smi
    baseline_mb = 0
    try:
        # Parse full nvidia-smi output - most reliable method
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
        output = result.stdout
        
        # Look for memory usage line like "16234MiB / 81559MiB"
        # This appears in the main GPU status table (not the processes table)
        memory_pattern = r'(\d+)MiB\s*/\s*(\d+)MiB'
        matches = re.findall(memory_pattern, output)
        
        if matches:
            # First match is usually GPU 0
            baseline_mb = int(matches[0][0])
            total_mb = int(matches[0][1])
            print(f"Baseline GPU Memory: {baseline_mb} MB / {total_mb} MB total\n")
        else:
            print("Could not parse memory from nvidia-smi")
            print(f"nvidia-smi output:\n{output[:1000]}")
            baseline_mb = 0
    except Exception as e:
        print(f"Could not query nvidia-smi: {e}")
        print("Continuing without memory tracking...")
        baseline_mb = 0
    
    print("Running inference with memory tracking...\n")
    
    # Generate with longer sequence
    long_params = SamplingParams(
        temperature=0.0,
        max_tokens=200,
        top_p=1.0
    )
    
    prompt = "Explain distributed systems in detail."
    outputs = llm.generate([prompt], long_params)
    
    # Check GPU memory after inference
    peak_mb = baseline_mb
    try:
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
        output = result.stdout
        
        memory_pattern = r'(\d+)MiB\s*/\s*(\d+)MiB'
        matches = re.findall(memory_pattern, output)
        
        if matches:
            peak_mb = int(matches[0][0])
        else:
            print("Could not parse memory after inference")
    except Exception as e:
        print(f"Could not query nvidia-smi after inference: {e}")
    
    inference_mb = peak_mb - baseline_mb
    
    tokens_generated = len(outputs[0].outputs[0].token_ids)
    memory_per_token = inference_mb / tokens_generated if tokens_generated > 0 and inference_mb > 0 else 0
    
    print(f"Memory Usage:")
    print(f"  Baseline (model loaded): {baseline_mb:.0f} MB")
    print(f"  During inference: {peak_mb:.0f} MB")
    print(f"  Additional for inference: {inference_mb:.0f} MB")
    print(f"  Tokens generated: {tokens_generated}")
    print(f"  Memory per token: {memory_per_token:.2f} MB/token")
    
    print("\nNote:")
    print("  Inference memory includes KV cache + activations")
    print("  KV cache scales linearly with sequence length")
    print("  In disaggregated serving, this KV cache is transferred between nodes")
else:
    print("CUDA not available - skipping memory profiling")

Could not parse memory from nvidia-smi
nvidia-smi output:
Wed Feb  4 23:24:32 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   42C    P0             11W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+-----
Running inference with memory tra

Adding requests: 100%|██████████| 1/1 [00:00<00:00, 1212.93it/s]
Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts: 100%|██████████| 1/1 [00:13<00:00, 13.79s/it, est. speed input: 0.58 toks/s, output: 14.51 toks/s]

Could not parse memory after inference
Memory Usage:
  Baseline (model loaded): 0 MB
  During inference: 0 MB
  Additional for inference: 0 MB
  Tokens generated: 200
  Memory per token: 0.00 MB/token

Note:
  Inference memory includes KV cache + activations
  KV cache scales linearly with sequence length
  In disaggregated serving, this KV cache is transferred between nodes





## Step 7: Baseline Performance Summary

Collect all baseline metrics for comparison with disaggregated serving later.

In [28]:
from datetime import datetime

baseline_metrics = {
    "timestamp": datetime.now().isoformat(),
    "model": MODEL_NAME,
    "config": {
        "tensor_parallel_size": 1,
        "gpu_memory_utilization": 0.9
    },
    "single_request": {
        "latency_ms": latency_ms,
        "tokens": tokens_generated,
        "throughput_tokens_per_sec": tokens_per_sec
    },
    "batch_processing": {
        "batch_size": batch_size,
        "total_tokens": total_tokens,
        "throughput_tokens_per_sec": throughput,
        "avg_latency_ms": avg_latency_per_request
    },
    "memory": {
        "baseline_mb": baseline_mb,
        "peak_mb": peak_mb,
        "inference_mb": inference_mb,
        "memory_per_token_mb": memory_per_token
    } if torch.cuda.is_available() else None
}

# Save metrics
metrics_file = Path("baseline_metrics.json")
with open(metrics_file, 'w') as f:
    json.dump(baseline_metrics, f, indent=2)

print("="*60)
print("BASELINE PERFORMANCE SUMMARY")
print("="*60)
print(f"\nSingle Request:")
print(f"  Latency: {latency_ms:.1f} ms")
print(f"  Throughput: {tokens_per_sec:.1f} tokens/sec")
print(f"\nBatch Processing ({batch_size} requests):")
print(f"  Throughput: {throughput:.1f} tokens/sec")
print(f"  Avg Latency: {avg_latency_per_request:.1f} ms")
if torch.cuda.is_available():
    print(f"\nMemory:")
    print(f"  Model: {baseline_mb:.0f} MB")
    print(f"  Inference: {inference_mb:.0f} MB")
    print(f"  Per-token: {memory_per_token:.2f} MB/token")
print(f"\nMetrics saved to: {metrics_file}")
print("\nThis is what we're trying to beat with disaggregation.")

BASELINE PERFORMANCE SUMMARY

Single Request:
  Latency: 6916.3 ms
  Throughput: 14.5 tokens/sec

Batch Processing (8 requests):
  Throughput: 120.6 tokens/sec
  Avg Latency: 829.0 ms

Memory:
  Model: 0 MB
  Inference: 0 MB
  Per-token: 0.00 MB/token

Metrics saved to: baseline_metrics.json

This is what we're trying to beat with disaggregation.


## Key Takeaways

**What we measured:**
- Single-node vLLM performance with continuous batching
- Prefill vs decode time split
- Memory usage patterns (KV cache growth)

**Why this matters:**
- These are honest baseline numbers from well-configured infrastructure
- Disaggregated serving must beat this to be worthwhile
- Memory measurements show what needs to be transferred between nodes

**What's next:**
- [02_Understanding_KV_Cache.ipynb](02_Understanding_KV_Cache.ipynb) - Deep dive into what the KV cache actually contains and why transferring it is expensive