# Production Benchmarking with guidellm

The manual benchmarks in Notebooks 01, 03, and 04 fire 8 concurrent requests and measure end-to-end latency. That approach is useful for learning, but it lacks the rigor needed for production evaluation: no TTFT/TPOT breakdown, no percentile distributions, no load sweeps, and no warmup/cooldown periods.

This notebook runs [guidellm](https://github.com/vllm-project/guidellm) against all three serving configurations on the same hardware. guidellm provides standardized metrics (TTFT, TPOT, ITL, P50/P95/P99), automatic rate sweeps, and saturation detection.

guidellm runs on the controller (CPU-only node), keeping it off the GPU nodes entirely. This eliminates measurement interference: the load generator and the inference servers never compete for the same CPU, memory, or network interfaces.

## What We're Comparing

| Configuration | Setup | Endpoint |
|---------------|-------|----------|
| Single node | 1 vLLM instance on spark-01 | spark-01:8100 |
| Replicated | 2 independent instances + round-robin proxy | controller:8192 |
| Disaggregated | Prefill on spark-01, decode on spark-02, NIXL/RDMA | controller:8192 |

Same model (Llama-3.1-8B-Instruct), same memory budget (0.3 `gpu-memory-utilization`), same prompt/output lengths.

## Prerequisites
- Notebooks 00, 01, 03, and 04 completed (environment verified, all configurations tested)
- vLLM cu130 build installed on both GPU nodes
- Model cached on both GPU nodes
- Passwordless SSH to controller, spark-01, and spark-02

## Step 1: Install guidellm and Load Configuration

In [1]:
import json
import subprocess
import time
import os
import signal
import urllib.request
import urllib.error
from pathlib import Path

# Load HuggingFace token from .env file.
# guidellm's synthetic data generator needs HF authentication to access
# gated models (e.g., Llama). The token is read from .env and passed to
# the controller via SSH environment variables.
#
# Create the file if it does not exist:
#   echo 'HF_TOKEN=hf_...' > /home/nvidia/src/github.com/elizabetht/spark/.env
env_file = Path("/home/nvidia/src/github.com/elizabetht/spark/.env")
HF_TOKEN = None
if env_file.exists():
    for line in env_file.read_text().splitlines():
        line = line.strip()
        if line.startswith("HF_TOKEN="):
            HF_TOKEN = line.split("=", 1)[1].strip().strip('"').strip("'")
            print(f"HF_TOKEN: loaded from {env_file} ({HF_TOKEN[:8]}...)")
            break
if not HF_TOKEN:
    HF_TOKEN = os.environ.get("HF_TOKEN")
    if HF_TOKEN:
        print(f"HF_TOKEN: loaded from environment ({HF_TOKEN[:8]}...)")
    else:
        raise ValueError(
            f"HF_TOKEN not found. guidellm needs it to generate synthetic prompts "
            f"for gated models.\n"
            f"Create {env_file} with:\n"
            f"  echo 'HF_TOKEN=hf_your_token_here' > {env_file}"
        )

# Load environment config
config_file = Path("environment_config.json")
if config_file.exists():
    with open(config_file) as f:
        env_config = json.load(f)
    print(f"Loaded config from {config_file}")
else:
    raise FileNotFoundError("Run 00_Environment_Setup.ipynb first")

# Node configuration
NODE1_HOST = env_config['network']['node1_ip']   # spark-01: 192.168.100.10 (IB)
NODE2_HOST = env_config['network']['node2_ip']   # spark-02: 192.168.100.11 (IB)
NODE1_LAN_HOST = "192.168.1.76"                  # spark-01 LAN
NODE2_LAN_HOST = "192.168.1.77"                  # spark-02 LAN
CONTROLLER_HOST = "192.168.1.75"                 # controller LAN
MODEL_NAME = env_config['model']['name']
NODE1_PORT = 8100
NODE2_PORT = 8200
PROXY_PORT = 8192
NIXL_PORT = 5600
VENV_PATH = os.path.expanduser("~/src/github.com/elizabetht/spark/.venv")

# Find model snapshot path
cache_dir = Path.home() / ".cache" / "huggingface" / "hub"
model_slug = MODEL_NAME.replace("/", "--")
model_cache = list(cache_dir.glob(f"models--{model_slug}*"))
if model_cache:
    snapshots_dir = model_cache[0] / "snapshots"
    snapshot_dirs = list(snapshots_dir.iterdir()) if snapshots_dir.exists() else []
    MODEL_PATH = str(snapshot_dirs[0]) if snapshot_dirs else str(model_cache[0])
    print(f"Model path: {MODEL_PATH}")
else:
    raise FileNotFoundError(f"Model {MODEL_NAME} not found in cache")

# Benchmark parameters (consistent across all configurations)
INPUT_LEN = 256
OUTPUT_LEN = 128
BENCH_DURATION = 60       # seconds per benchmark profile
RESULTS_DIR = Path("benchmark_results")
RESULTS_DIR.mkdir(exist_ok=True)

# Remote results directory on controller
REMOTE_RESULTS_DIR = "/tmp/benchmark_results"

# Install guidellm on the controller (CPU-only node, no GPU contention)
print("Checking guidellm on controller...")
check = subprocess.run(
    ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{CONTROLLER_HOST}',
     f'{VENV_PATH}/bin/guidellm --help'],
    capture_output=True, text=True, timeout=15
)
if check.returncode == 0:
    print("  guidellm CLI: available on controller")
else:
    print("  guidellm not found on controller, installing...")
    result = subprocess.run(
        ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{CONTROLLER_HOST}',
         f'{VENV_PATH}/bin/pip install "guidellm[recommended]"'],
        capture_output=True, text=True, timeout=600
    )
    if result.returncode == 0:
        print("  guidellm installed on controller")
    else:
        print(f"  Installation failed: {result.stderr.strip()[-500:]}")
        raise RuntimeError("Failed to install guidellm on controller")

# Create remote results directory
subprocess.run(
    ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{CONTROLLER_HOST}',
     f'mkdir -p {REMOTE_RESULTS_DIR}'],
    capture_output=True, timeout=10
)

# Copy tokenizer to controller so guidellm can generate synthetic prompts
# without needing HuggingFace authentication for the gated Llama model.
# Only the tokenizer files are needed (~10 MB), not the full model weights.
CONTROLLER_TOKENIZER_PATH = f"/tmp/tokenizer-{model_slug}"

tokenizer_files = ["tokenizer.json", "tokenizer_config.json", "special_tokens_map.json"]

print(f"Copying tokenizer to controller ({CONTROLLER_TOKENIZER_PATH})...")
subprocess.run(
    ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{CONTROLLER_HOST}',
     f'mkdir -p {CONTROLLER_TOKENIZER_PATH}'],
    capture_output=True, timeout=10
)

for tf in tokenizer_files:
    local_tf = Path(MODEL_PATH) / tf
    if local_tf.exists():
        scp = subprocess.run(
            ['scp', str(local_tf),
             f'nvidia@{CONTROLLER_HOST}:{CONTROLLER_TOKENIZER_PATH}/{tf}'],
            capture_output=True, text=True, timeout=15
        )
        if scp.returncode == 0:
            print(f"  {tf}: copied")
        else:
            print(f"  {tf}: SCP failed ({scp.stderr.strip()})")
    else:
        print(f"  {tf}: not found locally (skipping)")

print(f"\nBenchmark parameters:")
print(f"  Input tokens:  {INPUT_LEN}")
print(f"  Output tokens: {OUTPUT_LEN}")
print(f"  Duration:      {BENCH_DURATION}s per profile")
print(f"  Load generator: controller ({CONTROLLER_HOST})")
print(f"  Results dir:   {RESULTS_DIR} (local, copied from controller after each phase)")

HF_TOKEN: loaded from /home/nvidia/src/github.com/elizabetht/spark/.env (hf_WvokP...)
Loaded config from environment_config.json
Model path: /home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659
Checking guidellm on controller...
  guidellm CLI: available on controller
Copying tokenizer to controller (/tmp/tokenizer-meta-llama--Llama-3.1-8B-Instruct)...
  tokenizer.json: copied
  tokenizer_config.json: copied
  special_tokens_map.json: copied

Benchmark parameters:
  Input tokens:  256
  Output tokens: 128
  Duration:      60s per profile
  Load generator: controller (192.168.1.75)
  Results dir:   benchmark_results (local, copied from controller after each phase)


## Step 2: Helper Functions

Shared utilities for starting/stopping vLLM instances, proxies, and running guidellm. The load generator (guidellm) runs on the controller, a CPU-only node with no GPU workload. This avoids contention with the vLLM instances on the GPU nodes and provides a consistent measurement point across all three configurations.

In [2]:
def wait_for_server(host, port, label, timeout=300, interval=10):
    """Poll a vLLM server's health endpoint until it responds."""
    url = f"http://{host}:{port}/health"
    start = time.time()
    while time.time() - start < timeout:
        try:
            req = urllib.request.Request(url, method='GET')
            with urllib.request.urlopen(req, timeout=5) as resp:
                if resp.status == 200:
                    elapsed = time.time() - start
                    print(f"  {label}: ready ({elapsed:.0f}s)")
                    return True
        except (urllib.error.URLError, ConnectionRefusedError, OSError):
            pass
        elapsed = time.time() - start
        print(f"  {label}: waiting... ({elapsed:.0f}s / {timeout}s)")
        time.sleep(interval)
    print(f"  {label}: TIMEOUT after {timeout}s")
    return False


def stop_all_services():
    """Kill all vLLM and proxy processes on all nodes."""
    print("Stopping all services...")
    # Local vLLM (spark-01)
    subprocess.run(['pkill', '-f', 'vllm serve'], capture_output=True, timeout=5)
    # Remote vLLM (spark-02)
    subprocess.run(
        ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{NODE2_HOST}',
         'pkill -f "vllm serve" 2>/dev/null; true'],
        capture_output=True, timeout=10
    )
    # Proxies on controller
    subprocess.run(
        ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{CONTROLLER_HOST}',
         'pkill -f disagg_proxy.py 2>/dev/null; pkill -f replicated_proxy.py 2>/dev/null; true'],
        capture_output=True, timeout=10
    )
    time.sleep(3)
    # Force-kill survivors
    subprocess.run(['pkill', '-9', '-f', 'vllm serve'], capture_output=True, timeout=5)
    subprocess.run(
        ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{NODE2_HOST}',
         'pkill -9 -f "vllm serve" 2>/dev/null; true'],
        capture_output=True, timeout=10
    )
    time.sleep(2)
    print("  All services stopped")


def start_vllm_local(port, kv_transfer_config=None, log_file="/tmp/vllm_bench.log"):
    """Start a vLLM instance on the local node (spark-01)."""
    cmd = (
        f". {VENV_PATH}/bin/activate && "
        f"CUDA_HOME=/usr/local/cuda-13.0 "
        f"HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 "
    )
    if kv_transfer_config:
        cmd += f"VLLM_NIXL_SIDE_CHANNEL_HOST={NODE1_HOST} VLLM_NIXL_SIDE_CHANNEL_PORT={NIXL_PORT} "
    cmd += (
        f"vllm serve {MODEL_PATH} "
        f"--port {port} "
        f"--gpu-memory-utilization 0.3 "
        f"--tensor-parallel-size 1 "
    )
    if kv_transfer_config:
        cmd += f"--kv-transfer-config '{kv_transfer_config}' "
    cmd += f"> {log_file} 2>&1"

    proc = subprocess.Popen(
        cmd, shell=True,
        stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
        preexec_fn=os.setsid
    )
    return proc


def start_vllm_remote(host, port, kv_transfer_config=None, log_file="/tmp/vllm_bench.log"):
    """Start a vLLM instance on a remote node via SSH."""
    cmd = (
        f". {VENV_PATH}/bin/activate && "
        f"CUDA_HOME=/usr/local/cuda-13.0 "
        f"HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 "
    )
    if kv_transfer_config:
        cmd += f"VLLM_NIXL_SIDE_CHANNEL_HOST={NODE2_HOST} VLLM_NIXL_SIDE_CHANNEL_PORT={NIXL_PORT} "
    cmd += (
        f"vllm serve {MODEL_PATH} "
        f"--port {port} "
        f"--gpu-memory-utilization 0.3 "
        f"--tensor-parallel-size 1 "
    )
    if kv_transfer_config:
        cmd += f"--kv-transfer-config '{kv_transfer_config}' "
    cmd += f"> {log_file} 2>&1"

    proc = subprocess.Popen(
        ['ssh', f'nvidia@{host}', cmd],
        stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
        preexec_fn=os.setsid
    )
    return proc


def run_guidellm(target_url, label, profile="sweep", rate=None, extra_args=None):
    """
    Run guidellm benchmark on the controller node via SSH.

    guidellm runs on the controller (CPU-only) to avoid contending with
    vLLM on the GPU nodes. Results are written to the controller's
    /tmp/benchmark_results/<label>/ directory, then copied back to the
    local RESULTS_DIR via SCP.

    Args:
        target_url: Full URL (e.g., http://192.168.1.76:8100)
        label: Name for results files (e.g., 'single_node')
        profile: 'sweep', 'synchronous', 'concurrent', 'throughput'
        rate: Rate parameter (concurrent connections or RPS depending on profile)
        extra_args: Additional CLI arguments as a list
    """
    remote_output = f"{REMOTE_RESULTS_DIR}/{label}"
    local_output = RESULTS_DIR / label
    local_output.mkdir(exist_ok=True)

    # Build the guidellm command to run on the controller.
    # HF_TOKEN is required: guidellm's synthetic data generator downloads
    # the model config from HuggingFace to calibrate prompt token counts.
    guidellm_cmd = (
        f". {VENV_PATH}/bin/activate && "
        f"export HF_TOKEN={HF_TOKEN} && "
        f"mkdir -p {remote_output} && "
        f"guidellm benchmark "
        f"--target {target_url} "
        f"--request-type text_completions "
        f"--data 'prompt_tokens={INPUT_LEN},output_tokens={OUTPUT_LEN}' "
        f"--processor {CONTROLLER_TOKENIZER_PATH} "
        f"--profile {profile} "
        f"--max-seconds {BENCH_DURATION} "
        f"--output-path {remote_output}"
    )
    if rate is not None:
        guidellm_cmd += f" --rate {rate}"
    if extra_args:
        guidellm_cmd += " " + " ".join(extra_args)

    print(f"Running guidellm [{label}] on controller ({CONTROLLER_HOST})")
    print(f"  Profile: {profile}")
    print(f"  Target:  {target_url}")
    print(f"  Remote output: {remote_output}/")
    print(f"  Duration: ~{BENCH_DURATION}s per rate point")
    print()

    result = subprocess.run(
        ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{CONTROLLER_HOST}', guidellm_cmd],
        capture_output=False, text=True,
        timeout=BENCH_DURATION * 20  # generous timeout for sweep (multiple rate points)
    )

    if result.returncode == 0:
        print(f"\nBenchmark [{label}] completed on controller.")
        # Copy results back to local machine
        print(f"Copying results from controller to {local_output}/...")
        scp_result = subprocess.run(
            ['scp', '-r', f'nvidia@{CONTROLLER_HOST}:{remote_output}/',
             str(local_output.parent) + '/'],
            capture_output=True, text=True, timeout=30
        )
        if scp_result.returncode == 0:
            print(f"  Results saved to {local_output}/")
        else:
            print(f"  SCP failed: {scp_result.stderr.strip()}")
            print(f"  Results are still on controller at {remote_output}/")
    else:
        print(f"\nBenchmark [{label}] failed (exit code {result.returncode})")

    return result.returncode

print("Helper functions defined.")


Helper functions defined.


## Phase 1: Single Node Benchmark

Start one vLLM instance on spark-01 and run guidellm directly against it. No proxy, no second node. This is the baseline that both replicated and disaggregated must beat.

In [3]:
# Ensure clean state
stop_all_services()

# Start single vLLM instance on spark-01
print("\nStarting single-node vLLM on spark-01...")
single_proc = start_vllm_local(NODE1_PORT, log_file="/tmp/vllm_single.log")
print(f"  PID: {single_proc.pid}")
print(f"  Log: tail -f /tmp/vllm_single.log")

# Wait for ready
print("\nWaiting for vLLM to load model...")
ready = wait_for_server(NODE1_HOST, NODE1_PORT, "Single node (spark-01)")
if not ready:
    raise RuntimeError("vLLM failed to start. Check /tmp/vllm_single.log")

Stopping all services...
  All services stopped

Starting single-node vLLM on spark-01...
  PID: 987244
  Log: tail -f /tmp/vllm_single.log

Waiting for vLLM to load model...
  Single node (spark-01): waiting... (0s / 300s)
  Single node (spark-01): waiting... (10s / 300s)
  Single node (spark-01): waiting... (20s / 300s)
  Single node (spark-01): waiting... (30s / 300s)
  Single node (spark-01): waiting... (40s / 300s)
  Single node (spark-01): waiting... (50s / 300s)
  Single node (spark-01): waiting... (60s / 300s)
  Single node (spark-01): waiting... (70s / 300s)
  Single node (spark-01): waiting... (80s / 300s)
  Single node (spark-01): waiting... (90s / 300s)
  Single node (spark-01): waiting... (100s / 300s)
  Single node (spark-01): waiting... (110s / 300s)
  Single node (spark-01): ready (120s)


In [4]:
# Run guidellm sweep against single node
# guidellm runs on the controller and targets spark-01's LAN IP
run_guidellm(
    target_url=f"http://{NODE1_LAN_HOST}:{NODE1_PORT}",
    label="single_node",
    profile="sweep"
)

Running guidellm [single_node] on controller (192.168.1.75)
  Profile: sweep
  Target:  http://192.168.1.76:8100
  Remote output: /tmp/benchmark_results/single_node/
  Duration: ~60s per rate point

✔ OpenAIHTTPBackend backend validated with model 
/home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/sn
apshots/0e9e39f249a16976918f6564b8830bc894c89659
  {'target': 'http://192.168.1.76:8100', 'model': None, 'timeout': 60.0,        
  'http2': True, 'follow_redirects': True, 'verify': False, 'openai_paths':     
  {'health': 'health', 'models': 'v1/models', 'text_completions':               
  'v1/completions', 'chat_completions': 'v1/chat/completions',                  
  'audio_transcriptions': 'v1/audio/transcriptions', 'audio_translations':      
  'v1/audio/translations'}, 'validate_backend': {'method': 'GET', 'url':        
  'http://192.168.1.76:8100/health'}}                                           
✔ Processor resolved
  Using processor '/tmp/tokenizer

0

In [None]:
# Stop single node before starting replicated setup
stop_all_services()
print("Phase 1 complete. Single-node benchmark results saved.")

## Phase 2: Replicated Benchmark

Start two independent vLLM instances (one per node) behind the round-robin proxy on the controller. This is the same configuration as Notebook 03.

In [9]:
# Start Instance A on spark-01
print("Starting replicated setup...")
rep_proc_a = start_vllm_local(NODE1_PORT, log_file="/tmp/vllm_rep_a.log")
print(f"  Instance A (spark-01): PID {rep_proc_a.pid}")

# Start Instance B on spark-02
rep_proc_b = start_vllm_remote(NODE2_HOST, NODE2_PORT, log_file="/tmp/vllm_rep_b.log")
print(f"  Instance B (spark-02): PID {rep_proc_b.pid}")

# Wait for both
print("\nWaiting for both instances...")
ready_a = wait_for_server(NODE1_HOST, NODE1_PORT, "Instance A (spark-01)")
ready_b = wait_for_server(NODE2_HOST, NODE2_PORT, "Instance B (spark-02)")

if not (ready_a and ready_b):
    print("Check logs:")
    print(f"  Instance A: tail -50 /tmp/vllm_rep_a.log")
    print(f"  Instance B: ssh nvidia@{NODE2_HOST} 'tail -50 /tmp/vllm_rep_b.log'")
    raise RuntimeError("One or both instances failed to start")

# Deploy and start round-robin proxy on controller
# Always regenerate to ensure the /v1/models route is present
# (guidellm calls /v1/models to discover the model before benchmarking)
proxy_script = Path("/tmp/replicated_proxy.py")
script_content = f'''#!/usr/bin/env python3
import itertools, logging
import uvicorn
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, StreamingResponse, Response
import httpx

logging.basicConfig(level=logging.INFO, format="%(asctime)s [proxy] %(message)s")
logger = logging.getLogger("replicated_proxy")

BACKENDS = [
    "http://{NODE1_LAN_HOST}:{NODE1_PORT}",
    "http://{NODE2_LAN_HOST}:{NODE2_PORT}",
]
backend_cycle = itertools.cycle(BACKENDS)
TIMEOUT = httpx.Timeout(timeout=120.0)
app = FastAPI()

@app.get("/health")
async def health():
    return {{"status": "ok"}}

@app.get("/v1/models")
async def models():
    backend = next(backend_cycle)
    async with httpx.AsyncClient() as client:
        resp = await client.get(f"{{backend}}/v1/models", timeout=TIMEOUT)
        return JSONResponse(content=resp.json())

@app.post("/v1/completions")
@app.post("/v1/chat/completions")
async def handle_request(request: Request):
    body = await request.json()
    backend = next(backend_cycle)
    path = request.url.path
    is_streaming = body.get("stream", False)

    if is_streaming:
        # Stream SSE chunks directly from the backend to the client.
        # guidellm sends stream=true by default, so responses arrive as
        # Server-Sent Events that cannot be parsed as a single JSON blob.
        async def stream_response():
            async with httpx.AsyncClient() as client:
                async with client.stream("POST", f"{{backend}}{{path}}", json=body, timeout=TIMEOUT) as resp:
                    async for chunk in resp.aiter_bytes():
                        yield chunk
        return StreamingResponse(stream_response(), media_type="text/event-stream")
    else:
        async with httpx.AsyncClient() as client:
            resp = await client.post(f"{{backend}}{{path}}", json=body, timeout=TIMEOUT)
            return Response(content=resp.content, status_code=resp.status_code,
                            media_type=resp.headers.get("content-type", "application/json"))

if __name__ == "__main__":
    logger.info(f"Replicated proxy on 0.0.0.0:{PROXY_PORT}")
    uvicorn.run(app, host="0.0.0.0", port={PROXY_PORT}, log_level="info")
'''
proxy_script.write_text(script_content)

# Kill any existing proxy, copy the updated script, and start fresh
subprocess.run(
    ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{CONTROLLER_HOST}',
     'pkill -f replicated_proxy.py 2>/dev/null; sleep 1; pkill -9 -f replicated_proxy.py 2>/dev/null; true'],
    capture_output=True, timeout=10
)
time.sleep(1)
subprocess.run(
    ['scp', str(proxy_script), f'nvidia@{CONTROLLER_HOST}:/tmp/replicated_proxy.py'],
    capture_output=True, timeout=10
)
proxy_cmd = (
    f". {VENV_PATH}/bin/activate && "
    f"nohup python /tmp/replicated_proxy.py > /tmp/replicated_proxy.log 2>&1 < /dev/null &"
)
subprocess.Popen(
    ['ssh', f'nvidia@{CONTROLLER_HOST}', proxy_cmd],
    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, stdin=subprocess.DEVNULL
)
time.sleep(3)
proxy_ready = wait_for_server(CONTROLLER_HOST, PROXY_PORT, "Proxy (controller)", timeout=30, interval=2)
if not proxy_ready:
    raise RuntimeError("Proxy failed to start. Check /tmp/replicated_proxy.log on controller")

print("\nReplicated setup ready.")

Starting replicated setup...
  Instance A (spark-01): PID 988599
  Instance B (spark-02): PID 988601

Waiting for both instances...
  Instance A (spark-01): ready (0s)
  Instance B (spark-02): ready (0s)
  Proxy (controller): ready (0s)

Replicated setup ready.


In [10]:
# Run guidellm sweep against replicated proxy
run_guidellm(
    target_url=f"http://{CONTROLLER_HOST}:{PROXY_PORT}",
    label="replicated",
    profile="sweep"
)

Running guidellm [replicated] on controller (192.168.1.75)
  Profile: sweep
  Target:  http://192.168.1.75:8192
  Remote output: /tmp/benchmark_results/replicated/
  Duration: ~60s per rate point

✔ OpenAIHTTPBackend backend validated with model 
/home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/sn
apshots/0e9e39f249a16976918f6564b8830bc894c89659
  {'target': 'http://192.168.1.75:8192', 'model': None, 'timeout': 60.0,        
  'http2': True, 'follow_redirects': True, 'verify': False, 'openai_paths':     
  {'health': 'health', 'models': 'v1/models', 'text_completions':               
  'v1/completions', 'chat_completions': 'v1/chat/completions',                  
  'audio_transcriptions': 'v1/audio/transcriptions', 'audio_translations':      
  'v1/audio/translations'}, 'validate_backend': {'method': 'GET', 'url':        
  'http://192.168.1.75:8192/health'}}                                           
✔ Processor resolved
  Using processor '/tmp/tokenizer-m

0

In [11]:
# Stop replicated setup before starting disaggregated
stop_all_services()
print("Phase 2 complete. Replicated benchmark results saved.")

Stopping all services...
  All services stopped
Phase 2 complete. Replicated benchmark results saved.


## Phase 3: Disaggregated Benchmark

Start prefill on spark-01 and decode on spark-02 with NixlConnector, behind the disaggregated proxy on the controller. This is the same configuration as Notebook 04.

In [12]:
# KV transfer configuration for NixlConnector
kv_config_prefill = json.dumps({
    "kv_connector": "NixlConnector",
    "kv_role": "kv_both",
    "kv_ip": NODE1_HOST,
    "kv_port": NIXL_PORT
})
kv_config_decode = json.dumps({
    "kv_connector": "NixlConnector",
    "kv_role": "kv_both",
    "kv_ip": NODE2_HOST,
    "kv_port": NIXL_PORT
})

# Start prefill on spark-01
print("Starting disaggregated setup...")
prefill_proc = start_vllm_local(
    NODE1_PORT,
    kv_transfer_config=kv_config_prefill,
    log_file="/tmp/vllm_prefill.log"
)
print(f"  Prefill (spark-01): PID {prefill_proc.pid}")

# Start decode on spark-02
decode_proc = start_vllm_remote(
    NODE2_HOST, NODE2_PORT,
    kv_transfer_config=kv_config_decode,
    log_file="/tmp/vllm_decode.log"
)
print(f"  Decode (spark-02):  PID {decode_proc.pid}")

# Wait for both instances
print("\nWaiting for instances to load model and initialize NIXL...")
prefill_ready = wait_for_server(NODE1_HOST, NODE1_PORT, "Prefill (spark-01)")
decode_ready = wait_for_server(NODE2_HOST, NODE2_PORT, "Decode (spark-02)")

if not (prefill_ready and decode_ready):
    print("Check logs:")
    print(f"  Prefill: tail -50 /tmp/vllm_prefill.log")
    print(f"  Decode:  ssh nvidia@{NODE2_HOST} 'tail -50 /tmp/vllm_decode.log'")
    raise RuntimeError("One or both instances failed to start")

# Deploy and start disaggregated proxy on controller.
# Always regenerate to ensure /v1/models route and streaming support are present
# (guidellm calls /v1/models for model discovery and sends stream=true requests).
disagg_proxy = Path("/tmp/disagg_proxy.py")
disagg_content = f'''#!/usr/bin/env python3
import uuid, logging, itertools
import uvicorn
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, StreamingResponse, Response
import httpx

logging.basicConfig(level=logging.INFO, format="%(asctime)s [proxy] %(message)s")
logger = logging.getLogger("disagg_proxy")

PREFILL_URL = "http://{NODE1_LAN_HOST}:{NODE1_PORT}"
DECODE_URL = "http://{NODE2_LAN_HOST}:{NODE2_PORT}"
BACKENDS = [PREFILL_URL, DECODE_URL]
backend_cycle = itertools.cycle(BACKENDS)
TIMEOUT = httpx.Timeout(timeout=120.0)

app = FastAPI()

@app.get("/health")
async def health():
    return {{"status": "ok"}}

@app.get("/v1/models")
async def models():
    backend = next(backend_cycle)
    async with httpx.AsyncClient() as client:
        resp = await client.get(f"{{backend}}/v1/models", timeout=TIMEOUT)
        return JSONResponse(content=resp.json())

@app.post("/v1/completions")
@app.post("/v1/chat/completions")
async def handle_request(request: Request):
    body = await request.json()
    path = request.url.path
    request_id = str(uuid.uuid4())
    is_streaming = body.get("stream", False)

    async with httpx.AsyncClient() as client:
        # Step 1: Prefill (prompt processing + KV cache generation)
        prefill_body = dict(body)
        prefill_body["max_tokens"] = 1
        prefill_body["stream"] = False
        prefill_body["kv_transfer_params"] = {{
            "do_remote_decode": True,
            "do_remote_prefill": False,
            "remote_engine_id": None,
            "remote_block_ids": None,
            "remote_host": None,
            "remote_port": None,
        }}
        headers = {{"X-Request-Id": request_id}}
        resp = await client.post(f"{{PREFILL_URL}}{{path}}", json=prefill_body,
                                 headers=headers, timeout=TIMEOUT)
        resp.raise_for_status()
        prefill_resp = resp.json()

        # Step 2: Extract kv_transfer_params
        kv_params = prefill_resp.get("kv_transfer_params")
        if not kv_params:
            logger.error(f"No kv_transfer_params in prefill response for {{request_id}}")
            return JSONResponse(status_code=502, content={{
                "error": "Prefill did not return kv_transfer_params. "
                         "Verify NixlConnector is configured on both instances."}})

        # Step 3: Decode (KV cache pull via NIXL + token generation)
        decode_body = dict(body)
        decode_body["kv_transfer_params"] = kv_params

        if is_streaming:
            async def stream_response():
                async with httpx.AsyncClient() as dc:
                    async with dc.stream("POST", f"{{DECODE_URL}}{{path}}",
                                         json=decode_body, headers=headers,
                                         timeout=TIMEOUT) as dresp:
                        async for chunk in dresp.aiter_bytes():
                            yield chunk
            return StreamingResponse(stream_response(), media_type="text/event-stream")
        else:
            dresp = await client.post(f"{{DECODE_URL}}{{path}}", json=decode_body,
                                      headers=headers, timeout=TIMEOUT)
            dresp.raise_for_status()
            return Response(content=dresp.content, status_code=dresp.status_code,
                            media_type=dresp.headers.get("content-type", "application/json"))

if __name__ == "__main__":
    logger.info(f"Proxy listening on 0.0.0.0:{PROXY_PORT}")
    logger.info(f"  Prefill: {{PREFILL_URL}}")
    logger.info(f"  Decode:  {{DECODE_URL}}")
    uvicorn.run(app, host="0.0.0.0", port={PROXY_PORT}, log_level="info")
'''
disagg_proxy.write_text(disagg_content)

# Kill any existing proxy, copy, and start fresh
subprocess.run(
    ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{CONTROLLER_HOST}',
     'pkill -f disagg_proxy.py 2>/dev/null; sleep 1; pkill -9 -f disagg_proxy.py 2>/dev/null; true'],
    capture_output=True, timeout=10
)
time.sleep(1)
subprocess.run(
    ['scp', str(disagg_proxy), f'nvidia@{CONTROLLER_HOST}:/tmp/disagg_proxy.py'],
    capture_output=True, timeout=10
)
proxy_cmd = (
    f". {VENV_PATH}/bin/activate && "
    f"nohup python /tmp/disagg_proxy.py > /tmp/disagg_proxy.log 2>&1 < /dev/null &"
)
subprocess.Popen(
    ['ssh', f'nvidia@{CONTROLLER_HOST}', proxy_cmd],
    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, stdin=subprocess.DEVNULL
)
time.sleep(3)
proxy_ready = wait_for_server(CONTROLLER_HOST, PROXY_PORT, "Proxy (controller)", timeout=30, interval=2)
if not proxy_ready:
    raise RuntimeError("Proxy failed to start. Check /tmp/disagg_proxy.log on controller")

print("\nDisaggregated setup ready.")

Starting disaggregated setup...
  Prefill (spark-01): PID 990114
  Decode (spark-02):  PID 990116

Waiting for instances to load model and initialize NIXL...
  Prefill (spark-01): waiting... (0s / 300s)
  Prefill (spark-01): waiting... (10s / 300s)
  Prefill (spark-01): waiting... (20s / 300s)
  Prefill (spark-01): waiting... (30s / 300s)
  Prefill (spark-01): waiting... (40s / 300s)
  Prefill (spark-01): waiting... (50s / 300s)
  Prefill (spark-01): waiting... (60s / 300s)
  Prefill (spark-01): waiting... (70s / 300s)
  Prefill (spark-01): waiting... (80s / 300s)
  Prefill (spark-01): waiting... (90s / 300s)
  Prefill (spark-01): waiting... (100s / 300s)
  Prefill (spark-01): waiting... (110s / 300s)
  Prefill (spark-01): waiting... (120s / 300s)
  Prefill (spark-01): ready (130s)
  Decode (spark-02): ready (0s)
  Proxy (controller): ready (1s)

Disaggregated setup ready.


In [13]:
# Run guidellm sweep against disaggregated proxy
run_guidellm(
    target_url=f"http://{CONTROLLER_HOST}:{PROXY_PORT}",
    label="disaggregated",
    profile="sweep"
)

Running guidellm [disaggregated] on controller (192.168.1.75)
  Profile: sweep
  Target:  http://192.168.1.75:8192
  Remote output: /tmp/benchmark_results/disaggregated/
  Duration: ~60s per rate point

✔ OpenAIHTTPBackend backend validated with model 
/home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/sn
apshots/0e9e39f249a16976918f6564b8830bc894c89659
  {'target': 'http://192.168.1.75:8192', 'model': None, 'timeout': 60.0,        
  'http2': True, 'follow_redirects': True, 'verify': False, 'openai_paths':     
  {'health': 'health', 'models': 'v1/models', 'text_completions':               
  'v1/completions', 'chat_completions': 'v1/chat/completions',                  
  'audio_transcriptions': 'v1/audio/transcriptions', 'audio_translations':      
  'v1/audio/translations'}, 'validate_backend': {'method': 'GET', 'url':        
  'http://192.168.1.75:8192/health'}}                                           
✔ Processor resolved
  Using processor '/tmp/token

0

In [14]:
# Stop all services
stop_all_services()
print("Phase 3 complete. Disaggregated benchmark results saved.")

Stopping all services...
  All services stopped
Phase 3 complete. Disaggregated benchmark results saved.


## Step 3: Compare Results

Load the benchmark results from all three phases and produce a side-by-side comparison. guidellm saves results as `benchmarks.json` in each output directory.

In [15]:
def load_guidellm_results(label):
    """Load guidellm benchmark results from the output directory."""
    results_path = RESULTS_DIR / label / "benchmarks.json"
    if not results_path.exists():
        # Try alternate filename
        alt_paths = list((RESULTS_DIR / label).glob("*.json"))
        if alt_paths:
            results_path = alt_paths[0]
        else:
            print(f"  {label}: no results found in {RESULTS_DIR / label}")
            return None
    with open(results_path) as f:
        data = json.load(f)
    print(f"  {label}: loaded from {results_path}")
    return data


def extract_summary(data, label):
    """Extract key metrics from guidellm v0.5.3 results.

    guidellm nests metrics under: metrics -> <metric_name> -> successful -> {mean, percentiles}
    Time-related keys use _ms suffixes (already in milliseconds), except
    request_latency which is in seconds.
    """
    if data is None:
        return None

    benchmarks = data.get('benchmarks', [data]) if isinstance(data, dict) else data

    summaries = []
    for bench in benchmarks:
        if not isinstance(bench, dict):
            continue
        metrics = bench.get('metrics', {})

        # Helper to extract a stat from the nested structure
        def get_stat(metric_key, stat='mean', category='successful'):
            bucket = metrics.get(metric_key, {}).get(category, {})
            if stat == 'mean':
                return bucket.get('mean', 0) or 0
            percentiles = bucket.get('percentiles', {})
            return percentiles.get(stat, 0) or 0

        # Output tokens/sec (raw number, not ms)
        otps_mean = get_stat('output_tokens_per_second', 'mean')

        # TTFT: already in ms (key is time_to_first_token_ms)
        ttft_mean = get_stat('time_to_first_token_ms', 'mean')
        ttft_p95 = get_stat('time_to_first_token_ms', 'p95')

        # TPOT: already in ms (key is time_per_output_token_ms)
        tpot_mean = get_stat('time_per_output_token_ms', 'mean')
        tpot_p95 = get_stat('time_per_output_token_ms', 'p95')

        # ITL: already in ms (key is inter_token_latency_ms)
        itl_mean = get_stat('inter_token_latency_ms', 'mean')

        # Request latency: in seconds, convert to ms
        latency_mean_s = get_stat('request_latency', 'mean')
        latency_p95_s = get_stat('request_latency', 'p95')

        # Request rate
        rps_mean = get_stat('requests_per_second', 'mean')

        summary = {
            'request_rate_mean': rps_mean,
            'output_tps_mean': otps_mean,
            'ttft_mean_ms': ttft_mean,
            'ttft_p95_ms': ttft_p95,
            'tpot_mean_ms': tpot_mean,
            'tpot_p95_ms': tpot_p95,
            'itl_mean_ms': itl_mean,
            'latency_mean_ms': latency_mean_s * 1000,
            'latency_p95_ms': latency_p95_s * 1000,
        }
        summaries.append(summary)

    return summaries


print("Loading benchmark results...\n")
results = {}
for label in ['single_node', 'replicated', 'disaggregated']:
    data = load_guidellm_results(label)
    if data:
        results[label] = extract_summary(data, label)

# Display peak throughput comparison
print("\n" + "="*80)
print("PEAK THROUGHPUT COMPARISON")
print("="*80)
print(f"{'Configuration':<20} {'Output tok/s':>12} {'TTFT mean':>12} {'TTFT P95':>12} {'TPOT mean':>12} {'Latency P95':>12}")
print("-"*80)

baseline_tps = None
for label, display_name in [('single_node', 'Single node'), ('replicated', 'Replicated'), ('disaggregated', 'Disaggregated')]:
    if label not in results or not results[label]:
        print(f"{display_name:<20} {'(no data)':>12}")
        continue

    # Find the run with highest throughput (peak of the sweep)
    peak = max(results[label], key=lambda x: x['output_tps_mean'])

    tps = peak['output_tps_mean']
    if baseline_tps is None:
        baseline_tps = tps
        ratio_str = "(baseline)"
    else:
        ratio = tps / baseline_tps if baseline_tps > 0 else 0
        ratio_str = f"({ratio:.2f}x)"

    print(
        f"{display_name:<20} "
        f"{tps:>8.1f} {ratio_str:>3} "
        f"{peak['ttft_mean_ms']:>10.1f}ms "
        f"{peak['ttft_p95_ms']:>10.1f}ms "
        f"{peak['tpot_mean_ms']:>10.1f}ms "
        f"{peak['latency_p95_ms']:>10.1f}ms"
    )

print("\nNote: Peak throughput is the highest output tok/s observed across all")
print("rate points in the sweep. Other metrics are from that same rate point.")
print(f"\nFull results in: {RESULTS_DIR.resolve()}/")
print("  Each subdirectory contains benchmarks.json (raw data) and benchmarks.html (charts)")

Loading benchmark results...

  single_node: loaded from benchmark_results/single_node/benchmarks.json
  replicated: loaded from benchmark_results/replicated/benchmarks.json
  disaggregated: loaded from benchmark_results/disaggregated/benchmarks.json

PEAK THROUGHPUT COMPARISON
Configuration        Output tok/s    TTFT mean     TTFT P95    TPOT mean  Latency P95
--------------------------------------------------------------------------------
Single node             406.0 (baseline)     7908.2ms    15376.4ms      390.9ms    59280.4ms
Replicated               14.4 (0.04x)      612.6ms      612.6ms       74.4ms     9525.0ms
Disaggregated            14.0 (0.03x)      520.3ms      520.3ms       75.3ms     9644.3ms

Note: Peak throughput is the highest output tok/s observed across all
rate points in the sweep. Other metrics are from that same rate point.

Full results in: /home/nvidia/src/github.com/elizabetht/spark/notebooks/02_disaggregated-serving/vllm-native/benchmark_results/
  Each sub

## Key Takeaways

### What guidellm adds over manual benchmarks

| Metric | Manual (Notebooks 01/03/04) | guidellm (this notebook) |
|--------|---------------------------|-------------------------|
| Latency breakdown | End-to-end only | TTFT, TPOT, ITL separately |
| Statistical rigor | Single data point | P50, P95, P99 distributions |
| Load patterns | 8 requests, all at once | Sweep across rates, warmup/cooldown |
| Saturation detection | None | Automatic |
| Output formats | Print statements | JSON, CSV, HTML charts |

### How to read the results

- **TTFT (Time to First Token)**: How long until the first token arrives. Disaggregated TTFT includes the NIXL transfer hop, so it will be higher than single-node.
- **TPOT (Time Per Output Token)**: Average time between consecutive tokens during decode. This should be similar across configurations since decode is memory-bandwidth-bound on the same GPU.
- **ITL (Inter-Token Latency)**: Similar to TPOT but measured as the gap between each token pair. More sensitive to jitter.
- **Output tok/s under sweep**: The sweep increases request rate until saturation. Peak throughput is where the system delivers maximum tokens/sec before latency degrades.

### Where the HTML reports live

Each configuration's `benchmarks.html` file contains interactive charts showing latency vs throughput curves. These are suitable for screenshots in LinkedIn articles or technical documentation.

```
benchmark_results/
  single_node/benchmarks.json, benchmarks.html
  replicated/benchmarks.json, benchmarks.html
  disaggregated/benchmarks.json, benchmarks.html
```