# Replicated Serving Baseline

Run two independent vLLM instances (spark-01 and spark-02, each at 0.3 `gpu-memory-utilization`) behind a round-robin proxy on the controller. This is the fair comparison for disaggregated serving: same hardware footprint, same total memory budget, different architecture.

## What We're Building

```
Client Request
      |
      v
  [Round-Robin Proxy] (controller:8192)
      |
      +---> [vLLM Instance A] (spark-01:8100)   ← independent, full pipeline
      |
      +---> [vLLM Instance B] (spark-02:8200)   ← independent, full pipeline
```

Each instance runs the complete inference pipeline (prefill + decode) independently. No KV cache transfer, no NIXL, no coordination between GPUs. The proxy distributes requests across both instances.

## Why This Matters

Notebook 01 measured single-node performance at 0.3 utilization. Notebook 04 measures disaggregated serving with two nodes at 0.3 utilization each. Without this notebook, the comparison is one GPU vs two GPUs, and two GPUs will always look better.

This notebook uses the same two GPUs with the same memory budget. The question becomes: **does splitting prefill and decode across nodes outperform running two identical replicas?**

## Prerequisites
- Notebooks 00 and 01 completed (environment verified, baseline measured)
- vLLM cu130 build installed on both GPU nodes
- Model cached on both GPU nodes
- Passwordless SSH to controller, spark-01, and spark-02

## Architecture: Replicated vs Disaggregated

The core difference between this notebook and Notebook 04 is how each GPU participates in serving a request.

### Replicated Serving (This Notebook)

Each instance runs the full inference pipeline independently. The proxy's only job is distributing requests.

```
                          ┌─────────────────────────────────────────────────┐
                          │              controller:8192                    │
                          │           Round-Robin Proxy                     │
                          │                                                 │
                          │   Request 1 ──► Instance A                      │
                          │   Request 2 ──► Instance B                      │
                          │   Request 3 ──► Instance A                      │
                          │   Request 4 ──► Instance B                      │
                          └────────┬────────────────────┬───────────────────┘
                                   │                    │
                    ┌──────────────▼──────────┐  ┌──────▼──────────────────┐
                    │   spark-01:8100         │  │   spark-02:8200         │
                    │   vLLM Instance A       │  │   vLLM Instance B       │
                    │                         │  │                         │
                    │   ┌───────────────────┐ │  │   ┌───────────────────┐ │
                    │   │ Prefill (compute)  │ │  │   │ Prefill (compute)  │ │
                    │   └────────┬──────────┘ │  │   └────────┬──────────┘ │
                    │            ▼            │  │            ▼            │
                    │   ┌───────────────────┐ │  │   ┌───────────────────┐ │
                    │   │ KV Cache (local)   │ │  │   │ KV Cache (local)   │ │
                    │   └────────┬──────────┘ │  │   └────────┬──────────┘ │
                    │            ▼            │  │            ▼            │
                    │   ┌───────────────────┐ │  │   ┌───────────────────┐ │
                    │   │ Decode (generate)  │ │  │   │ Decode (generate)  │ │
                    │   └───────────────────┘ │  │   └───────────────────┘ │
                    │                         │  │                         │
                    │   No cross-node comms   │  │   No cross-node comms   │
                    └─────────────────────────┘  └─────────────────────────┘
```

### Disaggregated Serving (Notebook 04)

One node specializes in prefill (compute-heavy), the other in decode (memory-bandwidth-bound). KV cache transfers over NIXL/RDMA after prefill completes.

```
                          ┌─────────────────────────────────────────────────┐
                          │              controller:8192                    │
                          │           Disaggregated Proxy                   │
                          │                                                 │
                          │   All requests ──► Prefill node first           │
                          │   After prefill ──► Decode node generates       │
                          └────────┬────────────────────┬───────────────────┘
                                   │                    │
                    ┌──────────────▼──────────┐  ┌──────▼──────────────────┐
                    │   spark-01:8100         │  │   spark-02:8200         │
                    │   Prefill Worker        │  │   Decode Worker         │
                    │                         │  │                         │
                    │   ┌───────────────────┐ │  │                         │
                    │   │ Prefill (compute)  │ │  │                         │
                    │   └────────┬──────────┘ │  │                         │
                    │            ▼            │  │                         │
                    │   ┌───────────────────┐ │  │   ┌───────────────────┐ │
                    │   │ KV Cache (local) ──┼─╋──╋──► KV Cache (remote)  │ │
                    │   └───────────────────┘ │  │   └────────┬──────────┘ │
                    │                         │  │            ▼            │
                    │      NIXL/RDMA ─────────┼──┼─►  GPU-to-GPU transfer │
                    │                         │  │            ▼            │
                    │                         │  │   ┌───────────────────┐ │
                    │                         │  │   │ Decode (generate)  │ │
                    │                         │  │   └───────────────────┘ │
                    └─────────────────────────┘  └─────────────────────────┘
```

The tradeoff: replicated serving avoids the KV transfer overhead but cannot specialize hardware for different phases. Disaggregated serving pays the transfer cost upfront but enables independent scaling of prefill and decode.

## Step 1: Load Configuration

In [1]:
import json
import subprocess
import time
import os
from pathlib import Path

# Load environment config from Notebook 00
config_file = Path("environment_config.json")
if config_file.exists():
    with open(config_file) as f:
        env_config = json.load(f)
    print(f"Loaded config from {config_file}")
else:
    raise FileNotFoundError("Run 00_Environment_Setup.ipynb first")

# Load baseline metrics from Notebook 01
baseline_file = Path("baseline_metrics.json")
if baseline_file.exists():
    with open(baseline_file) as f:
        baseline = json.load(f)
    print(f"Loaded baseline from {baseline_file}")
    print(f"  Single request: {baseline['single_request']['latency_ms']:.1f} ms, "
          f"{baseline['single_request']['throughput_tokens_per_sec']:.1f} tok/s")
    print(f"  Batch (8 req):  {baseline['batch_processing']['throughput_tokens_per_sec']:.1f} tok/s")
else:
    print("WARNING: baseline_metrics.json not found. Run 01_Local_Inference_Baseline.ipynb first.")
    baseline = None

# Configuration
# InfiniBand IPs (192.168.100.x): direct link between GPU nodes.
NODE1_HOST = env_config['network']['node1_ip']   # spark-01: 192.168.100.10
NODE2_HOST = env_config['network']['node2_ip']   # spark-02: 192.168.100.11

# LAN IPs (192.168.1.x): shared subnet between all three nodes.
# The proxy on the controller uses these to reach vLLM's HTTP API.
NODE1_LAN_HOST = "192.168.1.76"                  # spark-01 LAN
NODE2_LAN_HOST = "192.168.1.77"                  # spark-02 LAN
CONTROLLER_HOST = "192.168.1.75"                 # controller: CPU-only node
MODEL_NAME = env_config['model']['name']
NODE1_PORT = 8100
NODE2_PORT = 8200
PROXY_PORT = 8192

# Virtual environment path (same on all nodes)
VENV_PATH = os.path.expanduser("~/src/github.com/elizabetht/spark/.venv")

print(f"\nConfiguration:")
print(f"  Model:       {MODEL_NAME}")
print(f"  Node A:      {NODE1_HOST}:{NODE1_PORT} (LAN: {NODE1_LAN_HOST})")
print(f"  Node B:      {NODE2_HOST}:{NODE2_PORT} (LAN: {NODE2_LAN_HOST})")
print(f"  Proxy:       {CONTROLLER_HOST}:{PROXY_PORT} (round-robin)")

Loaded config from environment_config.json
Loaded baseline from baseline_metrics.json
  Single request: 6874.1 ms, 14.5 tok/s
  Batch (8 req):  122.5 tok/s

Configuration:
  Model:       meta-llama/Llama-3.1-8B-Instruct
  Node A:      192.168.100.10:8100 (LAN: 192.168.1.76)
  Node B:      192.168.100.11:8200 (LAN: 192.168.1.77)
  Proxy:       192.168.1.75:8192 (round-robin)


## Step 2: Verify SSH and Dependencies

Same checks as Notebook 04 (disaggregated): SSH connectivity, model availability, vLLM cu130 build, and proxy dependencies on the controller.

In [2]:
def check_ssh(host, label):
    """Verify passwordless SSH connectivity."""
    result = subprocess.run(
        ['ssh', '-o', 'ConnectTimeout=5', '-o', 'BatchMode=yes',
         f'nvidia@{host}', 'hostname'],
        capture_output=True, text=True, timeout=10
    )
    if result.returncode == 0:
        print(f"  SSH to {label} ({host}): OK (hostname: {result.stdout.strip()})")
        return True
    else:
        print(f"  SSH to {label} ({host}): FAILED")
        print(f"    Error: {result.stderr.strip()}")
        print(f"    Fix: ssh-copy-id nvidia@{host}")
        return False

print("Checking SSH connectivity...\n")
check_ssh(NODE2_HOST, "spark-02")
check_ssh(CONTROLLER_HOST, "controller")

# Verify model is cached on spark-02
result = subprocess.run(
    ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{NODE2_HOST}',
     'ls -d ~/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct 2>/dev/null && echo FOUND || echo MISSING'],
    capture_output=True, text=True, timeout=10
)
model_status = result.stdout.strip().split('\n')[-1]
print(f"\nModel on spark-02: {model_status}")

# Verify vLLM cu130 on both nodes
VLLM_INSTALL_CMD = (
    f"{VENV_PATH}/bin/pip install vllm==0.13.0 "
    f"--extra-index-url https://wheels.vllm.ai/0.13.0/cu130 "
    f"--extra-index-url https://download.pytorch.org/whl/cu130"
)
TORCH_INSTALL_CMD = (
    f"{VENV_PATH}/bin/pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 "
    f"--index-url https://download.pytorch.org/whl/cu130"
)

print("\nvLLM CUDA 13 build check:")
for host, label in [(NODE1_HOST, "spark-01"), (NODE2_HOST, "spark-02")]:
    result = subprocess.run(
        ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{host}',
         f'{VENV_PATH}/bin/pip show vllm 2>/dev/null | grep Version'],
        capture_output=True, text=True, timeout=15
    )
    version_line = result.stdout.strip()
    if "cu130" in version_line:
        print(f"  {label}: {version_line} (CUDA 13 build)")
    elif version_line:
        print(f"  {label}: {version_line} (wrong build, needs cu130)")
        print(f"    Installing vLLM cu130 on {label}... (this takes several minutes)")
        vllm_result = subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{host}', VLLM_INSTALL_CMD],
            capture_output=True, text=True, timeout=600
        )
        if vllm_result.returncode == 0:
            print(f"    vLLM cu130 installed on {label}")
        else:
            print(f"    vLLM install failed on {label}: {vllm_result.stderr.strip()[-200:]}")

        print(f"    Installing torch cu130 on {label}...")
        torch_result = subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{host}', TORCH_INSTALL_CMD],
            capture_output=True, text=True, timeout=600
        )
        if torch_result.returncode == 0:
            print(f"    torch cu130 installed on {label}")
        else:
            print(f"    torch install failed on {label}: {torch_result.stderr.strip()[-200:]}")
    else:
        print(f"  {label}: vLLM NOT FOUND")
        print(f"    Install with: ssh nvidia@{host} '{VLLM_INSTALL_CMD}'")

# Verify controller proxy dependencies
print(f"\nController proxy dependencies:")
missing_pkgs = []
for pkg in ["fastapi", "httpx", "uvicorn"]:
    result = subprocess.run(
        ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{CONTROLLER_HOST}',
         f'{VENV_PATH}/bin/python -c "import {pkg}; print({pkg}.__version__)"'],
        capture_output=True, text=True, timeout=10
    )
    if result.returncode == 0:
        print(f"  {pkg}: v{result.stdout.strip()}")
    else:
        print(f"  {pkg}: MISSING")
        missing_pkgs.append(pkg)

if missing_pkgs:
    pkgs_str = " ".join(missing_pkgs)
    print(f"\nInstalling missing packages on controller: {pkgs_str}")
    install = subprocess.run(
        ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{CONTROLLER_HOST}',
         f'{VENV_PATH}/bin/pip install {pkgs_str}'],
        capture_output=True, text=True, timeout=120
    )
    if install.returncode == 0:
        print(f"  Installed successfully")
    else:
        print(f"  Installation failed: {install.stderr.strip()}")
else:
    print("  All dependencies available")

Checking SSH connectivity...

  SSH to spark-02 (192.168.100.11): OK (hostname: spark-02)
  SSH to controller (192.168.1.75): OK (hostname: controller)

Model on spark-02: FOUND

vLLM CUDA 13 build check:
  spark-01: Version: 0.13.0+cu130 (CUDA 13 build)
  spark-02: Version: 0.13.0+cu130 (CUDA 13 build)

Controller proxy dependencies:
  fastapi: v0.128.2
  httpx: v0.28.1
  uvicorn: v0.40.0
  All dependencies available


## Step 3: Stop Existing vLLM Processes

Clear any leftover vLLM or proxy processes on all three nodes.

In [None]:
def stop_stale_vllm(host, label):
    """Kill any running vLLM processes on a node."""
    check = subprocess.run(
        ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{host}',
         'pgrep -f "vllm" | head -5'],
        capture_output=True, text=True, timeout=10
    )
    if check.stdout.strip():
        pids = check.stdout.strip().split('\n')
        print(f"  {label} ({host}): found {len(pids)} vLLM process(es), stopping...")
        subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{host}',
             'pkill -f "vllm" || true'],
            capture_output=True, timeout=10
        )
        time.sleep(2)
        subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{host}',
             'pkill -9 -f "vllm" 2>/dev/null || true'],
            capture_output=True, timeout=10
        )
        print(f"  {label} ({host}): stopped")
    else:
        print(f"  {label} ({host}): no vLLM processes running")

print("Checking for stale processes...\n")
stop_stale_vllm(NODE1_HOST, "spark-01")
stop_stale_vllm(NODE2_HOST, "spark-02")

# Stop any leftover proxy on the controller.
# Kill both proxy scripts separately: pkill uses POSIX basic regex,
# where | is not an OR operator (that is ERE syntax).
subprocess.run(
    ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{CONTROLLER_HOST}',
     'pkill -f replicated_proxy.py 2>/dev/null; pkill -f disagg_proxy.py 2>/dev/null; true'],
    capture_output=True, timeout=10
)
print(f"  controller ({CONTROLLER_HOST}): cleared any stale proxy")

# Verify GPU memory is free on both nodes
print("\nGPU memory status:")
for host, label in [(NODE1_HOST, "spark-01"), (NODE2_HOST, "spark-02")]:
    result = subprocess.run(
        ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{host}',
         'nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits'],
        capture_output=True, text=True, timeout=10
    )
    if result.returncode == 0:
        used, total = result.stdout.strip().split(', ')
        print(f"  {label}: {used} MiB / {total} MiB used")
    else:
        print(f"  {label}: unable to query GPU (nvidia-smi failed)")

Checking for stale processes...

  spark-01 (192.168.100.10): found 3 vLLM process(es), stopping...
  spark-01 (192.168.100.10): stopped
  spark-02 (192.168.100.11): found 3 vLLM process(es), stopping...
  spark-02 (192.168.100.11): stopped
  controller (192.168.1.75): cleared any stale proxy

GPU memory status:
  spark-01: [N/A] MiB / [N/A] MiB used
  spark-02: [N/A] MiB / [N/A] MiB used


## Step 4: Start vLLM Instance A (spark-01)

A standard `vllm serve` with no KV transfer configuration. Each instance runs the full inference pipeline independently: prefill, decode, and response.

In [4]:
# Find model snapshot path (same logic as Notebook 01)
cache_dir = Path.home() / ".cache" / "huggingface" / "hub"
model_slug = MODEL_NAME.replace("/", "--")
model_cache = list(cache_dir.glob(f"models--{model_slug}*"))

if model_cache:
    snapshots_dir = model_cache[0] / "snapshots"
    if snapshots_dir.exists():
        snapshot_dirs = list(snapshots_dir.iterdir())
        if snapshot_dirs:
            MODEL_PATH = str(snapshot_dirs[0])
        else:
            MODEL_PATH = str(model_cache[0])
    else:
        MODEL_PATH = str(model_cache[0])
    print(f"Model path: {MODEL_PATH}")
else:
    raise FileNotFoundError(f"Model {MODEL_NAME} not found in cache")

# Build the vLLM command for spark-01.
# No --kv-transfer-config: this is a standalone instance.
node1_cmd = (
    f". {VENV_PATH}/bin/activate && "
    f"CUDA_HOME=/usr/local/cuda-13.0 "
    f"HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 "
    f"vllm serve {MODEL_PATH} "
    f"--port {NODE1_PORT} "
    f"--gpu-memory-utilization 0.3 "
    f"--tensor-parallel-size 1 "
    f"> /tmp/vllm_node1.log 2>&1"
)

print("Starting vLLM Instance A on spark-01...")
print(f"Command: vllm serve ... --port {NODE1_PORT} --gpu-memory-utilization 0.3")

node1_proc = subprocess.Popen(
    node1_cmd, shell=True,
    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
    preexec_fn=os.setsid
)

print(f"Instance A started (PID: {node1_proc.pid})")
print(f"Log: tail -f /tmp/vllm_node1.log")

Model path: /home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659
Starting vLLM Instance A on spark-01...
Command: vllm serve ... --port 8100 --gpu-memory-utilization 0.3
Instance A started (PID: 963810)
Log: tail -f /tmp/vllm_node1.log


## Step 5: Start vLLM Instance B (spark-02)

Same configuration as Instance A, running on spark-02 via SSH. No NixlConnector, no RDMA side-channel. Each instance is completely independent.

In [5]:
# Build the vLLM command for spark-02.
# Identical to spark-01: standalone instance, no KV transfer.
node2_cmd = (
    f". {VENV_PATH}/bin/activate && "
    f"CUDA_HOME=/usr/local/cuda-13.0 "
    f"HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 "
    f"vllm serve {MODEL_PATH} "
    f"--port {NODE2_PORT} "
    f"--gpu-memory-utilization 0.3 "
    f"--tensor-parallel-size 1 "
    f"> /tmp/vllm_node2.log 2>&1"
)

print("Starting vLLM Instance B on spark-02 via SSH...")
print(f"Command: ssh nvidia@{NODE2_HOST} 'vllm serve ... --port {NODE2_PORT} --gpu-memory-utilization 0.3'")

node2_proc = subprocess.Popen(
    ['ssh', f'nvidia@{NODE2_HOST}', node2_cmd],
    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
    preexec_fn=os.setsid
)

print(f"Instance B started on spark-02 (local PID: {node2_proc.pid})")
print(f"Remote log: ssh nvidia@{NODE2_HOST} 'tail -f /tmp/vllm_node2.log'")

Starting vLLM Instance B on spark-02 via SSH...
Command: ssh nvidia@192.168.100.11 'vllm serve ... --port 8200 --gpu-memory-utilization 0.3'
Instance B started on spark-02 (local PID: 963918)
Remote log: ssh nvidia@192.168.100.11 'tail -f /tmp/vllm_node2.log'


## Step 6: Wait for Both Instances

Poll health endpoints until both vLLM instances respond. Startup is faster than the disaggregated case because there is no NIXL initialization.

In [6]:
import urllib.request
import urllib.error

def wait_for_server(host, port, label, timeout=300, interval=10):
    """Poll a vLLM server's health endpoint until it responds."""
    url = f"http://{host}:{port}/health"
    start = time.time()

    while time.time() - start < timeout:
        try:
            req = urllib.request.Request(url, method='GET')
            with urllib.request.urlopen(req, timeout=5) as resp:
                if resp.status == 200:
                    elapsed = time.time() - start
                    print(f"  {label}: ready ({elapsed:.0f}s)")
                    return True
        except (urllib.error.URLError, ConnectionRefusedError, OSError):
            pass

        elapsed = time.time() - start
        print(f"  {label}: waiting... ({elapsed:.0f}s / {timeout}s)")
        time.sleep(interval)

    print(f"  {label}: TIMEOUT after {timeout}s")
    return False

print("Waiting for vLLM instances to load model...")
print("(Typically 2-4 minutes per instance)\n")

node1_ready = wait_for_server(NODE1_HOST, NODE1_PORT, "Instance A (spark-01)")
node2_ready = wait_for_server(NODE2_HOST, NODE2_PORT, "Instance B (spark-02)")

if node1_ready and node2_ready:
    print("\nBoth instances ready.")
else:
    print("\nOne or both instances failed to start.")
    print("Check logs:")
    print(f"  Instance A: tail -50 /tmp/vllm_node1.log")
    print(f"  Instance B: ssh nvidia@{NODE2_HOST} 'tail -50 /tmp/vllm_node2.log'")

Waiting for vLLM instances to load model...
(Typically 2-4 minutes per instance)

  Instance A (spark-01): ready (0s)
  Instance B (spark-02): ready (0s)

Both instances ready.


## Step 7: Start the Round-Robin Proxy (Controller)

The proxy distributes requests across both instances using round-robin. Compared to the disaggregated proxy in Notebook 04, this one is simpler: no `kv_transfer_params`, no prefill-then-decode pipeline, no NIXL coordination. Each request goes to one instance and that instance handles everything.

### Round-Robin vs Alternatives

| Strategy | How It Works | Tradeoff |
|----------|-------------|----------|
| Round-robin | Alternate between instances | Simple, fair for equal-length requests |
| Least-connections | Route to instance with fewest active requests | Better under variable load |
| Random | Random selection | Statistically equivalent to round-robin at scale |

Round-robin is appropriate here because all test requests have similar prompt lengths and max token counts. In production with variable request sizes, least-connections would be preferred.

### Request Lifecycle: Single vs Batch

The following diagrams show how requests flow through the round-robin proxy.

**Single request:** The proxy forwards to whichever instance is next in the round-robin cycle. One GPU handles the full pipeline.

```
  Client                 Proxy                Instance A           Instance B
    │                      │                      │                      │
    │── POST /completions ─►│                      │                      │
    │                      │── forward request ───►│                      │
    │                      │                      │── prefill ──┐        │
    │                      │                      │             │        │
    │                      │                      │◄─ KV cache ─┘        │
    │                      │                      │── decode ───┐        │
    │                      │                      │  (token by  │        │
    │                      │                      │   token)    │        │
    │                      │                      │◄────────────┘        │
    │                      │◄── response ─────────│                      │
    │◄── completion ───────│                      │                      │
    │                      │                      │                      │
```

**Batch of 8 concurrent requests:** The proxy alternates between instances. Each GPU gets 4 requests and applies continuous batching internally.

```
  Client                 Proxy                Instance A           Instance B
    │                      │                      │                      │
    │── req 1 ────────────►│── forward ──────────►│                      │
    │── req 2 ────────────►│── forward ──────────────────────────────────►│
    │── req 3 ────────────►│── forward ──────────►│                      │
    │── req 4 ────────────►│── forward ──────────────────────────────────►│
    │── req 5 ────────────►│── forward ──────────►│                      │
    │── req 6 ────────────►│── forward ──────────────────────────────────►│
    │── req 7 ────────────►│── forward ──────────►│                      │
    │── req 8 ────────────►│── forward ──────────────────────────────────►│
    │                      │                      │                      │
    │                      │              ┌───────┴───────┐  ┌───────────┴──────┐
    │                      │              │ 4 requests    │  │ 4 requests       │
    │                      │              │ continuous     │  │ continuous       │
    │                      │              │ batching       │  │ batching         │
    │                      │              └───────┬───────┘  └───────────┬──────┘
    │                      │                      │                      │
    │                      │◄── responses ────────│                      │
    │                      │◄── responses ───────────────────────────────│
    │◄── 8 completions ────│                      │                      │
    │                      │                      │                      │
```

The key observation: under batch load, total throughput should approach 2x single-node because each GPU processes half the batch independently. Latency per request stays roughly the same since each GPU sees only 4 requests instead of 8.

In [7]:
# Round-robin proxy script.
# Distributes requests across two independent vLLM instances.
proxy_script = f'''#!/usr/bin/env python3
"""
Round-robin proxy for replicated vLLM instances.

Routes each completions request to the next backend in rotation.
No KV transfer, no disaggregation. Each backend runs the full
inference pipeline independently.
"""
import itertools
import logging

import httpx
import uvicorn
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse

logging.basicConfig(level=logging.INFO, format="%(asctime)s [proxy] %(message)s")
logger = logging.getLogger("replicated_proxy")

BACKENDS = [
    "http://{NODE1_LAN_HOST}:{NODE1_PORT}",
    "http://{NODE2_LAN_HOST}:{NODE2_PORT}",
]
TIMEOUT = httpx.Timeout(timeout=120.0)

# itertools.cycle produces an infinite round-robin iterator
backend_cycle = itertools.cycle(BACKENDS)

app = FastAPI()

@app.get("/health")
async def health():
    return {{"status": "ok"}}

@app.post("/v1/completions")
@app.post("/v1/chat/completions")
async def handle_request(request: Request):
    """Forward request to the next backend in round-robin order."""
    body = await request.json()
    path = request.url.path
    backend = next(backend_cycle)
    url = f"{{backend}}{{path}}"

    logger.info(f"Routing to {{backend}}")

    async with httpx.AsyncClient() as client:
        resp = await client.post(url, json=body, timeout=TIMEOUT)
        resp.raise_for_status()

    return JSONResponse(content=resp.json())

if __name__ == "__main__":
    logger.info(f"Proxy listening on 0.0.0.0:{PROXY_PORT}")
    for b in BACKENDS:
        logger.info(f"  Backend: {{b}}")
    uvicorn.run(app, host="0.0.0.0", port={PROXY_PORT}, log_level="info")
'''

# Write proxy script locally, then copy to controller
local_proxy_path = Path("/tmp/replicated_proxy.py")
local_proxy_path.write_text(proxy_script)
print(f"Proxy script written to {local_proxy_path}")

# Copy to controller
result = subprocess.run(
    ['scp', str(local_proxy_path), f'nvidia@{CONTROLLER_HOST}:/tmp/replicated_proxy.py'],
    capture_output=True, text=True, timeout=10
)
if result.returncode == 0:
    print(f"Proxy script copied to controller ({CONTROLLER_HOST})")
else:
    print(f"SCP failed: {result.stderr.strip()}")
    raise RuntimeError("Cannot copy proxy script to controller")

# Start proxy on controller
proxy_cmd = (
    f". {VENV_PATH}/bin/activate && "
    f"nohup python /tmp/replicated_proxy.py > /tmp/replicated_proxy.log 2>&1 < /dev/null &"
)
result = subprocess.run(
    ['ssh', f'nvidia@{CONTROLLER_HOST}', proxy_cmd],
    capture_output=True, text=True, timeout=10
)
print(f"Proxy started on controller")
print(f"Log: ssh nvidia@{CONTROLLER_HOST} 'tail -f /tmp/replicated_proxy.log'")

# Wait for proxy to be ready
time.sleep(3)
proxy_ready = wait_for_server(CONTROLLER_HOST, PROXY_PORT, "Proxy (controller)", timeout=15, interval=2)

Proxy script written to /tmp/replicated_proxy.py
Proxy script copied to controller (192.168.1.75)
Proxy started on controller
Log: ssh nvidia@192.168.1.75 'tail -f /tmp/replicated_proxy.log'
  Proxy (controller): ready (0s)


## Step 8: Single Request Test

Send one request through the proxy. Expected latency should be close to the single-node baseline from Notebook 01, plus a small HTTP hop through the proxy. Unlike disaggregated serving, there is no KV cache transfer overhead.

In [9]:
import urllib.request

def send_completion(host, port, prompt, max_tokens=100):
    """Send a completion request to a vLLM-compatible endpoint."""
    url = f"http://{host}:{port}/v1/completions"
    payload = json.dumps({
        "model": MODEL_PATH,
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": 0.0
    }).encode('utf-8')

    req = urllib.request.Request(
        url, data=payload,
        headers={"Content-Type": "application/json"},
        method='POST'
    )

    start = time.time()
    with urllib.request.urlopen(req, timeout=120) as resp:
        result = json.loads(resp.read().decode('utf-8'))
    elapsed = time.time() - start

    return result, elapsed

# Single request test
test_prompt = "Explain how HTTP load balancers work in 3 sentences."
print(f"Prompt: '{test_prompt}'")
print(f"Sending to proxy at {CONTROLLER_HOST}:{PROXY_PORT}...\n")

result, elapsed = send_completion(CONTROLLER_HOST, PROXY_PORT, test_prompt, max_tokens=100)

# Extract metrics
choice = result['choices'][0]
output_text = choice['text']
usage = result.get('usage', {})
completion_tokens = usage.get('completion_tokens', len(output_text.split()))
latency_ms = elapsed * 1000
tokens_per_sec = completion_tokens / elapsed if elapsed > 0 else 0

print(f"Results (replicated, single request):")
print(f"  Latency:    {latency_ms:.1f} ms")
print(f"  Tokens:     {completion_tokens}")
print(f"  Throughput: {tokens_per_sec:.1f} tokens/sec")

# Compare with baseline
if baseline:
    baseline_latency = baseline['single_request']['latency_ms']
    baseline_tps = baseline['single_request']['throughput_tokens_per_sec']
    overhead_ms = latency_ms - baseline_latency
    overhead_pct = (overhead_ms / baseline_latency) * 100

    print(f"\nBaseline comparison (single node, 0.3 util):")
    print(f"  Baseline latency:    {baseline_latency:.1f} ms")
    print(f"  Replicated latency:  {latency_ms:.1f} ms")
    print(f"  Overhead:            {overhead_ms:.1f} ms ({overhead_pct:+.1f}%)")
    print(f"  Baseline throughput: {baseline_tps:.1f} tok/s")
    print(f"  Replicated:          {tokens_per_sec:.1f} tok/s")

print(f"\nOutput:\n{output_text.strip()}")

Prompt: 'Explain how HTTP load balancers work in 3 sentences.'
Sending to proxy at 192.168.1.75:8192...

Results (replicated, single request):
  Latency:    7303.8 ms
  Tokens:     100
  Throughput: 13.7 tokens/sec

Baseline comparison (single node, 0.3 util):
  Baseline latency:    6874.1 ms
  Replicated latency:  7303.8 ms
  Overhead:            429.7 ms (+6.3%)
  Baseline throughput: 14.5 tok/s
  Replicated:          13.7 tok/s

Output:
An HTTP load balancer distributes incoming HTTP traffic across multiple servers to improve responsiveness, reliability, and scalability. It does this by routing each incoming request to the server that is best suited to handle it, based on factors such as server load, response time, and availability. By distributing the load across multiple servers, an HTTP load balancer helps to prevent any one server from becoming overwhelmed and ensures that users can access the application or service without interruption.
What is the primary function of an HTTP l

## Step 9: Batch Request Test

Send 8 concurrent requests through the round-robin proxy. The proxy alternates between instances, so each GPU handles 4 requests. This is where replication shows its value: two GPUs processing requests in parallel, each using continuous batching independently.

### Expected Behavior

With round-robin across 2 identical instances:
- Each instance gets 4 of the 8 requests
- Each instance applies continuous batching to its 4 requests
- Total throughput should approach 2x the single-node baseline
- Per-request latency should be similar to single-node (half the batch load per GPU)

In [10]:
from concurrent.futures import ThreadPoolExecutor, as_completed

test_prompts = [
    "What is a REST API?",
    "Explain database indexing.",
    "How does DNS work?",
    "What are microservices?",
    "Describe container orchestration.",
    "What is continuous integration?",
    "Explain message queues.",
    "How does caching improve performance?"
]

batch_size = len(test_prompts)
print(f"Sending {batch_size} concurrent requests (round-robin across 2 instances)...\n")

batch_start = time.time()
results = []

with ThreadPoolExecutor(max_workers=batch_size) as executor:
    futures = {
        executor.submit(send_completion, CONTROLLER_HOST, PROXY_PORT, p, 100): p
        for p in test_prompts
    }
    for future in as_completed(futures):
        prompt = futures[future]
        try:
            result, elapsed = future.result()
            tokens = result.get('usage', {}).get('completion_tokens', 0)
            results.append({'prompt': prompt, 'latency_s': elapsed, 'tokens': tokens})
        except Exception as e:
            print(f"  FAILED: {prompt[:40]}... ({e})")
            results.append({'prompt': prompt, 'latency_s': 0, 'tokens': 0, 'error': str(e)})

batch_elapsed = time.time() - batch_start

# Calculate metrics
successful = [r for r in results if 'error' not in r]
total_tokens = sum(r['tokens'] for r in successful)
avg_latency_ms = (sum(r['latency_s'] for r in successful) / len(successful)) * 1000 if successful else 0
batch_throughput = total_tokens / batch_elapsed if batch_elapsed > 0 else 0

print(f"Batch Results (replicated, round-robin):")
print(f"  Requests:       {len(successful)}/{batch_size} successful")
print(f"  Total time:     {batch_elapsed:.2f} s")
print(f"  Total tokens:   {total_tokens}")
print(f"  Throughput:     {batch_throughput:.1f} tokens/sec")
print(f"  Avg latency:    {avg_latency_ms:.1f} ms")

# Compare with baseline
if baseline:
    baseline_batch_tps = baseline['batch_processing']['throughput_tokens_per_sec']
    baseline_batch_lat = baseline['batch_processing']['avg_latency_ms']
    tps_ratio = batch_throughput / baseline_batch_tps if baseline_batch_tps > 0 else 0

    print(f"\nBaseline comparison (single node, batch of {batch_size}):")
    print(f"  Baseline throughput:    {baseline_batch_tps:.1f} tok/s")
    print(f"  Replicated throughput:  {batch_throughput:.1f} tok/s ({tps_ratio:.2f}x)")
    print(f"  Baseline avg latency:   {baseline_batch_lat:.1f} ms")
    print(f"  Replicated avg latency: {avg_latency_ms:.1f} ms")

Sending 8 concurrent requests (round-robin across 2 instances)...

Batch Results (replicated, round-robin):
  Requests:       8/8 successful
  Total time:     30.21 s
  Total tokens:   800
  Throughput:     26.5 tokens/sec
  Avg latency:    18632.5 ms

Baseline comparison (single node, batch of 8):
  Baseline throughput:    122.5 tok/s
  Replicated throughput:  26.5 tok/s (0.22x)
  Baseline avg latency:   816.6 ms
  Replicated avg latency: 18632.5 ms


## Step 10: Save Metrics

Save replicated serving metrics for comparison in Notebook 04 (disaggregated serving).

In [11]:
from datetime import datetime

replicated_metrics = {
    "timestamp": datetime.now().isoformat(),
    "model": MODEL_NAME,
    "config": {
        "num_instances": 2,
        "gpu_memory_utilization": 0.3,
        "routing": "round-robin",
        "kv_transfer": "none"
    },
    "single_request": {
        "latency_ms": latency_ms,
        "tokens": completion_tokens,
        "throughput_tokens_per_sec": tokens_per_sec
    },
    "batch_processing": {
        "batch_size": batch_size,
        "total_tokens": total_tokens,
        "throughput_tokens_per_sec": batch_throughput,
        "avg_latency_ms": avg_latency_ms
    }
}

# Save metrics
metrics_file = Path("replicated_metrics.json")
with open(metrics_file, 'w') as f:
    json.dump(replicated_metrics, f, indent=2)

print("="*60)
print("REPLICATED SERVING PERFORMANCE SUMMARY")
print("="*60)
print(f"\nSingle Request:")
print(f"  Latency:    {latency_ms:.1f} ms")
print(f"  Throughput: {tokens_per_sec:.1f} tokens/sec")
print(f"\nBatch Processing ({batch_size} requests, round-robin):")
print(f"  Throughput: {batch_throughput:.1f} tokens/sec")
print(f"  Avg Latency: {avg_latency_ms:.1f} ms")

if baseline:
    print(f"\nComparison with single-node baseline (0.3 util):")
    print(f"  Batch throughput: {baseline['batch_processing']['throughput_tokens_per_sec']:.1f} tok/s (baseline)")
    print(f"                    {batch_throughput:.1f} tok/s (replicated, {tps_ratio:.2f}x)")

print(f"\nMetrics saved to: {metrics_file}")
print(f"\nNotebook 04 will compare disaggregated serving against these numbers.")

REPLICATED SERVING PERFORMANCE SUMMARY

Single Request:
  Latency:    7303.8 ms
  Throughput: 13.7 tokens/sec

Batch Processing (8 requests, round-robin):
  Throughput: 26.5 tokens/sec
  Avg Latency: 18632.5 ms

Comparison with single-node baseline (0.3 util):
  Batch throughput: 122.5 tok/s (baseline)
                    26.5 tok/s (replicated, 0.22x)

Metrics saved to: replicated_metrics.json

Notebook 04 will compare disaggregated serving against these numbers.


## Step 11: Cleanup

Stop both vLLM instances and the proxy.

In [12]:
import signal

def cleanup():
    """Stop all processes started by this notebook."""
    print("Stopping services...\n")

    # Stop proxy on controller
    try:
        subprocess.run(
            ['ssh', f'nvidia@{CONTROLLER_HOST}',
             'pkill -f replicated_proxy.py || true'],
            capture_output=True, timeout=10
        )
        print("  Proxy (controller): stopped")
    except Exception as e:
        print(f"  Proxy (controller): manual cleanup needed ({e})")

    # Stop Instance A on spark-01 (local)
    try:
        os.killpg(os.getpgid(node1_proc.pid), signal.SIGTERM)
        print("  Instance A (spark-01): stopped")
    except (ProcessLookupError, OSError):
        print("  Instance A (spark-01): already stopped")

    # Stop Instance B on spark-02
    try:
        subprocess.run(
            ['ssh', f'nvidia@{NODE2_HOST}',
             'pkill -f "vllm serve" || true'],
            capture_output=True, timeout=10
        )
        print("  Instance B (spark-02): stopped")
    except Exception as e:
        print(f"  Instance B (spark-02): manual cleanup needed ({e})")

    print("\nAll services stopped.")

cleanup()

Stopping services...

  Proxy (controller): stopped
  Instance A (spark-01): stopped
  Instance B (spark-02): stopped

All services stopped.


## Key Takeaways

**What we built:**
- Two independent vLLM instances on spark-01 and spark-02 (0.3 `gpu-memory-utilization` each)
- Round-robin proxy on the controller distributing requests across both instances
- No KV cache transfer, no NIXL, no coordination between GPUs

**What we measured:**
- Single-request latency: should be close to single-node baseline (minimal proxy overhead)
- Batch throughput: should approach ~2x single-node baseline (two GPUs processing in parallel)

**Why this is the fair comparison for disaggregation:**
- Same hardware: 2 GPUs across 2 nodes
- Same memory budget: 0.3 utilization per GPU
- Different architecture: replicated (each GPU does everything) vs disaggregated (one prefills, one decodes)

**Replication vs Disaggregation tradeoffs:**

| Dimension | Replicated (this notebook) | Disaggregated (Notebook 04) |
|-----------|---------------------------|----------------------------|
| Architecture | Each instance is independent | Pipeline: prefill on one, decode on another |
| KV transfer | None | GPU-to-GPU via NIXL/RDMA |
| Single-request latency | Lower (no transfer hop) | Higher (NIXL transfer overhead) |
| Batch throughput | 2x single-node (parallel replicas) | Depends on prefill/decode overlap |
| Complexity | Simple round-robin | Proxy + kv_transfer_params + NIXL side-channel |
| Scaling model | Add more replicas | Independent prefill/decode scaling |

**When replication wins:**
- Uniform request patterns (similar prompt and output lengths)
- Low concurrency where pipeline parallelism has no advantage
- Simplicity is a priority

**When disaggregation wins:**
- High concurrency where prefill and decode overlap matters
- Asymmetric workloads (long prefills, short decodes, or vice versa)
- Independent scaling of compute-bound (prefill) and memory-bound (decode) phases
- Production systems with KV-aware routing (cache reuse across requests)

**What's next:**
- [04_Disaggregated_Serving.ipynb](04_Disaggregated_Serving.ipynb): Split prefill/decode across nodes with NixlConnector and compare against these replicated numbers