# Environment Setup for Disaggregated Serving

Before building a disaggregated inference system, we need a working baseline environment. This notebook sets up two DGX Spark nodes for distributed LLM inference.

## What We're Setting Up

- **Node 1 (spark-01)**: Primary inference node (prefill)
- **Node 2 (spark-02)**: Secondary inference node (decode)
- **Network**: RDMA-capable RoCE link between nodes
- **Software**: PyTorch, vLLM, NIXL, monitoring tools

## Why This Matters

Disaggregated serving splits the inference pipeline across multiple machines. Before we optimize with RDMA and cache-aware routing, we need working compute and networking.

## Step 1: Verify Node Configuration

Check both nodes are accessible and have the expected hardware.

In [1]:
import subprocess
import socket

def get_hostname():
    return socket.gethostname()

def get_gpu_info():
    """Get GPU count and model from nvidia-smi"""
    try:
        result = subprocess.run(
            ['nvidia-smi', '--query-gpu=name,count', '--format=csv,noheader'],
            capture_output=True,
            text=True,
            check=True
        )
        gpu_info = result.stdout.strip().split('\n')
        return len(gpu_info), gpu_info[0].split(',')[0] if gpu_info else "Unknown"
    except Exception as e:
        return 0, str(e)

hostname = get_hostname()
gpu_count, gpu_model = get_gpu_info()

print(f"Hostname: {hostname}")
print(f"GPUs: {gpu_count}x {gpu_model}")

Hostname: spark-01
GPUs: 1x NVIDIA GB10


## Step 2: Check RDMA Network (Both Nodes)

Verify RDMA-capable interfaces are present and active on both nodes. NIXL uses these for GPU-to-GPU KV cache transfer.

In [3]:
# Node IPs - used by this cell and subsequent cells
NODE1_IP = "192.168.100.10"  # spark-01
NODE2_IP = "192.168.100.11"  # spark-02

def check_ib_devices():
    """List InfiniBand/RoCE devices using ibstat."""
    try:
        result = subprocess.run(
            ['ibstat', '-l'],
            capture_output=True, text=True, check=True
        )
        devices = result.stdout.strip().split('\n')
        return [d for d in devices if d]
    except subprocess.CalledProcessError:
        return []
    except FileNotFoundError:
        return []

def check_ib_link_state(device):
    """Check if an RDMA device port is active."""
    try:
        result = subprocess.run(
            ['ibstat', device],
            capture_output=True, text=True, check=True
        )
        return "State: Active" in result.stdout
    except Exception:
        return False

def check_remote_ib_devices(remote_ip):
    """List RDMA devices and their link state on a remote node via SSH."""
    try:
        result = subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{remote_ip}',
             'ibstat -l 2>/dev/null && echo "---" && for d in $(ibstat -l 2>/dev/null); do echo "$d: $(ibstat $d 2>/dev/null | grep -c \"State: Active\") active ports"; done'],
            capture_output=True, text=True, timeout=15
        )
        if result.returncode != 0:
            return [], "ibstat not available"

        lines = result.stdout.strip().split('\n')
        separator = lines.index('---') if '---' in lines else len(lines)
        devices = [d for d in lines[:separator] if d]
        states = lines[separator+1:] if separator < len(lines) else []
        return devices, states
    except Exception as e:
        return [], str(e)

# --- Check spark-01 (local) ---
print(f"spark-01 ({NODE1_IP}) - RDMA devices:\n")
ib_devices = check_ib_devices()
active_local = 0
for device in ib_devices:
    active = check_ib_link_state(device)
    state = "Active" if active else "Down"
    status = "✓" if active else "✗"
    print(f"  {status} {device}: {state}")
    if active:
        active_local += 1

if not ib_devices:
    print("  ✗ No RDMA devices found (ibstat not available or no devices)")

# --- Check spark-02 (remote) ---
print(f"\nspark-02 ({NODE2_IP}) - RDMA devices:\n")
remote_devices, remote_states = check_remote_ib_devices(NODE2_IP)
active_remote = 0

if isinstance(remote_states, str):
    # Error case
    print(f"  ✗ {remote_states}")
elif remote_devices:
    for state_line in remote_states:
        if state_line.strip():
            has_active = "0 active" not in state_line
            status = "✓" if has_active else "✗"
            dev_name = state_line.split(':')[0].strip()
            state_text = "Active" if has_active else "Down"
            print(f"  {status} {dev_name}: {state_text}")
            if has_active:
                active_remote += 1
else:
    print("  ✗ No RDMA devices found")

# Summary
print()
if active_local > 0 and active_remote > 0:
    print(f"✓ Both nodes have active RDMA interfaces ({active_local} on spark-01, {active_remote} on spark-02).")
elif active_local > 0:
    print(f"✗ spark-02 has no active RDMA interfaces. NIXL requires RDMA on both nodes.")
elif active_remote > 0:
    print(f"✗ spark-01 has no active RDMA interfaces. NIXL requires RDMA on both nodes.")
else:
    print("✗ Neither node has active RDMA interfaces. KV cache transfer will fall back to TCP.")

spark-01 (192.168.100.10) - RDMA devices:

  ✓ roceP2p1s0f0: Active
  ✓ roceP2p1s0f1: Active
  ✓ rocep1s0f0: Active
  ✓ rocep1s0f1: Active

spark-02 (192.168.100.11) - RDMA devices:

  ✓ roceP2p1s0f0: Active
  ✓ roceP2p1s0f1: Active
  ✓ rocep1s0f0: Active
  ✓ rocep1s0f1: Active

✓ Both nodes have active RDMA interfaces (4 on spark-01, 4 on spark-02).


## Step 3: Test Network Connectivity

Ping the other node to verify basic network connectivity. Update the IP address for your environment.

In [4]:
# NODE1_IP and NODE2_IP are defined in Step 2

def ping_host(ip_address, count=4):
    """Test network connectivity to remote host"""
    try:
        result = subprocess.run(
            ['ping', '-c', str(count), ip_address],
            capture_output=True,
            text=True,
            timeout=10
        )
        # Extract average latency from ping output
        for line in result.stdout.split('\n'):
            if 'avg' in line:
                # Format: min/avg/max/mdev = 0.123/0.456/0.789/0.012 ms
                avg_latency = line.split('=')[1].strip().split('/')[1]
                return True, f"{avg_latency} ms"
        return result.returncode == 0, "success"
    except Exception as e:
        return False, str(e)

current_node = get_hostname()
remote_ip = NODE2_IP if "01" in current_node else NODE1_IP

print(f"Testing connectivity to remote node ({remote_ip})...")
success, latency = ping_host(remote_ip)

if success:
    print(f"✓ Remote node reachable (avg latency: {latency})")
else:
    print(f"✗ Cannot reach remote node: {latency}")
    print("  Check network configuration and IP addresses")

Testing connectivity to remote node (192.168.100.11)...
✓ Remote node reachable (avg latency: 1.253 ms)


## Step 4: Verify Core Dependencies (Both Nodes)

Both nodes need PyTorch, vLLM, and Transformers installed. Prefill runs on spark-01, decode runs on spark-02, and both load the same model. This cell checks spark-01 (local) and then verifies spark-02 via SSH.

In [5]:
import os

# Virtual environment path (same on both nodes)
VENV_PATH = os.path.expanduser("~/src/github.com/elizabetht/spark/.venv")
VENV_PYTHON = f"{VENV_PATH}/bin/python"

def check_package(package_name):
    """Check if a Python package is installed locally."""
    try:
        __import__(package_name)
        return True
    except ImportError:
        return False

def check_remote_packages(remote_ip, venv_path):
    """Check if required packages are installed on a remote node via SSH."""
    # Use the venv python directly to avoid shell quoting issues with 'source'
    venv_python = f"{venv_path}/bin/python"
    check_cmd = (
        f'{venv_python} -c "'
        f'import torch; import vllm; import transformers; '
        f'print(f\\"torch={{torch.__version__}}\\"); '
        f'print(f\\"vllm={{vllm.__version__}}\\"); '
        f'print(f\\"transformers={{transformers.__version__}}\\")'
        f'"'
    )
    try:
        result = subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{remote_ip}', check_cmd],
            capture_output=True, text=True, timeout=30
        )
        if result.returncode == 0:
            return True, result.stdout.strip()
        else:
            # Try to identify which package failed
            return False, result.stderr.strip()
    except Exception as e:
        return False, str(e)

packages = {
    'torch': 'PyTorch (deep learning framework)',
    'vllm': 'vLLM (high-performance LLM serving)',
    'transformers': 'HuggingFace Transformers (model loading)',
}

# --- Check spark-01 (local) ---
print(f"spark-01 ({NODE1_IP}) - local packages:\n")
missing_local = []
for pkg, description in packages.items():
    installed = check_package(pkg)
    status = "✓" if installed else "✗"
    print(f"  {status} {pkg}: {description}")
    if not installed:
        missing_local.append(pkg)

if missing_local:
    print(f"\n  Missing: {', '.join(missing_local)}")
else:
    print("\n  All packages installed.")

# --- Check spark-02 (remote) ---
print(f"\nspark-02 ({NODE2_IP}) - remote packages:\n")
remote_ok, remote_detail = check_remote_packages(NODE2_IP, VENV_PATH)

if remote_ok:
    for line in remote_detail.split('\n'):
        print(f"  ✓ {line}")
    print("\n  All packages installed.")
else:
    print(f"  ✗ Package check failed")
    print(f"  Error: {remote_detail[:200]}")

# --- Print install commands if anything is missing ---
if missing_local or not remote_ok:
    print("\n" + "=" * 60)
    print("INSTALL COMMANDS")
    print("=" * 60)
    print(f"\nActivate the virtual environment first:")
    print(f"  source {VENV_PATH}/bin/activate")
    print(f"\nThen run:")
    print(f"  pip install transformers")
    print(f"  pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu130")
    print(f"  pip install vllm==0.13.0 --extra-index-url https://wheels.vllm.ai/0.13.0/cu130")
    if missing_local and not remote_ok:
        print(f"\nRun these commands on BOTH spark-01 and spark-02.")
    elif missing_local:
        print(f"\nRun these commands on spark-01.")
    else:
        print(f"\nRun these commands on spark-02:")
        print(f"  ssh nvidia@{NODE2_IP}")
        print(f"  source {VENV_PATH}/bin/activate")
        print(f"  # Then run the pip install commands above")
else:
    print("\n✓ Both nodes have all required packages.")

spark-01 (192.168.100.10) - local packages:

  ✓ torch: PyTorch (deep learning framework)
  ✓ vllm: vLLM (high-performance LLM serving)


  from .autonotebook import tqdm as notebook_tqdm


  ✓ transformers: HuggingFace Transformers (model loading)

  All packages installed.

spark-02 (192.168.100.11) - remote packages:

  ✓ torch=2.9.1+cu130
  ✓ vllm=0.13.0
  ✓ transformers=4.57.6

  All packages installed.

✓ Both nodes have all required packages.


## Step 5: Verify PyTorch GPU Access (Both Nodes)

Confirm PyTorch can see and use the GPUs on both nodes. Both nodes need working GPU compute for disaggregated serving.

In [6]:
# Set CUDA environment variables for PyTorch
import os

# CUDA 13.0 paths for DGX Spark
os.environ['CUDA_HOME'] = '/usr/local/cuda-13.0'
cuda_lib_path = '/usr/local/cuda-13.0/lib64'

if 'LD_LIBRARY_PATH' in os.environ:
    os.environ['LD_LIBRARY_PATH'] = f"{cuda_lib_path}:{os.environ['LD_LIBRARY_PATH']}"
else:
    os.environ['LD_LIBRARY_PATH'] = cuda_lib_path

print(f"CUDA_HOME: {os.environ['CUDA_HOME']}")
print(f"LD_LIBRARY_PATH: {os.environ['LD_LIBRARY_PATH']}")
print("\n✓ CUDA 13.0 environment configured")

CUDA_HOME: /usr/local/cuda-13.0
LD_LIBRARY_PATH: /usr/local/cuda-13.0/lib64

✓ CUDA 13.0 environment configured


In [7]:
import torch

# --- Check spark-01 (local) ---
print(f"spark-01 ({NODE1_IP}) - PyTorch GPU check:\n")

cuda_available = torch.cuda.is_available()
print(f"  CUDA available: {cuda_available}")

if cuda_available:
    gpu_count = torch.cuda.device_count()
    print(f"  GPUs visible to PyTorch: {gpu_count}\n")
    
    for i in range(gpu_count):
        name = torch.cuda.get_device_name(i)
        memory_gb = torch.cuda.get_device_properties(i).total_memory / 1e9
        print(f"  GPU {i}: {name} ({memory_gb:.1f} GB)")
    
    print("\n  Testing GPU compute...")
    x = torch.randn(1000, 1000, device='cuda')
    y = torch.matmul(x, x)
    torch.cuda.synchronize()
    print("  ✓ GPU compute working")
else:
    print("  ✗ CUDA not available")
    print("\n  Restart the kernel and run cells in order.")

# --- Check spark-02 (remote) ---
print(f"\nspark-02 ({NODE2_IP}) - PyTorch GPU check:\n")

venv_python = f"{VENV_PATH}/bin/python"
gpu_check_cmd = (
    f'{venv_python} -c "'
    f'import torch; '
    f'print(f\\"CUDA available: {{torch.cuda.is_available()}}\\"); '
    f'print(f\\"GPU count: {{torch.cuda.device_count()}}\\"); '
    f'[print(f\\"GPU {{i}}: {{torch.cuda.get_device_name(i)}} ({{torch.cuda.get_device_properties(i).total_memory / 1e9:.1f}} GB)\\") for i in range(torch.cuda.device_count())]; '
    f'x = torch.randn(100, 100, device=\\"cuda\\"); '
    f'y = torch.matmul(x, x); '
    f'torch.cuda.synchronize(); '
    f'print(\\"Compute test: OK\\")'
    f'"'
)

result = subprocess.run(
    ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{NODE2_IP}', gpu_check_cmd],
    capture_output=True, text=True, timeout=30
)

if result.returncode == 0:
    for line in result.stdout.strip().split('\n'):
        print(f"  ✓ {line}")
else:
    print(f"  ✗ GPU check failed")
    if result.stderr.strip():
        # Show last few lines of error
        err_lines = result.stderr.strip().split('\n')
        for line in err_lines[-3:]:
            print(f"    {line}")

print()
if cuda_available and result.returncode == 0:
    print("✓ Both nodes have working GPU compute.")
else:
    print("✗ Fix GPU access before continuing.")

spark-01 (192.168.100.10) - PyTorch GPU check:

  CUDA available: True
  GPUs visible to PyTorch: 1

  GPU 0: NVIDIA GB10 (128.5 GB)

  Testing GPU compute...


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    


  ✓ GPU compute working

spark-02 (192.168.100.11) - PyTorch GPU check:

  ✓ CUDA available: True
  ✓ GPU count: 1
  ✓ GPU 0: NVIDIA GB10 (128.5 GB)
  ✓ Compute test: OK

✓ Both nodes have working GPU compute.


## Step 6: Verify Model Cache (Both Nodes)

Both nodes need `meta-llama/Llama-3.1-8B-Instruct` cached locally. The prefill node loads it for prompt processing, the decode node loads it for token generation.

In [9]:
from pathlib import Path

MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
CACHE_DIR = Path.home() / ".cache" / "huggingface" / "hub"
model_slug = MODEL_NAME.replace("/", "--")

# --- Check spark-01 (local) ---
print(f"spark-01 ({NODE1_IP}) - model cache:\n")

local_model_dirs = list(CACHE_DIR.glob(f"models--{model_slug}*"))
if local_model_dirs:
    print(f"  ✓ {MODEL_NAME} found in cache")
    print(f"    {local_model_dirs[0]}")
else:
    print(f"  ✗ {MODEL_NAME} not found in cache")
    print(f"    Run: huggingface-cli download {MODEL_NAME}")

# --- Check spark-02 (remote) ---
print(f"\nspark-02 ({NODE2_IP}) - model cache:\n")

result = subprocess.run(
    ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{NODE2_IP}',
     f'ls -d ~/.cache/huggingface/hub/models--{model_slug}* 2>/dev/null && echo FOUND || echo MISSING'],
    capture_output=True, text=True, timeout=10
)

remote_lines = result.stdout.strip().split('\n')
remote_status = remote_lines[-1] if remote_lines else "MISSING"

if remote_status == "FOUND":
    print(f"  ✓ {MODEL_NAME} found in cache")
    if len(remote_lines) > 1:
        print(f"    {remote_lines[0]}")
else:
    print(f"  ✗ {MODEL_NAME} not found in cache")
    print(f"    Run on spark-02: huggingface-cli download {MODEL_NAME}")

print()
if local_model_dirs and remote_status == "FOUND":
    print("✓ Model cached on both nodes.")
else:
    print("✗ Model must be cached on both nodes before continuing.")

spark-01 (192.168.100.10) - model cache:

  ✓ meta-llama/Llama-3.1-8B-Instruct found in cache
    /home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct

spark-02 (192.168.100.11) - model cache:

  ✓ meta-llama/Llama-3.1-8B-Instruct found in cache
    /home/nvidia/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct

✓ Model cached on both nodes.


## Step 7: Environment Summary

Collect all configuration details for reference in later notebooks.

In [10]:
import json
from datetime import datetime

def get_remote_info(remote_ip):
    """Query remote node hostname, GPU, and InfiniBand configuration via SSH"""
    try:
        # Get hostname
        hostname_result = subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{remote_ip}', 'hostname'],
            capture_output=True,
            text=True,
            timeout=10
        )
        
        # Get GPU info
        gpu_result = subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{remote_ip}', 
             'nvidia-smi --query-gpu=name,count --format=csv,noheader'],
            capture_output=True,
            text=True,
            timeout=10
        )
        
        # Get InfiniBand devices
        ib_result = subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{remote_ip}', 'ibstat -l'],
            capture_output=True,
            text=True,
            timeout=10
        )
        
        if hostname_result.returncode == 0 and gpu_result.returncode == 0:
            hostname = hostname_result.stdout.strip()
            gpu_info = gpu_result.stdout.strip().split('\n')
            gpu_count = len(gpu_info)
            gpu_model = gpu_info[0].split(',')[0] if gpu_info else "Unknown"
            
            # Parse IB devices
            if ib_result.returncode == 0:
                ib_devices = [d for d in ib_result.stdout.strip().split('\n') if d]
            else:
                ib_devices = ["ibstat not found"]
            
            return hostname, gpu_count, gpu_model, ib_devices
        else:
            return None, None, "SSH failed", []
    except Exception as e:
        return None, None, str(e), []

# Gather local node information
local_node = get_hostname()
env_config = {
    "timestamp": datetime.now().isoformat(),
    "hostname": local_node,  # For backward compatibility
    "gpus": {
        "count": gpu_count,
        "model": gpu_model
    },
    "nodes": {
        local_node: {
            "ip": NODE1_IP if "01" in local_node else NODE2_IP,
            "gpus": {
                "count": gpu_count,
                "model": gpu_model
            },
            "ib_devices": ib_devices
        }
    },
    "network": {
        "node1_ip": NODE1_IP,
        "node2_ip": NODE2_IP,
        "ib_devices": ib_devices
    },
    "model": {
        "name": MODEL_NAME,
        "cache_dir": str(CACHE_DIR)
    }
}

# Try to query remote node
remote_ip = NODE2_IP if "01" in local_node else NODE1_IP
print(f"Querying remote node at {remote_ip}...")
remote_hostname, remote_gpu_count, remote_gpu_model, remote_ib_devices = get_remote_info(remote_ip)

if remote_hostname:
    env_config["nodes"][remote_hostname] = {
        "ip": remote_ip,
        "gpus": {
            "count": remote_gpu_count,
            "model": remote_gpu_model
        },
        "ib_devices": remote_ib_devices
    }
    print(f"✓ Remote node: {remote_hostname}")
    print(f"  GPUs: {remote_gpu_count}x {remote_gpu_model}")
    print(f"  IB devices: {', '.join(remote_ib_devices) if remote_ib_devices else 'none'}")
else:
    print(f"✗ Could not query remote node: {remote_gpu_model}")
    print("  Check SSH connectivity or run this notebook on both nodes")

# Save to file for reference
config_file = Path("environment_config.json")
with open(config_file, 'w') as f:
    json.dump(env_config, f, indent=2)

print("\nEnvironment Configuration:")
print(json.dumps(env_config, indent=2))
print(f"\nConfiguration saved to: {config_file}")

Querying remote node at 192.168.100.11...
✓ Remote node: spark-02
  GPUs: 1x NVIDIA GB10
  IB devices: roceP2p1s0f0, roceP2p1s0f1, rocep1s0f0, rocep1s0f1

Environment Configuration:
{
  "timestamp": "2026-02-05T18:06:29.961537",
  "hostname": "spark-01",
  "gpus": {
    "count": 1,
    "model": "NVIDIA GB10"
  },
  "nodes": {
    "spark-01": {
      "ip": "192.168.100.10",
      "gpus": {
        "count": 1,
        "model": "NVIDIA GB10"
      },
      "ib_devices": [
        "roceP2p1s0f0",
        "roceP2p1s0f1",
        "rocep1s0f0",
        "rocep1s0f1"
      ]
    },
    "spark-02": {
      "ip": "192.168.100.11",
      "gpus": {
        "count": 1,
        "model": "NVIDIA GB10"
      },
      "ib_devices": [
        "roceP2p1s0f0",
        "roceP2p1s0f1",
        "rocep1s0f0",
        "rocep1s0f1"
      ]
    }
  },
  "network": {
    "node1_ip": "192.168.100.10",
    "node2_ip": "192.168.100.11",
    "ib_devices": [
      "roceP2p1s0f0",
      "roceP2p1s0f1",
      "rocep1s0f0

## Next Steps

Environment is ready. The next notebook ([01_Local_Inference_Baseline.ipynb](01_Local_Inference_Baseline.ipynb)) establishes single-node vLLM performance.

**Remaining notebooks:**
1. **01_Local_Inference_Baseline**: Measure throughput, latency, and memory (the bar to beat)
2. **02_Understanding_KV_Cache**: Calculate cache dimensions and transfer cost (arithmetic only)
3. **03_Disaggregated_Serving**: Split prefill/decode across spark-01 and spark-02