# Environment Setup for Disaggregated Serving

Before building a disaggregated inference system, we need a working baseline environment. This notebook sets up two DGX Spark nodes for distributed LLM inference.

## What We're Setting Up

- **Node 1 (dgx01)**: Primary inference node
- **Node 2 (dgx02)**: Secondary inference node
- **Network**: RDMA-capable InfiniBand link between nodes
- **Software**: PyTorch, vLLM, monitoring tools

## Why This Matters

Disaggregated serving splits the inference pipeline across multiple machines. Before we optimize with RDMA and cache-aware routing, we need working compute and networking.

## Step 1: Verify Node Configuration

Check both nodes are accessible and have the expected hardware.

In [1]:
import subprocess
import socket

def get_hostname():
    return socket.gethostname()

def get_gpu_info():
    """Get GPU count and model from nvidia-smi"""
    try:
        result = subprocess.run(
            ['nvidia-smi', '--query-gpu=name,count', '--format=csv,noheader'],
            capture_output=True,
            text=True,
            check=True
        )
        gpu_info = result.stdout.strip().split('\n')
        return len(gpu_info), gpu_info[0].split(',')[0] if gpu_info else "Unknown"
    except Exception as e:
        return 0, str(e)

hostname = get_hostname()
gpu_count, gpu_model = get_gpu_info()

print(f"Hostname: {hostname}")
print(f"GPUs: {gpu_count}x {gpu_model}")

Hostname: spark-01
GPUs: 1x NVIDIA GB10


## Step 2: Check InfiniBand Network

Verify RDMA-capable interfaces are present and active. This is what enables fast KV cache transfer later.

In [2]:
def check_ib_devices():
    """List InfiniBand devices using ibstat"""
    try:
        result = subprocess.run(
            ['ibstat', '-l'],
            capture_output=True,
            text=True,
            check=True
        )
        devices = result.stdout.strip().split('\n')
        return [d for d in devices if d]
    except subprocess.CalledProcessError:
        return []
    except FileNotFoundError:
        return ["ibstat not found - install rdma-core"]

def check_ib_link_state(device):
    """Check if InfiniBand device is active"""
    try:
        result = subprocess.run(
            ['ibstat', device],
            capture_output=True,
            text=True,
            check=True
        )
        # Look for "State: Active" in output
        return "State: Active" in result.stdout
    except Exception:
        return False

print("InfiniBand Devices:")
ib_devices = check_ib_devices()
for device in ib_devices:
    state = "Active" if check_ib_link_state(device) else "Down"
    print(f"  {device}: {state}")

if not ib_devices:
    print("  No InfiniBand devices found")
    print("  Disaggregated serving will use TCP/IP (slower)")

InfiniBand Devices:
  roceP2p1s0f0: Active
  roceP2p1s0f1: Active
  rocep1s0f0: Active
  rocep1s0f1: Active


## Step 3: Test Network Connectivity

Ping the other node to verify basic network connectivity. Update the IP address for your environment.

In [3]:
# Configuration - Update these IPs for your setup
NODE1_IP = "192.168.100.10"  # dgx01
NODE2_IP = "192.168.100.11"  # dgx02

def ping_host(ip_address, count=4):
    """Test network connectivity to remote host"""
    try:
        result = subprocess.run(
            ['ping', '-c', str(count), ip_address],
            capture_output=True,
            text=True,
            timeout=10
        )
        # Extract average latency from ping output
        for line in result.stdout.split('\n'):
            if 'avg' in line:
                # Format: min/avg/max/mdev = 0.123/0.456/0.789/0.012 ms
                avg_latency = line.split('=')[1].strip().split('/')[1]
                return True, f"{avg_latency} ms"
        return result.returncode == 0, "success"
    except Exception as e:
        return False, str(e)

current_node = get_hostname()
remote_ip = NODE2_IP if "01" in current_node else NODE1_IP

print(f"Testing connectivity to remote node ({remote_ip})...")
success, latency = ping_host(remote_ip)

if success:
    print(f"✓ Remote node reachable (avg latency: {latency})")
else:
    print(f"✗ Cannot reach remote node: {latency}")
    print("  Check network configuration and IP addresses")

Testing connectivity to remote node (192.168.100.11)...
✓ Remote node reachable (avg latency: 1.012 ms)


## Step 4: Install Core Dependencies

Install PyTorch and vLLM for baseline inference. We'll add RDMA libraries later when we optimize.

In [4]:
# Check if packages are already installed
def check_package(package_name):
    """Check if a Python package is installed"""
    try:
        __import__(package_name)
        return True
    except ImportError:
        return False

packages = {
    'torch': 'PyTorch (deep learning framework)',
    'vllm': 'vLLM (high-performance LLM serving)',
    'transformers': 'HuggingFace Transformers (model loading)',
}

print("Checking installed packages:\n")
missing = []
for pkg, description in packages.items():
    installed = check_package(pkg)
    status = "✓" if installed else "✗"
    print(f"{status} {pkg}: {description}")
    if not installed:
        missing.append(pkg)

if missing:
    print(f"\nMissing packages: {', '.join(missing)}")
    print("Run: pip install transformers")
    print("Run: pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu130")
    print("Run: pip install vllm==0.13.0 --extra-index-url https://wheels.vllm.ai/0.13.0/cu130")
else:
    print("\n✓ All core packages installed")

Checking installed packages:

✓ torch: PyTorch (deep learning framework)
✓ vllm: vLLM (high-performance LLM serving)


  from .autonotebook import tqdm as notebook_tqdm


✓ transformers: HuggingFace Transformers (model loading)

✓ All core packages installed


## Step 5: Verify PyTorch GPU Access

Confirm PyTorch can see and use the GPUs. This is our compute layer for inference.

In [5]:
# Set CUDA environment variables for PyTorch
import os

# CUDA 13.0 paths for DGX Spark
os.environ['CUDA_HOME'] = '/usr/local/cuda-13.0'
cuda_lib_path = '/usr/local/cuda-13.0/lib64'

if 'LD_LIBRARY_PATH' in os.environ:
    os.environ['LD_LIBRARY_PATH'] = f"{cuda_lib_path}:{os.environ['LD_LIBRARY_PATH']}"
else:
    os.environ['LD_LIBRARY_PATH'] = cuda_lib_path

print(f"CUDA_HOME: {os.environ['CUDA_HOME']}")
print(f"LD_LIBRARY_PATH: {os.environ['LD_LIBRARY_PATH']}")
print("\n✓ CUDA 13.0 environment configured")

CUDA_HOME: /usr/local/cuda-13.0
LD_LIBRARY_PATH: /usr/local/cuda-13.0/lib64

✓ CUDA 13.0 environment configured


In [6]:
import torch

cuda_available = torch.cuda.is_available()
print(f"CUDA available: {cuda_available}")

if cuda_available:
    gpu_count = torch.cuda.device_count()
    print(f"GPUs visible to PyTorch: {gpu_count}\n")
    
    for i in range(gpu_count):
        name = torch.cuda.get_device_name(i)
        memory_gb = torch.cuda.get_device_properties(i).total_memory / 1e9
        print(f"GPU {i}: {name} ({memory_gb:.1f} GB)")
    
    # Quick compute test
    print("\nTesting GPU compute...")
    x = torch.randn(1000, 1000, device='cuda')
    y = torch.matmul(x, x)
    torch.cuda.synchronize()
    print("✓ GPU compute working")
else:
    print("✗ CUDA not available")
    print("\nIMPORTANT: Restart the kernel and run cells in order:")
    print("1. Kernel → Restart Kernel")
    print("2. Run Step 5 cells again (environment setup, then this test)")

CUDA available: True
GPUs visible to PyTorch: 1

GPU 0: NVIDIA GB10 (128.5 GB)

Testing GPU compute...


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    


✓ GPU compute working


## Step 6: Download Test Model

Download a small model (TinyLlama-1.1B) for testing. This runs fast enough to iterate quickly while learning.

In [7]:
from pathlib import Path

MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
CACHE_DIR = Path.home() / ".cache" / "huggingface" / "hub"

def check_model_cached(model_name):
    """Check if model is already downloaded"""
    # HuggingFace cache uses hashed directory names
    # We'll check if cache dir exists and has any models
    if not CACHE_DIR.exists():
        return False
    
    # Look for any cached models
    model_dirs = list(CACHE_DIR.glob("models--*"))
    return len(model_dirs) > 0

print(f"Model: {MODEL_NAME}")
print(f"Cache directory: {CACHE_DIR}\n")

if check_model_cached(MODEL_NAME):
    print("✓ Model files found in cache")
    print("  If you want to test download, delete cache directory")
else:
    print("Model not cached. Downloading...")
    print("  This will happen automatically on first inference")
    print("  Or run: huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0")

Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Cache directory: /home/nvidia/.cache/huggingface/hub

✓ Model files found in cache
  If you want to test download, delete cache directory


## Step 7: Environment Summary

Collect all configuration details for reference in later notebooks.

In [8]:
import json
from datetime import datetime

def get_remote_info(remote_ip):
    """Query remote node hostname, GPU, and InfiniBand configuration via SSH"""
    try:
        # Get hostname
        hostname_result = subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{remote_ip}', 'hostname'],
            capture_output=True,
            text=True,
            timeout=10
        )
        
        # Get GPU info
        gpu_result = subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{remote_ip}', 
             'nvidia-smi --query-gpu=name,count --format=csv,noheader'],
            capture_output=True,
            text=True,
            timeout=10
        )
        
        # Get InfiniBand devices
        ib_result = subprocess.run(
            ['ssh', '-o', 'ConnectTimeout=5', f'nvidia@{remote_ip}', 'ibstat -l'],
            capture_output=True,
            text=True,
            timeout=10
        )
        
        if hostname_result.returncode == 0 and gpu_result.returncode == 0:
            hostname = hostname_result.stdout.strip()
            gpu_info = gpu_result.stdout.strip().split('\n')
            gpu_count = len(gpu_info)
            gpu_model = gpu_info[0].split(',')[0] if gpu_info else "Unknown"
            
            # Parse IB devices
            if ib_result.returncode == 0:
                ib_devices = [d for d in ib_result.stdout.strip().split('\n') if d]
            else:
                ib_devices = ["ibstat not found"]
            
            return hostname, gpu_count, gpu_model, ib_devices
        else:
            return None, None, "SSH failed", []
    except Exception as e:
        return None, None, str(e), []

# Gather local node information
local_node = get_hostname()
env_config = {
    "timestamp": datetime.now().isoformat(),
    "hostname": local_node,  # For backward compatibility
    "gpus": {
        "count": gpu_count,
        "model": gpu_model
    },
    "nodes": {
        local_node: {
            "ip": NODE1_IP if "01" in local_node else NODE2_IP,
            "gpus": {
                "count": gpu_count,
                "model": gpu_model
            },
            "ib_devices": ib_devices
        }
    },
    "network": {
        "node1_ip": NODE1_IP,
        "node2_ip": NODE2_IP,
        "ib_devices": ib_devices
    },
    "model": {
        "name": MODEL_NAME,
        "cache_dir": str(CACHE_DIR)
    }
}

# Try to query remote node
remote_ip = NODE2_IP if "01" in local_node else NODE1_IP
print(f"Querying remote node at {remote_ip}...")
remote_hostname, remote_gpu_count, remote_gpu_model, remote_ib_devices = get_remote_info(remote_ip)

if remote_hostname:
    env_config["nodes"][remote_hostname] = {
        "ip": remote_ip,
        "gpus": {
            "count": remote_gpu_count,
            "model": remote_gpu_model
        },
        "ib_devices": remote_ib_devices
    }
    print(f"✓ Remote node: {remote_hostname}")
    print(f"  GPUs: {remote_gpu_count}x {remote_gpu_model}")
    print(f"  IB devices: {', '.join(remote_ib_devices) if remote_ib_devices else 'none'}")
else:
    print(f"✗ Could not query remote node: {remote_gpu_model}")
    print("  Check SSH connectivity or run this notebook on both nodes")

# Save to file for reference
config_file = Path("environment_config.json")
with open(config_file, 'w') as f:
    json.dump(env_config, f, indent=2)

print("\nEnvironment Configuration:")
print(json.dumps(env_config, indent=2))
print(f"\nConfiguration saved to: {config_file}")

Querying remote node at 192.168.100.11...
✓ Remote node: spark-02
  GPUs: 1x NVIDIA GB10
  IB devices: roceP2p1s0f0, roceP2p1s0f1, rocep1s0f0, rocep1s0f1

Environment Configuration:
{
  "timestamp": "2026-02-04T18:07:23.159361",
  "hostname": "spark-01",
  "gpus": {
    "count": 1,
    "model": "NVIDIA GB10"
  },
  "nodes": {
    "spark-01": {
      "ip": "192.168.100.10",
      "gpus": {
        "count": 1,
        "model": "NVIDIA GB10"
      },
      "ib_devices": [
        "roceP2p1s0f0",
        "roceP2p1s0f1",
        "rocep1s0f0",
        "rocep1s0f1"
      ]
    },
    "spark-02": {
      "ip": "192.168.100.11",
      "gpus": {
        "count": 1,
        "model": "NVIDIA GB10"
      },
      "ib_devices": [
        "roceP2p1s0f0",
        "roceP2p1s0f1",
        "rocep1s0f0",
        "rocep1s0f1"
      ]
    }
  },
  "network": {
    "node1_ip": "192.168.100.10",
    "node2_ip": "192.168.100.11",
    "ib_devices": [
      "roceP2p1s0f0",
      "roceP2p1s0f1",
      "rocep1s0f0

## Next Steps

Environment is ready. The next notebook ([01_Local_Inference_Baseline.ipynb](01_Local_Inference_Baseline.ipynb)) will run single-node inference to establish baseline performance.

**What we measured here:**
- Hardware configuration (GPUs, network)
- Network latency between nodes
- Software availability (PyTorch, vLLM)

**What we'll measure next:**
- Throughput (tokens/sec) for local inference
- Latency (ms) per request
- GPU memory usage patterns

This baseline is what we'll compare against when we add disaggregation.