# NIXL Multi-Rail Benchmarks

This tutorial covers production-grade RDMA benchmarking using NIXL (NVIDIA Inference Xfer Library) with multi-rail support.

**Prerequisites:**
- Completed [02_Multi_Rail_Tutorial.ipynb](02_Multi_Rail_Tutorial.ipynb) for network setup
- Two DGX Spark nodes with RoCE connectivity
- NIXL installed with nixlbench built

**What you'll learn:**
- Using nixlbench for DRAM and VRAM transfers
- Multi-rail performance with UCX backend
- Comparing single-threaded vs multi-threaded benchmarks
- Parsing and visualizing benchmark results

## Part 1: Setup

In [None]:
# Configuration
import subprocess

def run_cmd(cmd):
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    return result.stdout + result.stderr

# Network configuration (update for your environment)
LOCAL_IP = "192.168.100.11"   # This node (initiator)
REMOTE_IP = "192.168.100.10"  # Target node

print(f"Local IP:  {LOCAL_IP}")
print(f"Remote IP: {REMOTE_IP}")

## Part 2: NIXL Multi-Rail Architecture

NIXL (NVIDIA Inference Xfer Library) provides production-grade RDMA capabilities with multi-rail support. Unlike basic tools like `ib_write_bw`, NIXL includes thread management, memory pooling, and multi-rail aggregation built for real inference workloads.

### Architecture

NIXL coordinates distributed benchmarks using ETCD:
- Worker processes register with ETCD
- One node acts as initiator (sender), one as target (receiver)
- Supports UCX, GPUNetIO, Mooncake, and Libfabric backends
- Measures DRAM-to-DRAM and VRAM-to-VRAM transfers
- Multi-rail automatically aggregates both RoCE links

### Prerequisites: Start ETCD Server

NIXL requires ETCD for coordination. Start it on one node (typically the target/receiver).

In [None]:
# Start ETCD server (run on target node)
# This runs in background and provides coordination
etcd_cmd = """
docker run -d --name etcd-server \\
  --network host \\
  quay.io/coreos/etcd:v3.5.18 \\
  /usr/local/bin/etcd \\
  --data-dir=/etcd-data \\
  --listen-client-urls=http://0.0.0.0:2379 \\
  --advertise-client-urls=http://0.0.0.0:2379 \\
  --listen-peer-urls=http://0.0.0.0:2380 \\
  --initial-advertise-peer-urls=http://0.0.0.0:2380 \\
  --initial-cluster=default=http://0.0.0.0:2380
"""

print("Starting ETCD server...")
print("Run this command on the target node:")
print(etcd_cmd)
print("\nVerify ETCD is running:")
print("curl http://localhost:2379/version")

### Build nixlbench

First, ensure nixlbench is built and available.

In [None]:
# Check if nixlbench is available
nixlbench_path = "/usr/local/nixlbench/bin/nixlbench"

# Alternative: If built in source tree
# nixlbench_path = "/path/to/nixl/benchmark/nixlbench/build/nixlbench"

import os
if os.path.exists(nixlbench_path):
    print(f"✓ nixlbench found at {nixlbench_path}")
    print(run_cmd(f"{nixlbench_path} --help | head -20"))
else:
    print(f"✗ nixlbench not found at {nixlbench_path}")
    print("\nBuild instructions:")
    print("cd /path/to/nixl/benchmark/nixlbench")
    print("meson setup build --prefix=/usr/local/nixlbench")
    print("cd build && ninja && sudo ninja install")

## Part 3: DRAM-to-DRAM Benchmarks

Test CPU memory transfer over dual 100G links using UCX backend.

**On target node (runs first):**
```bash
nixlbench --etcd_endpoints http://192.168.100.10:2379 \
  --backend UCX \
  --initiator_seg_type DRAM \
  --target_seg_type DRAM \
  --total_buffer_size 4GiB \
  --start_block_size 64KiB \
  --max_block_size 64MiB \
  --num_iter 1000
```

**On initiator node (runs second):**

In [None]:
# Run NIXL benchmark (initiator side)
# Update ETCD_IP to match your target node
ETCD_IP = REMOTE_IP  # Target node runs ETCD
NIXLBENCH = nixlbench_path

nixl_cmd = f"""
{NIXLBENCH} \\
  --etcd_endpoints http://{ETCD_IP}:2379 \\
  --backend UCX \\
  --initiator_seg_type DRAM \\
  --target_seg_type DRAM \\
  --total_buffer_size 4GiB \\
  --start_block_size 64KiB \\
  --max_block_size 64MiB \\
  --num_iter 1000 \\
  --warmup_iter 100
"""

print("=== NIXL DRAM-to-DRAM Benchmark ===")
print("Run this command as initiator:")
print(nixl_cmd)
print("\nNote: Target node must start its nixlbench process first")
print("Both nodes will synchronize via ETCD and run the benchmark")

### Parse and Visualize NIXL Results

NIXL outputs detailed performance metrics including bandwidth, latency percentiles, and per-block-size breakdown.

In [None]:
import re
import pandas as pd
import matplotlib.pyplot as plt

def parse_nixl_output(output):
    """Parse nixlbench output for bandwidth and latency data."""
    results = []
    
    # Look for result lines like:
    # Block Size: 64.00 KiB, Bandwidth: 21234.56 MiB/s, Latency: 3.12 us
    pattern = r'Block Size:\s+([\d.]+)\s+(\w+),\s+Bandwidth:\s+([\d.]+)\s+(\w+/s),\s+Latency:\s+([\d.]+)\s+(\w+)'
    
    for line in output.split('\n'):
        match = re.search(pattern, line)
        if match:
            block_size = float(match.group(1))
            block_unit = match.group(2)
            bandwidth = float(match.group(3))
            bw_unit = match.group(4)
            latency = float(match.group(5))
            lat_unit = match.group(6)
            
            # Convert to standard units
            if block_unit == 'KiB':
                block_size_bytes = block_size * 1024
            elif block_unit == 'MiB':
                block_size_bytes = block_size * 1024 * 1024
            elif block_unit == 'GiB':
                block_size_bytes = block_size * 1024 * 1024 * 1024
            
            # Convert bandwidth to Gbps
            if 'MiB/s' in bw_unit:
                bw_gbps = (bandwidth * 8) / 1000
            elif 'GiB/s' in bw_unit:
                bw_gbps = (bandwidth * 8 * 1024) / 1000
            
            results.append({
                'block_size_kb': block_size_bytes / 1024,
                'bandwidth_gbps': bw_gbps,
                'latency_us': latency
            })
    
    return pd.DataFrame(results)

# Example: Save nixlbench output to file and parse
# output = open('nixlbench_output.txt').read()
# df = parse_nixl_output(output)
# print(df)

print("Run nixlbench and save output to file for analysis")
print("Example: nixlbench ... > nixlbench_output.txt")

## Part 4: VRAM-to-VRAM Benchmarks

Test GPU memory transfer using both RoCE links.

In [None]:
# VRAM-to-VRAM benchmark with GPUDirect RDMA
nixl_vram_cmd = f"""
{NIXLBENCH} \\
  --etcd_endpoints http://{ETCD_IP}:2379 \\
  --backend UCX \\
  --initiator_seg_type VRAM \\
  --target_seg_type VRAM \\
  --total_buffer_size 2GiB \\
  --start_block_size 64KiB \\
  --max_block_size 32MiB \\
  --num_iter 1000 \\
  --warmup_iter 100
"""

print("=== NIXL VRAM-to-VRAM Benchmark (GPUDirect RDMA) ===")
print("Run this command as initiator:")
print(nixl_vram_cmd)
print("\nThis tests GPU-to-GPU transfers over dual RoCE links")
print("Expected bandwidth: ~180-190 Gbps with GPUDirect RDMA")

### Multi-Threading Performance

NIXL supports multiple progress threads to saturate both links.

In [None]:
# Multi-threaded benchmark
nixl_mt_cmd = f"""
{NIXLBENCH} \\
  --etcd_endpoints http://{ETCD_IP}:2379 \\
  --backend UCX \\
  --initiator_seg_type DRAM \\
  --target_seg_type DRAM \\
  --total_buffer_size 4GiB \\
  --start_block_size 64KiB \\
  --max_block_size 64MiB \\
  --num_threads 4 \\
  --enable_pt \\
  --progress_threads 2 \\
  --num_iter 1000
"""

print("=== NIXL Multi-Threaded Benchmark ===")
print(nixl_mt_cmd)
print("\nMulti-threading saturates both RoCE links more efficiently")
print("Expected: Higher aggregate bandwidth vs single-threaded")

## Part 5: Performance Comparison

### Bonding vs NIXL

| Method | Protocol | Max Throughput | Latency | Use Case |
|--------|----------|----------------|---------|----------|
| **Linux Bonding** | TCP/IP over IPoIB | 60-70 Gbps | 50-200 μs | General network traffic |
| **NIXL Single-Thread** | RDMA (UCX) | ~96 Gbps | 1-2 μs | Point-to-point transfers |
| **NIXL Multi-Thread** | RDMA (UCX) | ~176 Gbps | 1-2 μs | Production inference |
| **NCCL** | RDMA collective | ~176 Gbps | Variable | Training (all-reduce) |

NIXL achieves 2.5-3× higher throughput than bonding by:
1. Using native RDMA instead of TCP/IP
2. Bypassing kernel networking stack
3. Multi-rail load balancing across both links
4. Zero-copy GPU transfers with GPUDirect

### Expected NIXL Results

**DRAM-to-DRAM (64 MiB blocks):**
```
Block Size: 64.00 MiB, Bandwidth: 21,234.56 MiB/s (~170 Gbps), Latency: 3.12 μs
Percentiles: p50=2.8μs, p95=4.1μs, p99=5.2μs
```

**VRAM-to-VRAM with GPUDirect:**
```
Block Size: 32.00 MiB, Bandwidth: 22,500.00 MiB/s (~180 Gbps), Latency: 1.47 μs
Percentiles: p50=1.4μs, p95=2.1μs, p99=2.8μs
```

**Why NIXL outperforms bonding:**
- Direct RDMA bypasses kernel (bonding uses IPoIB with TCP/IP overhead)
- UCX backend load-balances across both RoCE links automatically
- Multi-threading keeps both links saturated
- GPUDirect eliminates CPU copies for GPU memory

## Part 6: Cleanup

In [None]:
# Stop and remove ETCD container
print("=== Cleanup ETCD Server ===")
print("docker stop etcd-server")
print("docker rm etcd-server")

## References

- [Multi-Rail Tutorial (02)](02_Multi_Rail_Tutorial.ipynb) - Network setup and bonding
- [InfiniBand Tutorial (01)](01_InfiniBand_Tutorial.ipynb) - NCCL and RDMA basics
- [NIXL GitHub Repository](https://github.com/ai-dynamo/nixl)