# RoCE Link Aggregation: Bonding vs NIXL

This notebook compares two approaches to using dual 100G RoCE links on DGX Spark:

| Approach | Traffic Type | Expected Throughput | Latency |
|----------|--------------|---------------------|----------|
| Linux Bonding | TCP/IP | ~60-70 Gbps | 50-200 μs |
| NIXL | Point-to-point RDMA | ~176 Gbps | 1-2 μs |

**Goal**: Demonstrate why NIXL outperforms bonding for point-to-point inference data movement.

For collective operations (all-reduce, all-gather), see the [first tutorial](01_InfiniBand_Tutorial.ipynb) which covers NCCL.

**Prerequisites**:
- Two DGX Spark systems connected via both RoCE ports
- IP addresses configured on RoCE interfaces
- `perftest` and `iperf3` installed

## Setup and Configuration

In [2]:
import subprocess
import re
import time
import os

# Configuration - Update these for your environment
LOCAL_IP = "192.168.100.11"    # This node's RoCE IP
REMOTE_IP = "192.168.100.10"   # Remote node's RoCE IP
INTERFACE_1 = "enp1s0f0np0"    # First RoCE interface
INTERFACE_2 = "enp1s0f1np1"    # Second RoCE interface

def run_cmd(cmd, timeout=60):
    """Run a shell command and return output."""
    try:
        result = subprocess.run(
            cmd, shell=True, capture_output=True, text=True, timeout=timeout
        )
        return result.stdout + result.stderr
    except subprocess.TimeoutExpired:
        return "Command timed out"

def parse_bandwidth(output, pattern):
    """Extract bandwidth value from command output."""
    match = re.search(pattern, output)
    return float(match.group(1)) if match else None

print("Configuration loaded")
print(f"Local IP: {LOCAL_IP}")
print(f"Remote IP: {REMOTE_IP}")

Configuration loaded
Local IP: 192.168.100.11
Remote IP: 192.168.100.10


## Part 1: Verify Network Interfaces

Check that both RoCE interfaces are available and operational.

In [3]:
# Check interface status
print("=== Network Interfaces ===")
print(run_cmd(f"ip link show {INTERFACE_1}"))
print(run_cmd(f"ip link show {INTERFACE_2}"))

=== Network Interfaces ===
3: enp1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 30:c5:99:3e:6a:13 brd ff:ff:ff:ff:ff:ff

4: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 30:c5:99:3e:6a:14 brd ff:ff:ff:ff:ff:ff



In [2]:
# Check RDMA devices
print("=== RDMA Devices ===")
print(run_cmd("ibv_devinfo"))

=== RDMA Devices ===
hca_id:	rocep1s0f0
	transport:			InfiniBand (0)
	fw_ver:				28.45.4028
	node_guid:			30c5:9903:003e:6a13
	sys_image_guid:			30c5:9903:003e:6a13
	vendor_id:			0x02c9
	vendor_part_id:			4129
	hw_ver:				0x0
	board_id:			NVD0000000087
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

hca_id:	rocep1s0f1
	transport:			InfiniBand (0)
	fw_ver:				28.45.4028
	node_guid:			30c5:9903:003e:6a14
	sys_image_guid:			30c5:9903:003e:6a13
	vendor_id:			0x02c9
	vendor_part_id:			4129
	hw_ver:				0x0
	board_id:			NVD0000000087
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

hca_id:	roceP2p1s0f0
	transport:			InfiniBand (0)
	fw_ver:				28.45.4028
	node_guid:			30c5:9903:003e:6a17
	sys_image_guid:			30c5:9903:003e:6a13
	vendor

In [4]:
# Verify connectivity to remote node
print("=== Connectivity Test ===")
print(run_cmd(f"ping -c 3 {REMOTE_IP}"))

=== Connectivity Test ===
PING 192.168.100.10 (192.168.100.10) 56(84) bytes of data.
64 bytes from 192.168.100.10: icmp_seq=1 ttl=64 time=0.565 ms
64 bytes from 192.168.100.10: icmp_seq=2 ttl=64 time=0.444 ms
64 bytes from 192.168.100.10: icmp_seq=3 ttl=64 time=0.886 ms

--- 192.168.100.10 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2032ms
rtt min/avg/max/mdev = 0.444/0.631/0.886/0.186 ms



---

## Part 2: Baseline RDMA Performance (Single Link)

Measure raw RDMA bandwidth on a single 100G link using `ib_write_bw`.

### RoCE GID Index Selection

RoCE requires specifying the correct GID (Global Identifier) index. GID index 0 is typically link-local (`fe80::`) and does not work for routed connections.

Check available GIDs:
```bash
show_gids
```

For RoCEv2 with IPv4, use index 3 (the entry showing your IPv4 address with `v2`):
| Index | Type | Use |
|-------|------|-----|
| 0-1 | fe80:: (link-local) | Does not work for RoCE |
| 2 | IPv4-mapped, RoCEv1 | Legacy, may not work |
| 3 | IPv4-mapped, RoCEv2 | **Use this** |

**Run on remote node first:**
```bash
ib_write_bw -d rocep1s0f0 -x 3
```

In [5]:
# Single-link RDMA bandwidth test
# Requires server running on remote: ib_write_bw -d rocep1s0f0 -x 3

# GID index 3 = RoCEv2 with IPv4 address (required for RoCE connections)
GID_INDEX = 3

print("=== RDMA Bandwidth (Single Link - rocep1s0f0) ===")
print(f"NOTE: Start server on remote node: ib_write_bw -d rocep1s0f0 -x {GID_INDEX}")
print()

output = run_cmd(f"ib_write_bw -d rocep1s0f0 -x {GID_INDEX} {REMOTE_IP}")
print(output)

# Parse result
bw = parse_bandwidth(output, r"(\d+\.?\d*)\s+MB/sec")
if bw:
    print(f"\n>>> Single Link RDMA: {bw:.0f} MB/s ({bw * 8 / 1000:.1f} Gbps)")

=== RDMA Bandwidth (Single Link - rocep1s0f0) ===
NOTE: Start server on remote node: ib_write_bw -d rocep1s0f0 -x 3

---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : rocep1s0f0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x01b8 PSN 0x6c40bb RKey 0x184300 VAddr 0x00f3c34c74d000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:100:11
 remote address: LID 0000 QPN 0x01c1 PSN 0xef8f64 RKey 0x184300 VAddr 0x00fc043705d000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:100:10
---

---

## Part 3: Linux Bonding Performance (TCP/IP)

Bonding aggregates TCP traffic but **does not work with RDMA**.

### Critical Limitation: RDMA and Bonding Are Mutually Exclusive

When interfaces are enslaved to a bond:
- **TCP/IP works** through the bond interface (kernel stack)
- **RDMA fails** because verbs bypass the kernel and cannot traverse bond0

The `show_gids` output shows this: when bonded, GID entries associate with `bond0` instead of the physical interface, breaking RDMA connectivity.

**Testing sequence for this tutorial:**
1. Test RDMA first (Part 2) with unbonded interfaces
2. Create bond and test TCP (Part 3)
3. Remove bond before testing NIXL (Part 4)

### Bond Mode Selection

Linux bonding supports several modes. For direct-connect (no switch), only modes 0 and 2 are practical:

| Mode | Name | Hash Basis | Best For |
|------|------|------------|----------|
| 0 | balance-rr | Per-packet round-robin | Maximum single-flow throughput (causes reordering) |
| 1 | active-backup | None (failover only) | High availability, not performance |
| 2 | balance-xor | IP + port hash | Multiple flows, preserves packet order |
| 4 | 802.3ad | LACP negotiation | Requires switch support |

**Why balance-xor (mode 2) for this tutorial:**
- Each TCP connection hashes to one interface consistently
- No out-of-order packets (unlike balance-rr)
- Multiple parallel streams distribute across both links
- Single streams limited to one link (~35 Gbps) but without reordering overhead

**Trade-off:** balance-rr can achieve higher single-stream throughput by spreading packets across links, but causes TCP reordering which triggers congestion control. balance-xor sacrifices single-stream aggregation for predictable behavior.

### 3.0 Configure Jumbo Frames (MTU 9000)

RoCE links require jumbo frames for optimal TCP performance. The default MTU of 1500 bytes causes excessive packet fragmentation and triggers TCP congestion control, resulting in near-zero throughput.

**Symptoms of MTU mismatch:**
- Cwnd (congestion window) stuck at ~1.4 KB
- High retransmit counts
- Near-zero throughput despite successful connection

**Set MTU on both nodes:**
```bash
# On spark-01 (192.168.100.10)
sudo ip link set enp1s0f0np0 mtu 9000
sudo ip link set enp1s0f1np1 mtu 9000

# On spark-02 (192.168.100.11)
sudo ip link set enp1s0f0np0 mtu 9000
sudo ip link set enp1s0f1np1 mtu 9000
sudo ip link set bond0 mtu 9000  # Only if bond exists
```

**Verify on both nodes:**
```bash
cat /sys/class/net/enp1s0f0np0/mtu  # Should show 9000
cat /sys/class/net/enp1s0f1np1/mtu  # Should show 9000
```

Both endpoints must have matching MTU. A mismatch causes fragmentation and triggers TCP congestion control.

In [None]:
# Check if bond already exists on this node
bond_status = run_cmd("cat /proc/net/bonding/bond0 2>/dev/null")

if "Bonding Mode" in bond_status:
    print("Bond interface exists on THIS NODE (spark-02):")
    print(bond_status)
    print()
    print("=" * 70)
    print("⚠️  IMPORTANT: Also verify bond on spark-01 (192.168.100.10)")
    print("=" * 70)
else:
    print("No bond interface found on THIS NODE.")

print()
print("=" * 70)
print("Bond setup commands for BOTH nodes:")
print("=" * 70)
print()
print("--- ON SPARK-01 (192.168.100.10): ---")
print(f"""
sudo modprobe bonding
sudo ip link add bond0 type bond mode balance-xor
sudo ip link set bond0 type bond miimon 100
sudo ip link set bond0 type bond xmit_hash_policy layer3+4

sudo ip link set {INTERFACE_1} down
sudo ip link set {INTERFACE_1} master bond0
sudo ip link set {INTERFACE_1} up

sudo ip link set {INTERFACE_2} down
sudo ip link set {INTERFACE_2} master bond0
sudo ip link set {INTERFACE_2} up

sudo ip addr add 192.168.100.10/24 dev bond0
sudo ip link set bond0 up
""")
print("--- ON SPARK-02 (192.168.100.11) - THIS NODE: ---")
print(f"""
sudo modprobe bonding
sudo ip link add bond0 type bond mode balance-xor
sudo ip link set bond0 type bond miimon 100
sudo ip link set bond0 type bond xmit_hash_policy layer3+4

sudo ip link set {INTERFACE_1} down
sudo ip link set {INTERFACE_1} master bond0
sudo ip link set {INTERFACE_1} up

sudo ip link set {INTERFACE_2} down
sudo ip link set {INTERFACE_2} master bond0
sudo ip link set {INTERFACE_2} up

sudo ip addr add {LOCAL_IP}/24 dev bond0
sudo ip link set bond0 up
""")

### 3.2 Test Bonded TCP Performance

**Before running iperf3, verify connectivity:**
- Ensure bond0 is up on spark-01 with IP 192.168.100.10
- Check: `ip addr show bond0` on spark-01
- Verify ping works from current node to 192.168.100.10

**Run on remote node (spark-01):**
```bash
iperf3 -s -B 192.168.100.10
```

In [None]:
# iperf3 single stream over bond
print("=== TCP Bandwidth (Single Stream) ===")
print("NOTE: Start server on remote: iperf3 -s -B", REMOTE_IP)
print()

output = run_cmd(f"iperf3 -c {REMOTE_IP} -t 10")
print(output)

# Parse sender bandwidth
bw = parse_bandwidth(output, r"sender\s+.*?(\d+\.?\d*)\s+Gbits/sec")
if bw:
    print(f"\n>>> TCP Single Stream: {bw:.1f} Gbps")

In [None]:
# iperf3 multiple streams over bond
print("=== TCP Bandwidth (4 Parallel Streams) ===")
print()

output = run_cmd(f"iperf3 -c {REMOTE_IP} -t 10 -P 4")
print(output)

# Parse sender bandwidth (SUM line)
bw = parse_bandwidth(output, r"\[SUM\].*sender\s+.*?(\d+\.?\d*)\s+Gbits/sec")
if not bw:
    bw = parse_bandwidth(output, r"SUM.*?(\d+\.?\d*)\s+Gbits/sec.*sender")
if bw:
    print(f"\n>>> TCP 4 Streams: {bw:.1f} Gbps")

### 3.3 Results Summary: TCP vs RDMA

**Measured results:**

| Test | Throughput | Notes |
|------|------------|-------|
| RDMA single link (`ib_write_bw`) | 11,679 MB/s (93.4 Gbps) | Kernel bypass, near line rate |
| TCP single stream over bond (`iperf3`) | 33.7 Gbps | Kernel TCP/IP stack overhead |
| TCP 4 parallel streams over bond (`iperf3 -P 4`) | 93.0 Gbps | Utilizes both bonded links |

**Key observations:**
- RDMA achieves 2.8x the throughput of single-stream TCP over the same link
- TCP bonding with 4 parallel streams matches RDMA single-link performance
- Single TCP stream limited to ~34 Gbps despite 200 Gbps aggregate link capacity

### Why TCP Underperforms

The TCP/IP stack introduces overhead at every layer:
- **System calls**: Each send/recv crosses user-kernel boundary
- **Buffer copies**: Data copied between user space and kernel buffers
- **Protocol processing**: TCP segmentation, checksums, congestion control
- **Interrupt handling**: Each packet generates CPU interrupts

RDMA bypasses all of this. The NIC reads/writes directly to application memory.

### Bonding Limitations

Balance-xor bonding requires multiple TCP connections to utilize both links:

| Configuration | Single Stream | Multiple Streams |
|---------------|---------------|------------------|
| Direct interface (no bond) | 34 Gbps | 34 Gbps per stream |
| balance-xor bond | 34 Gbps (one link) | 93 Gbps (4 streams distributed) |
| RDMA (no bond possible) | 93 Gbps | 186 Gbps (dual-rail) |

**Implication for ML workloads:** Large tensor transfers are single logical connections. TCP bonding peaks at 34 Gbps for single streams, requiring application-level parallelism to utilize both links. RDMA achieves 93 Gbps per link with a single connection, making NIXL and NCCL essential for high-bandwidth inference data movement.

### 3.4 Testing RDMA Dual-Link Performance (Optional)

The single-link RDMA test above shows 93.4 Gbps on one port. The [first tutorial](01_InfiniBand_Tutorial.ipynb) covers basic RDMA testing on individual ports. To verify whether the hardware can achieve **aggregate throughput across both ports simultaneously** (~186 Gbps), run two `ib_write_bw` tests in parallel.

This requires two separate IP networks and manual coordination:

**On spark-01 (server) - two terminals:**
```bash
# Terminal 1: ib_write_bw -d rocep1s0f0 -F
# Terminal 2: ib_write_bw -d rocep1s0f1 -F
```

**On spark-02 (client) - two terminals:**
```bash
# Terminal 1: ib_write_bw -d rocep1s0f0 -F 192.168.100.10
# Terminal 2: ib_write_bw -d rocep1s0f1 -F 192.168.200.10  # Requires second IP network
```

**Expected:** Each process reports ~93.4 Gbps, aggregate ~186.8 Gbps. If significantly lower, check PCIe/memory bandwidth limits.

**Note:** This tests raw hardware capability. The NIXL dual-rail section below shows application-level performance with UCX.

### 3.3 Monitor Bond Traffic Distribution

During active transfers, verify traffic flows through both interfaces.

In [None]:
# Check per-interface statistics
print("=== Interface Statistics ===")
print(f"\n{INTERFACE_1}:")
print(run_cmd(f"ip -s link show {INTERFACE_1} | grep -A 2 'RX\\|TX'"))
print(f"\n{INTERFACE_2}:")
print(run_cmd(f"ip -s link show {INTERFACE_2} | grep -A 2 'RX\\|TX'"))

---

## Part 4: NIXL Point-to-Point Transfers

NIXL provides direct RDMA transfers for point-to-point workloads (KV-cache, tensor shards).

### 4.0 Remove Bond Interface (Required)

RDMA memory registration fails when network interfaces are enslaved to a bond. The verbs API requires direct access to the physical device, but bonded interfaces associate with `bond0` instead of the underlying hardware.

**Why bond must be removed:**
- When interfaces join a bond, the kernel reassigns their identity
- GID entries point to `bond0` instead of `rocep1s0f0`/`rocep1s0f1`
- RDMA operations fail because `bond0` has no verbs capability
- NIXL `register_memory()` returns empty descriptors or raises exceptions

**Remove bond on both nodes before proceeding:**

```bash
# Check if bond exists
cat /proc/net/bonding/bond0 2>/dev/null

# Remove bond
sudo ip link set bond0 down
sudo ip link set enp1s0f0np0 nomaster
sudo ip link set enp1s0f1np1 nomaster
sudo ip link delete bond0

# Bring interfaces back up
sudo ip link set enp1s0f0np0 up
sudo ip link set enp1s0f1np1 up

# Restore IP addresses
# spark-01: sudo ip addr add 192.168.100.10/24 dev enp1s0f0np0
# spark-02: sudo ip addr add 192.168.100.11/24 dev enp1s0f0np0
```

**Verify RDMA devices are accessible:**
```bash
ibdev2netdev
# Should show: rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
#             rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
```

### 4.1 Verify NIXL Installation

In [7]:
# Check NIXL installation
try:
    from nixl._api import nixl_agent, nixl_agent_config
    print("NIXL is installed")
    
    # Check UCX devices
    print("\n=== UCX Device Detection ===")
    print(run_cmd("ucx_info -d 2>/dev/null | grep -E 'mlx5|Transport' | head -20"))
except ImportError:
    print("NIXL not installed. Install with:")
    print("  pip install nixl[cu13]")

NIXL is installed

=== UCX Device Detection ===
#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: sysv
#      Transport: posix
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5



### 4.2 NIXL Local Memory Registration Test

Test NIXL memory registration and descriptor creation (single-node validation).

In [8]:
# NIXL GPU memory registration test (local validation)
try:
    import torch
    import os
    os.environ["PATH"] = "/usr/local/ucx/bin:" + os.environ.get("PATH", "")
    os.environ["LD_LIBRARY_PATH"] = "/usr/local/ucx/lib:" + os.environ.get("LD_LIBRARY_PATH", "")
    from nixl._api import nixl_agent, nixl_agent_config
    
    # Create NIXL agent
    config = nixl_agent_config(
        enable_prog_thread=True,
        backends=["UCX"]
    )
    
    agent = nixl_agent("test_agent", config)
    print("NIXL agent created successfully")
    
    # GPU memory registration
    if not torch.cuda.is_available():
        raise RuntimeError("CUDA not available. GPU required for this test.")
    
    print("\nRegistering GPU memory...")
    device = torch.device("cuda:0")
    tensor = torch.ones((1024, 1024), dtype=torch.float32, device=device)
    print(f"Allocated GPU tensor: {tensor.shape}, {tensor.numel() * 4 / 1e6:.1f} MB")
    
    reg_descs = agent.register_memory(tensor)
    
    if reg_descs:
        print("GPU memory registration: SUCCESS")
        
        # Get transfer descriptors
        xfer_descs = agent.get_xfer_descs([tensor])
        desc_str = agent.get_serialized_descs(xfer_descs)
        print(f"Descriptor size: {len(desc_str)} bytes")
        print("Descriptor serialization: SUCCESS")
        
        print("\n" + "=" * 60)
        print("Status: NIXL agent functional with GPU memory")
        print("=" * 60)
    else:
        print("GPU memory registration: FAILED")
        print("\nPossible causes:")
        print("  - UCX not compiled with CUDA support (check: ucx_info -d)")
        print("  - GPU memory not accessible via RDMA")
        print("  - RoCE adapters not configured for GPUDirect RDMA")
        
except ImportError as e:
    print(f"NIXL not available: {e}")
except Exception as e:
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()

2026-02-02 12:39:17 NIXL INFO    _api.py:363 Backend UCX was instantiated
2026-02-02 12:39:17 NIXL INFO    _api.py:253 Initialized NIXL agent: test_agent
NIXL agent created successfully

Registering GPU memory...
Allocated GPU tensor: torch.Size([1024, 1024]), 4.2 MB
GPU memory registration: SUCCESS
Descriptor size: 163 bytes
Descriptor serialization: SUCCESS

Status: NIXL agent functional with GPU memory


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    


### 4.3 NIXL Two-Node Transfer Test

For a full RDMA transfer test, run the target script on the remote node and the initiator script locally.

**On remote node (spark-02), run target:**

### 4.2.1 GPU Memory Registration Status

**DGX Spark limitation:** GPUDirect RDMA is not supported on DGX Spark. The platform uses a unified memory architecture where GPU-allocated pinned memory is not coherently accessible from PCIe devices. As a result, `nvidia-peermem` and GDRCopy do not work on this platform.

**GPU memory still works:** While GPUDirect RDMA is unavailable, GPU memory allocation and registration succeed using UCX's `cuda_copy` and `cuda_ipc` transports. These transports stage data through host memory, adding latency but still providing GPU-to-GPU transfer capability.

**Performance impact:**
- Without GPUDirect RDMA (DGX Spark): Uses staging through host memory via `cuda_copy`
- With GPUDirect RDMA (HGX/DGX H100): NIC directly accesses GPU memory (1.5-2x faster for large transfers)

**Working example from NIXL repository:**

The [basic_two_peers.py](https://github.com/ai-dynamo/nixl/blob/main/examples/python/basic_two_peers.py) example works with GPU memory on DGX Spark:

```bash
# On target node (spark-01)
python3 basic_two_peers.py --mode=target --use_cuda=true --ip=192.168.100.10 --port=4242

# On initiator node (spark-02)
python3 basic_two_peers.py --mode=initiator --use_cuda=true --ip=192.168.100.10 --port=4242
```

This demonstrates that GPU memory works with NIXL on DGX Spark, even though transfers bounce through host memory internally.

**To check UCX GPU support:**

```bash
# Verify UCX CUDA transports are available
ucx_info -d | grep -i cuda
# Should show: cuda_copy and cuda_ipc transports
```

**The examples below use GPU memory** to demonstrate that GPU-to-GPU transfers work on DGX Spark. While transfers stage through host memory (no zero-copy GPUDirect RDMA), using GPU memory is still beneficial as it avoids explicit Python-level CPU staging.

**Performance comparison: CPU vs GPU memory on DGX Spark:**

CPU memory path achieves higher throughput because the RDMA operation completes in one step:
- RDMA READ directly into host DRAM over `rc_mlx5` transport
- NIC writes directly to registered host memory buffer
- Result: 80-100 Gbps (limited by Python/NIXL overhead, not hardware)

GPU memory path is slower because it requires staging through host memory:
- RDMA READ into temporary host bounce buffer
- UCX `cuda_copy` transport copies host buffer to GPU memory
- Additional synchronization and memory registration overhead
- Result: Lower throughput, especially for one-sided RDMA READ operations

The performance difference is a direct result of the missing zero-copy path. On platforms with GPUDirect RDMA (HGX, DGX H100), the NIC writes directly to GPU memory and both paths achieve similar throughput. The CPU appearing faster than GPU is the expected behavior on DGX Spark—it confirms GPU buffers are not on a zero-copy RDMA path.

In [2]:
# Target node script (run on remote node)
target_script = '''#!/usr/bin/env python3
# target_node.py - Run on spark-01

import os
os.environ["PATH"] = "/usr/local/ucx/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = "/usr/local/ucx/lib:" + os.environ.get("LD_LIBRARY_PATH", "")
os.environ["UCX_NET_DEVICES"] = "rocep1s0f0:1"

import time
import torch
from nixl._api import nixl_agent, nixl_agent_config

config = nixl_agent_config(
    enable_prog_thread=True,
    enable_listen_thread=True,
    listen_port=5555,
    backends=["UCX"]
)

agent = nixl_agent("target", config)
print("NIXL agent created")

# GPU memory (works via cuda_copy transport which stages through host memory)
tensor = torch.ones((4096, 4096), dtype=torch.float32, device="cuda:0")
print(f"Target tensor: {tensor.shape}, {tensor.numel() * 4 / 1e6:.1f} MB (GPU)")

agent.register_memory(tensor)
print("Memory registered")

target_descs = agent.get_xfer_descs([tensor])
desc_str = agent.get_serialized_descs(target_descs)
print(f"Descriptor ready ({len(desc_str)} bytes)")

print("Waiting for initiator...")
while not agent.check_remote_metadata("initiator"):
    time.sleep(0.1)

agent.send_notif("initiator", desc_str)
print("Sent descriptors to initiator")

print("Waiting for transfer completion...")
while True:
    notifs = agent.get_new_notifs()
    if "initiator" in notifs:
        for notif in notifs["initiator"]:
            if b"done" in notif:
                print("Transfer completed!")
                break
        else:
            continue
        break
    time.sleep(0.1)

print("Target finished")
'''

print("=== Target Node Script ===")
print("Save to spark-01 as ~/target_node.py and run:")
print("  .venv/bin/python3 ~/target_node.py")
print()
print("NOTE: If 'Address already in use' error, kill existing process:")
print("  pkill -f target_node.py")
print()
print(target_script)

=== Target Node Script ===
Save to spark-01 as ~/target_node.py and run:
  .venv/bin/python3 ~/target_node.py

NOTE: If 'Address already in use' error, kill existing process:
  pkill -f target_node.py

#!/usr/bin/env python3
# target_node.py - Run on spark-01

import os
os.environ["PATH"] = "/usr/local/ucx/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = "/usr/local/ucx/lib:" + os.environ.get("LD_LIBRARY_PATH", "")
os.environ["UCX_NET_DEVICES"] = "rocep1s0f0:1"

import time
import torch
from nixl._api import nixl_agent, nixl_agent_config

config = nixl_agent_config(
    enable_prog_thread=True,
    enable_listen_thread=True,
    listen_port=5555,
    backends=["UCX"]
)

agent = nixl_agent("target", config)
print("NIXL agent created")

# GPU memory (works via cuda_copy transport which stages through host memory)
tensor = torch.ones((4096, 4096), dtype=torch.float32, device="cuda:0")
print(f"Target tensor: {tensor.shape}, {tensor.numel() * 4 / 1e6:.1f} MB (GPU)")

agent

In [3]:
# Initiator node script - run directly from this notebook
# PREREQUISITE: Target must be running on spark-01 and showing "Waiting for initiator..."

import os
import sys

# UCX logging - must be set BEFORE importing nixl
os.environ["PATH"] = "/usr/local/ucx-1.20-cuda/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = "/usr/local/ucx-1.20-cuda/lib:" + os.environ.get("LD_LIBRARY_PATH", "")
os.environ["UCX_NET_DEVICES"] = "rocep1s0f0:1"

# Enable UCX logging (logs go to stderr)
# os.environ["UCX_LOG_LEVEL"] = "debug"  # Options: error, warn, info, debug, trace
# os.environ["UCX_LOG_PRINT_ENABLE"] = "y"

# Redirect stderr to stdout so we can see UCX logs in notebook output
import io
from contextlib import redirect_stderr

import time
import torch
from nixl._api import nixl_agent, nixl_agent_config

TARGET_IP = REMOTE_IP  # Uses REMOTE_IP from setup cell
TARGET_PORT = 5555

config = nixl_agent_config(
    enable_prog_thread=True,
    enable_listen_thread=True,
    listen_port=0,
    backends=["UCX"]
)

print("Creating NIXL agent (check for UCX transport selection in logs)...")
sys.stderr.flush()
agent = nixl_agent("initiator", config)
sys.stderr.flush()
print("NIXL agent created")

# GPU memory (works via cuda_copy transport which stages through host memory)
local_tensor = torch.zeros((4096, 4096), dtype=torch.float32, device="cuda:0")
print(f"Local tensor: {local_tensor.shape}, {local_tensor.numel() * 4 / 1e6:.1f} MB (GPU)")

print("Registering memory (watch for cuda_copy transport selection)...")
sys.stderr.flush()
agent.register_memory(local_tensor)
sys.stderr.flush()
print("Memory registered")

print(f"Connecting to target at {TARGET_IP}:{TARGET_PORT}")
agent.fetch_remote_metadata("target", TARGET_IP, TARGET_PORT)
agent.send_local_metadata(TARGET_IP, TARGET_PORT)

print("Waiting for descriptors...")
notifs = agent.get_new_notifs()
while "target" not in notifs or len(notifs["target"]) == 0:
    time.sleep(0.1)
    notifs = agent.get_new_notifs()

remote_descs = agent.deserialize_descs(notifs["target"][0])
local_descs = agent.get_xfer_descs([local_tensor])
print("Received remote descriptors")

print("Starting RDMA READ (64 MB)...")
sys.stderr.flush()
start_time = time.perf_counter()

xfer_handle = agent.initialize_xfer("READ", local_descs, remote_descs, "target", "done")
agent.transfer(xfer_handle)

while agent.check_xfer_state(xfer_handle) != "DONE":
    time.sleep(0.001)

elapsed = time.perf_counter() - start_time
sys.stderr.flush()
size_mb = local_tensor.numel() * 4 / 1e6
throughput_gbps = (size_mb * 8) / (elapsed * 1000)

print(f"Transfer complete: {size_mb:.1f} MB in {elapsed*1000:.2f} ms")
print(f"Throughput: {throughput_gbps:.1f} Gbps")

expected = torch.ones((4096, 4096), dtype=torch.float32, device="cuda:0")
if torch.allclose(local_tensor, expected):
    print("Data verification: PASSED")
else:
    print("Data verification: FAILED")

print("Initiator finished")

Creating NIXL agent (check for UCX transport selection in logs)...
2026-02-02 15:01:47 NIXL INFO    _api.py:363 Backend UCX was instantiated
2026-02-02 15:01:47 NIXL INFO    _api.py:253 Initialized NIXL agent: initiator


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    


NIXL agent created
Local tensor: torch.Size([4096, 4096]), 67.1 MB (GPU)
Registering memory (watch for cuda_copy transport selection)...
Memory registered
Connecting to target at 192.168.100.10:5555
Waiting for descriptors...
Received remote descriptors
Starting RDMA READ (64 MB)...
Transfer complete: 67.1 MB in 152.46 ms
Throughput: 3.5 Gbps
Data verification: PASSED
Initiator finished


In [20]:
# Dual-rail target script (print for remote node)
dual_target_script = '''#!/usr/bin/env python3
# dual_rail_target.py - Run on spark-01

import os
os.environ["PATH"] = "/usr/local/ucx-1.20-cuda/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = "/usr/local/ucx-1.20-cuda/lib:" + os.environ.get("LD_LIBRARY_PATH", "")
os.environ["UCX_NET_DEVICES"] = "rocep1s0f0:1,rocep1s0f1:1"
os.environ["UCX_TLS"] = "rc,cuda,tcp,sm,self"
os.environ["UCX_MAX_RNDV_LANES"] = "2"
os.environ["UCX_MAX_EAGER_LANES"] = "2"
#os.environ["UCX_RNDV_THRESH"] = "65536"

import time
import torch
from nixl._api import nixl_agent, nixl_agent_config

config = nixl_agent_config(
    enable_prog_thread=True,
    enable_listen_thread=True,
    listen_port=5556,
    backends=["UCX"]
)

agent = nixl_agent("target", config)
print("NIXL agent created (dual-rail)")

# GPU memory buffer (1 GB)
tensor = torch.ones((16384, 16384), dtype=torch.float32, device="cuda:0")
size_mb = tensor.numel() * 4 / 1e6
print(f"Target tensor: {tensor.shape}, {size_mb:.1f} MB (GPU)")

agent.register_memory(tensor)
print("Memory registered")

target_descs = agent.get_xfer_descs([tensor])
desc_str = agent.get_serialized_descs(target_descs)
print(f"Descriptor ready ({len(desc_str)} bytes)")

print("Waiting for initiator...")
while not agent.check_remote_metadata("initiator"):
    time.sleep(0.1)

agent.send_notif("initiator", desc_str)
print("Sent descriptors to initiator")

print("Waiting for transfer completion...")
while True:
    notifs = agent.get_new_notifs()
    if "initiator" in notifs:
        for notif in notifs["initiator"]:
            if b"done" in notif:
                print("Transfer completed!")
                break
        else:
            continue
        break
    time.sleep(0.1)

print("Target finished")
'''

print("=== Dual-Rail Target Node Script ===")
print("Save to spark-01 as ~/dual_rail_target.py and run:")
print("  .venv/bin/python3 ~/dual_rail_target.py")
print()
print("NOTE: If 'Address already in use' error, kill existing process:")
print("  pkill -f dual_rail_target.py")
print()
print(dual_target_script)

=== Dual-Rail Target Node Script ===
Save to spark-01 as ~/dual_rail_target.py and run:
  .venv/bin/python3 ~/dual_rail_target.py

NOTE: If 'Address already in use' error, kill existing process:
  pkill -f dual_rail_target.py

#!/usr/bin/env python3
# dual_rail_target.py - Run on spark-01

import os
os.environ["PATH"] = "/usr/local/ucx-1.20-cuda/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = "/usr/local/ucx-1.20-cuda/lib:" + os.environ.get("LD_LIBRARY_PATH", "")
os.environ["UCX_NET_DEVICES"] = "rocep1s0f0:1,rocep1s0f1:1"
os.environ["UCX_TLS"] = "rc,cuda,tcp,sm,self"
os.environ["UCX_MAX_RNDV_LANES"] = "2"
os.environ["UCX_MAX_EAGER_LANES"] = "2"
#os.environ["UCX_RNDV_THRESH"] = "65536"

import time
import torch
from nixl._api import nixl_agent, nixl_agent_config

config = nixl_agent_config(
    enable_prog_thread=True,
    enable_listen_thread=True,
    listen_port=5556,
    backends=["UCX"]
)

agent = nixl_agent("target", config)
print("NIXL agent created (dual-rail)

In [None]:
# Dual-rail initiator - run directly from this notebook
# PREREQUISITE: Target must be running on spark-01 and showing "Waiting for initiator..."

import os
os.environ["PATH"] = "/usr/local/ucx-1.20-cuda/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = "/usr/local/ucx-1.20-cuda/lib:" + os.environ.get("LD_LIBRARY_PATH", "")
os.environ["UCX_NET_DEVICES"] = "rocep1s0f0:1,rocep1s0f1:1"
os.environ["UCX_TLS"] = "rc,cuda,tcp,sm,self"
os.environ["UCX_MAX_RNDV_LANES"] = "2"
os.environ["UCX_MAX_EAGER_LANES"] = "2"
os.environ["UCX_RNDV_THRESH"] = "65536"

import time
import torch
from nixl._api import nixl_agent, nixl_agent_config

TARGET_IP = REMOTE_IP  # Uses REMOTE_IP from setup cell
TARGET_PORT = 5556

config = nixl_agent_config(
    enable_prog_thread=True,
    enable_listen_thread=True,
    listen_port=0,
    backends=["UCX"]
)

agent = nixl_agent("initiator", config)
print("NIXL agent created (dual-rail)")

# GPU memory buffer (1 GB)
local_tensor = torch.zeros((16384, 16384), dtype=torch.float32, device="cuda:0")
size_mb = local_tensor.numel() * 4 / 1e6
print(f"Local tensor: {local_tensor.shape}, {size_mb:.1f} MB (GPU)")

agent.register_memory(local_tensor)
print("Memory registered")

print(f"Connecting to target at {TARGET_IP}:{TARGET_PORT}")
agent.fetch_remote_metadata("target", TARGET_IP, TARGET_PORT)
agent.send_local_metadata(TARGET_IP, TARGET_PORT)

print("Waiting for descriptors...")
notifs = agent.get_new_notifs()
while "target" not in notifs or len(notifs["target"]) == 0:
    time.sleep(0.1)
    notifs = agent.get_new_notifs()

remote_descs = agent.deserialize_descs(notifs["target"][0])
local_descs = agent.get_xfer_descs([local_tensor])
print("Received remote descriptors")

print(f"Starting RDMA READ ({size_mb:.0f} MB)...")
start_time = time.perf_counter()

xfer_handle = agent.initialize_xfer("READ", local_descs, remote_descs, "target", "done")
agent.transfer(xfer_handle)

while agent.check_xfer_state(xfer_handle) != "DONE":
    time.sleep(0.001)

elapsed = time.perf_counter() - start_time
throughput_gbps = (size_mb * 8) / (elapsed * 1000)

print(f"Transfer complete: {size_mb:.1f} MB in {elapsed*1000:.2f} ms")
print(f"Throughput: {throughput_gbps:.1f} Gbps")

print("Initiator finished")

2026-02-02 13:48:13 NIXL INFO    _api.py:363 Backend UCX was instantiated
2026-02-02 13:48:13 NIXL INFO    _api.py:253 Initialized NIXL agent: initiator
NIXL agent created (dual-rail)


    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    


Local tensor: torch.Size([16384, 16384]), 1073.7 MB (GPU)
Memory registered
Connecting to target at 192.168.100.10:5556
Waiting for descriptors...
Received remote descriptors
Starting RDMA READ (1074 MB)...
Transfer complete: 1073.7 MB in 1761.09 ms
Throughput: 4.9 Gbps
Initiator finished


### Why GPU Memory is Slower than CPU Memory on DGX Spark

**Observed Results:**
| Memory Type | Throughput |
|-------------|------------|
| CPU (DRAM)  | ~83 Gbps   |
| GPU (VRAM)  | ~4 Gbps    |

**Root Cause: No GPUDirect RDMA on DGX Spark**

DGX Spark uses a unified memory architecture where the RoCE NIC cannot directly access GPU memory. When you register GPU memory with NIXL/UCX:

```
CPU Memory Path (Fast):
┌─────────┐    RDMA     ┌─────────┐
│  CPU    │ ──────────► │  CPU    │   Direct NIC-to-memory transfer
│  DRAM   │  ~83 Gbps   │  DRAM   │   No CPU involvement
└─────────┘             └─────────┘

GPU Memory Path (Slow):
┌─────────┐  cuda_copy  ┌─────────┐    RDMA     ┌─────────┐  cuda_copy  ┌─────────┐
│  GPU    │ ──────────► │  CPU    │ ──────────► │  CPU    │ ──────────► │  GPU    │
│  VRAM   │   PCIe      │ bounce  │  ~83 Gbps   │ bounce  │   PCIe      │  VRAM   │
└─────────┘             │ buffer  │             │ buffer  │             └─────────┘
                        └─────────┘             └─────────┘
```

**The GPU path requires:**
1. `cuda_copy`: GPU → CPU bounce buffer (PCIe bandwidth limited)
2. RDMA transfer: CPU → CPU (fast)
3. `cuda_copy`: CPU bounce buffer → GPU (PCIe bandwidth limited)

**Verification with UCX logging:**

Set these environment variables to see which transports UCX selects:
```bash
export UCX_LOG_LEVEL=info
export UCX_LOG_PRINT_ENABLE=y
```

You'll see:
- CPU memory: Uses `rc_mlx5` (direct RDMA)
- GPU memory: Uses `cuda_copy` + `rc_mlx5` (staged transfer)

**On platforms with GPUDirect RDMA** (DGX H100, HGX):
- NIC registers GPU memory directly via `nvidia-peermem` kernel module
- GPU-to-GPU transfers bypass CPU entirely
- Both paths achieve similar throughput (~80+ Gbps)

### Visualizing Host-Staged GPU Transfers

The following diagnostic shows evidence that GPU transfers go through host memory bounce buffers when `nvidia_peermem` is not loaded. We can observe this by:

1. **Monitoring host memory** during GPU transfers (bounce buffers appear in RAM)
2. **Comparing transfer latency** between GPU and CPU memory
3. **Checking UCX transport selection** in debug logs

If GPUDirect RDMA were active, GPU transfers would bypass host memory entirely.

In [3]:
# Diagnostic: Visualize host-staging during GPU memory transfer
# This shows that GPU transfers use host memory bounce buffers

import subprocess
import threading
import time
import torch

def get_memory_stats():
    """Get current host memory usage in MB"""
    result = subprocess.run(['free', '-m'], capture_output=True, text=True)
    lines = result.stdout.strip().split('\n')
    mem_line = lines[1].split()
    return {
        'total': int(mem_line[1]),
        'used': int(mem_line[2]),
        'free': int(mem_line[3]),
        'available': int(mem_line[6])
    }

def monitor_memory(duration_sec, interval=0.1):
    """Monitor memory usage over time"""
    samples = []
    start = time.time()
    while time.time() - start < duration_sec:
        stats = get_memory_stats()
        stats['timestamp'] = time.time() - start
        samples.append(stats)
        time.sleep(interval)
    return samples

print("=" * 60)
print("HOST-STAGING DIAGNOSTIC")
print("=" * 60)

# Check nvidia_peermem status
peermem_result = subprocess.run(['lsmod'], capture_output=True, text=True)
peermem_loaded = 'nvidia_peermem' in peermem_result.stdout

print(f"\nnvidia_peermem loaded: {'YES ✓' if peermem_loaded else 'NO ✗'}")
if not peermem_loaded:
    print("  → GPU transfers will use host memory bounce buffers")
    print("  → This is why we can observe host RAM usage during GPU transfers")

# Baseline memory
baseline = get_memory_stats()
print(f"\nBaseline host memory: {baseline['used']} MB used / {baseline['total']} MB total")

# Allocate GPU tensor
tensor_size_mb = 512  # 512 MB test
elements = (tensor_size_mb * 1024 * 1024) // 4  # float32 = 4 bytes
side = int(elements ** 0.5)

print(f"\nAllocating {tensor_size_mb} MB GPU tensor...")
gpu_tensor = torch.ones((side, side), dtype=torch.float32, device='cuda:0')
torch.cuda.synchronize()

post_alloc = get_memory_stats()
print(f"After GPU alloc: {post_alloc['used']} MB used (delta: {post_alloc['used'] - baseline['used']:+d} MB)")

# Simulate transfer by copying to pinned host memory (mimics cuda_copy path)
print(f"\nSimulating host-staged transfer (GPU → pinned host → GPU)...")
print("This demonstrates the bounce buffer path UCX uses without GPUDirect RDMA:\n")

# Monitor during transfer
print("  Step 1: GPU → Host (cuda_copy stage 1)")
pre_copy = get_memory_stats()
host_tensor = gpu_tensor.cpu()  # This allocates host memory
torch.cuda.synchronize()
post_copy = get_memory_stats()
host_delta = post_copy['used'] - pre_copy['used']
print(f"           Host memory delta: {host_delta:+d} MB")
print(f"           → Bounce buffer allocated in host RAM")

print("\n  Step 2: Host → GPU (cuda_copy stage 2)")
pre_back = get_memory_stats()
gpu_tensor_2 = host_tensor.cuda()
torch.cuda.synchronize()
post_back = get_memory_stats()
print(f"           Host memory delta: {post_back['used'] - pre_back['used']:+d} MB")

# Cleanup
del host_tensor, gpu_tensor, gpu_tensor_2
torch.cuda.empty_cache()

final = get_memory_stats()
print(f"\nAfter cleanup: {final['used']} MB used (delta from baseline: {final['used'] - baseline['used']:+d} MB)")

print("\n" + "=" * 60)
print("INTERPRETATION")
print("=" * 60)
if host_delta > tensor_size_mb * 0.5:  # At least half the tensor size appeared in host RAM
    print(f"""
The {host_delta} MB increase in host memory during GPU→Host copy
confirms that data passes through host RAM bounce buffers.

During NIXL/UCX GPU transfers, this same path is used:
  1. cuda_copy: GPU VRAM → Host bounce buffer (~{tensor_size_mb} MB allocated)
  2. rc_mlx5:   Host buffer → Network → Remote host buffer  
  3. cuda_copy: Remote host buffer → Remote GPU VRAM

This explains why GPU transfers achieve ~4 Gbps instead of ~83 Gbps:
  - PCIe bandwidth to/from GPU becomes the bottleneck
  - Two extra memory copies add latency
""")
else:
    print(f"""
Host memory delta was only {host_delta} MB (expected ~{tensor_size_mb} MB).
This could indicate:
  - Memory was already pre-allocated
  - System has unified memory architecture
  - Measurement timing issue

Run 'watch -n 0.5 free -m' in a terminal during transfers for real-time view.
""")

HOST-STAGING DIAGNOSTIC

nvidia_peermem loaded: NO ✗
  → GPU transfers will use host memory bounce buffers
  → This is why we can observe host RAM usage during GPU transfers

Baseline host memory: 6570 MB used / 122506 MB total

Allocating 512 MB GPU tensor...
After GPU alloc: 7093 MB used (delta: +523 MB)

Simulating host-staged transfer (GPU → pinned host → GPU)...
This demonstrates the bounce buffer path UCX uses without GPUDirect RDMA:

  Step 1: GPU → Host (cuda_copy stage 1)
           Host memory delta: +511 MB
           → Bounce buffer allocated in host RAM

  Step 2: Host → GPU (cuda_copy stage 2)
           Host memory delta: +520 MB

After cleanup: 6581 MB used (delta from baseline: +11 MB)

INTERPRETATION

The 511 MB increase in host memory during GPU→Host copy
confirms that data passes through host RAM bounce buffers.

During NIXL/UCX GPU transfers, this same path is used:
  1. cuda_copy: GPU VRAM → Host bounce buffer (~512 MB allocated)
  2. rc_mlx5:   Host buffer → Netw

In [None]:
# Visual ASCII diagram of data flow
print("""
╔══════════════════════════════════════════════════════════════════════════════╗
║                     DATA PATH COMPARISON                                     ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  WITH nvidia_peermem (GPUDirect RDMA) - NOT available on DGX Spark:         ║
║  ─────────────────────────────────────────────────────────────────          ║
║                                                                              ║
║    ┌─────────┐                                      ┌─────────┐             ║
║    │   GPU   │ ════════════ RDMA ════════════════► │   GPU   │             ║
║    │  VRAM   │         ~80-90 Gbps                  │  VRAM   │             ║
║    └─────────┘      (NIC reads GPU directly)        └─────────┘             ║
║                                                                              ║
║                                                                              ║
║  WITHOUT nvidia_peermem (Host-Staged) - Current DGX Spark behavior:         ║
║  ───────────────────────────────────────────────────────────────            ║
║                                                                              ║
║    ┌─────────┐   cuda_copy   ┌─────────┐   RDMA    ┌─────────┐   cuda_copy  ┌─────────┐
║    │   GPU   │ ───────────► │  Host   │ ────────► │  Host   │ ───────────► │   GPU   │
║    │  VRAM   │    ~15 GB/s   │ Bounce  │  ~83 Gbps │ Bounce  │   ~15 GB/s   │  VRAM   │
║    └─────────┘    (PCIe)     │ Buffer  │           │ Buffer  │    (PCIe)    └─────────┘
║                              └─────────┘           └─────────┘                        ║
║                                                                              ║
║    Bottleneck: PCIe transfers limit effective throughput to ~4 Gbps         ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

Evidence that host-staging is active:
  1. UCX logs show 'cuda_copy' transport being used
  2. Host memory usage increases during GPU transfers (bounce buffers)
  3. GPU transfer throughput (~4 Gbps) << CPU transfer throughput (~83 Gbps)
  4. GDAKI warning: "please load Nvidia peermem driver"
""")

In [4]:
# Diagnostic: Check UCX transport capabilities
import subprocess

print("=== UCX CUDA Transport Support ===")
result = subprocess.run(
    "ucx_info -d 2>/dev/null | grep -E '(Transport|cuda|memory)' | head -40",
    shell=True, capture_output=True, text=True
)
print(result.stdout)

print("\n=== UCX Memory Domains ===")
result = subprocess.run(
    "ucx_info -d 2>/dev/null | grep -E '(md\[|reg_mem|alloc)' | head -30",
    shell=True, capture_output=True, text=True
)
print(result.stdout)

print("\n=== Key Observations ===")
print("""
Look for these patterns in the output above:

1. CUDA transports available:
   - cuda_copy: Stages GPU↔CPU via PCIe (SLOW for large transfers)
   - cuda_ipc: GPU-to-GPU on same node via NVLink/PCIe (not applicable here)

2. Memory domain capabilities (md[]):
   - 'reg_mem: host' = Can only register CPU memory for RDMA
   - 'reg_mem: cuda' = Can register GPU memory directly (GPUDirect RDMA)

DGX Spark limitation: The RoCE adapters show 'reg_mem: host' only.
GPU memory must be staged through CPU bounce buffers, adding latency
and limiting throughput to PCIe bandwidth (~32 GB/s per direction).
""")

  "ucx_info -d 2>/dev/null | grep -E '(md\[|reg_mem|alloc)' | head -30",


=== UCX CUDA Transport Support ===
#         memory types: host (access,reg_nonblock,reg,cache)
#      Transport: self
#         Device: memory
#         memory types: host (access,reg_nonblock,reg,cache)
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#         memory types: host (access,alloc,cache)
#      Transport: sysv
#         Device: memory
#         memory types: host (access,alloc,cache)
#      Transport: posix
#         Device: memory
#           local memory handle is required for zcopy
#           memory invalidation is supported
#         memory types: host (access,reg_nonblock,reg,cache), rdma (alloc,cache)
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#           local memory handle is required for zcopy
#           memory invalidation is supported
#         memory types: host (acc

### 4.3.2 NIXL latency test (two-node)

This measures per-transfer latency for small CPU buffers and includes Python overhead. Use it for relative comparisons, not absolute wire latency.

**Why the dual-rail throughput scripts are not suitable for latency:**
- They time one multi-gigabyte transfer, which reports bandwidth instead of per-transfer latency.
- Large transfers use rendezvous and pipelining, so timing reflects sustained throughput, not one-way latency.
- Latency measurement requires thousands of small transfers with per-iteration timing and percentile stats.

**Measured results (dual-rail, CPU, 4 KB, 1000 iterations):**
- Avg: 58.6 μs
- P50: 11.1 μs
- P95: 166.6 μs

**Measured results (single-rail, CPU, 4 KB, 1000 iterations):**
- Avg: 17.4 μs
- P50: 16.2 μs
- P95: 20.8 μs

In [None]:
# Generate NIXL latency test scripts
latency_target_script = '''#!/usr/bin/env python3
# nixl_latency_target.py - Run on spark-01

import os
os.environ["PATH"] = "/usr/local/ucx/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = "/usr/local/ucx/lib:" + os.environ.get("LD_LIBRARY_PATH", "")
os.environ["UCX_NET_DEVICES"] = "rocep1s0f0:1,rocep1s0f1:1,enp1s0f0np0,enp1s0f1np1,lo"
os.environ["UCX_TLS"] = "rc_verbs,rc_mlx5,tcp,sockcm,cuda_copy"
os.environ["UCX_MAX_RNDV_LANES"] = "2"
os.environ["UCX_MAX_EAGER_LANES"] = "2"
os.environ["UCX_RNDV_THRESH"] = "0"

import time
import torch
from nixl._api import nixl_agent, nixl_agent_config

config = nixl_agent_config(
    enable_prog_thread=True,
    enable_listen_thread=True,
    listen_port=5557,
    backends=["UCX"]
)

agent = nixl_agent("target", config)
print("NIXL agent created (latency target)")

# Small CPU buffer (4 KB)
tensor = torch.ones((1024,), dtype=torch.float32, device="cpu")
print(f"Target tensor: {tensor.shape}, {tensor.numel() * 4} bytes (CPU)")

agent.register_memory(tensor)
target_descs = agent.get_xfer_descs([tensor])
desc_str = agent.get_serialized_descs(target_descs)
print(f"Descriptor ready ({len(desc_str)} bytes)")

print("Waiting for initiator...")
while not agent.check_remote_metadata("initiator"):
    time.sleep(0.1)

agent.send_notif("initiator", desc_str)
print("Sent descriptors to initiator")

# Wait for completion signal
while True:
    notifs = agent.get_new_notifs()
    if "initiator" in notifs:
        for notif in notifs["initiator"]:
            if b"done" in notif:
                print("Latency test completed")
                break
        else:
            continue
        break
    time.sleep(0.1)

print("Target finished")
'''

latency_initiator_script = f'''#!/usr/bin/env python3
# nixl_latency_initiator.py - Run on spark-02

import os
os.environ["PATH"] = "/usr/local/ucx/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = "/usr/local/ucx/lib:" + os.environ.get("LD_LIBRARY_PATH", "")
os.environ["UCX_NET_DEVICES"] = "rocep1s0f0:1,rocep1s0f1:1,enp1s0f0np0,enp1s0f1np1,lo"
os.environ["UCX_TLS"] = "rc_verbs,rc_mlx5,tcp,sockcm"
os.environ["UCX_MAX_RNDV_LANES"] = "2"
os.environ["UCX_MAX_EAGER_LANES"] = "2"
os.environ["UCX_RNDV_THRESH"] = "0"

import time
import statistics
import torch
from nixl._api import nixl_agent, nixl_agent_config

TARGET_IP = "{REMOTE_IP}"
TARGET_PORT = 5557
ITERATIONS = 1000

config = nixl_agent_config(
    enable_prog_thread=True,
    enable_listen_thread=True,
    listen_port=0,
    backends=["UCX"]
)

agent = nixl_agent("initiator", config)
print("NIXL agent created (latency initiator)")

# Small CPU buffer (4 KB)
local_tensor = torch.zeros((1024,), dtype=torch.float32, device="cpu")
agent.register_memory(local_tensor)

print(f"Connecting to target at {{TARGET_IP}}:{{TARGET_PORT}}")
agent.fetch_remote_metadata("target", TARGET_IP, TARGET_PORT)
agent.send_local_metadata(TARGET_IP, TARGET_PORT)

print("Waiting for descriptors...")
notifs = agent.get_new_notifs()
while "target" not in notifs or len(notifs["target"]) == 0:
    time.sleep(0.1)
    notifs = agent.get_new_notifs()

remote_descs = agent.deserialize_descs(notifs["target"][0])
local_descs = agent.get_xfer_descs([local_tensor])
print("Received remote descriptors")

latencies_us = []
for _ in range(ITERATIONS):
    start_ns = time.perf_counter_ns()
    xfer_handle = agent.initialize_xfer("READ", local_descs, remote_descs, "target", "done")
    agent.transfer(xfer_handle)
    while agent.check_xfer_state(xfer_handle) != "DONE":
        time.sleep(0.0001)
    elapsed_us = (time.perf_counter_ns() - start_ns) / 1000
    latencies_us.append(elapsed_us)

avg_us = sum(latencies_us) / len(latencies_us)
p50_us = statistics.median(latencies_us)
p95_us = statistics.quantiles(latencies_us, n=20)[18]  # ~95th percentile
print(f"Latency (avg): {{avg_us:.1f}} μs")
print(f"Latency (p50): {{p50_us:.1f}} μs")
print(f"Latency (p95): {{p95_us:.1f}} μs")

agent.send_notif("target", b"done")
print("Initiator finished")
'''

single_latency_target_script = '''#!/usr/bin/env python3
# nixl_latency_target_single_rail.py - Run on spark-01

import os
os.environ["PATH"] = "/usr/local/ucx/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = "/usr/local/ucx/lib:" + os.environ.get("LD_LIBRARY_PATH", "")
os.environ["UCX_NET_DEVICES"] = "rocep1s0f0:1,enp1s0f0np0,lo"
os.environ["UCX_TLS"] = "rc_verbs,rc_mlx5,tcp,sockcm"
os.environ["UCX_MAX_RNDV_LANES"] = "1"
os.environ["UCX_MAX_EAGER_LANES"] = "1"
os.environ["UCX_RNDV_THRESH"] = "0"

import time
import torch
from nixl._api import nixl_agent, nixl_agent_config

config = nixl_agent_config(
    enable_prog_thread=True,
    enable_listen_thread=True,
    listen_port=5558,
    backends=["UCX"]
)

agent = nixl_agent("target", config)
print("NIXL agent created (single-rail latency target)")

# Small CPU buffer (4 KB)
tensor = torch.ones((1024,), dtype=torch.float32, device="cpu")
print(f"Target tensor: {tensor.shape}, {tensor.numel() * 4} bytes (CPU)")

agent.register_memory(tensor)
target_descs = agent.get_xfer_descs([tensor])
desc_str = agent.get_serialized_descs(target_descs)
print(f"Descriptor ready ({len(desc_str)} bytes)")

print("Waiting for initiator...")
while not agent.check_remote_metadata("initiator"):
    time.sleep(0.1)

agent.send_notif("initiator", desc_str)
print("Sent descriptors to initiator")

# Wait for completion signal
while True:
    notifs = agent.get_new_notifs()
    if "initiator" in notifs:
        for notif in notifs["initiator"]:
            if b"done" in notif:
                print("Latency test completed")
                break
        else:
            continue
        break
    time.sleep(0.1)

print("Target finished")
'''

single_latency_initiator_script = f'''#!/usr/bin/env python3
# nixl_latency_initiator_single_rail.py - Run on spark-02

import os
os.environ["PATH"] = "/usr/local/ucx/bin:" + os.environ.get("PATH", "")
os.environ["LD_LIBRARY_PATH"] = "/usr/local/ucx/lib:" + os.environ.get("LD_LIBRARY_PATH", "")
os.environ["UCX_NET_DEVICES"] = "rocep1s0f0:1,enp1s0f0np0,lo"
os.environ["UCX_TLS"] = "rc_verbs,rc_mlx5,tcp,sockcm"
os.environ["UCX_MAX_RNDV_LANES"] = "1"
os.environ["UCX_MAX_EAGER_LANES"] = "1"
os.environ["UCX_RNDV_THRESH"] = "0"

import time
import statistics
import torch
from nixl._api import nixl_agent, nixl_agent_config

TARGET_IP = "{REMOTE_IP}"
TARGET_PORT = 5558
ITERATIONS = 1000

config = nixl_agent_config(
    enable_prog_thread=True,
    enable_listen_thread=True,
    listen_port=0,
    backends=["UCX"]
)

agent = nixl_agent("initiator", config)
print("NIXL agent created (single-rail latency initiator)")

# Small CPU buffer (4 KB)
local_tensor = torch.zeros((1024,), dtype=torch.float32, device="cpu")
agent.register_memory(local_tensor)

print(f"Connecting to target at {{TARGET_IP}}:{{TARGET_PORT}}")
agent.fetch_remote_metadata("target", TARGET_IP, TARGET_PORT)
agent.send_local_metadata(TARGET_IP, TARGET_PORT)

print("Waiting for descriptors...")
notifs = agent.get_new_notifs()
while "target" not in notifs or len(notifs["target"]) == 0:
    time.sleep(0.1)
    notifs = agent.get_new_notifs()

remote_descs = agent.deserialize_descs(notifs["target"][0])
local_descs = agent.get_xfer_descs([local_tensor])
print("Received remote descriptors")

latencies_us = []
for _ in range(ITERATIONS):
    start_ns = time.perf_counter_ns()
    xfer_handle = agent.initialize_xfer("READ", local_descs, remote_descs, "target", "done")
    agent.transfer(xfer_handle)
    while agent.check_xfer_state(xfer_handle) != "DONE":
        time.sleep(0.0001)
    elapsed_us = (time.perf_counter_ns() - start_ns) / 1000
    latencies_us.append(elapsed_us)

avg_us = sum(latencies_us) / len(latencies_us)
p50_us = statistics.median(latencies_us)
p95_us = statistics.quantiles(latencies_us, n=20)[18]  # ~95th percentile
print(f"Latency (avg): {{avg_us:.1f}} μs")
print(f"Latency (p50): {{p50_us:.1f}} μs")
print(f"Latency (p95): {{p95_us:.1f}} μs")

agent.send_notif("target", b"done")
print("Initiator finished")
'''

print("=== Latency Target Script (Dual-Rail) ===")
print("Save to spark-01 as ~/nixl_latency_target.py and run:")
print("  .venv/bin/python3 ~/nixl_latency_target.py")
print()
print(latency_target_script)
print()
print("=== Latency Initiator Script (Dual-Rail) ===")
print("Run on spark-02 AFTER target shows 'Waiting for initiator...':")
print("  .venv/bin/python3 ~/nixl_latency_initiator.py")
print()
print(latency_initiator_script)
print()
print("=== Latency Target Script (Single-Rail) ===")
print("Save to spark-01 as ~/nixl_latency_target_single_rail.py and run:")
print("  .venv/bin/python3 ~/nixl_latency_target_single_rail.py")
print()
print(single_latency_target_script)
print()
print("=== Latency Initiator Script (Single-Rail) ===")
print("Run on spark-02 AFTER target shows 'Waiting for initiator...':")
print("  .venv/bin/python3 ~/nixl_latency_initiator_single_rail.py")
print()
print(single_latency_initiator_script)

### 4.3.1 Dual-rail IP setup (second port)

Before the dual-rail test, restore IPs on the second RoCE port on both nodes:

- spark-01: add 192.168.100.12/24 to `enp1s0f1np1`
- spark-02: add 192.168.100.13/24 to `enp1s0f1np1`

After dual-rail testing, remove those secondary IPs before running single-rail tests to avoid UCX/GID confusion.

---

## Part 5: Performance Summary

Compare bonding vs NIXL results.

In [5]:
# Performance comparison table
## Updated from notebook outputs (Jan 2026)

import math

# Measured values from cell outputs
rdma_single_link_gbps = 11678.83 * 8 / 1000  # ib_write_bw BW average (MB/sec)
tcp_single_stream_gbps = 33.7                # iperf3 single stream output
tcp_4_streams_gbps = 93.0                    # iperf3 -P 4 SUM output

# NIXL values - using CPU device registration (update after running scripts)
nixl_single_rail_gbps = 81.8                 # from NIXL single-rail run output
nixl_dual_rail_gbps = 93.4                   # set after dual-rail run output

# NIXL values - using GPU device registration (update after running scripts)
nixl_single_rail_gpu_gbps = 3.5                 # from NIXL single-rail run output
nixl_dual_rail_gpu_gbps = 4.9                   # set after dual-rail run output

# NIXL latency values (CPU, 4 KB, 1000 iterations)
nixl_latency_dual_avg_us = 58.6
nixl_latency_dual_p50_us = 11.1
nixl_latency_dual_p95_us = 166.6

# Single-rail latency (CPU, 4 KB, 1000 iterations)
nixl_latency_single_avg_us = 17.4
nixl_latency_single_p50_us = 16.2
nixl_latency_single_p95_us = 20.8

comparison_data = {
    "TCP Single Stream": {"gbps": tcp_single_stream_gbps, "latency_us": 100},
    "TCP 4 Streams (bonded)": {"gbps": tcp_4_streams_gbps, "latency_us": 100},
    "RDMA Single Link (ib_write_bw)": {"gbps": rdma_single_link_gbps, "latency_us": 2},
    "NIXL Single Rail (CPU memory)": {"gbps": nixl_single_rail_gbps, "latency_us": nixl_latency_single_avg_us},
    "NIXL Dual Rail (CPU memory)": {"gbps": nixl_dual_rail_gbps, "latency_us": nixl_latency_dual_avg_us},
    "NIXL Single Rail (GPU memory)": {"gbps": nixl_single_rail_gpu_gbps, "latency_us": nixl_latency_single_avg_us},
    "NIXL Dual Rail (GPU memory)": {"gbps": nixl_dual_rail_gpu_gbps, "latency_us": nixl_latency_dual_avg_us},
}

def format_latency(val):
    if isinstance(val, (int, float)) and not math.isnan(val):
        return f"{val:>12.0f} μs"
    return "        N/A"

print("=" * 70)
print(f"{'Test Configuration':<32} {'Throughput':>16} {'Latency':>15}")
print("=" * 70)

for test, data in comparison_data.items():
    gbps = data["gbps"]
    latency = data["latency_us"]
    gbps_str = f"{gbps:>6.1f} Gbps" if isinstance(gbps, (int, float)) else "   N/A  "
    print(f"{test:<32} {gbps_str:>16} {format_latency(latency)}")

print("=" * 70)
print()
print("Key findings:")
print(f"- Raw RDMA (ib_write_bw): {rdma_single_link_gbps:.1f} Gbps (near 100G line rate)")
print(f"- TCP single stream:     {tcp_single_stream_gbps:.1f} Gbps (kernel overhead)")
print(f"- TCP 4 streams:         {tcp_4_streams_gbps:.1f} Gbps (bonded)")
print(f"- NIXL dual-rail latency (avg): {nixl_latency_dual_avg_us:.1f} μs (p50 {nixl_latency_dual_p50_us:.1f} μs, p95 {nixl_latency_dual_p95_us:.1f} μs)")
print(f"- NIXL single-rail latency (avg): {nixl_latency_single_avg_us:.1f} μs (p50 {nixl_latency_single_p50_us:.1f} μs, p95 {nixl_latency_single_p95_us:.1f} μs)")


Test Configuration                     Throughput         Latency
TCP Single Stream                       33.7 Gbps          100 μs
TCP 4 Streams (bonded)                  93.0 Gbps          100 μs
RDMA Single Link (ib_write_bw)          93.4 Gbps            2 μs
NIXL Single Rail (CPU memory)           81.8 Gbps           17 μs
NIXL Dual Rail (CPU memory)             93.4 Gbps           59 μs
NIXL Single Rail (GPU memory)            3.5 Gbps           17 μs
NIXL Dual Rail (GPU memory)              4.9 Gbps           59 μs

Key findings:
- Raw RDMA (ib_write_bw): 93.4 Gbps (near 100G line rate)
- TCP single stream:     33.7 Gbps (kernel overhead)
- TCP 4 streams:         93.0 Gbps (bonded)
- NIXL dual-rail latency (avg): 58.6 μs (p50 11.1 μs, p95 166.6 μs)
- NIXL single-rail latency (avg): 17.4 μs (p50 16.2 μs, p95 20.8 μs)


### What the NIXL results indicate

- **Single-rail NIXL (~81.8 Gbps)**: RDMA is working end-to-end, but throughput is lower than raw `ib_write_bw` because of NIXL metadata handling and Python overhead.
- **Dual-rail NIXL (~93.4 Gbps)**: The second rail contributes, but scaling is limited by host-staging on DGX Spark (no GPUDirect RDMA) and CPU/memory overhead.
- **Takeaway**: NIXL provides much higher throughput than TCP single streams and approaches raw RDMA on a single link, but dual-rail gains are constrained on this platform.

In [None]:
# Visualization
try:
    import matplotlib.pyplot as plt
    import math

    tests = list(comparison_data.keys())
    throughputs = [d["gbps"] for d in comparison_data.values()]
    latencies = [d["latency_us"] for d in comparison_data.values()]

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

    # Throughput comparison
    colors = ['#ff6b6b' if 'TCP' in t else '#4ecdc4' for t in tests]
    bars1 = ax1.barh(tests, throughputs, color=colors)
    ax1.set_xlabel('Throughput (Gbps)')
    ax1.set_title('Throughput: Bonding vs NIXL')
    ax1.axvline(x=100, color='gray', linestyle='--', alpha=0.5)

    for bar, val in zip(bars1, throughputs):
        ax1.text(val + 2, bar.get_y() + bar.get_height()/2, f'{val:.0f}',
                va='center', fontsize=10)

    # Latency comparison (log scale)
    if any(isinstance(v, (int, float)) and math.isnan(v) for v in latencies):
        ax2.text(0.5, 0.5, 'Latency chart skipped\n(missing values)',
                 transform=ax2.transAxes, ha='center', va='center', fontsize=10)
        ax2.set_axis_off()
    else:
        bars2 = ax2.barh(tests, latencies, color=colors)
        ax2.set_xlabel('Latency (μs) - Log Scale')
        ax2.set_xscale('log')
        ax2.set_title('Latency: Bonding vs NIXL')

        for bar, val in zip(bars2, latencies):
            ax2.text(val * 1.2, bar.get_y() + bar.get_height()/2, f'{val:.0f}',
                    va='center', fontsize=10)

    plt.tight_layout()
    plt.savefig('bonding_vs_nixl_comparison.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("\nChart saved to bonding_vs_nixl_comparison.png")

except ImportError:
    print("matplotlib not installed. Install with: pip install matplotlib")

---

## Part 6: Key Findings

### Why Bonding Underperforms

```
TCP/IP Path (Bonding):
  Application → Socket API → Kernel TCP/IP Stack → Driver → NIC
  
RDMA Path (NIXL):
  Application → Verbs API → NIC (direct memory access)
```

The kernel TCP/IP stack introduces:
- **CPU overhead**: Context switches, buffer copies, interrupt handling
- **Latency**: 50-200 μs vs 1-2 μs for RDMA
- **Throughput ceiling**: ~35 Gbps per flow regardless of link speed

### When to Use Each Approach

| Workload | Recommended |
|----------|-------------|
| KV-cache transfer | NIXL |
| Tensor shard movement | NIXL |
| Disaggregated inference | NIXL |
| Collective operations | NCCL (see [first tutorial](01_InfiniBand_Tutorial.ipynb)) |
| SSH/management | Bonding |
| NFS storage | Bonding |

### Coexistence

Bonding and NIXL can coexist:
- Bond for IP traffic (uses kernel stack)
- NIXL for RDMA (bypasses kernel entirely)

RDMA verbs access `mlx5_0`/`mlx5_1` directly; traffic does not traverse `bond0`.

---

## Part 7: Troubleshooting

Common issues and solutions when working with bonding and NIXL.

### UCX Version Mismatch Between Nodes

**Symptom:** NIXL connection fails with `NIXL_ERR_NOT_FOUND`

**Cause:** UCX requires matching versions on both endpoints. A node with UCX 1.16.0 cannot connect to a node with UCX 1.21.0.

**Diagnosis:**
```bash
# Check UCX version on each node
ucx_info -v
# Look for the version line, e.g., "# UCX version=1.16.0"
```

**Fix:** Build matching UCX versions on both nodes from the same git tag.

### P2P Interface IP Conflict

**Symptom:** UCX connects but uses wrong interface, or connection times out despite correct IP addresses.

**Cause:** DGX Spark has a P2P interface (`enP2p1s0f0np0`) that may have an IP address in the same subnet as the RoCE interfaces. UCX picks the first matching interface.

**Diagnosis:**
```bash
# Check all interfaces for IPs in your subnet
ip addr | grep "192.168.100"
```

**Fix:**
```bash
# Remove the conflicting IP
sudo ip addr del 192.168.100.15/24 dev enP2p1s0f0np0

# Or restrict UCX to the correct interface (set before importing NIXL)
export UCX_NET_DEVICES=rocep1s0f0:1
```

### Interface Down

**Symptom:** `ibdev2netdev` shows interface as `(Down)`

**Diagnosis:**
```bash
ibdev2netdev
# Expected: rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
# Problem:  rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
```

**Fix:**
```bash
sudo ip link set enp1s0f1np1 up
```

### NIXL GPU Registration Fails

**Error:** `ibv_reg_mr failed: Bad address` or `NIXL_ERR_BACKEND`

**Cause:** DGX Spark does not support GPUDirect RDMA. The `nvidia-peermem` module fails to load due to the unified memory architecture.

**Diagnosis:**
```bash
# Check if nvidia-peermem is loaded
lsmod | grep nvidia_peermem

# Try loading it
sudo modprobe nvidia-peermem
# May fail with: modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument
```

**Workaround:** Use CPU pinned memory fallback:
```python
try:
    tensor = torch.ones((4096, 4096), dtype=torch.float32, device="cuda:0")
    agent.register_memory(tensor)
except Exception as e:
    print(f"GPU registration failed: {e}")
    tensor = torch.ones((4096, 4096), dtype=torch.float32, device="cpu")
    tensor = tensor.pin_memory()
    agent.register_memory(tensor)
```

### TCP Throughput Near Zero

**Symptom:** `iperf3` shows very low throughput (< 1 Gbps) despite successful connection.

**Cause:** MTU mismatch causes TCP congestion control to throttle. Default MTU (1500) is insufficient for RoCE links.

**Diagnosis:**
```bash
# Check MTU on both endpoints
cat /sys/class/net/bond0/mtu  # Should be 9000
cat /sys/class/net/enp1s0f0np0/mtu

# Look for Cwnd stuck at ~1.4 KB in iperf3 output
```

**Fix:**
```bash
sudo ip link set enp1s0f0np0 mtu 9000
sudo ip link set enp1s0f1np1 mtu 9000
sudo ip link set bond0 mtu 9000  # If bond exists
```

### NIXL Does Not Use Both Rails

**Symptom:** NIXL throughput same on single-rail and dual-rail configurations.

**Diagnosis:**
```bash
# Check UCX device detection
ucx_info -d | grep mlx5
```

**Fix:**
```bash
# Explicitly set devices (set before importing NIXL)
export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1
export UCX_TLS=rc_verbs,rc_mlx5
```

### Bond Not Forming

**Symptom:** `cat /proc/net/bonding/bond0` shows error or empty output.

**Diagnosis:**
```bash
# Check if bonding module is loaded
lsmod | grep bonding

# Check for errors
dmesg | grep -i bond
```

**Fix:**
```bash
sudo modprobe bonding

# Verify bond status
cat /proc/net/bonding/bond0
```

### Transfer Falls Back to TCP

**Symptom:** NIXL debug output shows socket-based transport instead of RDMA.

**Checks:**
1. Verify RDMA devices are active: `ibstat`
2. Check UCX installation includes RDMA: `ucx_info -v | grep verbs`
3. Ensure `libibverbs` is installed: `dpkg -l | grep libibverbs`
4. Verify GIDs point to physical interfaces: `show_gids | grep rocep`

---

## Part 8: Cleanup

In [None]:
# Commands to remove bond (run manually if needed)
print("To remove the bond interface:")
print("""
sudo ip link set bond0 down
sudo ip link set enp1s0f0np0 nomaster
sudo ip link set enp1s0f1np1 nomaster
sudo ip link delete bond0

# Restore individual interface IPs if needed
sudo ip addr add 192.168.100.10/24 dev enp1s0f0np0
sudo ip addr add 192.168.200.10/24 dev enp1s0f1np1
""")

---

## References

- [RoCE Link Aggregation Tutorial (Markdown)](02_Multi_Rail_Tutorial.md)
- [NCCL and RDMA Benchmarks (First Tutorial)](01_InfiniBand_Tutorial.ipynb)
- [NIXL GitHub Repository](https://github.com/ai-dynamo/nixl)
- [Linux Kernel Bonding Documentation](https://www.kernel.org/doc/Documentation/networking/bonding.txt)