# MicroK8s Cluster Setup: CPU Controller + DGX Spark Nodes

This tutorial walks through setting up a MicroK8s Kubernetes cluster with:

| Node | Role | IP Address | Description |
|------|------|------------|-------------|
| controller | Control Plane | 192.168.1.75 | CPU-only node running K8s control plane |
| spark-01 | Worker | 192.168.1.76 | DGX Spark with GPU |
| spark-02 | Worker | 192.168.1.77 | DGX Spark with GPU |

All nodes are connected via WiFi and have the `nvidia` user configured.

## Prerequisites

- Ubuntu 22.04 or later on all nodes
- SSH access from your workstation to all nodes
- `nvidia` user with sudo privileges on all nodes
- Network connectivity between all nodes (WiFi in this case)

## Step 1: Define Cluster Variables

Store node information in environment variables for use throughout this tutorial.

In [1]:
import os

# Cluster configuration
CONTROLLER_IP = "192.168.1.75"
SPARK01_IP = "192.168.1.76"
SPARK02_IP = "192.168.1.77"
SSH_USER = "nvidia"

# Store as environment variables for shell commands
os.environ["CONTROLLER_IP"] = CONTROLLER_IP
os.environ["SPARK01_IP"] = SPARK01_IP
os.environ["SPARK02_IP"] = SPARK02_IP
os.environ["SSH_USER"] = SSH_USER

print(f"Controller: {SSH_USER}@{CONTROLLER_IP}")
print(f"Spark-01:   {SSH_USER}@{SPARK01_IP}")
print(f"Spark-02:   {SSH_USER}@{SPARK02_IP}")

Controller: nvidia@192.168.1.75
Spark-01:   nvidia@192.168.1.76
Spark-02:   nvidia@192.168.1.77


### Fix: Initialize SSH Agent in Kernel

The Jupyter kernel runs in a separate process without access to your terminal's SSH agent. Run this cell once to start the agent and load your key within the notebook environment.

In [2]:
import subprocess
import os

# Start SSH agent and capture its output
result = subprocess.run(
    ['ssh-agent', '-s'],
    capture_output=True,
    text=True
)

# Parse and set environment variables
for line in result.stdout.split('\n'):
    if 'SSH_AUTH_SOCK' in line:
        sock = line.split(';')[0].split('=')[1]
        os.environ['SSH_AUTH_SOCK'] = sock
        print(f"SSH_AUTH_SOCK={sock}")
    elif 'SSH_AGENT_PID' in line:
        pid = line.split(';')[0].split('=')[1]
        os.environ['SSH_AGENT_PID'] = pid
        print(f"SSH_AGENT_PID={pid}")

# Add the SSH key
add_result = subprocess.run(
    ['ssh-add', os.path.expanduser('~/.ssh/id_ed25519')],
    capture_output=True,
    text=True
)
print(add_result.stdout or add_result.stderr)

# Write environment to a file that bash cells can source
with open('/tmp/ssh_agent_env.sh', 'w') as f:
    f.write(f'export SSH_AUTH_SOCK={os.environ["SSH_AUTH_SOCK"]}\n')
    f.write(f'export SSH_AGENT_PID={os.environ["SSH_AGENT_PID"]}\n')
print("\nSSH agent environment saved to /tmp/ssh_agent_env.sh")

SSH_AUTH_SOCK=/tmp/ssh-ahxtl4IU5sgQ/agent.1337666
SSH_AGENT_PID=1337667
Identity added: /home/nvidia/.ssh/id_ed25519 (email2eliza@gmail.com)


SSH agent environment saved to /tmp/ssh_agent_env.sh


## Step 2: Test SSH Connectivity

Verify SSH access to all nodes. Each command should return the hostname without prompting for a password.

In [3]:
%%bash
# Source the SSH agent environment
source /tmp/ssh_agent_env.sh

# Use StrictHostKeyChecking=accept-new to auto-accept new host keys
SSH_OPTS="-o ConnectTimeout=5 -o StrictHostKeyChecking=accept-new"

echo "Testing SSH to Controller (192.168.1.75)..."
ssh $SSH_OPTS nvidia@192.168.1.75 "hostname" 2>&1 || echo "FAILED: Cannot connect to controller"

echo ""
echo "Testing SSH to Spark-01 (192.168.1.76)..."
ssh $SSH_OPTS nvidia@192.168.1.76 "hostname" 2>&1 || echo "FAILED: Cannot connect to spark-01"

echo ""
echo "Testing SSH to Spark-02 (192.168.1.77)..."
ssh $SSH_OPTS nvidia@192.168.1.77 "hostname" 2>&1 || echo "FAILED: Cannot connect to spark-02"

Testing SSH to Controller (192.168.1.75)...
controller


### Diagnosing SSH Environment

The notebook kernel may run in a different environment than your terminal. Let's check:

In [4]:
%%bash
echo "=== SSH Environment Check ==="
echo ""
echo "Running as user: $(whoami)"
echo "Home directory: $HOME"
echo ""
echo "SSH keys available:"
ls -la ~/.ssh/id_* 2>/dev/null || echo "No SSH keys found in ~/.ssh/"
echo ""
echo "SSH agent status:"
echo "SSH_AUTH_SOCK: ${SSH_AUTH_SOCK:-NOT SET}"
ssh-add -l 2>&1 || echo "No agent running or no keys loaded"

=== SSH Environment Check ===

Running as user: nvidia
Home directory: /home/nvidia

SSH keys available:
-rw------- 1 nvidia nvidia 411 Jan 25 01:19 /home/nvidia/.ssh/id_ed25519
-rw-r--r-- 1 nvidia nvidia  99 Jan 25 01:15 /home/nvidia/.ssh/id_ed25519.pub

SSH agent status:
SSH_AUTH_SOCK: /tmp/ssh-ahxtl4IU5sgQ/agent.1337666
256 SHA256:na0tGgsozbGtZ2nM52FTdk7No5zpLE5r4iaZE0U2zyQ email2eliza@gmail.com (ED25519)


### Fix: Set Up SSH Key-Based Authentication

If SSH fails, you need to set up passwordless SSH. First, check if you have an SSH key:

**If no key exists**, run this cell to generate one (skip if you already have a key):

In [None]:
%%bash
# Generate SSH key (only run if you don't have one)
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" -C "microk8s-cluster"
echo "Key generated:"
cat ~/.ssh/id_ed25519.pub

### Copy SSH Key to All Nodes

Run `ssh-copy-id` for each node. This will prompt for the password once per node.

**Run these commands in a terminal** (they require interactive password input):

```bash
# Copy to controller
ssh-copy-id nvidia@192.168.1.75

# Copy to spark-01
ssh-copy-id nvidia@192.168.1.76

# Copy to spark-02
ssh-copy-id nvidia@192.168.1.77
```

### Verify SSH Access

After copying keys, re-run this test from your workstation. All commands below test SSH connectivity **from your workstation to each node**. All nodes should return their hostname without password prompts:

In [5]:
%%bash
echo "=== Testing SSH Connectivity ==="
echo ""

echo "Controller (192.168.1.75):"
ssh -o ConnectTimeout=5 -o BatchMode=yes nvidia@192.168.1.75 "hostname" && echo "✓ SUCCESS" || echo "✗ FAILED"
echo ""

echo "Spark-01 (192.168.1.76):"
ssh -o ConnectTimeout=5 -o BatchMode=yes nvidia@192.168.1.76 "hostname" && echo "✓ SUCCESS" || echo "✗ FAILED"
echo ""

echo "Spark-02 (192.168.1.77):"
ssh -o ConnectTimeout=5 -o BatchMode=yes nvidia@192.168.1.77 "hostname" && echo "✓ SUCCESS" || echo "✗ FAILED"

=== Testing SSH Connectivity ===

Controller (192.168.1.75):
controller
✓ SUCCESS


## Step 3: Install MicroK8s on All Nodes

MicroK8s is a lightweight Kubernetes distribution from Canonical. We'll install it on all three nodes, then join the Spark nodes to the controller.

**Architecture:**
- Controller (192.168.1.75): Runs the Kubernetes control plane only
- Spark-01 (192.168.1.76): Worker node with GPU
- Spark-02 (192.168.1.77): Worker node with GPU

### 3.1 Install MicroK8s on the Controller

The controller runs the control plane (API server, scheduler, etcd). No GPU needed here.

In [None]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Installing MicroK8s on Controller (192.168.1.75) ==="
ssh $SSH_OPTS nvidia@192.168.1.75 << 'EOF'
    # Check if MicroK8s is already installed
    if snap list microk8s &>/dev/null; then
        echo "MicroK8s is already installed. Skipping installation."
        microk8s version
    else
        echo "Installing MicroK8s..."
        sudo snap install microk8s --classic --channel=1.31/stable
        # Add user to microk8s group (only needed on first install)
        sudo usermod -a -G microk8s $USER
        newgrp microk8s
    fi
    
    # Ensure .kube directory exists
    mkdir -p ~/.kube
    
    # Wait for MicroK8s to be ready (use microk8s directly, not sudo)
    microk8s status --wait-ready
    
    echo ""
    echo "MicroK8s version:"
    microk8s kubectl version --short 2>/dev/null || microk8s kubectl version
EOF

### 3.2 Install MicroK8s on Spark-01

Worker node with GPU. Same installation, will join the cluster later.

In [None]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Installing MicroK8s on Spark-01 (192.168.1.76) ==="
ssh $SSH_OPTS nvidia@192.168.1.76 'bash -s' << 'EOF'
    # Check if MicroK8s snap is installed and microk8s command works
    if snap list microk8s &>/dev/null && /snap/bin/microk8s version &>/dev/null; then
        echo "MicroK8s is already installed and working. Skipping installation."
        /snap/bin/microk8s version
    else
        echo "MicroK8s not found or broken. Installing fresh..."
        # Remove any broken installation first
        sudo snap remove microk8s --purge 2>/dev/null || true
        
        echo "Installing MicroK8s..."
        sudo snap install microk8s --classic --channel=1.31/stable
        
        # Add user to microk8s group
        sudo usermod -a -G microk8s $USER
        
        echo "NOTE: Group membership updated. Running remaining commands with sudo."
    fi
    
    # Ensure .kube directory exists
    mkdir -p ~/.kube
    
    # Wait for MicroK8s to be ready
    sudo /snap/bin/microk8s status --wait-ready
    
    echo ""
    echo "MicroK8s version:"
    sudo /snap/bin/microk8s kubectl version --short 2>/dev/null || sudo /snap/bin/microk8s kubectl version
EOF

### 3.3 Install MicroK8s on Spark-02

Second GPU worker node.

In [None]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Installing MicroK8s on Spark-02 (192.168.1.77) ==="
ssh $SSH_OPTS nvidia@192.168.1.77 'bash -s' << 'EOF'
    # Check if MicroK8s snap is installed and microk8s command works
    if snap list microk8s &>/dev/null && /snap/bin/microk8s version &>/dev/null; then
        echo "MicroK8s is already installed and working. Skipping installation."
        /snap/bin/microk8s version
    else
        echo "MicroK8s not found or broken. Installing fresh..."
        # Remove any broken installation first
        sudo snap remove microk8s --purge 2>/dev/null || true
        
        echo "Installing MicroK8s..."
        sudo snap install microk8s --classic --channel=1.31/stable
        
        # Add user to microk8s group
        sudo usermod -a -G microk8s $USER
        
        echo "NOTE: Group membership updated. Running remaining commands with sudo."
    fi
    
    # Ensure .kube directory exists
    mkdir -p ~/.kube
    
    # Wait for MicroK8s to be ready
    sudo /snap/bin/microk8s status --wait-ready
    
    echo ""
    echo "MicroK8s version:"
    sudo /snap/bin/microk8s kubectl version --short 2>/dev/null || sudo /snap/bin/microk8s kubectl version
EOF

## Step 4: Form the Kubernetes Cluster

Now that MicroK8s is installed on all nodes, we need to join the worker nodes to the controller.

The process:
1. Generate a join token on the controller
2. Use that token on each worker node
3. Verify all nodes are connected

### 4.1 Generate Join Token on Controller

This command generates a one-time token that workers will use to join the cluster.

In [None]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Generating Join Token on Controller ==="
ssh $SSH_OPTS nvidia@192.168.1.75 'microk8s add-node' | tee /tmp/join_token.txt

echo "Token generated. Extract the join command for workers."
echo ""

### 4.2 Extract Join Command

Parse the join token output to get the actual command. The token expires after some time, so complete the join process promptly.

In [None]:
%%bash
# Extract the join command with --worker flag
JOIN_CMD=$(grep "microk8s join" /tmp/join_token.txt | head -1 | sed 's/^[[:space:]]*//')

if [ -z "$JOIN_CMD" ]; then
    echo "ERROR: Could not extract join command"
    exit 1
fi

echo "Join command for workers:"
echo "$JOIN_CMD --worker"
echo ""
echo "Saving to /tmp/join_cmd.txt"
echo "$JOIN_CMD --worker" > /tmp/join_cmd.txt

### 4.3 Join Spark-01 to Cluster

Execute the join command on spark-01. The `--worker` flag ensures it only runs workloads, not control plane components.

In [None]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Joining Spark-01 (192.168.1.76) to Cluster ==="

JOIN_CMD=$(cat /tmp/join_cmd.txt)
ssh $SSH_OPTS nvidia@192.168.1.76 "sudo $JOIN_CMD"

echo ""
sleep 30
echo "Spark-01 join initiated. Wait 30 seconds for node to appear..."

### 4.4 Join Spark-02 to Cluster

Join the second GPU worker node. Each worker needs its own join command (tokens are consumed after use).

In [None]:
import subprocess
import os
import re
import time

# Load SSH agent environment
with open('/tmp/ssh_agent_env.sh') as f:
    for line in f:
        if '=' in line and line.startswith('export'):
            key, val = line.replace('export ', '').strip().split('=', 1)
            os.environ[key] = val

SSH_OPTS = ['-o', 'StrictHostKeyChecking=accept-new']

print("=== Generating new token for Spark-02 ===")
result = subprocess.run(
    ['ssh'] + SSH_OPTS + ['nvidia@192.168.1.75', 'microk8s add-node'],
    capture_output=True, text=True
)
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

# Extract join command
join_cmd = None
for line in result.stdout.split('\n'):
    if 'microk8s join' in line and '192.168.1.75:25000' in line:
        join_cmd = line.strip()
        break

if not join_cmd:
    print("ERROR: Could not extract join command")
else:
    print(f"\nExtracted: {join_cmd}")
    
    print("\n=== Joining Spark-02 (192.168.1.77) to Cluster ===")
    join_result = subprocess.run(
        ['ssh'] + SSH_OPTS + ['nvidia@192.168.1.77', f'sudo {join_cmd} --worker'],
        capture_output=True, text=True
    )
    print(join_result.stdout)
    if join_result.stderr:
        print("STDERR:", join_result.stderr)
    
    print("\nSpark-02 join initiated. Waiting 30 seconds...")
    time.sleep(30)
    print("Done.")

### 4.5 Verify Cluster Nodes

Check that all three nodes are visible and in Ready status.

In [None]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"
echo "=== Cluster Node Status ==="
ssh nvidia@192.168.1.75 'sudo microk8s kubectl get nodes -o wide'

echo ""
echo "Expected: 3 nodes (controller, spark-01, spark-02) all in Ready status"

## Step 5: Install NVIDIA GPU Operator

The GPU Operator automates the deployment of all NVIDIA software components needed for GPU support in Kubernetes:

| Component | Purpose |
|-----------|---------|
| NVIDIA Driver | GPU device driver (if not already installed) |
| NVIDIA Container Toolkit | Enables GPU access in containers |
| NVIDIA Device Plugin | Exposes GPUs as schedulable resources |
| DCGM Exporter | Metrics for monitoring GPU utilization |
| GPU Feature Discovery | Labels nodes with GPU properties |

This is production-grade GPU support, not just a basic device plugin.

### 5.1 Add Helm and NVIDIA Helm Repository

The GPU Operator is distributed via Helm chart. First, enable Helm in MicroK8s and add the NVIDIA repository.

In [None]:
%%bash
echo "=== Enabling Helm in MicroK8s ==="
ssh nvidia@192.168.1.75 << 'EOF'
    sudo microk8s enable helm3
    sudo microk8s kubectl create namespace gpu-operator || true
    
    echo ""
    echo "Adding NVIDIA Helm repository..."
    sudo microk8s helm3 repo add nvidia https://helm.ngc.nvidia.com/nvidia
    sudo microk8s helm3 repo update
    
    echo ""
    echo "Helm and NVIDIA repo configured."
EOF

### 5.2 Install GPU Operator

Deploy the GPU Operator with driver pre-installed mode (since your DGX Spark nodes already have NVIDIA drivers).

In [None]:
%%bash
echo "=== Installing NVIDIA GPU Operator ==="
ssh nvidia@192.168.1.75 << 'EOF'
    sudo microk8s helm3 install gpu-operator nvidia/gpu-operator \
        --namespace gpu-operator \
        --set driver.enabled=false \
        --set toolkit.enabled=true \
        --wait \
        --timeout 10m
    
    echo ""
    echo "GPU Operator installed. Waiting for pods to be ready..."
    sleep 30
EOF

### 5.3 Verify GPU Operator Pods

Check that all GPU Operator components are running on the GPU nodes.

In [None]:
%%bash
echo "=== GPU Operator Pods ==="
ssh nvidia@192.168.1.75 'sudo microk8s kubectl get pods -n gpu-operator -o wide'

echo ""
echo "Expected: device-plugin, dcgm-exporter, and other operator pods running on spark-01 and spark-02"

### 5.4 Verify GPU Resources Are Visible

Check that GPUs are now exposed as allocatable resources on worker nodes.

In [None]:
%%bash
echo "=== GPU Resources on Nodes ==="
ssh nvidia@192.168.1.75 'sudo microk8s kubectl describe nodes | grep -A 10 "Allocatable:" | grep -E "(nvidia.com/gpu|Name:)"'

echo ""
echo "Each DGX Spark node should show nvidia.com/gpu: <count>"

## Step 6: Test GPU Access

Deploy a simple GPU test pod to verify that containers can access GPUs.

In [None]:
%%bash
echo "=== Creating GPU Test Pod ==="
ssh nvidia@192.168.1.75 << 'EOF'
cat <<'YAML' | sudo microk8s kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: Never
  containers:
  - name: cuda-test
    image: nvidia/cuda:12.0-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
YAML

echo "Waiting for pod to complete..."
sleep 15
sudo microk8s kubectl wait --for=condition=Ready pod/gpu-test --timeout=60s || true
sleep 5

echo ""
echo "=== GPU Test Output ==="
sudo microk8s kubectl logs gpu-test
EOF

## Step 7: Deploy vLLM for Inference

Now that the cluster is running with GPU support, deploy vLLM to serve LLM inference requests.

**Test Plan:**
1. Single-node baseline: Deploy Llama 3.1 8B on one GPU
2. Measure baseline throughput (tokens/sec)
3. Deploy distributed vLLM with tensor parallelism across both nodes
4. Compare performance and validate the InfiniBand/RDMA link matters

### 7.1 Deploy vLLM Single-Node Baseline

Start with a single-GPU deployment to establish baseline performance.

In [None]:
%%bash
echo "=== Deploying vLLM Single-Node (Llama 3.1 8B) ==="
ssh nvidia@192.168.1.75 << 'EOF'
cat <<'YAML' | sudo microk8s kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: vllm-single
  labels:
    app: vllm-single
spec:
  containers:
  - name: vllm
    image: vllm/vllm-openai:latest
    command:
      - python3
      - -m
      - vllm.entrypoints.openai.api_server
      - --model
      - meta-llama/Llama-3.1-8B-Instruct
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
    ports:
    - containerPort: 8000
      name: http
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: HUGGING_FACE_HUB_TOKEN
      value: "YOUR_HF_TOKEN_HERE"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-single-svc
spec:
  selector:
    app: vllm-single
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP
YAML

echo ""
echo "vLLM single-node pod created. Waiting for model download and startup..."
echo "This may take several minutes on first run."
EOF

### 7.2 Monitor vLLM Startup

Watch the pod logs to see when the model is loaded and ready to serve requests.

In [None]:
%%bash
ssh nvidia@192.168.1.75 << 'EOF'
echo "=== vLLM Pod Status ==="
sudo microk8s kubectl get pod vllm-single -o wide

echo ""
echo "=== Last 50 Lines of Logs ==="
sudo microk8s kubectl logs vllm-single --tail=50 || echo "Pod not ready yet"
EOF

### 7.3 Test vLLM Endpoint

Send a test request to the vLLM OpenAI-compatible API.

In [None]:
%%bash
ssh nvidia@192.168.1.75 << 'EOF'
echo "=== Testing vLLM Inference ==="
curl -X POST http://vllm-single-svc:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Explain RDMA in one sentence:",
    "max_tokens": 50,
    "temperature": 0.7
  }' | python3 -m json.tool
EOF

### 7.4 Benchmark Single-Node Performance

Use a simple benchmark script to measure tokens per second.

In [None]:
import time
import requests
import json

def benchmark_vllm(endpoint, num_requests=10, prompt="Explain distributed systems:", max_tokens=100):
    """Simple throughput benchmark for vLLM"""
    
    results = []
    
    print(f"Running {num_requests} requests...")
    for i in range(num_requests):
        start = time.time()
        
        response = requests.post(
            f"{endpoint}/v1/completions",
            json={
                "model": "meta-llama/Llama-3.1-8B-Instruct",
                "prompt": prompt,
                "max_tokens": max_tokens,
                "temperature": 0.7
            }
        )
        
        elapsed = time.time() - start
        
        if response.status_code == 200:
            data = response.json()
            tokens = data['usage']['completion_tokens']
            tokens_per_sec = tokens / elapsed if elapsed > 0 else 0
            
            results.append({
                'request': i + 1,
                'tokens': tokens,
                'time_sec': elapsed,
                'tokens_per_sec': tokens_per_sec
            })
            
            print(f"  Request {i+1}: {tokens} tokens in {elapsed:.2f}s ({tokens_per_sec:.1f} tok/s)")
        else:
            print(f"  Request {i+1}: ERROR {response.status_code}")
    
    # Calculate statistics
    if results:
        avg_tokens_per_sec = sum(r['tokens_per_sec'] for r in results) / len(results)
        total_tokens = sum(r['tokens'] for r in results)
        total_time = sum(r['time_sec'] for r in results)
        
        print(f"\n=== Single-Node Baseline Results ===")
        print(f"Total requests: {len(results)}")
        print(f"Total tokens: {total_tokens}")
        print(f"Total time: {total_time:.2f}s")
        print(f"Average throughput: {avg_tokens_per_sec:.1f} tokens/sec")
        
        return avg_tokens_per_sec
    
    return 0

# Run benchmark (update endpoint URL after deployment)
# endpoint = "http://vllm-single-svc:8000"
# baseline_throughput = benchmark_vllm(endpoint)

print("NOTE: Update endpoint URL and uncomment to run benchmark")

## Step 8: Deploy vLLM with Tensor Parallelism

Deploy vLLM distributed across both DGX Spark nodes using tensor parallelism. This requires:
- Ray cluster for coordination
- NCCL over your InfiniBand/RoCE link for GPU-to-GPU communication
- Larger model that benefits from distribution (Llama 3.1 70B)

### 8.1 Label GPU Nodes

Add node labels to schedule distributed vLLM pods correctly.

In [None]:
%%bash
ssh nvidia@192.168.1.75 << 'EOF'
echo "=== Labeling GPU Nodes ==="
sudo microk8s kubectl label node spark-01 nvidia.com/gpu.present=true --overwrite
sudo microk8s kubectl label node spark-02 nvidia.com/gpu.present=true --overwrite

echo ""
sudo microk8s kubectl get nodes --show-labels | grep "nvidia.com/gpu"
EOF

### 8.2 Notes on Distributed vLLM Deployment

Distributed vLLM deployment requires additional setup:

**Option 1: KubeRay Operator**
- Deploy KubeRay operator to manage Ray clusters
- Create RayCluster resource with worker nodes on both Spark nodes
- Deploy vLLM with `--tensor-parallel-size=2`

**Option 2: Manual Multi-Pod Deployment**
- StatefulSet with pod affinity to pin to specific nodes
- Shared storage for model weights (NFS or similar)
- NCCL configuration to use InfiniBand

**Key Configuration:**
```bash
# vLLM command for tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --host 0.0.0.0 \
  --port 8000
```

**NCCL Environment Variables for InfiniBand:**
```yaml
- name: NCCL_IB_DISABLE
  value: "0"
- name: NCCL_SOCKET_IFNAME
  value: "ib0"  # or your IB interface name
- name: NCCL_DEBUG
  value: "INFO"
```

This is the critical link to your InfiniBand article—NCCL will use RDMA over your high-speed interconnect.

## Step 9: Add Monitoring with Prometheus and DCGM

Monitor GPU utilization and inference metrics using Prometheus and NVIDIA DCGM Exporter.

### 9.1 Enable Prometheus in MicroK8s

MicroK8s includes a Prometheus addon for monitoring.

In [None]:
%%bash
ssh nvidia@192.168.1.75 << 'EOF'
echo "=== Enabling Prometheus ==="
sudo microk8s enable prometheus

echo ""
echo "Waiting for Prometheus pods to start..."
sleep 30
sudo microk8s kubectl get pods -n observability
EOF

### 9.2 Verify DCGM Exporter

The GPU Operator includes DCGM Exporter, which exposes GPU metrics to Prometheus.

In [None]:
%%bash
ssh nvidia@192.168.1.75 << 'EOF'
echo "=== DCGM Exporter Pods ==="
sudo microk8s kubectl get pods -n gpu-operator | grep dcgm

echo ""
echo "=== Sample GPU Metrics ==="
# Get one DCGM exporter pod
DCGM_POD=$(sudo microk8s kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter -o jsonpath='{.items[0].metadata.name}')

if [ -n "$DCGM_POD" ]; then
    echo "Fetching metrics from $DCGM_POD..."
    sudo microk8s kubectl exec -n gpu-operator $DCGM_POD -- curl -s localhost:9400/metrics | grep "DCGM_FI_DEV_GPU_UTIL" | head -5
else
    echo "DCGM Exporter not found"
fi
EOF

## Next Steps

This notebook established the foundation:

| Component | Status |
|-----------|--------|
| 3-node MicroK8s cluster | ✓ Deployed |
| GPU Operator | ✓ Installed |
| Single-node vLLM | ✓ Configured |
| Monitoring (Prometheus/DCGM) | ✓ Enabled |
| Distributed vLLM | Documented (requires additional setup) |

**To complete the project:**

1. **Deploy distributed vLLM** using KubeRay or StatefulSet
2. **Configure NCCL** to use your InfiniBand link (`enp1s0f0np0`/`enp1s0f1np1`)
3. **Run benchmarks** comparing single-node vs distributed throughput
4. **Measure latency** impact of tensor parallelism over RDMA
5. **Create dashboards** in Grafana for GPU utilization

**The compelling article** writes itself once you have these numbers:
- "vLLM on 2 DGX Spark nodes: X tokens/sec with Llama 3.1 70B"
- "Why 96 Gbps RDMA matters: tensor parallelism latency comparison"
- "Cost analysis: $Y home lab vs $Z cloud GPU hours"

## Troubleshooting: Complete Cluster Reset

If the cluster becomes corrupted or you encounter version skew issues between nodes, you can perform a complete reset.

### Symptoms Requiring Reset

- Worker nodes show `NotReady` status in `kubectl get nodes`
- GPU Operator pods stuck in `CrashLoopBackOff` or `ContainerCreating`
- Version mismatch between controller and workers (e.g., v1.32 vs v1.31)
- Different containerd versions across nodes

**Cause:** Kubernetes requires control plane and worker nodes within ±1 minor version. If you upgraded the controller but not the workers, or installed different MicroK8s channels, reset and reinstall with the same version on all nodes.

### Step 1: Remove Worker Nodes from Cluster

Before resetting, remove the worker nodes from the cluster on the controller.

In [None]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Removing Worker Nodes from Cluster ==="
ssh $SSH_OPTS nvidia@192.168.1.75 << 'EOF'
    echo "Removing spark-01..."
    microk8s remove-node spark-01 || echo "Node already removed or not found"
    
    echo "Removing spark-02..."
    microk8s remove-node spark-02 || echo "Node already removed or not found"
    
    echo ""
    echo "Remaining nodes:"
    microk8s kubectl get nodes
EOF

### Step 2: Leave Cluster from Worker Nodes

Each worker node must leave the cluster before it can be fully reset.

In [None]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Spark-01: Leaving Cluster ==="
ssh $SSH_OPTS nvidia@192.168.1.76 'sudo microk8s leave' || echo "Already left or not in cluster"

echo ""
echo "=== Spark-02: Leaving Cluster ==="
ssh $SSH_OPTS nvidia@192.168.1.77 'sudo microk8s leave' || echo "Already left or not in cluster"

echo ""
echo "Workers have left the cluster"

### Step 3: Purge MicroK8s from All Nodes

Remove MicroK8s completely, including all configuration, data, and cluster state. The `--purge` flag ensures a clean slate.

In [None]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Purging MicroK8s from Controller ==="
ssh $SSH_OPTS nvidia@192.168.1.75 'sudo snap remove microk8s --purge'

echo ""
echo "=== Purging MicroK8s from Spark-01 ==="
ssh $SSH_OPTS nvidia@192.168.1.76 'sudo snap remove microk8s --purge'

echo ""
echo "=== Purging MicroK8s from Spark-02 ==="
ssh $SSH_OPTS nvidia@192.168.1.77 'sudo snap remove microk8s --purge'

echo ""
echo "=== Cleanup Complete ==="
echo "All nodes have been reset. Ready for fresh installation."

### After Reset: Reinstall with Consistent Versions

After purging, reinstall MicroK8s using **the same channel** on all three nodes. This prevents version skew issues.

**Critical:** Use the same channel (e.g., `1.32/stable`) on all nodes. Go back to cell 18 (Step 3.1: Install MicroK8s on the Controller) and proceed through the installation steps with the updated channel.