# MicroK8s Cluster Setup: CPU Controller + DGX Spark Nodes

This tutorial walks through setting up a MicroK8s Kubernetes cluster with:

| Node | Role | IP Address | Description |
|------|------|------------|-------------|
| controller | Control Plane | 192.168.1.75 | CPU-only node running K8s control plane |
| spark-01 | Worker | 192.168.1.76 | DGX Spark with GPU |
| spark-02 | Worker | 192.168.1.77 | DGX Spark with GPU |

All nodes are connected via WiFi and have the `nvidia` user configured.

## Prerequisites

- Ubuntu 22.04 or later on all nodes
- SSH access from your workstation to all nodes
- `nvidia` user with sudo privileges on all nodes
- Network connectivity between all nodes (WiFi in this case)

## Step 1: Define Cluster Variables

Store node information in environment variables for use throughout this tutorial.

In [1]:
import os

# Cluster configuration
CONTROLLER_IP = "192.168.1.75"
SPARK01_IP = "192.168.1.76"
SPARK02_IP = "192.168.1.77"
SSH_USER = "nvidia"

# Store as environment variables for shell commands
os.environ["CONTROLLER_IP"] = CONTROLLER_IP
os.environ["SPARK01_IP"] = SPARK01_IP
os.environ["SPARK02_IP"] = SPARK02_IP
os.environ["SSH_USER"] = SSH_USER

print(f"Controller: {SSH_USER}@{CONTROLLER_IP}")
print(f"Spark-01:   {SSH_USER}@{SPARK01_IP}")
print(f"Spark-02:   {SSH_USER}@{SPARK02_IP}")

Controller: nvidia@192.168.1.75
Spark-01:   nvidia@192.168.1.76
Spark-02:   nvidia@192.168.1.77


### Fix: Initialize SSH Agent in Kernel

The Jupyter kernel runs in a separate process without access to your terminal's SSH agent. Run this cell once to start the agent and load your key within the notebook environment.

In [4]:
import subprocess
import os

# Start SSH agent and capture its output
result = subprocess.run(
    ['ssh-agent', '-s'],
    capture_output=True,
    text=True
)

# Parse and set environment variables
for line in result.stdout.split('\n'):
    if 'SSH_AUTH_SOCK' in line:
        sock = line.split(';')[0].split('=')[1]
        os.environ['SSH_AUTH_SOCK'] = sock
        print(f"SSH_AUTH_SOCK={sock}")
    elif 'SSH_AGENT_PID' in line:
        pid = line.split(';')[0].split('=')[1]
        os.environ['SSH_AGENT_PID'] = pid
        print(f"SSH_AGENT_PID={pid}")

# Add the SSH key
add_result = subprocess.run(
    ['ssh-add', os.path.expanduser('~/.ssh/id_ed25519')],
    capture_output=True,
    text=True
)
print(add_result.stdout or add_result.stderr)

# Write environment to a file that bash cells can source
with open('/tmp/ssh_agent_env.sh', 'w') as f:
    f.write(f'export SSH_AUTH_SOCK={os.environ["SSH_AUTH_SOCK"]}\n')
    f.write(f'export SSH_AGENT_PID={os.environ["SSH_AGENT_PID"]}\n')
print("\nSSH agent environment saved to /tmp/ssh_agent_env.sh")

SSH_AUTH_SOCK=/var/folders/lv/0641bl7j7cj9wls4nqrctypm0000gp/T//ssh-9pEoO3Jo7nYI/agent.51358
SSH_AGENT_PID=51359
Identity added: /Users/elizabeththomas/.ssh/id_ed25519 (email2eliza@gmail.com)


SSH agent environment saved to /tmp/ssh_agent_env.sh


## Step 2: Test SSH Connectivity

Verify SSH access to all nodes. Each command should return the hostname without prompting for a password.

In [7]:
import subprocess

nodes = [
    ("Controller", "192.168.1.75"),
    ("Spark-01", "192.168.1.76"),
    ("Spark-02", "192.168.1.77")
]

ssh_opts = ["-o", "ConnectTimeout=5", "-o", "StrictHostKeyChecking=accept-new"]

for name, ip in nodes:
    print(f"Testing SSH to {name} ({ip})...")
    result = subprocess.run(
        ["ssh"] + ssh_opts + [f"nvidia@{ip}", "hostname"],
        capture_output=True,
        text=True,
        timeout=10
    )
    
    if result.returncode == 0:
        print(result.stdout.strip())
    else:
        print(f"FAILED: Cannot connect to {name.lower()}")
        if result.stderr:
            print(f"Error: {result.stderr.strip()}")
    print()

Testing SSH to Controller (192.168.1.75)...
controller

Testing SSH to Spark-01 (192.168.1.76)...
spark-01

Testing SSH to Spark-02 (192.168.1.77)...
spark-02



### Diagnosing SSH Environment

The notebook kernel may run in a different environment than your terminal. Let's check:

In [8]:
%%bash
echo "=== SSH Environment Check ==="
echo ""
echo "Running as user: $(whoami)"
echo "Home directory: $HOME"
echo ""
echo "SSH keys available:"
ls -la ~/.ssh/id_* 2>/dev/null || echo "No SSH keys found in ~/.ssh/"
echo ""
echo "SSH agent status:"
echo "SSH_AUTH_SOCK: ${SSH_AUTH_SOCK:-NOT SET}"
ssh-add -l 2>&1 || echo "No agent running or no keys loaded"

=== SSH Environment Check ===

Running as user: elizabeththomas
Home directory: /Users/elizabeththomas

SSH keys available:
-rw-------@ 1 elizabeththomas  staff   411 Nov  4 12:57 /Users/elizabeththomas/.ssh/id_ed25519
-rw-r--r--@ 1 elizabeththomas  staff   103 Nov  4 12:57 /Users/elizabeththomas/.ssh/id_ed25519.pub
-rw-------@ 1 elizabeththomas  staff  3434 Oct 11  2024 /Users/elizabeththomas/.ssh/id_rsa
-rw-r--r--@ 1 elizabeththomas  staff   747 Oct 11  2024 /Users/elizabeththomas/.ssh/id_rsa.pub

SSH agent status:
SSH_AUTH_SOCK: /var/folders/lv/0641bl7j7cj9wls4nqrctypm0000gp/T//ssh-9pEoO3Jo7nYI/agent.51358
256 SHA256:na0tGgsozbGtZ2nM52FTdk7No5zpLE5r4iaZE0U2zyQ email2eliza@gmail.com (ED25519)


### Fix: Set Up SSH Key-Based Authentication

If SSH fails, you need to set up passwordless SSH. First, check if you have an SSH key:

**If no key exists**, run this cell to generate one (skip if you already have a key):

In [None]:
%%bash
# Generate SSH key (only run if you don't have one)
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" -C "microk8s-cluster"
echo "Key generated:"
cat ~/.ssh/id_ed25519.pub

### Copy SSH Key to All Nodes

Run `ssh-copy-id` for each node. This will prompt for the password once per node.

**Run these commands in a terminal** (they require interactive password input):

```bash
# Copy to controller
ssh-copy-id nvidia@192.168.1.75

# Copy to spark-01
ssh-copy-id nvidia@192.168.1.76

# Copy to spark-02
ssh-copy-id nvidia@192.168.1.77
```

## Step 3: Install MicroK8s on All Nodes

MicroK8s is a lightweight Kubernetes distribution from Canonical. We'll install it on all three nodes, then join the Spark nodes to the controller.

**Architecture:**
- Controller (192.168.1.75): Runs the Kubernetes control plane only
- Spark-01 (192.168.1.76): Worker node with GPU
- Spark-02 (192.168.1.77): Worker node with GPU

### 3.1 Install MicroK8s on the Controller

The controller runs the control plane (API server, scheduler, etcd). No GPU needed here.

In [10]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Installing MicroK8s on Controller (192.168.1.75) ==="
ssh $SSH_OPTS nvidia@192.168.1.75 << 'EOF'
    # Check if MicroK8s is already installed
    if snap list microk8s &>/dev/null; then
        echo "MicroK8s is already installed. Skipping installation."
        microk8s version
    else
        echo "Installing MicroK8s..."
        sudo snap install microk8s --classic --channel=1.31/stable
        # Add user to microk8s group (only needed on first install)
        sudo usermod -a -G microk8s $USER
        newgrp microk8s
    fi
    
    # Ensure .kube directory exists
    mkdir -p ~/.kube
    
    # Wait for MicroK8s to be ready (use microk8s directly, not sudo)
    microk8s status --wait-ready
    
    echo ""
    echo "MicroK8s version:"
    microk8s kubectl version --short 2>/dev/null || microk8s kubectl version
EOF

=== Installing MicroK8s on Controller (192.168.1.75) ===


Pseudo-terminal will not be allocated because stdin is not a terminal.


Welcome to Ubuntu 24.04.3 LTS (GNU/Linux 6.8.0-90-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Sun Feb  1 04:51:15 AM UTC 2026

  System load:             0.0
  Usage of /:              1.6% of 835.58GB
  Memory usage:            3%
  Swap usage:              0%
  Temperature:             39.9 C
  Processes:               320
  Users logged in:         1
  IPv4 address for wlp2s0: 192.168.1.75
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:dba1:cc44:2510:46a6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:36b9:c4a0:a7e9:31ed
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:9d6d:19f6:90da:d7f6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:ef2d:9464:3557:9dec
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:7a8d:97c:e56c:fe96
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10::48
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:22:bc19:d505:ab6a
  IPv6 address for

### 3.2 Install MicroK8s on Spark-01

Worker node with GPU. Same installation, will join the cluster later.

In [11]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Installing MicroK8s on Spark-01 (192.168.1.76) ==="
ssh $SSH_OPTS nvidia@192.168.1.76 'bash -s' << 'EOF'
    # Check if MicroK8s snap is installed and microk8s command works
    if snap list microk8s &>/dev/null && /snap/bin/microk8s version &>/dev/null; then
        echo "MicroK8s is already installed and working. Skipping installation."
        /snap/bin/microk8s version
    else
        echo "MicroK8s not found or broken. Installing fresh..."
        # Remove any broken installation first
        sudo snap remove microk8s --purge 2>/dev/null || true
        
        echo "Installing MicroK8s..."
        sudo snap install microk8s --classic --channel=1.31/stable
        
        # Add user to microk8s group
        sudo usermod -a -G microk8s $USER
        
        echo "NOTE: Group membership updated. Running remaining commands with sudo."
    fi
    
    # Ensure .kube directory exists
    mkdir -p ~/.kube
    
    # Wait for MicroK8s to be ready
    sudo /snap/bin/microk8s status --wait-ready
    
    echo ""
    echo "MicroK8s version:"
    sudo /snap/bin/microk8s kubectl version --short 2>/dev/null || sudo /snap/bin/microk8s kubectl version
EOF

=== Installing MicroK8s on Spark-01 (192.168.1.76) ===
MicroK8s not found or broken. Installing fresh...
Installing MicroK8s...
microk8s (1.31/stable) v1.31.14 from Canonical** installed
NOTE: Group membership updated. Running remaining commands with sudo.
microk8s is running
high-availability: no
  datastore master nodes: 127.0.0.1:19001
  datastore standby nodes: none
addons:
  enabled:
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
  disabled:
    cert-manager         # (core) Cloud native certificate management
    cis-hardening        # (core) Apply CIS K8s hardening
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    host-access          # (core) Allow Pods connecting to Host s

### 3.3 Install MicroK8s on Spark-02

Second GPU worker node.

In [12]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Installing MicroK8s on Spark-02 (192.168.1.77) ==="
ssh $SSH_OPTS nvidia@192.168.1.77 'bash -s' << 'EOF'
    # Check if MicroK8s snap is installed and microk8s command works
    if snap list microk8s &>/dev/null && /snap/bin/microk8s version &>/dev/null; then
        echo "MicroK8s is already installed and working. Skipping installation."
        /snap/bin/microk8s version
    else
        echo "MicroK8s not found or broken. Installing fresh..."
        # Remove any broken installation first
        sudo snap remove microk8s --purge 2>/dev/null || true
        
        echo "Installing MicroK8s..."
        sudo snap install microk8s --classic --channel=1.31/stable
        
        # Add user to microk8s group
        sudo usermod -a -G microk8s $USER
        
        echo "NOTE: Group membership updated. Running remaining commands with sudo."
    fi
    
    # Ensure .kube directory exists
    mkdir -p ~/.kube
    
    # Wait for MicroK8s to be ready
    sudo /snap/bin/microk8s status --wait-ready
    
    echo ""
    echo "MicroK8s version:"
    sudo /snap/bin/microk8s kubectl version --short 2>/dev/null || sudo /snap/bin/microk8s kubectl version
EOF

=== Installing MicroK8s on Spark-02 (192.168.1.77) ===
MicroK8s not found or broken. Installing fresh...
Installing MicroK8s...
microk8s (1.31/stable) v1.31.14 from Canonical** installed
NOTE: Group membership updated. Running remaining commands with sudo.
microk8s is running
high-availability: no
  datastore master nodes: 127.0.0.1:19001
  datastore standby nodes: none
addons:
  enabled:
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
  disabled:
    cert-manager         # (core) Cloud native certificate management
    cis-hardening        # (core) Apply CIS K8s hardening
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    host-access          # (core) Allow Pods connecting to Host s

## Step 4: Form the Kubernetes Cluster

Now that MicroK8s is installed on all nodes, we need to join the worker nodes to the controller.

The process:
1. Generate a join token on the controller
2. Use that token on each worker node
3. Verify all nodes are connected

### 4.1 Generate Join Token on Controller

This command generates a one-time token that workers will use to join the cluster.

In [13]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Generating Join Token on Controller ==="
ssh $SSH_OPTS nvidia@192.168.1.75 'microk8s add-node' | tee /tmp/join_token.txt

echo "Token generated. Extract the join command for workers."
echo ""

=== Generating Join Token on Controller ===
From the node you wish to join to this cluster, run the following:
microk8s join 192.168.1.75:25000/49058fcd49010d1ddcbebfee1a209411/eee097e0a8ae

Use the '--worker' flag to join a node as a worker not running the control plane, eg:
microk8s join 192.168.1.75:25000/49058fcd49010d1ddcbebfee1a209411/eee097e0a8ae --worker

If the node you are adding is not reachable through the default interface you can use one of the following:
microk8s join 192.168.1.75:25000/49058fcd49010d1ddcbebfee1a209411/eee097e0a8ae
microk8s join 2600:1702:56e5:4e10:dba1:cc44:2510:46a6:25000/49058fcd49010d1ddcbebfee1a209411/eee097e0a8ae
microk8s join 2600:1702:56e5:4e10:36b9:c4a0:a7e9:31ed:25000/49058fcd49010d1ddcbebfee1a209411/eee097e0a8ae
microk8s join 2600:1702:56e5:4e10:9d6d:19f6:90da:d7f6:25000/49058fcd49010d1ddcbebfee1a209411/eee097e0a8ae
microk8s join 2600:1702:56e5:4e10:ef2d:9464:3557:9dec:25000/49058fcd49010d1ddcbebfee1a209411/eee097e0a8ae
microk8s join 2600:1702

### 4.2 Extract Join Command

Parse the join token output to get the actual command. The token expires after some time, so complete the join process promptly.

In [14]:
%%bash
# Extract the join command with --worker flag
JOIN_CMD=$(grep "microk8s join" /tmp/join_token.txt | head -1 | sed 's/^[[:space:]]*//')

if [ -z "$JOIN_CMD" ]; then
    echo "ERROR: Could not extract join command"
    exit 1
fi

echo "Join command for workers:"
echo "$JOIN_CMD --worker"
echo ""
echo "Saving to /tmp/join_cmd.txt"
echo "$JOIN_CMD --worker" > /tmp/join_cmd.txt

Join command for workers:
microk8s join 192.168.1.75:25000/49058fcd49010d1ddcbebfee1a209411/eee097e0a8ae --worker

Saving to /tmp/join_cmd.txt


### 4.3 Join Spark-01 to Cluster

Execute the join command on spark-01. The `--worker` flag ensures it only runs workloads, not control plane components.

In [15]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Joining Spark-01 (192.168.1.76) to Cluster ==="

JOIN_CMD=$(cat /tmp/join_cmd.txt)
ssh $SSH_OPTS nvidia@192.168.1.76 "sudo $JOIN_CMD"

echo ""
sleep 30
echo "Spark-01 join initiated. Wait 30 seconds for node to appear..."

=== Joining Spark-01 (192.168.1.76) to Cluster ===
Contacting cluster at 192.168.1.75

The node has joined the cluster and will appear in the nodes list in a few seconds.

This worker node gets automatically configured with the API server endpoints.
If the API servers are behind a loadbalancer please set the '--refresh-interval' to '0s' in:
    /var/snap/microk8s/current/args/apiserver-proxy
and replace the API server endpoints with the one provided by the loadbalancer in:
    /var/snap/microk8s/current/args/traefik/provider.yaml

Successfully joined the cluster.


### 4.4 Join Spark-02 to Cluster

Join the second GPU worker node. Each worker needs its own join command (tokens are consumed after use).

In [16]:
import subprocess
import os
import re
import time

# Load SSH agent environment
with open('/tmp/ssh_agent_env.sh') as f:
    for line in f:
        if '=' in line and line.startswith('export'):
            key, val = line.replace('export ', '').strip().split('=', 1)
            os.environ[key] = val

SSH_OPTS = ['-o', 'StrictHostKeyChecking=accept-new']

print("=== Generating new token for Spark-02 ===")
result = subprocess.run(
    ['ssh'] + SSH_OPTS + ['nvidia@192.168.1.75', 'microk8s add-node'],
    capture_output=True, text=True
)
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

# Extract join command
join_cmd = None
for line in result.stdout.split('\n'):
    if 'microk8s join' in line and '192.168.1.75:25000' in line:
        join_cmd = line.strip()
        break

if not join_cmd:
    print("ERROR: Could not extract join command")
else:
    print(f"\nExtracted: {join_cmd}")
    
    print("\n=== Joining Spark-02 (192.168.1.77) to Cluster ===")
    join_result = subprocess.run(
        ['ssh'] + SSH_OPTS + ['nvidia@192.168.1.77', f'sudo {join_cmd} --worker'],
        capture_output=True, text=True
    )
    print(join_result.stdout)
    if join_result.stderr:
        print("STDERR:", join_result.stderr)
    
    print("\nSpark-02 join initiated. Waiting 30 seconds...")
    time.sleep(30)
    print("Done.")

=== Generating new token for Spark-02 ===
From the node you wish to join to this cluster, run the following:
microk8s join 192.168.1.75:25000/5e98a07149da21fdcf0fcb1c2a5dcba0/eee097e0a8ae

Use the '--worker' flag to join a node as a worker not running the control plane, eg:
microk8s join 192.168.1.75:25000/5e98a07149da21fdcf0fcb1c2a5dcba0/eee097e0a8ae --worker

If the node you are adding is not reachable through the default interface you can use one of the following:
microk8s join 192.168.1.75:25000/5e98a07149da21fdcf0fcb1c2a5dcba0/eee097e0a8ae
microk8s join 2600:1702:56e5:4e10:dba1:cc44:2510:46a6:25000/5e98a07149da21fdcf0fcb1c2a5dcba0/eee097e0a8ae
microk8s join 2600:1702:56e5:4e10:36b9:c4a0:a7e9:31ed:25000/5e98a07149da21fdcf0fcb1c2a5dcba0/eee097e0a8ae
microk8s join 2600:1702:56e5:4e10:9d6d:19f6:90da:d7f6:25000/5e98a07149da21fdcf0fcb1c2a5dcba0/eee097e0a8ae
microk8s join 2600:1702:56e5:4e10:ef2d:9464:3557:9dec:25000/5e98a07149da21fdcf0fcb1c2a5dcba0/eee097e0a8ae
microk8s join 2600:1702:5

### 4.5 Verify Cluster Nodes

Check that all three nodes are visible and in Ready status.

In [17]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"
echo "=== Cluster Node Status ==="
ssh nvidia@192.168.1.75 'sudo microk8s kubectl get nodes -o wide'

echo ""
echo "Expected: 3 nodes (controller, spark-01, spark-02) all in Ready status"

=== Cluster Node Status ===
NAME         STATUS   ROLES    AGE     VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
controller   Ready    <none>   9m31s   v1.31.14   192.168.1.75   <none>        Ubuntu 24.04.3 LTS   6.8.0-90-generic     containerd://1.6.28
spark-01     Ready    <none>   94s     v1.31.14   192.168.1.76   <none>        Ubuntu 24.04.3 LTS   6.11.0-1016-nvidia   containerd://1.6.28
spark-02     Ready    <none>   49s     v1.31.14   192.168.1.77   <none>        Ubuntu 24.04.3 LTS   6.14.0-1015-nvidia   containerd://1.6.28


## Step 5: Install NVIDIA GPU Operator

The GPU Operator automates the deployment of all NVIDIA software components needed for GPU support in Kubernetes:

| Component | Purpose |
|-----------|---------|
| NVIDIA Driver | GPU device driver (if not already installed) |
| NVIDIA Container Toolkit | Enables GPU access in containers |
| NVIDIA Device Plugin | Exposes GPUs as schedulable resources |
| DCGM Exporter | Metrics for monitoring GPU utilization |
| GPU Feature Discovery | Labels nodes with GPU properties |

This is production-grade GPU support, not just a basic device plugin.

### 5.1 Add Helm and NVIDIA Helm Repository

The GPU Operator is distributed via Helm chart. First, enable Helm in MicroK8s and add the NVIDIA repository.

In [18]:
%%bash
echo "=== Enabling Helm in MicroK8s ==="
ssh nvidia@192.168.1.75 << 'EOF'
    sudo microk8s enable helm3
    sudo microk8s kubectl create namespace gpu-operator || true
    
    echo ""
    echo "Adding NVIDIA Helm repository..."
    sudo microk8s helm3 repo add nvidia https://helm.ngc.nvidia.com/nvidia
    sudo microk8s helm3 repo update
    
    echo ""
    echo "Helm and NVIDIA repo configured."
EOF

=== Enabling Helm in MicroK8s ===


Pseudo-terminal will not be allocated because stdin is not a terminal.


Welcome to Ubuntu 24.04.3 LTS (GNU/Linux 6.8.0-90-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Sun Feb  1 05:01:18 AM UTC 2026

  System load:             0.13
  Usage of /:              1.7% of 835.58GB
  Memory usage:            4%
  Swap usage:              0%
  Temperature:             41.1 C
  Processes:               357
  Users logged in:         1
  IPv4 address for wlp2s0: 192.168.1.75
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:dba1:cc44:2510:46a6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:36b9:c4a0:a7e9:31ed
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:9d6d:19f6:90da:d7f6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:ef2d:9464:3557:9dec
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:7a8d:97c:e56c:fe96
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10::48
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:22:bc19:d505:ab6a
  IPv6 address fo

Infer repository core for addon helm3


"nvidia" already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈

Helm and NVIDIA repo configured.


### 5.2 Install GPU Operator

Deploy the GPU Operator with driver pre-installed mode.

**Why `driver.enabled=false`?**

DGX Spark nodes ship with NVIDIA drivers pre-installed on the host OS. You can verify this by running `nvidia-smi` directly on the nodes. The GPU Operator supports two driver deployment modes:

1. **Containerized drivers** (`driver.enabled=true`): Operator deploys drivers as privileged pods. Use when nodes don't have drivers pre-installed.

2. **Pre-installed drivers** (`driver.enabled=false`): Operator uses existing host drivers. Use when drivers are already installed (our case).

Setting `driver.enabled=true` on nodes with existing drivers causes conflicts:
```
modprobe: ERROR: could not insert 'nvidia': File exists
```

Even with `driver.enabled=false`, the operator still installs:
- NVIDIA Container Toolkit (maps GPUs into containers)
- Device Plugin (exposes `nvidia.com/gpu` to Kubernetes)
- DCGM Exporter (GPU metrics for Prometheus)
- GPU Feature Discovery (automatic node labeling)

In [19]:
%%bash
echo "=== Installing NVIDIA GPU Operator ==="
ssh nvidia@192.168.1.75 << 'EOF'
    sudo microk8s helm3 install gpu-operator nvidia/gpu-operator \
        --namespace gpu-operator \
        --set driver.enabled=false \
        --set toolkit.enabled=true \
        --wait \
        --timeout 10m
    
    echo ""
    echo "GPU Operator installed. Waiting for pods to be ready..."
    sleep 30
EOF

=== Installing NVIDIA GPU Operator ===


Pseudo-terminal will not be allocated because stdin is not a terminal.


Welcome to Ubuntu 24.04.3 LTS (GNU/Linux 6.8.0-90-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Sun Feb  1 05:03:00 AM UTC 2026

  System load:             0.09
  Usage of /:              1.7% of 835.58GB
  Memory usage:            4%
  Swap usage:              0%
  Temperature:             41.0 C
  Processes:               358
  Users logged in:         1
  IPv4 address for wlp2s0: 192.168.1.75
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:dba1:cc44:2510:46a6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:36b9:c4a0:a7e9:31ed
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:9d6d:19f6:90da:d7f6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:ef2d:9464:3557:9dec
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:7a8d:97c:e56c:fe96
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10::48
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:22:bc19:d505:ab6a
  IPv6 address fo



NAME: gpu-operator
LAST DEPLOYED: Sun Feb  1 05:03:04 2026
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

GPU Operator installed. Waiting for pods to be ready...


### 5.3 Verify GPU Operator Pods

Check that all GPU Operator components are running on the GPU nodes.

In [25]:
%%bash
echo "=== GPU Operator Pods ==="
ssh nvidia@192.168.1.75 'sudo microk8s kubectl get pods -n gpu-operator -o wide'

echo ""
echo "Expected: device-plugin, dcgm-exporter, and other operator pods running on spark-01 and spark-02"

=== GPU Operator Pods ===
NAME                                                          READY   STATUS     RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
gpu-feature-discovery-g7nz6                                   0/1     Init:0/1   0          14m   <none>         spark-02     <none>           <none>
gpu-feature-discovery-xz7hm                                   0/1     Init:0/1   0          14m   <none>         spark-01     <none>           <none>
gpu-operator-767fdbb8d5-rczrr                                 1/1     Running    0          14m   10.1.168.194   spark-01     <none>           <none>
gpu-operator-node-feature-discovery-gc-645d95b6c-r9kpq        1/1     Running    0          14m   10.1.199.129   spark-02     <none>           <none>
gpu-operator-node-feature-discovery-master-66995587d9-dxqfc   1/1     Running    0          14m   10.1.199.131   spark-02     <none>           <none>
gpu-operator-node-feature-discovery-worker-bshlc              1/1

#### Troubleshooting: Pods Stuck in Init Phase

If GPU Operator pods remain in `Init:0/1` or `Init:0/4` status for more than 5 minutes, the init containers are waiting for driver validation or GPU availability. Check the logs:

In [30]:
import subprocess
import os

# Load SSH environment
with open('/tmp/ssh_agent_env.sh') as f:
    for line in f:
        if '=' in line and line.startswith('export'):
            key, val = line.replace('export ', '').strip().split('=', 1)
            os.environ[key] = val

ssh_opts = ['-o', 'StrictHostKeyChecking=accept-new']

print("=== Configuring NVIDIA Runtime in Containerd ===")

for node_ip, node_name in [("192.168.1.76", "spark-01"), ("192.168.1.77", "spark-02")]:
    print(f"\n=== {node_name} ===")
    
    # Configure containerd for nvidia runtime
    config_script = """
    # Backup existing config
    sudo cp /var/snap/microk8s/current/args/containerd-template.toml /var/snap/microk8s/current/args/containerd-template.toml.backup || true
    
    # Check if nvidia runtime is already configured
    if grep -q "nvidia" /var/snap/microk8s/current/args/containerd-template.toml 2>/dev/null; then
        echo "NVIDIA runtime already configured in containerd"
    else
        echo "Adding NVIDIA runtime configuration to containerd..."
        
        # Add nvidia runtime configuration
        sudo tee -a /var/snap/microk8s/current/args/containerd-template.toml > /dev/null <<'EOF'

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
    BinaryName = "/usr/bin/nvidia-container-runtime"
EOF
        
        echo "NVIDIA runtime configuration added"
    fi
    
    # Restart containerd
    echo "Restarting MicroK8s..."
    sudo snap restart microk8s.daemon-containerd
    sleep 10
    
    echo "Verifying nvidia-container-runtime is available..."
    which nvidia-container-runtime || echo "nvidia-container-runtime not found in PATH"
    """
    
    result = subprocess.run(
        ['ssh'] + ssh_opts + [f'nvidia@{node_ip}', 'bash -s'],
        input=config_script,
        capture_output=True,
        text=True
    )
    
    print(result.stdout)
    if result.stderr:
        print(f"STDERR: {result.stderr}")

print("\n=== Waiting 30 seconds for pods to reinitialize ===")
import time
time.sleep(30)

print("\n=== Checking GPU Operator pods ===")
result = subprocess.run(
    ['ssh'] + ssh_opts + ['nvidia@192.168.1.75', 
     'sudo microk8s kubectl get pods -n gpu-operator -o wide'],
    capture_output=True, text=True
)
print(result.stdout)

=== Configuring NVIDIA Runtime in Containerd ===

=== spark-01 ===
NVIDIA runtime already configured in containerd
Restarting MicroK8s...
Restarted.
Verifying nvidia-container-runtime is available...
/usr/bin/nvidia-container-runtime


=== spark-02 ===
NVIDIA runtime already configured in containerd
Restarting MicroK8s...
Restarted.
Verifying nvidia-container-runtime is available...
/usr/bin/nvidia-container-runtime


=== Waiting 30 seconds for pods to reinitialize ===

=== Checking GPU Operator pods ===
NAME                                                          READY   STATUS     RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
gpu-feature-discovery-g7nz6                                   0/1     Init:0/1   0          35m   <none>         spark-02     <none>           <none>
gpu-feature-discovery-xz7hm                                   0/1     Init:0/1   0          35m   <none>         spark-01     <none>           <none>
gpu-operator-767fdbb8d5-rczrr  

In [31]:
import subprocess
import os

# Load SSH environment
with open('/tmp/ssh_agent_env.sh') as f:
    for line in f:
        if '=' in line and line.startswith('export'):
            key, val = line.replace('export ', '').strip().split('=', 1)
            os.environ[key] = val

ssh_opts = ['-o', 'StrictHostKeyChecking=accept-new']

print("=== Checking Device Plugin Init Container ===")
result = subprocess.run(
    ['ssh'] + ssh_opts + ['nvidia@192.168.1.75', """
        POD=$(sudo microk8s kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset -o jsonpath='{.items[0].metadata.name}')
        echo "Pod: $POD"
        echo ""
        echo "=== Init Container Logs ==="
        sudo microk8s kubectl logs -n gpu-operator $POD -c toolkit-validation --tail=100 2>&1 || echo "No logs available"
        echo ""
        echo "=== Pod Description (Events) ==="
        sudo microk8s kubectl describe pod -n gpu-operator $POD | tail -30
    """],
    capture_output=True, text=True
)
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

=== Checking Device Plugin Init Container ===
Pod: nvidia-device-plugin-daemonset-7fbqc

=== Init Container Logs ===
Error from server (BadRequest): container "toolkit-validation" in pod "nvidia-device-plugin-daemonset-7fbqc" is waiting to start: PodInitializing
No logs available

=== Pod Description (Events) ===
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  mps-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/mps
    HostPathType:  DirectoryOrCreate
  mps-shm:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/mps/shm
    HostPathType:  
  kube-api-access-nxpk6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:       

In [32]:
%%bash
ssh nvidia@192.168.1.75 << 'EOF'
echo "=== Checking Init Container Logs ==="

# Get one stuck device-plugin pod
POD=$(sudo microk8s kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset -o jsonpath='{.items[0].metadata.name}')

if [ -n "$POD" ]; then
    echo "Checking pod: $POD"
    echo ""
    echo "=== Init Container Logs ==="
    sudo microk8s kubectl logs -n gpu-operator $POD -c toolkit-validation --tail=50 2>&1 || echo "No logs yet or container not started"
    
    echo ""
    echo "=== Pod Events ==="
    sudo microk8s kubectl describe pod -n gpu-operator $POD | grep -A 10 "Events:"
else
    echo "No device-plugin pods found"
fi
EOF

Pseudo-terminal will not be allocated because stdin is not a terminal.


Welcome to Ubuntu 24.04.3 LTS (GNU/Linux 6.8.0-90-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Sun Feb  1 05:38:55 AM UTC 2026

  System load:             0.03
  Usage of /:              1.8% of 835.58GB
  Memory usage:            5%
  Swap usage:              0%
  Temperature:             45.0 C
  Processes:               352
  Users logged in:         1
  IPv4 address for wlp2s0: 192.168.1.75
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:dba1:cc44:2510:46a6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:36b9:c4a0:a7e9:31ed
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:9d6d:19f6:90da:d7f6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:ef2d:9464:3557:9dec
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:7a8d:97c:e56c:fe96
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10::48
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:22:bc19:d505:ab6a
  IPv6 address fo

**Common causes and fixes:**

1. **Container runtime not configured**: MicroK8s may need containerd restart after toolkit installation
2. **Driver paths not accessible**: Container toolkit can't find `/dev/nvidia*` devices
3. **AppArmor/SELinux blocking**: Security policies preventing GPU device access

Try these fixes:

In [33]:
import subprocess
import os

# Load SSH environment
with open('/tmp/ssh_agent_env.sh') as f:
    for line in f:
        if '=' in line and line.startswith('export'):
            key, val = line.replace('export ', '').strip().split('=', 1)
            os.environ[key] = val

ssh_opts = ['-o', 'StrictHostKeyChecking=accept-new']

print("=== Fix 1: Restart containerd on GPU nodes ===")
for node_ip, node_name in [("192.168.1.76", "spark-01"), ("192.168.1.77", "spark-02")]:
    print(f"\nRestarting containerd on {node_name}...")
    result = subprocess.run(
        ['ssh'] + ssh_opts + [f'nvidia@{node_ip}', 'sudo systemctl restart containerd'],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        print(f"✓ {node_name} containerd restarted")
    else:
        print(f"✗ {node_name} failed: {result.stderr}")

print("\n=== Waiting 30 seconds for pods to reinitialize ===")
import time
time.sleep(30)

print("\n=== Checking pod status ===")
result = subprocess.run(
    ['ssh'] + ssh_opts + ['nvidia@192.168.1.75', 
     'sudo microk8s kubectl get pods -n gpu-operator'],
    capture_output=True, text=True
)
print(result.stdout)

=== Fix 1: Restart containerd on GPU nodes ===

Restarting containerd on spark-01...
✗ spark-01 failed: Job for containerd.service failed because the control process exited with error code.
See "systemctl status containerd.service" and "journalctl -xeu containerd.service" for details.


Restarting containerd on spark-02...
✗ spark-02 failed: Job for containerd.service failed because the control process exited with error code.
See "systemctl status containerd.service" and "journalctl -xeu containerd.service" for details.


=== Waiting 30 seconds for pods to reinitialize ===

=== Checking pod status ===
NAME                                                          READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-g7nz6                                   0/1     Init:0/1   0          40m
gpu-feature-discovery-xz7hm                                   0/1     Init:0/1   0          40m
gpu-operator-767fdbb8d5-rczrr                                 1/1     Running    0          40m
gpu-ope

#### Fix: Configure NVIDIA Runtime in Containerd

If pods fail with `no runtime for "nvidia" is configured`, containerd needs explicit configuration for the NVIDIA runtime. The GPU Operator installs the toolkit, but MicroK8s containerd may need manual configuration.

#### Deep Diagnostics: Check What Init Containers Are Waiting For

Let's examine exactly what's blocking the init containers.

In [47]:
import subprocess
import os

# Load SSH environment
with open('/tmp/ssh_agent_env.sh') as f:
    for line in f:
        if '=' in line and line.startswith('export'):
            key, val = line.replace('export ', '').strip().split('=', 1)
            os.environ[key] = val

ssh_opts = ['-o', 'StrictHostKeyChecking=accept-new']

print("=== COMPREHENSIVE GPU OPERATOR DIAGNOSTICS ===\n")

# Check init container logs first - this tells us the actual error
print("1. CHECKING INIT CONTAINER ERROR MESSAGES")
print("=" * 60)
result = subprocess.run(
    ['ssh'] + ssh_opts + ['nvidia@192.168.1.75', """
        POD=$(sudo microk8s kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset -o jsonpath='{.items[0].metadata.name}')
        echo "Checking pod: $POD"
        echo ""
        echo "Init container logs (toolkit-validation):"
        sudo microk8s kubectl logs -n gpu-operator $POD -c toolkit-validation 2>&1 | tail -20
    """],
    capture_output=True, text=True
)
print(result.stdout)

# Check GPU nodes for nvidia-container-runtime and config
for node_ip, node_name in [("192.168.1.76", "spark-01"), ("192.168.1.77", "spark-02")]:
    print(f"\n2. CHECKING {node_name.upper()} CONFIGURATION")
    print("=" * 60)
    
    result = subprocess.run(
        ['ssh'] + ssh_opts + [f'nvidia@{node_ip}', """
            echo "A. nvidia-container-runtime installation:"
            dpkg -l | grep nvidia-container-runtime || echo "  NOT INSTALLED via dpkg"
            which nvidia-container-runtime 2>/dev/null || echo "  NOT in PATH"
            ls -la /usr/bin/nvidia-container-runtime 2>/dev/null || echo "  NOT found in /usr/bin"
            
            echo ""
            echo "B. MicroK8s containerd active config (not template):"
            if [ -f /var/snap/microk8s/current/args/containerd-template.toml ]; then
                sudo grep -A 3 'runtimes.nvidia' /var/snap/microk8s/current/args/containerd-template.toml 2>/dev/null || echo "  No nvidia runtime in config"
            else
                echo "  Config file not found"
            fi
            
            echo ""
            echo "C. Can containerd see nvidia runtime?"
            sudo microk8s ctr plugin ls 2>/dev/null | grep nvidia || echo "  No nvidia plugin visible to containerd"
            
            echo ""
            echo "D. GPU devices accessible?"
            ls -la /dev/nvidia* 2>/dev/null | head -5 || echo "  No /dev/nvidia* devices"
            
            echo ""
            echo "E. nvidia-smi works on host?"
            nvidia-smi --query-gpu=name --format=csv,noheader 2>&1 | head -3
        """],
        capture_output=True, text=True
    )
    print(result.stdout)

print("\n3. CHECKING TOOLKIT DAEMONSET STATUS")
print("=" * 60)
result = subprocess.run(
    ['ssh'] + ssh_opts + ['nvidia@192.168.1.75', """
        echo "Toolkit pods (these install nvidia-container-toolkit):"
        sudo microk8s kubectl get pods -n gpu-operator -l app=nvidia-container-toolkit-daemonset -o wide
        echo ""
        echo "Checking if toolkit installation completed:"
        POD=$(sudo microk8s kubectl get pods -n gpu-operator -l app=nvidia-container-toolkit-daemonset -o jsonpath='{.items[0].metadata.name}')
        sudo microk8s kubectl logs -n gpu-operator $POD --tail=30 2>&1 | grep -E "(Completed|Error|Installing|Failed)" || echo "Check full logs manually"
    """],
    capture_output=True, text=True
)
print(result.stdout)

=== COMPREHENSIVE GPU OPERATOR DIAGNOSTICS ===

1. CHECKING INIT CONTAINER ERROR MESSAGES
Checking pod: nvidia-device-plugin-daemonset-7fbqc

Init container logs (toolkit-validation):
waiting for nvidia container stack to be setup


2. CHECKING SPARK-01 CONFIGURATION
A. nvidia-container-runtime installation:
  NOT INSTALLED via dpkg
/usr/bin/nvidia-container-runtime
-rwxr-xr-x 1 root root 5505192 May 30  2025 /usr/bin/nvidia-container-runtime

B. MicroK8s containerd active config (not template):
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
      # runtime_type is the runtime type to use in containerd e.g. io.containerd.runtime.v1.linux
      runtime_type = "${RUNTIME_TYPE}"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime.options]
        BinaryName = "nvidia-container-runtime"

   [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
--
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nv

#### FIX: Add nvidia Runtime Configuration

**The Problem:** GPU Operator init containers are stuck in `PodInitializing` because they cannot find the required container runtime.

**Root Cause:** The GPU Operator expects a containerd runtime named `nvidia`, but MicroK8s on DGX Spark configures a runtime named `nvidia-container-runtime`. This is a naming mismatch.

**How to Identify:**
```bash
# Check containerd config on GPU nodes
sudo grep -A 3 'runtimes.nvidia' /var/snap/microk8s/current/args/containerd-template.toml

# You'll see:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
  # Wrong name! GPU Operator looks for "nvidia", not "nvidia-container-runtime"
```

**The Fix:** Add a runtime entry named `nvidia` that points to the same `/usr/bin/nvidia-container-runtime` binary. Both names can coexist - the existing `nvidia-container-runtime` config remains, and we add a `nvidia` alias for GPU Operator compatibility.

### 5.4 Verify GPU Resources Are Visible

Check that GPUs are now exposed as allocatable resources on worker nodes.

In [51]:
import subprocess
import os

# Load SSH environment
with open('/tmp/ssh_agent_env.sh') as f:
    for line in f:
        if '=' in line and line.startswith('export'):
            key, val = line.replace('export ', '').strip().split('=', 1)
            os.environ[key] = val

ssh_opts = ['-o', 'StrictHostKeyChecking=accept-new']

print("=== Adding 'nvidia' Runtime to Containerd Config ===\n")

for node_ip, node_name in [("192.168.1.76", "spark-01"), ("192.168.1.77", "spark-02")]:
    print(f"=== Configuring {node_name} ===")
    
    fix_script = """
    # Check if 'nvidia' runtime already exists
    if sudo grep -q '\\[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia\\]' /var/snap/microk8s/current/args/containerd-template.toml; then
        echo "✓ nvidia runtime already configured"
    else
        echo "Adding nvidia runtime configuration..."
        
        # Add nvidia runtime (GPU Operator expects this name)
        sudo tee -a /var/snap/microk8s/current/args/containerd-template.toml > /dev/null <<'EOF'

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
    BinaryName = "/usr/bin/nvidia-container-runtime"
EOF
        
        echo "✓ nvidia runtime added"
    fi
    
    # Restart containerd to apply changes
    echo "Restarting MicroK8s containerd..."
    sudo snap restart microk8s.daemon-containerd
    sleep 5
    
    echo "✓ Done"
    """
    
    result = subprocess.run(
        ['ssh'] + ssh_opts + [f'nvidia@{node_ip}', 'bash -s'],
        input=fix_script,
        capture_output=True,
        text=True
    )
    
    print(result.stdout)
    if result.stderr:
        print(f"STDERR: {result.stderr}")
    print()

print("\n=== Waiting 45 seconds for pods to restart ===")
import time
time.sleep(45)

print("\n=== Checking Pod Status ===")
result = subprocess.run(
    ['ssh'] + ssh_opts + ['nvidia@192.168.1.75', 
     'sudo microk8s kubectl get pods -n gpu-operator -o wide | grep -E "(NAME|device-plugin|dcgm-exporter|validator)"'],
    capture_output=True, text=True
)
print(result.stdout)

=== Adding 'nvidia' Runtime to Containerd Config ===

=== Configuring spark-01 ===
✓ nvidia runtime already configured
Restarting MicroK8s containerd...
Restarted.
✓ Done


=== Configuring spark-02 ===
✓ nvidia runtime already configured
Restarting MicroK8s containerd...
Restarted.
✓ Done



=== Waiting 45 seconds for pods to restart ===

=== Checking Pod Status ===
NAME                                                          READY   STATUS      RESTARTS   AGE   IP             NODE         NOMINATED NODE   READINESS GATES
nvidia-cuda-validator-2bnrh                                   0/1     Completed   0          12h   10.1.168.200   spark-01     <none>           <none>
nvidia-cuda-validator-gl2hj                                   0/1     Completed   0          12h   10.1.199.137   spark-02     <none>           <none>
nvidia-dcgm-exporter-88q4k                                    1/1     Running     0          12h   10.1.168.199   spark-01     <none>           <none>
nvidia-dcgm-export

In [52]:
%%bash
echo "=== GPU Resources on Nodes ==="
ssh nvidia@192.168.1.75 'sudo microk8s kubectl describe nodes | grep -A 10 "Allocatable:" | grep -E "(nvidia.com/gpu|Name:)"'

echo ""
echo "Each DGX Spark node should show nvidia.com/gpu: <count>"

=== GPU Resources on Nodes ===
  nvidia.com/gpu:     1
  nvidia.com/gpu:     1


## Step 6: Test GPU Access

Deploy a simple GPU test pod to verify that containers can access GPUs.

In [55]:
%%bash
echo "=== Recreating GPU Test Pod with Explicit Runtime ==="
ssh nvidia@192.168.1.75 << 'EOF'

echo "1. CREATE NVIDIA RUNTIMECLASS"
echo "=============================="
cat <<YAML | sudo microk8s kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
YAML

echo ""
echo "2. DELETE EXISTING POD"
echo "======================"
sudo microk8s kubectl delete pod gpu-test --ignore-not-found
sleep 5

echo ""
echo "3. CREATE NEW GPU TEST POD"
echo "=========================="
cat <<YAML | sudo microk8s kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: Never
  runtimeClassName: nvidia
  containers:
  - name: cuda-test
    image: nvidia/cuda:12.2.0-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
YAML

echo ""
echo "Waiting 15 seconds for pod to start..."
sleep 15

echo ""
echo "4. POD STATUS"
echo "============="
sudo microk8s kubectl get pod gpu-test -o wide

echo ""
echo "5. POD LOGS (nvidia-smi output)"
echo "================================"
sudo microk8s kubectl logs gpu-test 2>&1 || echo "Pod not ready yet - check status above"

echo ""
echo "If pod shows 'Completed' status, GPU access is working!"
echo "If still in error, run: kubectl describe pod gpu-test"
EOF

=== Recreating GPU Test Pod with Explicit Runtime ===


Pseudo-terminal will not be allocated because stdin is not a terminal.


Welcome to Ubuntu 24.04.3 LTS (GNU/Linux 6.8.0-90-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Sun Feb  1 06:09:03 PM UTC 2026

  System load:             0.27
  Usage of /:              1.8% of 835.58GB
  Memory usage:            6%
  Swap usage:              0%
  Temperature:             47.1 C
  Processes:               357
  Users logged in:         1
  IPv4 address for wlp2s0: 192.168.1.75
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:dba1:cc44:2510:46a6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:36b9:c4a0:a7e9:31ed
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:9d6d:19f6:90da:d7f6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:ef2d:9464:3557:9dec
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:7a8d:97c:e56c:fe96
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10::48
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:22:bc19:d505:ab6a
  IPv6 address fo




3. CREATE NEW GPU TEST POD
pod/gpu-test created

Waiting 15 seconds for pod to start...

4. POD STATUS
NAME       READY   STATUS      RESTARTS   AGE   IP             NODE       NOMINATED NODE   READINESS GATES
gpu-test   0/1     Completed   0          15s   10.1.199.139   spark-02   <none>           <none>

5. POD LOGS (nvidia-smi output)
Sun Feb  1 18:09:11 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |      

In [56]:
%%bash
echo "=== Checking GPU Test Pod Status ==="
ssh nvidia@192.168.1.75 << 'EOF'
echo "Pod Status:"
sudo microk8s kubectl get pod gpu-test -o wide

echo ""
echo "Pod Events:"
sudo microk8s kubectl describe pod gpu-test | tail -20

echo ""
echo "If status is Completed, check logs:"
sudo microk8s kubectl logs gpu-test 2>&1 || echo "Pod not completed yet"
EOF

=== Checking GPU Test Pod Status ===


Pseudo-terminal will not be allocated because stdin is not a terminal.


Welcome to Ubuntu 24.04.3 LTS (GNU/Linux 6.8.0-90-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Sun Feb  1 06:10:19 PM UTC 2026

  System load:             0.25
  Usage of /:              1.8% of 835.58GB
  Memory usage:            6%
  Swap usage:              0%
  Temperature:             47.8 C
  Processes:               363
  Users logged in:         1
  IPv4 address for wlp2s0: 192.168.1.75
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:dba1:cc44:2510:46a6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:36b9:c4a0:a7e9:31ed
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:9d6d:19f6:90da:d7f6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:ef2d:9464:3557:9dec
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:7a8d:97c:e56c:fe96
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10::48
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:22:bc19:d505:ab6a
  IPv6 address fo

## Step 7: Deploy vLLM for Inference

Now that the cluster is running with GPU support, deploy vLLM to serve LLM inference requests.

**Test Plan:**
1. Single-node baseline: Deploy Llama 3.1 8B on one GPU
2. Measure baseline throughput (tokens/sec)
3. Deploy distributed vLLM with tensor parallelism across both nodes
4. Compare performance and validate the InfiniBand/RDMA link matters

### 7.1 Deploy vLLM Single-Node Baseline

Start with a single-GPU deployment to establish baseline performance.

### 7.0 Set Up Hugging Face Token

Load your Hugging Face token from environment or `.env` file. The token is needed to download gated models like Llama.

**Option 1:** Set environment variable before starting Jupyter:
```bash
export HF_TOKEN="hf_..."
jupyter notebook
```

**Option 2:** Create `.env` file in workspace root:
```bash
HF_TOKEN=hf_...
```

In [None]:
import os
from pathlib import Path

# Try to load from environment first
hf_token = os.getenv('HF_TOKEN')

# If not in environment, try .env file
if not hf_token:
    env_file = Path.cwd() / '.env'
    if env_file.exists():
        with open(env_file) as f:
            for line in f:
                if line.startswith('HF_TOKEN='):
                    hf_token = line.split('=', 1)[1].strip()
                    break

if hf_token:
    print(f"✓ Hugging Face token loaded (starts with: {hf_token[:10]}...)")
    os.environ['HF_TOKEN'] = hf_token
else:
    print("⚠ No HF_TOKEN found in environment or .env file")
    print("  Models may fail to download if they require authentication")
    hf_token = ""

In [None]:
import subprocess
import os

# Load SSH environment
with open('/tmp/ssh_agent_env.sh') as f:
    for line in f:
        if '=' in line and line.startswith('export'):
            key, val = line.replace('export ', '').strip().split('=', 1)
            os.environ[key] = val

# Get HF token from environment (set in previous cell)
hf_token = os.getenv('HF_TOKEN', '')

print("=== Deploying vLLM Single-Node (Llama 3.1 8B) ===")

yaml_content = f"""apiVersion: v1
kind: Pod
metadata:
  name: vllm-single
  labels:
    app: vllm-single
spec:
  containers:
  - name: vllm
    image: vllm/vllm-openai:latest
    command:
      - python3
      - -m
      - vllm.entrypoints.openai.api_server
      - --model
      - meta-llama/Llama-3.1-8B-Instruct
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
    ports:
    - containerPort: 8000
      name: http
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: HUGGING_FACE_HUB_TOKEN
      value: "{hf_token}"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-single-svc
spec:
  selector:
    app: vllm-single
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP
"""

ssh_opts = ['-o', 'StrictHostKeyChecking=accept-new']

# Apply the YAML via SSH
result = subprocess.run(
    ['ssh'] + ssh_opts + ['nvidia@192.168.1.75', 
     f'sudo microk8s kubectl apply -f - <<EOF\n{yaml_content}\nEOF'],
    capture_output=True, text=True
)

print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

print("\nvLLM pod and service created.")
print("Model download may take several minutes on first run.")

### 7.2 Monitor vLLM Startup

Watch the pod logs to see when the model is loaded and ready to serve requests.

In [None]:
%%bash
ssh nvidia@192.168.1.75 << 'EOF'
echo "=== vLLM Pod Status ==="
sudo microk8s kubectl get pod vllm-single -o wide

echo ""
echo "=== Last 50 Lines of Logs ==="
sudo microk8s kubectl logs vllm-single --tail=50 || echo "Pod not ready yet"
EOF

### 7.3 Test vLLM Endpoint

Send a test request to the vLLM OpenAI-compatible API.

In [None]:
%%bash
ssh nvidia@192.168.1.75 << 'EOF'
echo "=== Testing vLLM Inference ==="
curl -X POST http://vllm-single-svc:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Explain RDMA in one sentence:",
    "max_tokens": 50,
    "temperature": 0.7
  }' | python3 -m json.tool
EOF

### 7.4 Benchmark Single-Node Performance

Use a simple benchmark script to measure tokens per second.

In [None]:
import time
import requests
import json

def benchmark_vllm(endpoint, num_requests=10, prompt="Explain distributed systems:", max_tokens=100):
    """Simple throughput benchmark for vLLM"""
    
    results = []
    
    print(f"Running {num_requests} requests...")
    for i in range(num_requests):
        start = time.time()
        
        response = requests.post(
            f"{endpoint}/v1/completions",
            json={
                "model": "meta-llama/Llama-3.1-8B-Instruct",
                "prompt": prompt,
                "max_tokens": max_tokens,
                "temperature": 0.7
            }
        )
        
        elapsed = time.time() - start
        
        if response.status_code == 200:
            data = response.json()
            tokens = data['usage']['completion_tokens']
            tokens_per_sec = tokens / elapsed if elapsed > 0 else 0
            
            results.append({
                'request': i + 1,
                'tokens': tokens,
                'time_sec': elapsed,
                'tokens_per_sec': tokens_per_sec
            })
            
            print(f"  Request {i+1}: {tokens} tokens in {elapsed:.2f}s ({tokens_per_sec:.1f} tok/s)")
        else:
            print(f"  Request {i+1}: ERROR {response.status_code}")
    
    # Calculate statistics
    if results:
        avg_tokens_per_sec = sum(r['tokens_per_sec'] for r in results) / len(results)
        total_tokens = sum(r['tokens'] for r in results)
        total_time = sum(r['time_sec'] for r in results)
        
        print(f"\n=== Single-Node Baseline Results ===")
        print(f"Total requests: {len(results)}")
        print(f"Total tokens: {total_tokens}")
        print(f"Total time: {total_time:.2f}s")
        print(f"Average throughput: {avg_tokens_per_sec:.1f} tokens/sec")
        
        return avg_tokens_per_sec
    
    return 0

# Run benchmark (update endpoint URL after deployment)
# endpoint = "http://vllm-single-svc:8000"
# baseline_throughput = benchmark_vllm(endpoint)

print("NOTE: Update endpoint URL and uncomment to run benchmark")

## Step 8: Deploy vLLM with Tensor Parallelism

Deploy vLLM distributed across both DGX Spark nodes using tensor parallelism. This requires:
- Ray cluster for coordination
- NCCL over your InfiniBand/RoCE link for GPU-to-GPU communication
- Larger model that benefits from distribution (Llama 3.1 70B)

### 8.1 Label GPU Nodes

Add node labels to schedule distributed vLLM pods correctly.

In [None]:
%%bash
ssh nvidia@192.168.1.75 << 'EOF'
echo "=== Labeling GPU Nodes ==="
sudo microk8s kubectl label node spark-01 nvidia.com/gpu.present=true --overwrite
sudo microk8s kubectl label node spark-02 nvidia.com/gpu.present=true --overwrite

echo ""
sudo microk8s kubectl get nodes --show-labels | grep "nvidia.com/gpu"
EOF

### 8.2 Notes on Distributed vLLM Deployment

Distributed vLLM deployment requires additional setup:

**Option 1: KubeRay Operator**
- Deploy KubeRay operator to manage Ray clusters
- Create RayCluster resource with worker nodes on both Spark nodes
- Deploy vLLM with `--tensor-parallel-size=2`

**Option 2: Manual Multi-Pod Deployment**
- StatefulSet with pod affinity to pin to specific nodes
- Shared storage for model weights (NFS or similar)
- NCCL configuration to use InfiniBand

**Key Configuration:**
```bash
# vLLM command for tensor parallelism
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --host 0.0.0.0 \
  --port 8000
```

**NCCL Environment Variables for InfiniBand:**
```yaml
- name: NCCL_IB_DISABLE
  value: "0"
- name: NCCL_SOCKET_IFNAME
  value: "ib0"  # or your IB interface name
- name: NCCL_DEBUG
  value: "INFO"
```

This is the critical link to your InfiniBand article—NCCL will use RDMA over your high-speed interconnect.

## Step 9: Add Monitoring with Prometheus and DCGM

Monitor GPU utilization and inference metrics using Prometheus and NVIDIA DCGM Exporter.

### 9.1 Enable Prometheus in MicroK8s

MicroK8s includes a Prometheus addon for monitoring.

In [None]:
%%bash
ssh nvidia@192.168.1.75 << 'EOF'
echo "=== Enabling Prometheus ==="
sudo microk8s enable prometheus

echo ""
echo "Waiting for Prometheus pods to start..."
sleep 30
sudo microk8s kubectl get pods -n observability
EOF

### 9.2 Verify DCGM Exporter

The GPU Operator includes DCGM Exporter, which exposes GPU metrics to Prometheus.

In [None]:
%%bash
ssh nvidia@192.168.1.75 << 'EOF'
echo "=== DCGM Exporter Pods ==="
sudo microk8s kubectl get pods -n gpu-operator | grep dcgm

echo ""
echo "=== Sample GPU Metrics ==="
# Get one DCGM exporter pod
DCGM_POD=$(sudo microk8s kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter -o jsonpath='{.items[0].metadata.name}')

if [ -n "$DCGM_POD" ]; then
    echo "Fetching metrics from $DCGM_POD..."
    sudo microk8s kubectl exec -n gpu-operator $DCGM_POD -- curl -s localhost:9400/metrics | grep "DCGM_FI_DEV_GPU_UTIL" | head -5
else
    echo "DCGM Exporter not found"
fi
EOF

## Next Steps

This notebook established the foundation:

| Component | Status |
|-----------|--------|
| 3-node MicroK8s cluster | ✓ Deployed |
| GPU Operator | ✓ Installed |
| Single-node vLLM | ✓ Configured |
| Monitoring (Prometheus/DCGM) | ✓ Enabled |
| Distributed vLLM | Documented (requires additional setup) |

**To complete the project:**

1. **Deploy distributed vLLM** using KubeRay or StatefulSet
2. **Configure NCCL** to use your InfiniBand link (`enp1s0f0np0`/`enp1s0f1np1`)
3. **Run benchmarks** comparing single-node vs distributed throughput
4. **Measure latency** impact of tensor parallelism over RDMA
5. **Create dashboards** in Grafana for GPU utilization

**The compelling article** writes itself once you have these numbers:
- "vLLM on 2 DGX Spark nodes: X tokens/sec with Llama 3.1 70B"
- "Why 96 Gbps RDMA matters: tensor parallelism latency comparison"
- "Cost analysis: $Y home lab vs $Z cloud GPU hours"

## Troubleshooting: Complete Cluster Reset

If the cluster becomes corrupted or you encounter version skew issues between nodes, you can perform a complete reset.

### Symptoms Requiring Reset

- Worker nodes show `NotReady` status in `kubectl get nodes`
- GPU Operator pods stuck in `CrashLoopBackOff` or `ContainerCreating`
- Version mismatch between controller and workers (e.g., v1.32 vs v1.31)
- Different containerd versions across nodes

**Cause:** Kubernetes requires control plane and worker nodes within ±1 minor version. If you upgraded the controller but not the workers, or installed different MicroK8s channels, reset and reinstall with the same version on all nodes.

### Step 1: Remove Worker Nodes from Cluster

Before resetting, remove the worker nodes from the cluster on the controller.

In [57]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Removing Worker Nodes from Cluster ==="
ssh $SSH_OPTS nvidia@192.168.1.75 << 'EOF'
    echo "Removing spark-01..."
    microk8s remove-node spark-01 || echo "Node already removed or not found"
    
    echo "Removing spark-02..."
    microk8s remove-node spark-02 || echo "Node already removed or not found"
    
    echo ""
    echo "Remaining nodes:"
    microk8s kubectl get nodes
EOF

=== Removing Worker Nodes from Cluster ===


Pseudo-terminal will not be allocated because stdin is not a terminal.


Welcome to Ubuntu 24.04.3 LTS (GNU/Linux 6.8.0-90-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Sun Feb  1 11:11:53 PM UTC 2026

  System load:             0.06
  Usage of /:              1.8% of 835.58GB
  Memory usage:            6%
  Swap usage:              0%
  Temperature:             45.6 C
  Processes:               356
  Users logged in:         1
  IPv4 address for wlp2s0: 192.168.1.75
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:90ac:c2f6:7999:5aef
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:dba1:cc44:2510:46a6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:36b9:c4a0:a7e9:31ed
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:9d6d:19f6:90da:d7f6
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:ef2d:9464:3557:9dec
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10:7a8d:97c:e56c:fe96
  IPv6 address for wlp2s0: 2600:1702:56e5:4e10::48
  IPv6 address 

### Step 2: Leave Cluster from Worker Nodes

Each worker node must leave the cluster before it can be fully reset.

In [58]:
%%bash
source /tmp/ssh_agent_env.sh
SSH_OPTS="-o StrictHostKeyChecking=accept-new"

echo "=== Spark-01: Leaving Cluster ==="
ssh $SSH_OPTS nvidia@192.168.1.76 'sudo microk8s leave' || echo "Already left or not in cluster"

echo ""
echo "=== Spark-02: Leaving Cluster ==="
ssh $SSH_OPTS nvidia@192.168.1.77 'sudo microk8s leave' || echo "Already left or not in cluster"

echo ""
echo "Workers have left the cluster"

=== Spark-01: Leaving Cluster ===
Configuring services.
Generating new cluster certificates.
Waiting for node to start.  


### Step 3: Purge MicroK8s from All Nodes

Remove MicroK8s completely, including all configuration, data, and cluster state. The `--purge` flag ensures a clean slate.

In [65]:
import subprocess
import os

# Load SSH environment
with open('/tmp/ssh_agent_env.sh') as f:
    for line in f:
        if '=' in line and line.startswith('export'):
            key, val = line.replace('export ', '').strip().split('=', 1)
            os.environ[key] = val

ssh_opts = ['-o', 'StrictHostKeyChecking=accept-new']

print("=== Purging MicroK8s from All Nodes ===\n")

nodes = [
    ("192.168.1.75", "Controller"),
    ("192.168.1.76", "Spark-01"),
    ("192.168.1.77", "Spark-02")
]

for ip, name in nodes:
    print(f"{name} ({ip}):")
    try:
        result = subprocess.run(
            ['ssh'] + ssh_opts + [f'nvidia@{ip}', 'sudo snap remove microk8s --purge'],
            capture_output=True,
            text=True,
            timeout=60
        )
        
        if result.returncode == 0:
            print(f"  ✓ {name} purged successfully")
        else:
            if "is not installed" in result.stderr or "is not installed" in result.stdout:
                print(f"  ✓ {name} - MicroK8s not installed")
            else:
                print(f"  ✗ {name} purge failed")
                if result.stderr:
                    print(f"  Error: {result.stderr.strip()}")
    except subprocess.TimeoutExpired:
        print(f"  ✗ {name} - SSH timeout")
    except Exception as e:
        print(f"  ✗ {name} - Error: {e}")
    print()

print("=== Cleanup Complete ===")
print("All nodes processed. Ready for fresh installation.")

=== Purging MicroK8s from All Nodes ===

Controller (192.168.1.75):
  ✓ Controller purged successfully

Spark-01 (192.168.1.76):
  ✓ Spark-01 purged successfully

Spark-02 (192.168.1.77):
  ✓ Spark-02 purged successfully

=== Cleanup Complete ===
All nodes processed. Ready for fresh installation.


### After Reset: Reinstall with Consistent Versions

After purging, reinstall MicroK8s using **the same channel** on all three nodes. This prevents version skew issues.

**Critical:** Use the same channel (e.g., `1.32/stable`) on all nodes. Go back to cell 18 (Step 3.1: Install MicroK8s on the Controller) and proceed through the installation steps with the updated channel.