# MicroK8s Cluster Setup: CPU Controller + DGX Spark Nodes

This tutorial walks through setting up a MicroK8s Kubernetes cluster with:

| Node | Role | IP Address | Description |
|------|------|------------|-------------|
| controller | Control Plane | 192.168.1.75 | CPU-only node running K8s control plane |
| spark-01 | Worker | 192.168.1.76 | DGX Spark with GPU |
| spark-02 | Worker | 192.168.1.77 | DGX Spark with GPU |

All nodes are connected via WiFi and have the `nvidia` user configured.

## Prerequisites

- Ubuntu 22.04 or later on all nodes
- SSH access from your workstation to all nodes
- `nvidia` user with sudo privileges on all nodes
- Network connectivity between all nodes (WiFi in this case)

## Step 1: Define Cluster Variables

Store node information in environment variables for use throughout this tutorial.

In [1]:
import os

# Cluster configuration
CONTROLLER_IP = "192.168.1.75"
SPARK01_IP = "192.168.1.76"
SPARK02_IP = "192.168.1.77"
SSH_USER = "nvidia"

# Store as environment variables for shell commands
os.environ["CONTROLLER_IP"] = CONTROLLER_IP
os.environ["SPARK01_IP"] = SPARK01_IP
os.environ["SPARK02_IP"] = SPARK02_IP
os.environ["SSH_USER"] = SSH_USER

print(f"Controller: {SSH_USER}@{CONTROLLER_IP}")
print(f"Spark-01:   {SSH_USER}@{SPARK01_IP}")
print(f"Spark-02:   {SSH_USER}@{SPARK02_IP}")

Controller: nvidia@192.168.1.75
Spark-01:   nvidia@192.168.1.76
Spark-02:   nvidia@192.168.1.77


## Step 2: Test SSH Connectivity

Verify SSH access to all nodes. Each command should return the hostname without prompting for a password.

In [2]:
%%bash
echo "Testing SSH to Controller (192.168.1.75)..."
ssh -o ConnectTimeout=5 -o BatchMode=yes ${SSH_USER}@${CONTROLLER_IP} "hostname" 2>&1 || echo "FAILED: Cannot connect to controller"

echo ""
echo "Testing SSH to Spark-01 (192.168.1.76)..."
ssh -o ConnectTimeout=5 -o BatchMode=yes ${SSH_USER}@${SPARK01_IP} "hostname" 2>&1 || echo "FAILED: Cannot connect to spark-01"

echo ""
echo "Testing SSH to Spark-02 (192.168.1.77)..."
ssh -o ConnectTimeout=5 -o BatchMode=yes ${SSH_USER}@${SPARK02_IP} "hostname" 2>&1 || echo "FAILED: Cannot connect to spark-02"

Testing SSH to Controller (192.168.1.75)...
controller


### Diagnosing SSH Environment

The notebook kernel may run in a different environment than your terminal. Let's check:

In [4]:
%%bash
echo "=== SSH Environment Check ==="
echo ""
echo "Running as user: $(whoami)"
echo "Home directory: $HOME"
echo ""
echo "SSH keys available:"
ls -la ~/.ssh/id_* 2>/dev/null || echo "No SSH keys found in ~/.ssh/"
echo ""
echo "SSH agent status:"
echo "SSH_AUTH_SOCK: ${SSH_AUTH_SOCK:-NOT SET}"
ssh-add -l 2>&1 || echo "No agent running or no keys loaded"

=== SSH Environment Check ===

Running as user: nvidia
Home directory: /home/nvidia

SSH keys available:
-rw------- 1 nvidia nvidia 411 Jan 22 23:30 /home/nvidia/.ssh/id_ed25519
-rw-r--r-- 1 nvidia nvidia 103 Jan 22 23:30 /home/nvidia/.ssh/id_ed25519.pub
-rw------- 1 nvidia nvidia 411 Jan 22 00:40 /home/nvidia/.ssh/id_ed25519_shared
-rw-r--r-- 1 nvidia nvidia 100 Jan 22 00:40 /home/nvidia/.ssh/id_ed25519_shared.pub

SSH agent status:
SSH_AUTH_SOCK: NOT SET
Could not open a connection to your authentication agent.
No agent running or no keys loaded


### Fix: Set Up SSH Key-Based Authentication

If SSH fails, you need to set up passwordless SSH. First, check if you have an SSH key:

**If no key exists**, run this cell to generate one (skip if you already have a key):

In [None]:
%%bash
# Generate SSH key (only run if you don't have one)
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" -C "microk8s-cluster"
echo "Key generated:"
cat ~/.ssh/id_ed25519.pub

### Copy SSH Key to All Nodes

Run `ssh-copy-id` for each node. This will prompt for the password once per node.

**Run these commands in a terminal** (they require interactive password input):

```bash
# Copy to controller
ssh-copy-id nvidia@192.168.1.75

# Copy to spark-01
ssh-copy-id nvidia@192.168.1.76

# Copy to spark-02
ssh-copy-id nvidia@192.168.1.77
```

### Verify SSH Access

After copying keys, re-run this test. All nodes should return their hostname without password prompts:

In [6]:
%%bash
echo "=== Testing SSH Connectivity ==="
echo ""

echo "Controller (192.168.1.75):"
ssh -o ConnectTimeout=5 -o BatchMode=yes nvidia@192.168.1.75 "hostname" && echo "✓ SUCCESS" || echo "✗ FAILED"
echo ""

echo "Spark-01 (192.168.1.76):"
ssh -o ConnectTimeout=5 -o BatchMode=yes nvidia@192.168.1.76 "hostname" && echo "✓ SUCCESS" || echo "✗ FAILED"
echo ""

echo "Spark-02 (192.168.1.77):"
ssh -o ConnectTimeout=5 -o BatchMode=yes nvidia@192.168.1.77 "hostname" && echo "✓ SUCCESS" || echo "✗ FAILED"

=== Testing SSH Connectivity ===

Controller (192.168.1.75):
controller
✓ SUCCESS


In [None]:
## Step 3: Install MicroK8s on All Nodes

MicroK8s is a lightweight Kubernetes distribution from Canonical. We'll install it on all three nodes, then join the Spark nodes to the controller.

**Architecture:**
- Controller (192.168.1.75): Runs the Kubernetes control plane only
- Spark-01 (192.168.1.76): Worker node with GPU
- Spark-02 (192.168.1.77): Worker node with GPU

### 3.1 Install MicroK8s on the Controller

The controller runs the control plane (API server, scheduler, etcd). No GPU needed here.

In [None]:
%%bash
echo "=== Installing MicroK8s on Controller (192.168.1.75) ==="
ssh nvidia@192.168.1.75 << 'EOF'
    # Install MicroK8s
    sudo snap install microk8s --classic --channel=1.31/stable
    
    # Add user to microk8s group (avoids sudo for kubectl)
    sudo usermod -a -G microk8s $USER
    sudo chown -R $USER ~/.kube 2>/dev/null || mkdir -p ~/.kube
    
    # Wait for MicroK8s to be ready
    sudo microk8s status --wait-ready
    
    echo ""
    echo "MicroK8s version:"
    sudo microk8s kubectl version --short 2>/dev/null || sudo microk8s kubectl version
EOF

### 3.2 Install MicroK8s on Spark-01

Worker node with GPU. Same installation, will join the cluster later.

In [None]:
%%bash
echo "=== Installing MicroK8s on Spark-01 (192.168.1.76) ==="
ssh nvidia@192.168.1.76 << 'EOF'
    # Install MicroK8s
    sudo snap install microk8s --classic --channel=1.31/stable
    
    # Add user to microk8s group
    sudo usermod -a -G microk8s $USER
    sudo chown -R $USER ~/.kube 2>/dev/null || mkdir -p ~/.kube
    
    # Wait for MicroK8s to be ready
    sudo microk8s status --wait-ready
    
    echo ""
    echo "MicroK8s version:"
    sudo microk8s kubectl version --short 2>/dev/null || sudo microk8s kubectl version
EOF

### 3.3 Install MicroK8s on Spark-02

Second GPU worker node.

In [None]:
%%bash
echo "=== Installing MicroK8s on Spark-02 (192.168.1.77) ==="
ssh nvidia@192.168.1.77 << 'EOF'
    # Install MicroK8s
    sudo snap install microk8s --classic --channel=1.31/stable
    
    # Add user to microk8s group
    sudo usermod -a -G microk8s $USER
    sudo chown -R $USER ~/.kube 2>/dev/null || mkdir -p ~/.kube
    
    # Wait for MicroK8s to be ready
    sudo microk8s status --wait-ready
    
    echo ""
    echo "MicroK8s version:"
    sudo microk8s kubectl version --short 2>/dev/null || sudo microk8s kubectl version
EOF