CKS Practice Environment

A production-grade web-based simulator for Certified Kubernetes Security Specialist exam preparation. This project demonstrates advanced Kubernetes orchestration, virtualization management, and full-stack development practices through a real-world application that provisions ephemeral kubeadm clusters for hands-on security training.

Overview

The CKS Practice Environment is a full-stack application that creates real Kubernetes clusters using KubeVirt virtualization, allowing users to practice CKS exam scenarios in an authentic environment. The system maintains a pool of pre-provisioned clusters, assigns them instantly to users, and resets them using snapshot-based restoration for rapid turnaround.

Architecture

Technology Stack

Backend

Go 1.24+ with Gin framework for high-performance HTTP handling
Kubernetes client-go for native API integration
KubeVirt client for VM lifecycle management
WebSocket-based terminal multiplexing with persistent SSH connections
Prometheus metrics and structured logging with request correlation

Frontend

Next.js 14 with React 18 and App Router
Context API for state management with custom hooks
xterm.js for full-featured terminal emulation
WebSocket protocol for real-time terminal communication
Tailwind CSS for responsive UI

Infrastructure

KubeVirt 1.5.0 for VM orchestration on Kubernetes
Longhorn distributed storage with CDI for disk image management
Cilium 1.17.3 CNI for cluster networking
CloudInit for declarative VM initialization
kubeadm 1.33.0 for cluster bootstrapping

DevOps Practices

Infrastructure as Code

Golden Image Pipeline: Automated image creation using virt-customize with version-controlled dependencies. The build process installs containerd, kubelet, kubeadm, kubectl, and security tooling (kube-bench) on Ubuntu 22.04, creating a repeatable base image uploaded to KubeVirt as a DataVolume.

VM Templates: Control plane and worker node configurations defined as YAML templates with CloudInit. Variable substitution enables dynamic configuration while maintaining infrastructure as code principles.

Resource Quotas: Declarative resource limits per cluster namespace (16 CPU, 16Gi memory, 20 pods) enforced through Kubernetes ResourceQuota objects.

Automation

Cluster Lifecycle Management: Fully automated bootstrap, snapshot, and restore cycle with no manual intervention. The system handles:

Sequential cluster provisioning to avoid resource conflicts
Automatic cleanup of orphaned restore PVCs
Background maintenance loops for cluster health monitoring
Graceful degradation when scenario initialization fails

State Persistence: Cluster pool state stored in Kubernetes namespace annotations (cks.io/cluster-status, cks.io/last-reset), ensuring state survives backend restarts without external databases.

Observability

Structured Logging: Logrus-based logging with JSON formatting in production. Request ID middleware enables distributed tracing across the request lifecycle, with contextual fields (sessionID, vmName, namespace) for debugging.

Metrics Exposure: Prometheus endpoint (/metrics) tracking:

Active session counts and lifecycle events
VM provisioning duration histograms
Validation execution times
Terminal connection metrics

Health Endpoints: Kubernetes-ready liveness and readiness probes at /health and /ready.

Resilience

Retry Logic: Configurable exponential backoff with context cancellation support. Operations retry up to 3 times with 10-second initial delay and 2.0 backoff factor.

Error Handling: Comprehensive error wrapping with context preservation. Cleanup operations continue despite partial failures using defer statements and error aggregation.

Cluster Bootstrap and Reset

Bootstrap Process

Phase 1: Golden Image Creation

The scripts/build-image.sh script uses virt-customize to create a base Ubuntu 22.04 image with:

Containerd runtime configured with SystemdCgroup driver
Kubernetes components (kubelet, kubeadm, kubectl v1.33.0) pre-installed
Kernel modules (br_netfilter, overlay) loaded and persistent
Swap disabled and networking prerequisites configured
SSH user "suporte" with passwordless sudo access

The image is uploaded to KubeVirt via HTTP server to the kubevirt-os-images namespace.

Phase 2: Cluster Initialization via CloudInit

Control plane VMs execute kubeadm initialization with:

kubeadm init --config=/etc/kubeadm-config.yaml \
  --cri-socket=unix:///run/containerd/containerd.sock \
  --pod-network-cidr=10.0.0.0/8 \
  --kubernetes-version=1.33.0

Post-initialization automation:

Cilium CNI installation via Helm
Control plane taint removal (allows pod scheduling)
kubeconfig configuration for root and suporte users
Join command generation stored at /etc/kubeadm-join-command

Worker nodes wait for control plane availability on port 6443, then execute the join command retrieved from the control plane VM.

Snapshot-Based Reset

The system achieves sub-3-minute cluster resets using KubeVirt's snapshot functionality:

Snapshot Creation (scripts/snapshot-session.sh):

Gracefully stop VMs to ensure filesystem consistency
Create VirtualMachineSnapshot objects for control plane and worker
Export snapshots to DataVolumes using VirtualMachineExport API
Store in vm-templates namespace for reuse
Restart original VMs

Reset Process (backend/internal/clusterpool/manager.go):

Stop VMs and delete VirtualMachineInstance objects
Clean up previous restore PVCs using pattern matching (restore-[uuid]-*)
Create VirtualMachineRestore objects referencing snapshots
Wait for restore completion and VM readiness
Mark cluster as available in pool

This approach provides:

Cold start (kubeadm): 8-10 minutes
Pool assignment: <1 second
Snapshot restore: 2-3 minutes (background)

Cluster Pool Management

Architecture

The cluster pool maintains 3 pre-provisioned clusters (cluster1, cluster2, cluster3) for instant session assignment. Each cluster consists of:

1 control plane VM (4 vCPU, 4Gi RAM, 30Gi disk)
1 worker node VM (4 vCPU, 4Gi RAM, 30Gi disk)
Dedicated namespace matching cluster ID
ResourceQuota for resource isolation

State Machine

Clusters transition through states:

available: Ready for assignment
locked: Assigned to active session
resetting: Snapshot restoration in progress
error: Requires manual intervention

State transitions are atomic and persisted to namespace annotations.

Assignment Algorithm

func (m *Manager) AssignCluster(sessionID string) (*ClusterPool, error) {
    // Find first available cluster
    // Atomically transition to locked state
    // Update namespace annotation
    // Return cluster details
}

On session end, clusters are released and asynchronously reset in the background.

Task Configuration and Validation

Task Structure

Tasks are defined using a dual-file approach:

tasks/NN-task.md: Human-readable Markdown with structured sections:

H1: Task title
H2 sections: Description, Objectives, Step-by-Step Guide, Hints

validation/NN-validation.yaml: Machine-readable validation rules

The scenario loader parses Markdown into structured data (Task objects) while validation rules remain in YAML for declarative specification.

Validation Engine

The UnifiedValidator (backend/internal/validation/unified_validator.go) supports multiple validation types:

Resource Validation:

resource_exists: Verifies Kubernetes resource presence
resource_property: Validates JSONPath property values with conditions (equals, contains, greater_than, less_than)

Command Validation:

command: Executes shell command and validates stdout/stderr
script: Runs multi-line bash scripts with exit code validation

File Validation:

file_exists: Checks file presence on VM filesystem
file_content: Validates file content with regex or exact matching

Example validation rule:

validation:
  - id: pod-security-context
    type: resource_property
    resource:
      kind: Pod
      name: secure-pod
      namespace: default
      property: .spec.securityContext.runAsUser
    condition: equals
    value: "1000"
    errorMessage: "Pod is not running as user 1000"

Execution Model

Validations execute on VMs via SSH using virtctl ssh. The validator:

Establishes SSH connection to control plane or worker
Executes validation command (kubectl, bash script, file check)
Parses output and compares against expected values
Returns structured ValidationResult with expected vs. actual values

No retry logic ensures validation accuracy (failed operations indicate genuine configuration issues).

Admin Panel

The admin panel provides cluster pool management through dedicated endpoints:

Bootstrap Pool

POST /api/v1/admin/bootstrap-pool

Creates 3 baseline clusters sequentially to avoid resource contention. Each cluster creation:

Creates dedicated namespace with ResourceQuota
Provisions control plane VM from golden image
Waits for VM readiness and kubeadm completion
Provisions worker node VM
Waits for worker to join cluster
Adds cluster to pool with "available" status

Total provisioning time: ~30 minutes (3 clusters × 10 minutes)

Create Snapshots

POST /api/v1/admin/create-snapshots

Snapshots all pool clusters for reset functionality. For each cluster:

Gracefully stops VMs
Creates VirtualMachineSnapshot objects
Exports to DataVolumes in vm-templates namespace
Restarts VMs
Returns per-cluster success/failure

Release All Clusters

POST /api/v1/admin/release-all-clusters

Emergency reset of entire pool:

Releases all locked/error clusters
Triggers async reset for each cluster
Returns updated pool status

Useful for bulk maintenance or error recovery.

Terminal Architecture

Persistent SSH Connections

The terminal system maintains long-lived SSH connections that multiple WebSocket clients can attach to:

type PersistentSSHConnection struct {
    ID          string
    Command     *exec.Cmd      // virtctl ssh process
    PTY         *os.File       // Pseudo-terminal
    ActiveConns int            // Active WebSocket count
    Mutex       sync.Mutex
}

Benefits:

Instant reconnection after page refresh
State preservation (shell history, working directory, running processes)
Reduced load on VMs (single SSH session per user)

Deterministic Terminal IDs

Terminal IDs follow the pattern {sessionID}-{target}:

Example: abc12345-control-plane
Frontend can predict terminal ID without API call
Backend auto-creates terminal sessions on reconnect
Enables seamless reconnection after network interruption

WebSocket Protocol

Browser → WebSocket /api/v1/terminals/:id/attach
              ↓
       HandleTerminal()
              ↓
       GetOrCreatePersistentSSH()
              ↓
       PTY ↔ WebSocket Bridge
              ↓
       xterm.js rendering

The bridge handles:

Terminal resize events (rows/cols)
Binary data streaming
Connection lifecycle (attach/detach)
Graceful cleanup when ActiveConns reaches 0

Unique Implementation Details

Namespace Annotations for State

Cluster pool state persists in Kubernetes namespace annotations:

apiVersion: v1
kind: Namespace
metadata:
  name: cluster1
  annotations:
    cks.io/cluster-status: "available"
    cks.io/last-reset: "2025-01-14T10:30:00Z"
    cks.io/created-at: "2025-01-14T08:00:00Z"

This Kubernetes-native approach eliminates the need for external databases while surviving backend restarts.

Smart Cleanup of Restore PVCs

VM restoration creates PVCs that persist after restore completes. The cleanup algorithm:

Extracts PVC names from VirtualMachineRestore.Status.Restores array
Deletes known restore PVCs
Scans for orphaned PVCs matching pattern restore-[uuid]-*
Removes orphaned PVCs to prevent disk space leaks

CloudInit Variable Substitution

Custom template engine with environment variable fallback:

func substituteEnvVars(input string, vars map[string]string) string {
    re := regexp.MustCompile(`\${([A-Za-z0-9_]+)}`)
    // Lookup in vars map, then os.Getenv, then leave unchanged
}

Enables dynamic CloudInit generation without external dependencies (Helm, Kustomize).

Scenario Loader with Markdown Parsing

Tasks are authored in Markdown for readability, parsed into structured data:

# Task 1: Configure Pod Security

## Description
Configure a secure pod with proper security context.

## Step-by-Step Guide
1. Create a pod manifest
2. Add security context with runAsUser: 1000

The parser extracts title, description, steps, and hints into Task structs, bridging human-readable documentation with machine-processable validation.

Retry Logic with Context Cancellation

Sophisticated retry mechanism respects context cancellation:

func retryOperation(ctx context.Context, operation func() error) error {
    for attempt := 0; attempt <= maxRetries; attempt++ {
        select {
        case <-ctx.Done():
            return ctx.Err()
        default:
            err := operation()
            if err == nil { return nil }
            time.Sleep(delay)
            delay *= backoff
        }
    }
}

Ensures operations abort cleanly when sessions end or timeouts occur.

Structured Validation Responses

Clean API design for validation results:

type ValidationResult struct {
    RuleID      string
    RuleType    string
    Passed      bool
    Message     string
    Expected    interface{}
    Actual      interface{}
    ErrorCode   string
}

Enables detailed frontend feedback showing users exactly what failed and why, with expected vs. actual values for debugging.

Project Structure

cks/
├── backend/
│   ├── cmd/server/              # Main entry point
│   ├── internal/
│   │   ├── config/              # Configuration with env var support
│   │   ├── models/              # Data structures (Session, Task, ValidationRule)
│   │   ├── controllers/         # HTTP handlers (sessions, terminals, admin)
│   │   ├── services/            # Business logic layer
│   │   ├── kubevirt/            # KubeVirt client wrapper
│   │   ├── sessions/            # Session lifecycle management
│   │   ├── terminal/            # Terminal/SSH handling with PTY
│   │   ├── scenarios/           # Scenario loading and parsing
│   │   ├── validation/          # Task validation engine
│   │   ├── clusterpool/         # Cluster pool manager
│   │   └── middleware/          # Request ID, logging, CORS
│   ├── templates/               # CloudInit templates (control-plane, worker)
│   └── scenarios/               # Practice scenarios with tasks and validations
├── frontend/
│   ├── app/                     # Next.js App Router pages
│   ├── components/              # React components (Terminal, TaskPanel, Session)
│   ├── contexts/                # State management (SessionContext)
│   ├── hooks/                   # Custom hooks (useSession, useTerminal, useValidation)
│   └── lib/                     # API client with fetch wrapper
├── docker/                      # Dockerfiles for backend and frontend
├── scripts/                     # Automation scripts (build-image, snapshot)
└── kubernetes/                  # K8s manifests (future deployment configs)

Technical Highlights

Concurrent Operations: Goroutines for parallel cluster operations with WaitGroups and proper error aggregation. Background maintenance loops with ticker-based scheduling.

Resource Management: defer statements ensure cleanup even during panics. Context propagation enables timeout enforcement across operation chains.

Type Safety: Extensive use of Go interfaces (KubeVirtClient, SessionManager, Validator) enabling dependency injection and testing.

Performance Optimization: Frontend memoization of terminal components prevents unnecessary re-renders. Backend caching of scenario definitions reduces disk I/O.

Security: Session-based access control. Resource isolation via Kubernetes namespaces and ResourceQuotas. No privileged containers or host path mounts.

Built by a DevOps/SRE/Cloud Engineer as a demonstration of Kubernetes expertise, systems programming, and full-stack development capabilities.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
backend		backend
docker		docker
frontend		frontend
scripts		scripts
.gitignore		.gitignore
README.md		README.md
cks.png		cks.png

Folders and files

Latest commit

History

Repository files navigation

CKS Practice Environment

Overview

Architecture

Technology Stack

DevOps Practices

Infrastructure as Code

Automation

Observability

Resilience

Cluster Bootstrap and Reset

Bootstrap Process

Snapshot-Based Reset

Cluster Pool Management

Architecture

State Machine

Assignment Algorithm

Task Configuration and Validation

Task Structure

Validation Engine

Execution Model

Admin Panel

Bootstrap Pool

Create Snapshots

Release All Clusters

Terminal Architecture

Persistent SSH Connections

Deterministic Terminal IDs

WebSocket Protocol

Unique Implementation Details

Namespace Annotations for State

Smart Cleanup of Restore PVCs

CloudInit Variable Substitution

Scenario Loader with Markdown Parsing

Retry Logic with Context Cancellation

Structured Validation Responses

Project Structure

Technical Highlights

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages