A production-grade web-based simulator for Certified Kubernetes Security Specialist exam preparation. This project demonstrates advanced Kubernetes orchestration, virtualization management, and full-stack development practices through a real-world application that provisions ephemeral kubeadm clusters for hands-on security training.
The CKS Practice Environment is a full-stack application that creates real Kubernetes clusters using KubeVirt virtualization, allowing users to practice CKS exam scenarios in an authentic environment. The system maintains a pool of pre-provisioned clusters, assigns them instantly to users, and resets them using snapshot-based restoration for rapid turnaround.
Backend
- Go 1.24+ with Gin framework for high-performance HTTP handling
- Kubernetes client-go for native API integration
- KubeVirt client for VM lifecycle management
- WebSocket-based terminal multiplexing with persistent SSH connections
- Prometheus metrics and structured logging with request correlation
Frontend
- Next.js 14 with React 18 and App Router
- Context API for state management with custom hooks
- xterm.js for full-featured terminal emulation
- WebSocket protocol for real-time terminal communication
- Tailwind CSS for responsive UI
Infrastructure
- KubeVirt 1.5.0 for VM orchestration on Kubernetes
- Longhorn distributed storage with CDI for disk image management
- Cilium 1.17.3 CNI for cluster networking
- CloudInit for declarative VM initialization
- kubeadm 1.33.0 for cluster bootstrapping
Golden Image Pipeline: Automated image creation using virt-customize with version-controlled dependencies. The build process installs containerd, kubelet, kubeadm, kubectl, and security tooling (kube-bench) on Ubuntu 22.04, creating a repeatable base image uploaded to KubeVirt as a DataVolume.
VM Templates: Control plane and worker node configurations defined as YAML templates with CloudInit. Variable substitution enables dynamic configuration while maintaining infrastructure as code principles.
Resource Quotas: Declarative resource limits per cluster namespace (16 CPU, 16Gi memory, 20 pods) enforced through Kubernetes ResourceQuota objects.
Cluster Lifecycle Management: Fully automated bootstrap, snapshot, and restore cycle with no manual intervention. The system handles:
- Sequential cluster provisioning to avoid resource conflicts
- Automatic cleanup of orphaned restore PVCs
- Background maintenance loops for cluster health monitoring
- Graceful degradation when scenario initialization fails
State Persistence: Cluster pool state stored in Kubernetes namespace annotations (cks.io/cluster-status, cks.io/last-reset), ensuring state survives backend restarts without external databases.
Structured Logging: Logrus-based logging with JSON formatting in production. Request ID middleware enables distributed tracing across the request lifecycle, with contextual fields (sessionID, vmName, namespace) for debugging.
Metrics Exposure: Prometheus endpoint (/metrics) tracking:
- Active session counts and lifecycle events
- VM provisioning duration histograms
- Validation execution times
- Terminal connection metrics
Health Endpoints: Kubernetes-ready liveness and readiness probes at /health and /ready.
Retry Logic: Configurable exponential backoff with context cancellation support. Operations retry up to 3 times with 10-second initial delay and 2.0 backoff factor.
Error Handling: Comprehensive error wrapping with context preservation. Cleanup operations continue despite partial failures using defer statements and error aggregation.
Phase 1: Golden Image Creation
The scripts/build-image.sh script uses virt-customize to create a base Ubuntu 22.04 image with:
- Containerd runtime configured with SystemdCgroup driver
- Kubernetes components (kubelet, kubeadm, kubectl v1.33.0) pre-installed
- Kernel modules (br_netfilter, overlay) loaded and persistent
- Swap disabled and networking prerequisites configured
- SSH user "suporte" with passwordless sudo access
The image is uploaded to KubeVirt via HTTP server to the kubevirt-os-images namespace.
Phase 2: Cluster Initialization via CloudInit
Control plane VMs execute kubeadm initialization with:
kubeadm init --config=/etc/kubeadm-config.yaml \
--cri-socket=unix:///run/containerd/containerd.sock \
--pod-network-cidr=10.0.0.0/8 \
--kubernetes-version=1.33.0Post-initialization automation:
- Cilium CNI installation via Helm
- Control plane taint removal (allows pod scheduling)
- kubeconfig configuration for root and suporte users
- Join command generation stored at
/etc/kubeadm-join-command
Worker nodes wait for control plane availability on port 6443, then execute the join command retrieved from the control plane VM.
The system achieves sub-3-minute cluster resets using KubeVirt's snapshot functionality:
Snapshot Creation (scripts/snapshot-session.sh):
- Gracefully stop VMs to ensure filesystem consistency
- Create VirtualMachineSnapshot objects for control plane and worker
- Export snapshots to DataVolumes using VirtualMachineExport API
- Store in
vm-templatesnamespace for reuse - Restart original VMs
Reset Process (backend/internal/clusterpool/manager.go):
- Stop VMs and delete VirtualMachineInstance objects
- Clean up previous restore PVCs using pattern matching (
restore-[uuid]-*) - Create VirtualMachineRestore objects referencing snapshots
- Wait for restore completion and VM readiness
- Mark cluster as available in pool
This approach provides:
- Cold start (kubeadm): 8-10 minutes
- Pool assignment: <1 second
- Snapshot restore: 2-3 minutes (background)
The cluster pool maintains 3 pre-provisioned clusters (cluster1, cluster2, cluster3) for instant session assignment. Each cluster consists of:
- 1 control plane VM (4 vCPU, 4Gi RAM, 30Gi disk)
- 1 worker node VM (4 vCPU, 4Gi RAM, 30Gi disk)
- Dedicated namespace matching cluster ID
- ResourceQuota for resource isolation
Clusters transition through states:
- available: Ready for assignment
- locked: Assigned to active session
- resetting: Snapshot restoration in progress
- error: Requires manual intervention
State transitions are atomic and persisted to namespace annotations.
func (m *Manager) AssignCluster(sessionID string) (*ClusterPool, error) {
// Find first available cluster
// Atomically transition to locked state
// Update namespace annotation
// Return cluster details
}On session end, clusters are released and asynchronously reset in the background.
Tasks are defined using a dual-file approach:
tasks/NN-task.md: Human-readable Markdown with structured sections:
- H1: Task title
- H2 sections: Description, Objectives, Step-by-Step Guide, Hints
validation/NN-validation.yaml: Machine-readable validation rules
The scenario loader parses Markdown into structured data (Task objects) while validation rules remain in YAML for declarative specification.
The UnifiedValidator (backend/internal/validation/unified_validator.go) supports multiple validation types:
Resource Validation:
resource_exists: Verifies Kubernetes resource presenceresource_property: Validates JSONPath property values with conditions (equals, contains, greater_than, less_than)
Command Validation:
command: Executes shell command and validates stdout/stderrscript: Runs multi-line bash scripts with exit code validation
File Validation:
file_exists: Checks file presence on VM filesystemfile_content: Validates file content with regex or exact matching
Example validation rule:
validation:
- id: pod-security-context
type: resource_property
resource:
kind: Pod
name: secure-pod
namespace: default
property: .spec.securityContext.runAsUser
condition: equals
value: "1000"
errorMessage: "Pod is not running as user 1000"Validations execute on VMs via SSH using virtctl ssh. The validator:
- Establishes SSH connection to control plane or worker
- Executes validation command (kubectl, bash script, file check)
- Parses output and compares against expected values
- Returns structured ValidationResult with expected vs. actual values
No retry logic ensures validation accuracy (failed operations indicate genuine configuration issues).
The admin panel provides cluster pool management through dedicated endpoints:
POST /api/v1/admin/bootstrap-pool
Creates 3 baseline clusters sequentially to avoid resource contention. Each cluster creation:
- Creates dedicated namespace with ResourceQuota
- Provisions control plane VM from golden image
- Waits for VM readiness and kubeadm completion
- Provisions worker node VM
- Waits for worker to join cluster
- Adds cluster to pool with "available" status
Total provisioning time: ~30 minutes (3 clusters × 10 minutes)
POST /api/v1/admin/create-snapshots
Snapshots all pool clusters for reset functionality. For each cluster:
- Gracefully stops VMs
- Creates VirtualMachineSnapshot objects
- Exports to DataVolumes in
vm-templatesnamespace - Restarts VMs
- Returns per-cluster success/failure
POST /api/v1/admin/release-all-clusters
Emergency reset of entire pool:
- Releases all locked/error clusters
- Triggers async reset for each cluster
- Returns updated pool status
Useful for bulk maintenance or error recovery.
The terminal system maintains long-lived SSH connections that multiple WebSocket clients can attach to:
type PersistentSSHConnection struct {
ID string
Command *exec.Cmd // virtctl ssh process
PTY *os.File // Pseudo-terminal
ActiveConns int // Active WebSocket count
Mutex sync.Mutex
}Benefits:
- Instant reconnection after page refresh
- State preservation (shell history, working directory, running processes)
- Reduced load on VMs (single SSH session per user)
Terminal IDs follow the pattern {sessionID}-{target}:
- Example:
abc12345-control-plane - Frontend can predict terminal ID without API call
- Backend auto-creates terminal sessions on reconnect
- Enables seamless reconnection after network interruption
Browser → WebSocket /api/v1/terminals/:id/attach
↓
HandleTerminal()
↓
GetOrCreatePersistentSSH()
↓
PTY ↔ WebSocket Bridge
↓
xterm.js rendering
The bridge handles:
- Terminal resize events (rows/cols)
- Binary data streaming
- Connection lifecycle (attach/detach)
- Graceful cleanup when ActiveConns reaches 0
Cluster pool state persists in Kubernetes namespace annotations:
apiVersion: v1
kind: Namespace
metadata:
name: cluster1
annotations:
cks.io/cluster-status: "available"
cks.io/last-reset: "2025-01-14T10:30:00Z"
cks.io/created-at: "2025-01-14T08:00:00Z"This Kubernetes-native approach eliminates the need for external databases while surviving backend restarts.
VM restoration creates PVCs that persist after restore completes. The cleanup algorithm:
- Extracts PVC names from
VirtualMachineRestore.Status.Restoresarray - Deletes known restore PVCs
- Scans for orphaned PVCs matching pattern
restore-[uuid]-* - Removes orphaned PVCs to prevent disk space leaks
Custom template engine with environment variable fallback:
func substituteEnvVars(input string, vars map[string]string) string {
re := regexp.MustCompile(`\${([A-Za-z0-9_]+)}`)
// Lookup in vars map, then os.Getenv, then leave unchanged
}Enables dynamic CloudInit generation without external dependencies (Helm, Kustomize).
Tasks are authored in Markdown for readability, parsed into structured data:
# Task 1: Configure Pod Security
## Description
Configure a secure pod with proper security context.
## Step-by-Step Guide
1. Create a pod manifest
2. Add security context with runAsUser: 1000The parser extracts title, description, steps, and hints into Task structs, bridging human-readable documentation with machine-processable validation.
Sophisticated retry mechanism respects context cancellation:
func retryOperation(ctx context.Context, operation func() error) error {
for attempt := 0; attempt <= maxRetries; attempt++ {
select {
case <-ctx.Done():
return ctx.Err()
default:
err := operation()
if err == nil { return nil }
time.Sleep(delay)
delay *= backoff
}
}
}Ensures operations abort cleanly when sessions end or timeouts occur.
Clean API design for validation results:
type ValidationResult struct {
RuleID string
RuleType string
Passed bool
Message string
Expected interface{}
Actual interface{}
ErrorCode string
}Enables detailed frontend feedback showing users exactly what failed and why, with expected vs. actual values for debugging.
cks/
├── backend/
│ ├── cmd/server/ # Main entry point
│ ├── internal/
│ │ ├── config/ # Configuration with env var support
│ │ ├── models/ # Data structures (Session, Task, ValidationRule)
│ │ ├── controllers/ # HTTP handlers (sessions, terminals, admin)
│ │ ├── services/ # Business logic layer
│ │ ├── kubevirt/ # KubeVirt client wrapper
│ │ ├── sessions/ # Session lifecycle management
│ │ ├── terminal/ # Terminal/SSH handling with PTY
│ │ ├── scenarios/ # Scenario loading and parsing
│ │ ├── validation/ # Task validation engine
│ │ ├── clusterpool/ # Cluster pool manager
│ │ └── middleware/ # Request ID, logging, CORS
│ ├── templates/ # CloudInit templates (control-plane, worker)
│ └── scenarios/ # Practice scenarios with tasks and validations
├── frontend/
│ ├── app/ # Next.js App Router pages
│ ├── components/ # React components (Terminal, TaskPanel, Session)
│ ├── contexts/ # State management (SessionContext)
│ ├── hooks/ # Custom hooks (useSession, useTerminal, useValidation)
│ └── lib/ # API client with fetch wrapper
├── docker/ # Dockerfiles for backend and frontend
├── scripts/ # Automation scripts (build-image, snapshot)
└── kubernetes/ # K8s manifests (future deployment configs)
Concurrent Operations: Goroutines for parallel cluster operations with WaitGroups and proper error aggregation. Background maintenance loops with ticker-based scheduling.
Resource Management: defer statements ensure cleanup even during panics. Context propagation enables timeout enforcement across operation chains.
Type Safety: Extensive use of Go interfaces (KubeVirtClient, SessionManager, Validator) enabling dependency injection and testing.
Performance Optimization: Frontend memoization of terminal components prevents unnecessary re-renders. Backend caching of scenario definitions reduces disk I/O.
Security: Session-based access control. Resource isolation via Kubernetes namespaces and ResourceQuotas. No privileged containers or host path mounts.
Built by a DevOps/SRE/Cloud Engineer as a demonstration of Kubernetes expertise, systems programming, and full-stack development capabilities.