Skip to content
This repository was archived by the owner on Nov 17, 2025. It is now read-only.

fullstack-pw/cks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

121 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CKS Practice Environment

A production-grade web-based simulator for Certified Kubernetes Security Specialist exam preparation. This project demonstrates advanced Kubernetes orchestration, virtualization management, and full-stack development practices through a real-world application that provisions ephemeral kubeadm clusters for hands-on security training.

Overview

The CKS Practice Environment is a full-stack application that creates real Kubernetes clusters using KubeVirt virtualization, allowing users to practice CKS exam scenarios in an authentic environment. The system maintains a pool of pre-provisioned clusters, assigns them instantly to users, and resets them using snapshot-based restoration for rapid turnaround.

Architecture

Technology Stack

Backend

  • Go 1.24+ with Gin framework for high-performance HTTP handling
  • Kubernetes client-go for native API integration
  • KubeVirt client for VM lifecycle management
  • WebSocket-based terminal multiplexing with persistent SSH connections
  • Prometheus metrics and structured logging with request correlation

Frontend

  • Next.js 14 with React 18 and App Router
  • Context API for state management with custom hooks
  • xterm.js for full-featured terminal emulation
  • WebSocket protocol for real-time terminal communication
  • Tailwind CSS for responsive UI

Infrastructure

  • KubeVirt 1.5.0 for VM orchestration on Kubernetes
  • Longhorn distributed storage with CDI for disk image management
  • Cilium 1.17.3 CNI for cluster networking
  • CloudInit for declarative VM initialization
  • kubeadm 1.33.0 for cluster bootstrapping

DevOps Practices

Infrastructure as Code

Golden Image Pipeline: Automated image creation using virt-customize with version-controlled dependencies. The build process installs containerd, kubelet, kubeadm, kubectl, and security tooling (kube-bench) on Ubuntu 22.04, creating a repeatable base image uploaded to KubeVirt as a DataVolume.

VM Templates: Control plane and worker node configurations defined as YAML templates with CloudInit. Variable substitution enables dynamic configuration while maintaining infrastructure as code principles.

Resource Quotas: Declarative resource limits per cluster namespace (16 CPU, 16Gi memory, 20 pods) enforced through Kubernetes ResourceQuota objects.

Automation

Cluster Lifecycle Management: Fully automated bootstrap, snapshot, and restore cycle with no manual intervention. The system handles:

  • Sequential cluster provisioning to avoid resource conflicts
  • Automatic cleanup of orphaned restore PVCs
  • Background maintenance loops for cluster health monitoring
  • Graceful degradation when scenario initialization fails

State Persistence: Cluster pool state stored in Kubernetes namespace annotations (cks.io/cluster-status, cks.io/last-reset), ensuring state survives backend restarts without external databases.

Observability

Structured Logging: Logrus-based logging with JSON formatting in production. Request ID middleware enables distributed tracing across the request lifecycle, with contextual fields (sessionID, vmName, namespace) for debugging.

Metrics Exposure: Prometheus endpoint (/metrics) tracking:

  • Active session counts and lifecycle events
  • VM provisioning duration histograms
  • Validation execution times
  • Terminal connection metrics

Health Endpoints: Kubernetes-ready liveness and readiness probes at /health and /ready.

Resilience

Retry Logic: Configurable exponential backoff with context cancellation support. Operations retry up to 3 times with 10-second initial delay and 2.0 backoff factor.

Error Handling: Comprehensive error wrapping with context preservation. Cleanup operations continue despite partial failures using defer statements and error aggregation.

Cluster Bootstrap and Reset

Bootstrap Process

Phase 1: Golden Image Creation

The scripts/build-image.sh script uses virt-customize to create a base Ubuntu 22.04 image with:

  • Containerd runtime configured with SystemdCgroup driver
  • Kubernetes components (kubelet, kubeadm, kubectl v1.33.0) pre-installed
  • Kernel modules (br_netfilter, overlay) loaded and persistent
  • Swap disabled and networking prerequisites configured
  • SSH user "suporte" with passwordless sudo access

The image is uploaded to KubeVirt via HTTP server to the kubevirt-os-images namespace.

Phase 2: Cluster Initialization via CloudInit

Control plane VMs execute kubeadm initialization with:

kubeadm init --config=/etc/kubeadm-config.yaml \
  --cri-socket=unix:///run/containerd/containerd.sock \
  --pod-network-cidr=10.0.0.0/8 \
  --kubernetes-version=1.33.0

Post-initialization automation:

  • Cilium CNI installation via Helm
  • Control plane taint removal (allows pod scheduling)
  • kubeconfig configuration for root and suporte users
  • Join command generation stored at /etc/kubeadm-join-command

Worker nodes wait for control plane availability on port 6443, then execute the join command retrieved from the control plane VM.

Snapshot-Based Reset

The system achieves sub-3-minute cluster resets using KubeVirt's snapshot functionality:

Snapshot Creation (scripts/snapshot-session.sh):

  1. Gracefully stop VMs to ensure filesystem consistency
  2. Create VirtualMachineSnapshot objects for control plane and worker
  3. Export snapshots to DataVolumes using VirtualMachineExport API
  4. Store in vm-templates namespace for reuse
  5. Restart original VMs

Reset Process (backend/internal/clusterpool/manager.go):

  1. Stop VMs and delete VirtualMachineInstance objects
  2. Clean up previous restore PVCs using pattern matching (restore-[uuid]-*)
  3. Create VirtualMachineRestore objects referencing snapshots
  4. Wait for restore completion and VM readiness
  5. Mark cluster as available in pool

This approach provides:

  • Cold start (kubeadm): 8-10 minutes
  • Pool assignment: <1 second
  • Snapshot restore: 2-3 minutes (background)

Cluster Pool Management

Architecture

The cluster pool maintains 3 pre-provisioned clusters (cluster1, cluster2, cluster3) for instant session assignment. Each cluster consists of:

  • 1 control plane VM (4 vCPU, 4Gi RAM, 30Gi disk)
  • 1 worker node VM (4 vCPU, 4Gi RAM, 30Gi disk)
  • Dedicated namespace matching cluster ID
  • ResourceQuota for resource isolation

State Machine

Clusters transition through states:

  • available: Ready for assignment
  • locked: Assigned to active session
  • resetting: Snapshot restoration in progress
  • error: Requires manual intervention

State transitions are atomic and persisted to namespace annotations.

Assignment Algorithm

func (m *Manager) AssignCluster(sessionID string) (*ClusterPool, error) {
    // Find first available cluster
    // Atomically transition to locked state
    // Update namespace annotation
    // Return cluster details
}

On session end, clusters are released and asynchronously reset in the background.

Task Configuration and Validation

Task Structure

Tasks are defined using a dual-file approach:

tasks/NN-task.md: Human-readable Markdown with structured sections:

  • H1: Task title
  • H2 sections: Description, Objectives, Step-by-Step Guide, Hints

validation/NN-validation.yaml: Machine-readable validation rules

The scenario loader parses Markdown into structured data (Task objects) while validation rules remain in YAML for declarative specification.

Validation Engine

The UnifiedValidator (backend/internal/validation/unified_validator.go) supports multiple validation types:

Resource Validation:

  • resource_exists: Verifies Kubernetes resource presence
  • resource_property: Validates JSONPath property values with conditions (equals, contains, greater_than, less_than)

Command Validation:

  • command: Executes shell command and validates stdout/stderr
  • script: Runs multi-line bash scripts with exit code validation

File Validation:

  • file_exists: Checks file presence on VM filesystem
  • file_content: Validates file content with regex or exact matching

Example validation rule:

validation:
  - id: pod-security-context
    type: resource_property
    resource:
      kind: Pod
      name: secure-pod
      namespace: default
      property: .spec.securityContext.runAsUser
    condition: equals
    value: "1000"
    errorMessage: "Pod is not running as user 1000"

Execution Model

Validations execute on VMs via SSH using virtctl ssh. The validator:

  1. Establishes SSH connection to control plane or worker
  2. Executes validation command (kubectl, bash script, file check)
  3. Parses output and compares against expected values
  4. Returns structured ValidationResult with expected vs. actual values

No retry logic ensures validation accuracy (failed operations indicate genuine configuration issues).

Admin Panel

The admin panel provides cluster pool management through dedicated endpoints:

Bootstrap Pool

POST /api/v1/admin/bootstrap-pool

Creates 3 baseline clusters sequentially to avoid resource contention. Each cluster creation:

  1. Creates dedicated namespace with ResourceQuota
  2. Provisions control plane VM from golden image
  3. Waits for VM readiness and kubeadm completion
  4. Provisions worker node VM
  5. Waits for worker to join cluster
  6. Adds cluster to pool with "available" status

Total provisioning time: ~30 minutes (3 clusters × 10 minutes)

Create Snapshots

POST /api/v1/admin/create-snapshots

Snapshots all pool clusters for reset functionality. For each cluster:

  1. Gracefully stops VMs
  2. Creates VirtualMachineSnapshot objects
  3. Exports to DataVolumes in vm-templates namespace
  4. Restarts VMs
  5. Returns per-cluster success/failure

Release All Clusters

POST /api/v1/admin/release-all-clusters

Emergency reset of entire pool:

  1. Releases all locked/error clusters
  2. Triggers async reset for each cluster
  3. Returns updated pool status

Useful for bulk maintenance or error recovery.

Terminal Architecture

Persistent SSH Connections

The terminal system maintains long-lived SSH connections that multiple WebSocket clients can attach to:

type PersistentSSHConnection struct {
    ID          string
    Command     *exec.Cmd      // virtctl ssh process
    PTY         *os.File       // Pseudo-terminal
    ActiveConns int            // Active WebSocket count
    Mutex       sync.Mutex
}

Benefits:

  • Instant reconnection after page refresh
  • State preservation (shell history, working directory, running processes)
  • Reduced load on VMs (single SSH session per user)

Deterministic Terminal IDs

Terminal IDs follow the pattern {sessionID}-{target}:

  • Example: abc12345-control-plane
  • Frontend can predict terminal ID without API call
  • Backend auto-creates terminal sessions on reconnect
  • Enables seamless reconnection after network interruption

WebSocket Protocol

Browser → WebSocket /api/v1/terminals/:id/attach
              ↓
       HandleTerminal()
              ↓
       GetOrCreatePersistentSSH()
              ↓
       PTY ↔ WebSocket Bridge
              ↓
       xterm.js rendering

The bridge handles:

  • Terminal resize events (rows/cols)
  • Binary data streaming
  • Connection lifecycle (attach/detach)
  • Graceful cleanup when ActiveConns reaches 0

Unique Implementation Details

Namespace Annotations for State

Cluster pool state persists in Kubernetes namespace annotations:

apiVersion: v1
kind: Namespace
metadata:
  name: cluster1
  annotations:
    cks.io/cluster-status: "available"
    cks.io/last-reset: "2025-01-14T10:30:00Z"
    cks.io/created-at: "2025-01-14T08:00:00Z"

This Kubernetes-native approach eliminates the need for external databases while surviving backend restarts.

Smart Cleanup of Restore PVCs

VM restoration creates PVCs that persist after restore completes. The cleanup algorithm:

  1. Extracts PVC names from VirtualMachineRestore.Status.Restores array
  2. Deletes known restore PVCs
  3. Scans for orphaned PVCs matching pattern restore-[uuid]-*
  4. Removes orphaned PVCs to prevent disk space leaks

CloudInit Variable Substitution

Custom template engine with environment variable fallback:

func substituteEnvVars(input string, vars map[string]string) string {
    re := regexp.MustCompile(`\${([A-Za-z0-9_]+)}`)
    // Lookup in vars map, then os.Getenv, then leave unchanged
}

Enables dynamic CloudInit generation without external dependencies (Helm, Kustomize).

Scenario Loader with Markdown Parsing

Tasks are authored in Markdown for readability, parsed into structured data:

# Task 1: Configure Pod Security

## Description
Configure a secure pod with proper security context.

## Step-by-Step Guide
1. Create a pod manifest
2. Add security context with runAsUser: 1000

The parser extracts title, description, steps, and hints into Task structs, bridging human-readable documentation with machine-processable validation.

Retry Logic with Context Cancellation

Sophisticated retry mechanism respects context cancellation:

func retryOperation(ctx context.Context, operation func() error) error {
    for attempt := 0; attempt <= maxRetries; attempt++ {
        select {
        case <-ctx.Done():
            return ctx.Err()
        default:
            err := operation()
            if err == nil { return nil }
            time.Sleep(delay)
            delay *= backoff
        }
    }
}

Ensures operations abort cleanly when sessions end or timeouts occur.

Structured Validation Responses

Clean API design for validation results:

type ValidationResult struct {
    RuleID      string
    RuleType    string
    Passed      bool
    Message     string
    Expected    interface{}
    Actual      interface{}
    ErrorCode   string
}

Enables detailed frontend feedback showing users exactly what failed and why, with expected vs. actual values for debugging.

Project Structure

cks/
├── backend/
│   ├── cmd/server/              # Main entry point
│   ├── internal/
│   │   ├── config/              # Configuration with env var support
│   │   ├── models/              # Data structures (Session, Task, ValidationRule)
│   │   ├── controllers/         # HTTP handlers (sessions, terminals, admin)
│   │   ├── services/            # Business logic layer
│   │   ├── kubevirt/            # KubeVirt client wrapper
│   │   ├── sessions/            # Session lifecycle management
│   │   ├── terminal/            # Terminal/SSH handling with PTY
│   │   ├── scenarios/           # Scenario loading and parsing
│   │   ├── validation/          # Task validation engine
│   │   ├── clusterpool/         # Cluster pool manager
│   │   └── middleware/          # Request ID, logging, CORS
│   ├── templates/               # CloudInit templates (control-plane, worker)
│   └── scenarios/               # Practice scenarios with tasks and validations
├── frontend/
│   ├── app/                     # Next.js App Router pages
│   ├── components/              # React components (Terminal, TaskPanel, Session)
│   ├── contexts/                # State management (SessionContext)
│   ├── hooks/                   # Custom hooks (useSession, useTerminal, useValidation)
│   └── lib/                     # API client with fetch wrapper
├── docker/                      # Dockerfiles for backend and frontend
├── scripts/                     # Automation scripts (build-image, snapshot)
└── kubernetes/                  # K8s manifests (future deployment configs)

Technical Highlights

Concurrent Operations: Goroutines for parallel cluster operations with WaitGroups and proper error aggregation. Background maintenance loops with ticker-based scheduling.

Resource Management: defer statements ensure cleanup even during panics. Context propagation enables timeout enforcement across operation chains.

Type Safety: Extensive use of Go interfaces (KubeVirtClient, SessionManager, Validator) enabling dependency injection and testing.

Performance Optimization: Frontend memoization of terminal components prevents unnecessary re-renders. Backend caching of scenario definitions reduces disk I/O.

Security: Session-based access control. Resource isolation via Kubernetes namespaces and ResourceQuotas. No privileged containers or host path mounts.


Built by a DevOps/SRE/Cloud Engineer as a demonstration of Kubernetes expertise, systems programming, and full-stack development capabilities.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors