Skip to content

fengcone/fast-sandbox

Repository files navigation

Fast Sandbox

Fast Sandbox is a high-performance, cloud-native (Kubernetes-native) sandbox management system designed to provide millisecond-scale cold container startup and controlled self-healing capabilities for AI Agents, Serverless functions, and compute-intensive tasks.

By pre-warming "Agent Pod" resource pools and directly integrating with host-level container management, Fast Sandbox bypasses the significant overhead of traditional Kubernetes Pod creation, achieving ultra-fast task distribution with physical isolation.

Features

  • Fast-Path API: gRPC-based fast path supporting <50ms end-to-end startup latency. Dual-mode switching between Fast Mode (Agent-First, ultra-fast) and Strong Mode (CRD-First, strong consistency). image
  • Developer CLI (fsb-ctl): Docker-like command-line experience with interactive creation, configuration management, streaming log viewing (logs -f), and status queries.
  • Zero-Pull Startup: Leverages Host Containerd Integration to launch microcontainers directly on the host, reusing node image cache.
  • Smart Scheduling: Allocation algorithm based on Image Affinity and Atomic Slots, eliminating image pull latency and avoiding port conflicts.
  • Resilient Design:
    • Controlled Self-Healing: Supports AutoRecreate policy and manual resetRevision.
    • Graceful Shutdown: Complete SIGTERM → SIGKILL flow preventing zombie processes.
    • Node Janitor: Independent DaemonSet for automatic orphan container and file cleanup.

Architecture

The system uses a "centralized control plane decision, extreme data plane execution" architecture: ARCHITECTURE

Control Plane

  • Fast-Path Server (gRPC): Handles high-concurrency sandbox create/delete requests, direct CLI access
    • Port: 9090
    • Services: CreateSandbox, DeleteSandbox, UpdateSandbox, ListSandboxes, GetSandbox
  • SandboxController: Manages CRD state machine, Finalizer resource cleanup, and dual-mode consistency coordination
  • SandboxPoolController: Manages Agent Pod resource pools (Min/Max capacity)
  • Atomic Registry: In-memory state center supporting high-concurrency mutex allocation and image weight scoring

Data Plane (Agent)

  • Privileged Pods running on hosts, communicating via HTTP with the control plane
  • Runtime Integration: Direct Containerd Socket access for container lifecycle and log persistence
  • HTTP Server: Listens on port 5758
    • POST /api/v1/agent/create - Create sandbox
    • POST /api/v1/agent/delete - Delete sandbox
    • GET /api/v1/agent/status - Get agent status
    • GET /api/v1/agent/logs?follow=true - Stream logs

Tooling

  • fsb-ctl: Developer CLI with run, list, get, logs, delete commands

Quick Start

1. Install CLI

make build
# Generates bin/fsb-ctl
export PATH=$PWD/bin:$PATH

2. Create a Sandbox (Interactive)

fsb-ctl run my-sandbox
# Opens editor for configuration (image, ports, command, env)

3. View Real-time Logs

fsb-ctl logs my-sandbox -f

4. Declarative YAML

You can also use Kubernetes CRD directly:

apiVersion: sandbox.fast.io/v1alpha1
kind: Sandbox
metadata:
  name: my-sandbox
  namespace: default
spec:
  image: alpine:latest
  exposedPorts: [8080]
  poolRef: default-pool
  consistencyMode: fast  # or strong
  failurePolicy: AutoRecreate

Consistency Modes

Fast Mode (Default)

  1. CLI → Controller gRPC request
  2. Registry allocates Agent
  3. Controller → Agent HTTP create request
  4. Agent starts container via Containerd
  5. Controller returns success to CLI
  6. Controller async creates K8s CRD

Latency: <50ms Trade-off: CRD creation failure may cause orphan (cleaned by Janitor)

Strong Mode

  1. CLI → Controller gRPC request
  2. Controller creates K8s CRD (Pending phase)
  3. Controller Watch triggers
  4. Controller → Agent HTTP create request
  5. Agent starts container
  6. CRD status updated to Running

Latency: ~200ms Guarantee: Strong consistency, no orphans

Configuration

Controller Flags

Flag Default Description
--agent-port 5758 Agent HTTP server port
--metrics-bind-address :9091 Prometheus metrics endpoint
--health-probe-bind-address :5758 Health check endpoint
--fastpath-consistency-mode fast Consistency mode: fast or strong
--fastpath-orphan-timeout 10s Fast mode orphan cleanup timeout

Agent Flags

Flag Default Description
--containerd-socket /run/containerd/containerd.sock Containerd socket path
--http-port 5758 HTTP server port

Environment Variables

Variable Description
AGENT_CAPACITY Max sandboxes per agent (default: 5)

gRPC API

service FastPathService {
  rpc CreateSandbox(CreateRequest) returns (CreateResponse);
  rpc DeleteSandbox(DeleteRequest) returns (DeleteResponse);
  rpc UpdateSandbox(UpdateRequest) returns (UpdateResponse);
  rpc ListSandboxes(ListRequest) returns (ListResponse);
  rpc GetSandbox(GetRequest) returns (SandboxInfo);
}

ConsistencyMode

  • FAST: Create container first, async CRD write
  • STRONG: Write CRD first, then create container

FailurePolicy

  • MANUAL: Report status only, no auto-recovery
  • AUTO_RECREATE: Automatically reschedule on failure

Development

Running Tests

# All tests
go test ./... -v

# With coverage
go test ./... -coverprofile=coverage.out

# Specific module
go test ./internal/controller/agentpool/ -v

See docs/TESTING.md for detailed testing documentation.

Performance Profiling

# CPU profiling
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof

# View profile
go tool pprof -http=:8080 cpu.prof

See docs/PERFORMANCE.md for performance analysis.

Roadmap

  • Phase 1: Core Runtime (Containerd) & gRPC framework
  • Phase 2: Fast-Path API & Registry scheduling
  • Phase 3: CLI (fsb-ctl) & interactive experience
  • Phase 4: Log streaming & auto tunneling
  • Phase 5: Unified logging (klog)
  • Phase 6: Performance instrumentation & unit tests
  • Phase 7: Supports custom volume mounting.
  • Phase 8: Container checkpoint/restore (CRIU)
  • Phase 9: Web console & traffic proxy
  • Phase 10: gVisor support for secure sandboxing
  • Phase 11: CLI exec bash & Python SDK (Modal-like)
  • Phase 12: GPU container support

License

MIT

About

Fast Sandbox is a Kubernetes-based high-performance sandbox management system. The core objective is to provide millisecond-scale container startup latency for scenarios sensitive to startup delay, such as serverless functions and code sandbox execution.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors