Skip to content

[Proposal] Pluggable ateom backend: abstract the checkpoint/restore interface beyond gVisor #121

@yummypeng

Description

We are an OS infrastructure team building system-level optimizations for Agent workloads (high-density scheduling, process/filesystem checkpointing, structured CLI). We are evaluating Agent Substrate as a potential upstream dependency for actor lifecycle management and want to understand the project's stance on supporting alternative checkpoint/restore backends.

Current State

From reading the codebase, I see that:

  1. The ateom proto (internal/proto/ateompb/ateom.proto) already hints at extensibility:

    // Ateom is the interface to control a single gVisor (or, in the future microVM)
    // guest inside a worker pod.
    
  2. The component is named cmd/ateom-gvisor (not just cmd/ateom), suggesting the naming convention anticipates alternative implementations.

  3. However, the proto messages include gVisor-specific fields (e.g., runsc_path in all three request types), and the WorkerPool CRD's ateomImage field is the only configuration point for selecting a backend.

  4. The newly proposed Actor State Machine ([Feature] Implementation of Actor State Machine: Lifecycle and Transitions #119) introduces PAUSED vs SUSPENDED states with different durability/performance trade-offs — this is orthogonal to the backend question but has implications for it (e.g., a microVM backend might have different PAUSED semantics than gVisor).

Motivation: Why Alternative Backends Matter

Different deployment environments have different constraints and existing infrastructure:

  1. CRIU + container runtime: For environments already running containerd/CRI-O, CRIU-based checkpoint/restore integrates with existing infrastructure without adding a new runtime dependency. The Linux kernel has supported CRIU for 10+ years, and containerd has native checkpoint APIs. This makes it a natural fit for teams that want actor lifecycle management without changing their container runtime stack.

  2. Firecracker/microVM: For strong multi-tenancy isolation requirements where gVisor's syscall-interception model isn't sufficient, microVM snapshots (Firecracker's PUT /snapshot/create) provide VM-level checkpoint/restore.

Proposal

I'd like to discuss formalizing the ateom backend interface. Concretely:

Phase 1 — Proto cleanup (minimal change):

Generalize runsc_path to a runtime_config oneof (or make it optional and backend-specific), so alternative ateom implementations don't need to accept gVisor-specific fields:

message RuntimeConfig {
  oneof backend {
    GVisorConfig gvisor = 1;
    CRIUConfig criu = 2;
    MicroVMConfig microvm = 3;
  }
}

message GVisorConfig {
  string runsc_path = 1;
}

Phase 2 — ateom interface contract:

Document the behavioral contract that any ateom implementation must satisfy:

  • RunWorkload: Start a new workload from OCI spec, report ready
  • CheckpointWorkload: Capture full state (process + filesystem) to snapshot_uri_prefix, return to available state
  • RestoreWorkload: Restore from snapshot_uri_prefix, report ready

This contract already exists implicitly; making it explicit enables third-party implementations.

Phase 3 — Reference alternative implementation (future):

We would be interested in contributing an ateom-criu implementation that uses containerd's checkpoint API.

Questions for Maintainers

  1. Is supporting alternative ateom backends aligned with the project's near-term goals, or is gVisor intended to remain the sole implementation for the foreseeable future?
  2. Would you accept a proto change that generalizes runsc_path, or do you prefer keeping the proto gVisor-specific and handling multi-backend at a different layer?
  3. Is there any internal design doc or roadmap for the "microVM" path mentioned in the proto comment?

Related

Metadata

Metadata

Labels

area/apiUser-facing API changesarea/nodekind/featureAn enhancement / feature request or implementationprio/P1Important but not critical

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions