You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are an OS infrastructure team building system-level optimizations for Agent workloads (high-density scheduling, process/filesystem checkpointing, structured CLI). We are evaluating Agent Substrate as a potential upstream dependency for actor lifecycle management and want to understand the project's stance on supporting alternative checkpoint/restore backends.
Current State
From reading the codebase, I see that:
The ateom proto (internal/proto/ateompb/ateom.proto) already hints at extensibility:
// Ateom is the interface to control a single gVisor (or, in the future microVM)
// guest inside a worker pod.
The component is named cmd/ateom-gvisor (not just cmd/ateom), suggesting the naming convention anticipates alternative implementations.
However, the proto messages include gVisor-specific fields (e.g., runsc_path in all three request types), and the WorkerPool CRD's ateomImage field is the only configuration point for selecting a backend.
The newly proposed Actor State Machine ([Feature] Implementation of Actor State Machine: Lifecycle and Transitions #119) introduces PAUSED vs SUSPENDED states with different durability/performance trade-offs — this is orthogonal to the backend question but has implications for it (e.g., a microVM backend might have different PAUSED semantics than gVisor).
Motivation: Why Alternative Backends Matter
Different deployment environments have different constraints and existing infrastructure:
CRIU + container runtime: For environments already running containerd/CRI-O, CRIU-based checkpoint/restore integrates with existing infrastructure without adding a new runtime dependency. The Linux kernel has supported CRIU for 10+ years, and containerd has native checkpoint APIs. This makes it a natural fit for teams that want actor lifecycle management without changing their container runtime stack.
Firecracker/microVM: For strong multi-tenancy isolation requirements where gVisor's syscall-interception model isn't sufficient, microVM snapshots (Firecracker's PUT /snapshot/create) provide VM-level checkpoint/restore.
Proposal
I'd like to discuss formalizing the ateom backend interface. Concretely:
Phase 1 — Proto cleanup (minimal change):
Generalize runsc_path to a runtime_config oneof (or make it optional and backend-specific), so alternative ateom implementations don't need to accept gVisor-specific fields:
Document the behavioral contract that any ateom implementation must satisfy:
RunWorkload: Start a new workload from OCI spec, report ready
CheckpointWorkload: Capture full state (process + filesystem) to snapshot_uri_prefix, return to available state
RestoreWorkload: Restore from snapshot_uri_prefix, report ready
This contract already exists implicitly; making it explicit enables third-party implementations.
Phase 3 — Reference alternative implementation (future):
We would be interested in contributing an ateom-criu implementation that uses containerd's checkpoint API.
Questions for Maintainers
Is supporting alternative ateom backends aligned with the project's near-term goals, or is gVisor intended to remain the sole implementation for the foreseeable future?
Would you accept a proto change that generalizes runsc_path, or do you prefer keeping the proto gVisor-specific and handling multi-backend at a different layer?
Is there any internal design doc or roadmap for the "microVM" path mentioned in the proto comment?
We are an OS infrastructure team building system-level optimizations for Agent workloads (high-density scheduling, process/filesystem checkpointing, structured CLI). We are evaluating Agent Substrate as a potential upstream dependency for actor lifecycle management and want to understand the project's stance on supporting alternative checkpoint/restore backends.
Current State
From reading the codebase, I see that:
The
ateomproto (internal/proto/ateompb/ateom.proto) already hints at extensibility:The component is named
cmd/ateom-gvisor(not justcmd/ateom), suggesting the naming convention anticipates alternative implementations.However, the proto messages include gVisor-specific fields (e.g.,
runsc_pathin all three request types), and theWorkerPoolCRD'sateomImagefield is the only configuration point for selecting a backend.The newly proposed Actor State Machine ([Feature] Implementation of Actor State Machine: Lifecycle and Transitions #119) introduces PAUSED vs SUSPENDED states with different durability/performance trade-offs — this is orthogonal to the backend question but has implications for it (e.g., a microVM backend might have different PAUSED semantics than gVisor).
Motivation: Why Alternative Backends Matter
Different deployment environments have different constraints and existing infrastructure:
CRIU + container runtime: For environments already running containerd/CRI-O, CRIU-based checkpoint/restore integrates with existing infrastructure without adding a new runtime dependency. The Linux kernel has supported CRIU for 10+ years, and containerd has native checkpoint APIs. This makes it a natural fit for teams that want actor lifecycle management without changing their container runtime stack.
Firecracker/microVM: For strong multi-tenancy isolation requirements where gVisor's syscall-interception model isn't sufficient, microVM snapshots (Firecracker's
PUT /snapshot/create) provide VM-level checkpoint/restore.Proposal
I'd like to discuss formalizing the ateom backend interface. Concretely:
Phase 1 — Proto cleanup (minimal change):
Generalize
runsc_pathto aruntime_configoneof (or make it optional and backend-specific), so alternative ateom implementations don't need to accept gVisor-specific fields:Phase 2 — ateom interface contract:
Document the behavioral contract that any ateom implementation must satisfy:
RunWorkload: Start a new workload from OCI spec, report readyCheckpointWorkload: Capture full state (process + filesystem) tosnapshot_uri_prefix, return to available stateRestoreWorkload: Restore fromsnapshot_uri_prefix, report readyThis contract already exists implicitly; making it explicit enables third-party implementations.
Phase 3 — Reference alternative implementation (future):
We would be interested in contributing an
ateom-criuimplementation that uses containerd's checkpoint API.Questions for Maintainers
runsc_path, or do you prefer keeping the proto gVisor-specific and handling multi-backend at a different layer?Related