[Proposal] Pluggable ateom backend: abstract the checkpoint/restore interface beyond gVisor

We are an OS infrastructure team building system-level optimizations for Agent workloads (high-density scheduling, process/filesystem checkpointing, structured CLI). We are evaluating Agent Substrate as a potential upstream dependency for actor lifecycle management and want to understand the project's stance on supporting alternative checkpoint/restore backends.

### Current State

From reading the codebase, I see that:

1. The `ateom` proto (`internal/proto/ateompb/ateom.proto`) already hints at extensibility:
   ```
   // Ateom is the interface to control a single gVisor (or, in the future microVM)
   // guest inside a worker pod.
   ```

2. The component is named `cmd/ateom-gvisor` (not just `cmd/ateom`), suggesting the naming convention anticipates alternative implementations.

3. However, the proto messages include gVisor-specific fields (e.g., `runsc_path` in all three request types), and the `WorkerPool` CRD's `ateomImage` field is the only configuration point for selecting a backend.

4. The newly proposed Actor State Machine (#119) introduces PAUSED vs SUSPENDED states with different durability/performance trade-offs — this is orthogonal to the backend question but has implications for it (e.g., a microVM backend might have different PAUSED semantics than gVisor).

### Motivation: Why Alternative Backends Matter

Different deployment environments have different constraints and existing infrastructure:

1. **CRIU + container runtime**: For environments already running containerd/CRI-O, CRIU-based checkpoint/restore integrates with existing infrastructure without adding a new runtime dependency. The Linux kernel has supported CRIU for 10+ years, and containerd has native checkpoint APIs. This makes it a natural fit for teams that want actor lifecycle management without changing their container runtime stack.

2. **Firecracker/microVM**: For strong multi-tenancy isolation requirements where gVisor's syscall-interception model isn't sufficient, microVM snapshots (Firecracker's `PUT /snapshot/create`) provide VM-level checkpoint/restore.


### Proposal

I'd like to discuss formalizing the ateom backend interface. Concretely:

**Phase 1 — Proto cleanup (minimal change):**

Generalize `runsc_path` to a `runtime_config` oneof (or make it optional and backend-specific), so alternative ateom implementations don't need to accept gVisor-specific fields:

```protobuf
message RuntimeConfig {
  oneof backend {
    GVisorConfig gvisor = 1;
    CRIUConfig criu = 2;
    MicroVMConfig microvm = 3;
  }
}

message GVisorConfig {
  string runsc_path = 1;
}
```

**Phase 2 — ateom interface contract:**

Document the behavioral contract that any ateom implementation must satisfy:
- `RunWorkload`: Start a new workload from OCI spec, report ready
- `CheckpointWorkload`: Capture full state (process + filesystem) to `snapshot_uri_prefix`, return to available state
- `RestoreWorkload`: Restore from `snapshot_uri_prefix`, report ready

This contract already exists implicitly; making it explicit enables third-party implementations.

**Phase 3 — Reference alternative implementation (future):**

We would be interested in contributing an `ateom-criu` implementation that uses containerd's checkpoint API.

### Questions for Maintainers

1. Is supporting alternative ateom backends aligned with the project's near-term goals, or is gVisor intended to remain the sole implementation for the foreseeable future?
2. Would you accept a proto change that generalizes `runsc_path`, or do you prefer keeping the proto gVisor-specific and handling multi-backend at a different layer?
3. Is there any internal design doc or roadmap for the "microVM" path mentioned in the proto comment?

### Related

- #119 — Actor State Machine (the PAUSED/SUSPENDED distinction has implications for backend choice)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Pluggable ateom backend: abstract the checkpoint/restore interface beyond gVisor #121

Current State

Motivation: Why Alternative Backends Matter

Proposal

Questions for Maintainers

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Proposal] Pluggable ateom backend: abstract the checkpoint/restore interface beyond gVisor #121

Description

Current State

Motivation: Why Alternative Backends Matter

Proposal

Questions for Maintainers

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions