Skip to content

Proposal: Fold atelet into ateom #128

@yuval-k

Description

@yuval-k

Background

Today the actor lifecycle is split across two tightly coupled components:

  • atelet: a node-level DaemonSet that the control plane talks to
  • ateom: the worker component inside the worker pod that actually runs, checkpoints, and restores actors

To create a single worker, the control plane today coordinates two RPCs:
one to atelet and one to ateom-gvisor. The two processes also share
state through a host bind mount at /run/ateom-gvisor so they can hand
off snapshot files.

Problem

This split presents three structural issues:

  1. Two-component coordination. Every worker-lifecycle operation is a
    distributed transaction across atelet and ateom. Failures and
    partial states have to be reconciled by callers, and upgrades have to
    keep the two binaries version-compatible. Debugging means reading two
    sets of logs and reasoning about the handoff between them.

  2. Backend lock-in. The split assumes the gVisor model (a node agent
    plus an in-sandbox helper). Adding a different worker backend
    (Firecracker, for example) will be harder as we will need to build the support for it in 2 components.

  3. Shared host /run mount is a blast radius. The ateletateom
    handoff requires a host bind mount on /run/ateom-gvisor. A
    misbehaving sandbox that fills that directory can exhaust /run on
    the node and take down every other pod on it. With per-pod state
    (no host mount), one bad sandbox only takes itself down.

Proposal

Remove atelet and consolidate its responsibilities into ateom
(running per-worker-pod), exposing a single control-plane-facing
interface. Concretely:

  • Worker lifecycle RPCs (create / start / suspend / restore / destroy)
    become a single call to the per-pod agent.
  • The backend (gVisor today, others later) lives behind an interface
    inside ateom; new backends plug in there.
  • Snapshot/restore state stays inside the worker pod's own filesystem —
    no host mount needed.
  • In the future, we can potentially even standardtize the api that ateom exposes to allow out-of-tree ateoms.

Currently it is not possible due to how the ateom does networking, but once #110 is in, we can implement this proposal.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions