Skip to content

[Feature] Implementation of Actor State Machine: Lifecycle and Transitions #119

@dberkov

Description

Issue Description

This issue tracks the formal implementation of the actor state machine within the substrate.

To achieve high density and efficiency, we must distinguish between different "resting" states for actors - specifically focusing on the trade-offs between resume speed, data durability, and resource costs. The implementation must specifically address a potential NIC saturation problem. A simplistic approach of uploading full memory and filesystem images over the network during every suspend/resume cycle would quickly overwhelm the host VM's network interface card (NIC). Consequently, the state machine implementation must prioritize keeping data on the local host to maintain required performance and density.

The "story" of an actor follows a journey from active execution through various stages of dormancy or failure, driven by both user actions and system timeouts.

Actor State Definitions

State Description Durability & Performance
RUNNING The actor is currently active and executing on a worker. Active CPU/RAM usage.
PAUSED The actor is offloaded from the CPU, with snapshots stored locally on the node VM. Fast resume; lower durability (data lost if node fails); cheaper (local, already provisioned storage, no NIC); we can optimize this (e.g., save to peers opportunistically (TODO add issue link)).
SUSPENDED The actor is offloaded from the CPU, and snapshots are persisted to durable storage. Higher SLO/durability (e.g., saved to blob storage); expensive (external storage and NIC); requires retention policies; slower resume due to data transfer; can be symbolically tagged.
CRASHED A failure state reached if the host node fails while an actor is Running or Paused. Requires manual recovery or revert to a known good state.

Snapshot layers

---
config:
  treemap:
    showValues: false
---
treemap-beta
"Actor's process"
    "memory" : 40
    "local storage"
        "rootfs"
            "homedir" : 20
            "oci bundle" : 80:::noborder

classDef noborder stroke-width:0px;
Loading
  • memory - a snapshot of an actor's process memory. A snapshot cannot be used for restoration if the OCI image of the actor has been changed (e.g., upgraded).
  • rootfs - an entire boot disk snapshot with all files. A snapshot cannot be used for restoration if the OCI image of the actor has been changed (e.g., upgraded).
  • homedir - files under the home directory. This can be used for restoration even if the OCI image has been changed.

Actor Lifecycle Story

---
title: Actor State (`ate actor ...`)
---
stateDiagram-v2
direction LR

classDef crushedEvent fill:#f00,font-weight:bold,stroke-width:2px

[*] --> Suspended : create --from

Suspended --> Running : resume
Suspended --> Suspended : commit --tag | revert --to
Suspended --> [*] : delete

Running --> Running : commit --remain [--tag]
Running --> Paused : pause [--timer]
Running --> Suspended : commit [--tag] | revert --to
Running --> Crashed : [VM crashed | ateom crashed | etc...]

Paused --> Running : resume 
Paused --> Suspended : commit [--tag] | revert --to | (auto-commit-timeout)
Paused --> Paused : commit --remain [--tag]
Paused --> Crashed  : [VM crashed | ateom crashed | etc...]
Paused --> [*] : delete

Crashed --> Suspended : revert --to | dump | commit --tag
Crashed --> [*] : delete

class Crashed crushedEvent
class end crushedEvent
Loading

1. Create a new actor

  • Action: ate actor create [--from]
  • Mechanism: The system creates a new actor from either the default configuration or a tagged snapshot.
  • Transition: NONE → SUSPENDED

2. Active Execution to Temporary Pause

When a user needs to temporarily free up vCPU resources but expects a quick return, the actor moves to a PAUSED state.

  • Action: ate actor pause [--timer]
  • Mechanism: The system takes a snapshot and keeps it on the local node. If an actor remains paused for an extended period and --timer is provided, the system escalates it to a SUSPENDED state to ensure data safety and free up local node storage.
  • Transition: RUNNING → PAUSED

3. Active Execution to Long-term Suspension

When a user needs to free up vCPU resources and does not expect a quick return, or needs to ensure data safety.

  • Action: ate actor commit [--tag]
  • Mechanism: The system takes a snapshot and moves it to durable, persistent storage. A tag can be attached to the snapshot state, allowing the future creation of a new actor from this state (fork) or reverting the current actor to this state.
  • Transition: RUNNING → SUSPENDED

4. Resuming Operations

Actors can be brought back to a RUNNING state from either dormancy stage.

  • Action: ate actor resume
  • From Paused: The scheduler prefers the original node to avoid data transfer.
  • From Suspended: The system copies the latest snapshots to a node where a new available worker is running.
  • Transition: PAUSED/SUSPENDED → RUNNING

5. Handling Failures

If the underlying node fails or ateom crashes while the actor's state is only stored locally (Running or Paused), the actor enters the CRASHED state.

  • Recovery: Users can use ate actor revert --tag to abandon the lost local data and resume from the last known durable snapshot.
  • Debugging: A user can run ate actor dump to transition an actor to the SUSPENDED state by dumping the current in-memory state while the node VM is still alive, allowing for future debugging.
  • Transition: RUNNING/PAUSED → CRASHED

Actor’s Storage Snapshots configuration

snapshotConfig:
  onPause: (none | homedir | process)
  onCommit: (none | homedir | process)

The substrate manages three distinct but interrelated data types: homedir, rootfs, and process (memory+rootfs).

Specific system events necessitate the discarding of certain data layers - a process termed "devolving." For instance, updating an actor's OCI image invalidates both rootfs and memory, whereas a gVisor version update permits rootfs reuse while requiring a memory purge.

  • None:
    • Use Case: Ideal for ephemeral, high-speed startup requirements.
    • Note: For components like MCP, this effectively disables suspension.
  • Homedir only:
    • Use Case: Suitable for fast-starting agents capable of autonomous state recovery.
    • Persistence: This layer is designed to never devolve.
  • Process + Homedir:
    • Mechanism: Represents a comprehensive full process snapshot.
    • Use Case: Critical for slow-starting actors with significant RAM-resident state (e.g., actors with slow bootstrap entry point and/or actors with state in RAM).
    • Devolution Path: Reverts to rootfs + homedir on hardware/gVisor changes, and homedir only on image upgrades.

Failure Modes

SUSPENDED State Details

  • Resume: If the actor snapshot is missing, the system transitions it to the CRASHED state. Should the snapshot fail checksum validation or prove unusable, the actor moves to CRASHED.
  • Tagging: Existing tags trigger an error during assignment unless the -f flag is employed.
  • Revert Operations: Attempts to revert to a non-existent commit or tag result in a system error.

RUNNING State Details

  • Asynchronous Events (Containers & Kernel): Internal container failures force a transition to CRASHED, though the system attempts to preserve the rootfs and home directory. A gVisor kernel failure results in a CRASHED state without any data preservation.
  • Asynchronous Events (Workers): If the worker (ateom) crashes, the actor moves to CRASHED while saving the essential filesystem state, regardless of recovery speed. Permanent worker loss triggers the CRASHED state with metadata preservation. API-driven deletions put workers in a "deleting" phase until all actors are cleared, mirroring the autoscaler's scale-down logic.
  • Asynchronous Events (Host Nodes): Host node failures lead to a CRASHED state; filesystem data is retained for future recovery. Prolonged node downtime necessitates a transition to CRASHED, saving the current rootfs. Should a node never return, a controller is required to finalize the move to CRASHED.
  • Commit: Failures during snapshotting return an error, leaving the actor RUNNING. Persistent storage upload errors maintain the RUNNING state, though critical failures may result in a CRASHED status.
  • Revert To: Errors are thrown if the target commit identifier or tag is missing.
  • Pause: If snapshot creation fails, the actor remains in the RUNNING state and reports an error.

PAUSED State Details

  • Asynchronous Events: Node crashes result in a transition to CRASHED; however, since the state was local, it remains preserved if the node returns. Extended node failures follow the same CRASHED logic with the pre-saved local state. Total node loss requires controller intervention to move the actor to the CRASHED state.
  • Commit: If the underlying local data is missing, the actor enters the CRASHED state. Checksum failures on local data also trigger a transition to CRASHED. Persistent storage upload failures return an API error while maintaining the PAUSED status.
  • Revert To: If the target tag is invalid, an API error is returned and the state remains PAUSED.
  • Resume: Actor data missing from local storage results in a move to CRASHED. Corruption detected via checksum leads to a CRASHED state.

CRASHED Recovery Options

  • Snapshot Dump: Execute ate actor dump <aid> <tag> to persist filesystem state (excluding memory) to a new tag; transitions to SUSPENDED at the prior commit.
  • State Commitment: Use ate actor commit <aid> to push the current head state to SUSPENDED.
  • Reversion: Running ate actor revert abandons the failed local data and returns the actor to a SUSPENDED state based on the previous stable commit.

Metadata

Metadata

Labels

area/apiUser-facing API changesarea/storagekind/featureAn enhancement / feature request or implementationprio/P0Highest priority / required for next milestone

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions