You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue tracks the formal implementation of the actor state machine within the substrate.
To achieve high density and efficiency, we must distinguish between different "resting" states for actors - specifically focusing on the trade-offs between resume speed, data durability, and resource costs. The implementation must specifically address a potential NIC saturation problem. A simplistic approach of uploading full memory and filesystem images over the network during every suspend/resume cycle would quickly overwhelm the host VM's network interface card (NIC). Consequently, the state machine implementation must prioritize keeping data on the local host to maintain required performance and density.
The "story" of an actor follows a journey from active execution through various stages of dormancy or failure, driven by both user actions and system timeouts.
Actor State Definitions
State
Description
Durability & Performance
RUNNING
The actor is currently active and executing on a worker.
Active CPU/RAM usage.
PAUSED
The actor is offloaded from the CPU, with snapshots stored locally on the node VM.
Fast resume; lower durability (data lost if node fails); cheaper (local, already provisioned storage, no NIC); we can optimize this (e.g., save to peers opportunistically (TODO add issue link)).
SUSPENDED
The actor is offloaded from the CPU, and snapshots are persisted to durable storage.
Higher SLO/durability (e.g., saved to blob storage); expensive (external storage and NIC); requires retention policies; slower resume due to data transfer; can be symbolically tagged.
CRASHED
A failure state reached if the host node fails while an actor is Running or Paused.
Requires manual recovery or revert to a known good state.
memory - a snapshot of an actor's process memory. A snapshot cannot be used for restoration if the OCI image of the actor has been changed (e.g., upgraded).
rootfs - an entire boot disk snapshot with all files. A snapshot cannot be used for restoration if the OCI image of the actor has been changed (e.g., upgraded).
homedir - files under the home directory. This can be used for restoration even if the OCI image has been changed.
Mechanism: The system creates a new actor from either the default configuration or a tagged snapshot.
Transition:NONE → SUSPENDED
2. Active Execution to Temporary Pause
When a user needs to temporarily free up vCPU resources but expects a quick return, the actor moves to a PAUSED state.
Action:ate actor pause [--timer]
Mechanism: The system takes a snapshot and keeps it on the local node. If an actor remains paused for an extended period and --timer is provided, the system escalates it to a SUSPENDED state to ensure data safety and free up local node storage.
Transition:RUNNING → PAUSED
3. Active Execution to Long-term Suspension
When a user needs to free up vCPU resources and does not expect a quick return, or needs to ensure data safety.
Action:ate actor commit [--tag]
Mechanism: The system takes a snapshot and moves it to durable, persistent storage. A tag can be attached to the snapshot state, allowing the future creation of a new actor from this state (fork) or reverting the current actor to this state.
Transition:RUNNING → SUSPENDED
4. Resuming Operations
Actors can be brought back to a RUNNING state from either dormancy stage.
Action:ate actor resume
From Paused: The scheduler prefers the original node to avoid data transfer.
From Suspended: The system copies the latest snapshots to a node where a new available worker is running.
Transition:PAUSED/SUSPENDED → RUNNING
5. Handling Failures
If the underlying node fails or ateom crashes while the actor's state is only stored locally (Running or Paused), the actor enters the CRASHED state.
Recovery: Users can use ate actor revert --tag to abandon the lost local data and resume from the last known durable snapshot.
Debugging: A user can run ate actor dump to transition an actor to the SUSPENDED state by dumping the current in-memory state while the node VM is still alive, allowing for future debugging.
The substrate manages three distinct but interrelated data types: homedir, rootfs, and process (memory+rootfs).
Specific system events necessitate the discarding of certain data layers - a process termed "devolving." For instance, updating an actor's OCI image invalidates both rootfs and memory, whereas a gVisor version update permits rootfs reuse while requiring a memory purge.
None:
Use Case: Ideal for ephemeral, high-speed startup requirements.
Note: For components like MCP, this effectively disables suspension.
Homedir only:
Use Case: Suitable for fast-starting agents capable of autonomous state recovery.
Persistence: This layer is designed to never devolve.
Process + Homedir:
Mechanism: Represents a comprehensive full process snapshot.
Use Case: Critical for slow-starting actors with significant RAM-resident state (e.g., actors with slow bootstrap entry point and/or actors with state in RAM).
Devolution Path: Reverts to rootfs + homedir on hardware/gVisor changes, and homedir only on image upgrades.
Failure Modes
SUSPENDED State Details
Resume: If the actor snapshot is missing, the system transitions it to the CRASHED state. Should the snapshot fail checksum validation or prove unusable, the actor moves to CRASHED.
Tagging: Existing tags trigger an error during assignment unless the -f flag is employed.
Revert Operations: Attempts to revert to a non-existent commit or tag result in a system error.
RUNNING State Details
Asynchronous Events (Containers & Kernel): Internal container failures force a transition to CRASHED, though the system attempts to preserve the rootfs and home directory. A gVisor kernel failure results in a CRASHED state without any data preservation.
Asynchronous Events (Workers): If the worker (ateom) crashes, the actor moves to CRASHED while saving the essential filesystem state, regardless of recovery speed. Permanent worker loss triggers the CRASHED state with metadata preservation. API-driven deletions put workers in a "deleting" phase until all actors are cleared, mirroring the autoscaler's scale-down logic.
Asynchronous Events (Host Nodes): Host node failures lead to a CRASHED state; filesystem data is retained for future recovery. Prolonged node downtime necessitates a transition to CRASHED, saving the current rootfs. Should a node never return, a controller is required to finalize the move to CRASHED.
Commit: Failures during snapshotting return an error, leaving the actor RUNNING. Persistent storage upload errors maintain the RUNNING state, though critical failures may result in a CRASHED status.
Revert To: Errors are thrown if the target commit identifier or tag is missing.
Pause: If snapshot creation fails, the actor remains in the RUNNING state and reports an error.
PAUSED State Details
Asynchronous Events: Node crashes result in a transition to CRASHED; however, since the state was local, it remains preserved if the node returns. Extended node failures follow the same CRASHED logic with the pre-saved local state. Total node loss requires controller intervention to move the actor to the CRASHED state.
Commit: If the underlying local data is missing, the actor enters the CRASHED state. Checksum failures on local data also trigger a transition to CRASHED. Persistent storage upload failures return an API error while maintaining the PAUSED status.
Revert To: If the target tag is invalid, an API error is returned and the state remains PAUSED.
Resume: Actor data missing from local storage results in a move to CRASHED. Corruption detected via checksum leads to a CRASHED state.
CRASHED Recovery Options
Snapshot Dump: Execute ate actor dump <aid> <tag> to persist filesystem state (excluding memory) to a new tag; transitions to SUSPENDED at the prior commit.
State Commitment: Use ate actor commit <aid> to push the current head state to SUSPENDED.
Reversion: Running ate actor revert abandons the failed local data and returns the actor to a SUSPENDED state based on the previous stable commit.
Issue Description
This issue tracks the formal implementation of the actor state machine within the substrate.
To achieve high density and efficiency, we must distinguish between different "resting" states for actors - specifically focusing on the trade-offs between resume speed, data durability, and resource costs. The implementation must specifically address a potential NIC saturation problem. A simplistic approach of uploading full memory and filesystem images over the network during every suspend/resume cycle would quickly overwhelm the host VM's network interface card (NIC). Consequently, the state machine implementation must prioritize keeping data on the local host to maintain required performance and density.
The "story" of an actor follows a journey from active execution through various stages of dormancy or failure, driven by both user actions and system timeouts.
Actor State Definitions
Snapshot layers
--- config: treemap: showValues: false --- treemap-beta "Actor's process" "memory" : 40 "local storage" "rootfs" "homedir" : 20 "oci bundle" : 80:::noborder classDef noborder stroke-width:0px;Actor Lifecycle Story
1. Create a new actor
ate actor create [--from]NONE → SUSPENDED2. Active Execution to Temporary Pause
When a user needs to temporarily free up vCPU resources but expects a quick return, the actor moves to a
PAUSEDstate.ate actor pause [--timer]--timeris provided, the system escalates it to aSUSPENDEDstate to ensure data safety and free up local node storage.RUNNING → PAUSED3. Active Execution to Long-term Suspension
When a user needs to free up vCPU resources and does not expect a quick return, or needs to ensure data safety.
ate actor commit [--tag]RUNNING → SUSPENDED4. Resuming Operations
Actors can be brought back to a
RUNNINGstate from either dormancy stage.ate actor resumePAUSED/SUSPENDED → RUNNING5. Handling Failures
If the underlying node fails or ateom crashes while the actor's state is only stored locally (Running or Paused), the actor enters the
CRASHEDstate.ate actor revert --tagto abandon the lost local data and resume from the last known durable snapshot.ate actor dumpto transition an actor to theSUSPENDEDstate by dumping the current in-memory state while the node VM is still alive, allowing for future debugging.RUNNING/PAUSED → CRASHEDActor’s Storage Snapshots configuration
The substrate manages three distinct but interrelated data types: homedir, rootfs, and process (memory+rootfs).
Specific system events necessitate the discarding of certain data layers - a process termed "devolving." For instance, updating an actor's OCI image invalidates both rootfs and memory, whereas a gVisor version update permits rootfs reuse while requiring a memory purge.
Failure Modes
SUSPENDED State Details
-fflag is employed.RUNNING State Details
rootfsand home directory. A gVisor kernel failure results in a CRASHED state without any data preservation.ateom) crashes, the actor moves to CRASHED while saving the essential filesystem state, regardless of recovery speed. Permanent worker loss triggers the CRASHED state with metadata preservation. API-driven deletions put workers in a "deleting" phase until all actors are cleared, mirroring the autoscaler's scale-down logic.rootfs. Should a node never return, a controller is required to finalize the move to CRASHED.PAUSED State Details
CRASHED Recovery Options
ate actor dump <aid> <tag>to persist filesystem state (excluding memory) to a new tag; transitions to SUSPENDED at the prior commit.ate actor commit <aid>to push the current head state to SUSPENDED.ate actor revertabandons the failed local data and returns the actor to a SUSPENDED state based on the previous stable commit.