[Feature] Implementation of Actor State Machine: Lifecycle and Transitions

## Issue Description

This issue tracks the formal implementation of the actor state machine within the substrate. 

To achieve high density and efficiency, we must distinguish between different "resting" states for actors - specifically focusing on the trade-offs between resume speed, data durability, and resource costs. The implementation must specifically address a potential NIC saturation problem. A simplistic approach of uploading full memory and filesystem images over the network during every suspend/resume cycle would quickly overwhelm the host VM's network interface card (NIC). Consequently, the state machine implementation must prioritize keeping data on the local host to maintain required performance and density.

The "story" of an actor follows a journey from active execution through various stages of dormancy or failure, driven by both user actions and system timeouts.

## Actor State Definitions

| State | Description | Durability & Performance |
| :--- | :--- | :--- |
| **RUNNING** | The actor is currently active and executing on a worker. | Active CPU/RAM usage. |
| **PAUSED** | The actor is offloaded from the CPU, with snapshots stored locally on the node VM. | Fast resume; lower durability (data lost if node fails); cheaper (local, already provisioned storage, no NIC); we can optimize this (e.g., save to peers opportunistically (TODO add issue link)). |
| **SUSPENDED** | The actor is offloaded from the CPU, and snapshots are persisted to durable storage. | Higher SLO/durability (e.g., saved to blob storage); expensive (external storage and NIC); requires retention policies; slower resume due to data transfer; can be symbolically tagged. |
| **CRASHED** | A failure state reached if the host node fails while an actor is Running or Paused. | Requires manual recovery or revert to a known good state. |

## Snapshot layers

```mermaid
---
config:
  treemap:
    showValues: false
---
treemap-beta
"Actor's process"
    "memory" : 40
    "local storage"
        "rootfs"
            "homedir" : 20
            "oci bundle" : 80:::noborder

classDef noborder stroke-width:0px;
```

- **memory** - a snapshot of an actor's process memory. A snapshot cannot be used for restoration if the OCI image of the actor has been changed (e.g., upgraded).
- **rootfs** - an entire boot disk snapshot with all files. A snapshot cannot be used for restoration if the OCI image of the actor has been changed (e.g., upgraded).
- **homedir** - files under the home directory. This can be used for restoration even if the OCI image has been changed.


## Actor Lifecycle Story

```mermaid
---
title: Actor State (`ate actor ...`)
---
stateDiagram-v2
direction LR

classDef crushedEvent fill:#f00,font-weight:bold,stroke-width:2px

[*] --> Suspended : create --from

Suspended --> Running : resume
Suspended --> Suspended : commit --tag | revert --to
Suspended --> [*] : delete

Running --> Running : commit --remain [--tag]
Running --> Paused : pause [--timer]
Running --> Suspended : commit [--tag] | revert --to
Running --> Crashed : [VM crashed | ateom crashed | etc...]

Paused --> Running : resume 
Paused --> Suspended : commit [--tag] | revert --to | (auto-commit-timeout)
Paused --> Paused : commit --remain [--tag]
Paused --> Crashed  : [VM crashed | ateom crashed | etc...]
Paused --> [*] : delete

Crashed --> Suspended : revert --to | dump | commit --tag
Crashed --> [*] : delete

class Crashed crushedEvent
class end crushedEvent
```


### 1. Create a new actor
* **Action:** `ate actor create [--from] `
* **Mechanism:** The system creates a new actor from either the default configuration or a tagged snapshot.
* **Transition:** `NONE → SUSPENDED`

### 2. Active Execution to Temporary Pause
When a user needs to temporarily free up vCPU resources but expects a quick return, the actor moves to a `PAUSED` state.
* **Action:** `ate actor pause [--timer]`
* **Mechanism:** The system takes a snapshot and keeps it on the local node. If an actor remains paused for an extended period and `--timer` is provided, the system escalates it to a `SUSPENDED` state to ensure data safety and free up local node storage.
* **Transition:** `RUNNING → PAUSED`

### 3. Active Execution to Long-term Suspension
When a user needs to free up vCPU resources and does not expect a quick return, or needs to ensure data safety.
* **Action:** `ate actor commit [--tag]`
* **Mechanism:** The system takes a snapshot and moves it to durable, persistent storage. A tag can be attached to the snapshot state, allowing the future creation of a new actor from this state (fork) or reverting the current actor to this state.
* **Transition:** `RUNNING → SUSPENDED`

### 4. Resuming Operations
Actors can be brought back to a `RUNNING` state from either dormancy stage.
* **Action:** `ate actor resume`
* **From Paused:** The scheduler prefers the original node to avoid data transfer.
* **From Suspended:** The system copies the latest snapshots to a node where a new available worker is running.
* **Transition:** `PAUSED/SUSPENDED → RUNNING`

### 5. Handling Failures
If the underlying node fails or ateom crashes while the actor's state is only stored locally (Running or Paused), the actor enters the `CRASHED` state.
* **Recovery:** Users can use `ate actor revert --tag` to abandon the lost local data and resume from the last known durable snapshot.
* **Debugging:** A user can run `ate actor dump` to transition an actor to the `SUSPENDED` state by dumping the current in-memory state while the node VM is still alive, allowing for future debugging.
* **Transition:** `RUNNING/PAUSED → CRASHED`

## Actor’s Storage Snapshots configuration

```yaml
snapshotConfig:
  onPause: (none | homedir | process)
  onCommit: (none | homedir | process)
```

The substrate manages three distinct but interrelated data types: **homedir**, **rootfs**, and **process** (**memory**+**rootfs**).

Specific system events necessitate the discarding of certain data layers - a process termed "devolving." For instance, updating an actor's OCI image invalidates both rootfs and memory, whereas a gVisor version update permits rootfs reuse while requiring a memory purge.

- **None:**
  - **Use Case**: Ideal for ephemeral, high-speed startup requirements.
  - **Note**: For components like MCP, this effectively disables suspension.
- **Homedir only:**
  - **Use Case**: Suitable for fast-starting agents capable of autonomous state recovery.
  - **Persistence**: This layer is designed to never devolve.
- **Process + Homedir:**
  - **Mechanism**: Represents a comprehensive full process snapshot.
  - **Use Case**: Critical for slow-starting actors with significant RAM-resident state (e.g., actors with slow bootstrap entry point and/or actors with state in RAM).
  - **Devolution Path**: Reverts to **rootfs** + **homedir** on hardware/gVisor changes, and **homedir** only on image upgrades.


## Failure Modes

###  **SUSPENDED** State Details

* **Resume:** If the actor snapshot is missing, the system transitions it to the **CRASHED** state. Should the snapshot fail checksum validation or prove unusable, the actor moves to **CRASHED**.
* **Tagging:** Existing tags trigger an error during assignment unless the `-f` flag is employed.
* **Revert Operations:** Attempts to revert to a non-existent commit or tag result in a system error.

### **RUNNING** State Details

* **Asynchronous Events (Containers & Kernel):** Internal container failures force a transition to **CRASHED**, though the system attempts to preserve the `rootfs` and home directory. A gVisor kernel failure results in a **CRASHED** state without any data preservation.
* **Asynchronous Events (Workers):** If the worker (`ateom`) crashes, the actor moves to **CRASHED** while saving the essential filesystem state, regardless of recovery speed. Permanent worker loss triggers the **CRASHED** state with metadata preservation. API-driven deletions put workers in a "deleting" phase until all actors are cleared, mirroring the autoscaler's scale-down logic.
* **Asynchronous Events (Host Nodes):** Host node failures lead to a **CRASHED** state; filesystem data is retained for future recovery. Prolonged node downtime necessitates a transition to **CRASHED**, saving the current `rootfs`. Should a node never return, a controller is required to finalize the move to **CRASHED**.
* **Commit:** Failures during snapshotting return an error, leaving the actor **RUNNING**. Persistent storage upload errors maintain the **RUNNING** state, though critical failures may result in a **CRASHED** status.
* **Revert To:** Errors are thrown if the target commit identifier or tag is missing.
* **Pause:** If snapshot creation fails, the actor remains in the **RUNNING** state and reports an error.

### **PAUSED** State Details

* **Asynchronous Events:** Node crashes result in a transition to **CRASHED**; however, since the state was local, it remains preserved if the node returns. Extended node failures follow the same **CRASHED** logic with the pre-saved local state. Total node loss requires controller intervention to move the actor to the **CRASHED** state.
* **Commit:** If the underlying local data is missing, the actor enters the **CRASHED** state. Checksum failures on local data also trigger a transition to **CRASHED**. Persistent storage upload failures return an API error while maintaining the **PAUSED** status.
* **Revert To:** If the target tag is invalid, an API error is returned and the state remains **PAUSED**.
* **Resume:** Actor data missing from local storage results in a move to **CRASHED**. Corruption detected via checksum leads to a **CRASHED** state.

### **CRASHED** Recovery Options

* **Snapshot Dump:** Execute `ate actor dump <aid> <tag>` to persist filesystem state (excluding memory) to a new tag; transitions to **SUSPENDED** at the prior commit.
* **State Commitment:** Use `ate actor commit <aid>` to push the current head state to **SUSPENDED**.
* **Reversion:** Running `ate actor revert` abandons the failed local data and returns the actor to a **SUSPENDED** state based on the previous stable commit.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Implementation of Actor State Machine: Lifecycle and Transitions #119

Issue Description

Actor State Definitions

Snapshot layers

Actor Lifecycle Story

1. Create a new actor

2. Active Execution to Temporary Pause

3. Active Execution to Long-term Suspension

4. Resuming Operations

5. Handling Failures

Actor’s Storage Snapshots configuration

Failure Modes

SUSPENDED State Details

RUNNING State Details

PAUSED State Details

CRASHED Recovery Options

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

State	Description	Durability & Performance
RUNNING	The actor is currently active and executing on a worker.	Active CPU/RAM usage.
PAUSED	The actor is offloaded from the CPU, with snapshots stored locally on the node VM.	Fast resume; lower durability (data lost if node fails); cheaper (local, already provisioned storage, no NIC); we can optimize this (e.g., save to peers opportunistically (TODO add issue link)).
SUSPENDED	The actor is offloaded from the CPU, and snapshots are persisted to durable storage.	Higher SLO/durability (e.g., saved to blob storage); expensive (external storage and NIC); requires retention policies; slower resume due to data transfer; can be symbolically tagged.
CRASHED	A failure state reached if the host node fails while an actor is Running or Paused.	Requires manual recovery or revert to a known good state.

[Feature] Implementation of Actor State Machine: Lifecycle and Transitions #119

Description

Issue Description

Actor State Definitions

Snapshot layers

Actor Lifecycle Story

1. Create a new actor

2. Active Execution to Temporary Pause

3. Active Execution to Long-term Suspension

4. Resuming Operations

5. Handling Failures

Actor’s Storage Snapshots configuration

Failure Modes

SUSPENDED State Details

RUNNING State Details

PAUSED State Details

CRASHED Recovery Options

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions