Skip to content

Concurrent Failure Semantics

coo1white edited this page Jun 11, 2026 · 2 revisions

Concurrent Failure Semantics

Shipped in v0.1.77 (Track 2). The hard part of going concurrent is not "run N at once" — it is how the batch collapses when some agents fail, hang, or return garbage, and whether the recorded state can still answer "who passed / who failed" on replay.

What runs concurrently

A phase authored with parallel() (DSL) carries mode: "parallel" onto the run state. The drive loop derives its round width from that phase, bounded by limits.maxConcurrentAgents, through every shipping surface (run --drive, quickstart) — no hidden flag. A plain phase() stays sequential, so existing apps are unchanged. (The on-ramp landed in v0.1.78; before it, only tests passed a width.)

How the batch executes

One spawnSync'd Node batch delegate child spawns all of the round's agents concurrently and returns each outcome. The CW parent stays fully synchronous, so the public drive() API is unchanged. Each outcome then settles through the same runBackend path the serial loop uses (via an internal preparedAgentOutcome), so envelopes, refusals, and accept-time gates are identical by construction — the concurrent and serial paths cannot drift.

Results are recorded in deterministic batch task order, regardless of the wall-clock order the agents finished in, so replay stays byte-stable.

The two locked decisions

  • Collect-all: a failing hop never aborts its siblings. Every hop in the round settles and is recorded; a failure only blocks the next round via the existing phase gate. (The alternative — fail-fast — would discard finished work.)
  • Kill-on-timeout + count: a hung agent is SIGTERM'd at its per-job deadline (SIGKILL after a grace) by the batch child, and counted as one failure through the same retry/park path as a crash. A wedged batch child cannot deadlock the drive: a parent backstop timeout fails every job closed as a spawn refusal — never a fabricated completion.

Acceptance (the build-map criterion, now a CI smoke)

concurrent-failure-semantics-smoke: 16 agents with a forced hang + crash + dirty-return. Asserts: no deadlock (wall-clock bounded well under the serial floor), all 13 good hops accepted in the same round as the 3 failures, each failure parked with its own recorded reason, deterministic record order, and the persisted state replays who passed / who failed completely with no disk corruption.

See Execution Backends for the delegation envelope and [[Control-Plane Scheduling]] for retry/park.

Clone this wiki locally