-
Notifications
You must be signed in to change notification settings - Fork 0
Concurrent Failure Semantics
Shipped in v0.1.77 (Track 2). The hard part of going concurrent is not "run N at once" — it is how the batch collapses when some agents fail, hang, or return garbage, and whether the recorded state can still answer "who passed / who failed" on replay.
A phase authored with parallel() (DSL) carries mode: "parallel" onto the run
state. The drive loop derives its round width from that phase, bounded by
limits.maxConcurrentAgents, through every shipping surface (run --drive,
quickstart) — no hidden flag. A plain phase() stays sequential, so existing
apps are unchanged. (The on-ramp landed in v0.1.78; before it, only tests passed a
width.)
One spawnSync'd Node batch delegate child spawns all of the round's agents
concurrently and returns each outcome. The CW parent stays fully synchronous, so
the public drive() API is unchanged. Each outcome then settles through the
same runBackend path the serial loop uses (via an internal
preparedAgentOutcome), so envelopes, refusals, and accept-time gates are
identical by construction — the concurrent and serial paths cannot drift.
Results are recorded in deterministic batch task order, regardless of the wall-clock order the agents finished in, so replay stays byte-stable.
- Collect-all: a failing hop never aborts its siblings. Every hop in the round settles and is recorded; a failure only blocks the next round via the existing phase gate. (The alternative — fail-fast — would discard finished work.)
-
Kill-on-timeout + count: a hung agent is
SIGTERM'd at its per-job deadline (SIGKILLafter a grace) by the batch child, and counted as one failure through the same retry/park path as a crash. A wedged batch child cannot deadlock the drive: a parent backstop timeout fails every job closed as a spawn refusal — never a fabricated completion.
These per-round guarantees also hold when the round is one turn of a bounded loop(...) phase: if a loop round's tasks are authored parallel, each round settles through the same collect-all, kill-on-timeout, and deterministic batch-order path, so a hung or dirty hop blocks only the next round, never its siblings.
concurrent-failure-semantics-smoke: 16 agents with a forced hang + crash +
dirty-return. Asserts: no deadlock (wall-clock bounded well under the serial
floor), all 13 good hops accepted in the same round as the 3 failures, each
failure parked with its own recorded reason, deterministic record order, and the
persisted state replays who passed / who failed completely with no disk
corruption.
See Execution Backends for the delegation envelope and [[Control-Plane Scheduling]] for retry/park.
Organized from local Obsidian notes and reconciled with the current
coo1white/cool-workflow repository state.
Start here
Go deeper
- Workflow Apps
- Architecture
- Trust And Audit
- Recovery And Restore
- Commands or API
- MCP And Manifests
- Operations
- FAQ
Source docs