Concurrent Failure Semantics

Shipped in v0.1.77 (Track 2). The hard part of going concurrent is not "run N at once" — it is how the batch collapses when some agents fail, hang, or return garbage, and whether the recorded state can still answer "who passed / who failed" on replay.

What runs concurrently

A phase authored with parallel() (DSL) carries mode: "parallel" onto the run state. The drive loop derives its round width from that phase, bounded by limits.maxConcurrentAgents, through every shipping surface (run --drive, quickstart) — no hidden flag. A plain phase() stays sequential, so existing apps are unchanged. (The on-ramp landed in v0.1.78; before it, only tests passed a width.)

How the batch executes

One spawnSync'd Node batch delegate child spawns all of the round's agents concurrently and returns each outcome. The CW parent stays fully synchronous, so the public drive() API is unchanged. Each outcome then settles through the same runBackend path the serial loop uses (via an internal preparedAgentOutcome), so envelopes, refusals, and accept-time gates are identical by construction — the concurrent and serial paths cannot drift.

Results are recorded in deterministic batch task order, regardless of the wall-clock order the agents finished in, so replay stays byte-stable.

The two locked decisions

Collect-all: a failing hop never aborts its siblings. Every hop in the round settles and is recorded; a failure only blocks the next round via the existing phase gate. (The alternative — fail-fast — would discard finished work.)
Kill-on-timeout + count: a hung agent is SIGTERM'd at its per-job deadline (SIGKILL after a grace) by the batch child, and counted as one failure through the same retry/park path as a crash. A wedged batch child cannot deadlock the drive: a parent backstop timeout fails every job closed as a spawn refusal — never a fabricated completion.

These per-round guarantees also hold when the round is one turn of a bounded loop(...) phase: if a loop round's tasks are authored parallel, each round settles through the same collect-all, kill-on-timeout, and deterministic batch-order path, so a hung or dirty hop blocks only the next round, never its siblings.

Acceptance (the build-map criterion, now a CI smoke)

concurrent-failure-semantics-smoke: 16 agents with a forced hang + crash + dirty-return. Asserts: no deadlock (wall-clock bounded well under the serial floor), all 13 good hops accepted in the same round as the 3 failures, each failure parked with its own recorded reason, deterministic record order, and the persisted state replays who passed / who failed completely with no disk corruption.

See Execution Backends for the delegation envelope and [[Control-Plane Scheduling]] for retry/park.

Organized from local Obsidian notes and reconciled with the current coo1white/cool-workflow repository state.

Cool Workflow

Start here

Go deeper

Source docs

Concurrent Failure Semantics

Concurrent Failure Semantics

What runs concurrently

How the batch executes

The two locked decisions

Acceptance (the build-map criterion, now a CI smoke)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cool Workflow

Clone this wiki locally