Skip to content

feat(durable): Phase 2 swarm-tree checkpointing#506

Draft
cdayAI wants to merge 1 commit into
mainfrom
claude/security-code-audit-Nq8Tw
Draft

feat(durable): Phase 2 swarm-tree checkpointing#506
cdayAI wants to merge 1 commit into
mainfrom
claude/security-code-audit-Nq8Tw

Conversation

@cdayAI
Copy link
Copy Markdown
Owner

@cdayAI cdayAI commented May 31, 2026

Phase 2 of docs/specs/durable-execution.md, building on the episode-scoped keying from #497. Closes the gap Phase 1 left for swarm goals: a crash during spawn_swarm (children in-flight, no tool_result recorded) used to re-run the entire swarm on resume. Now each child resumes its own loop.

What's here

  • Agent gains an optional checkpoint_id override + _current_step. The depth-0 gate in run() is lifted so any agent with an explicit checkpoint_id (a swarm child) also checkpoints. Children resume their own messages+step but do not restore the shared budget — only the depth-0 owner does (a child restoring would clobber the swarm's shared counter).
  • spawn_swarm keys each child by parent.checkpoint_id + spawn-step + index + brief-hash (stable across resume). After gather, children that returned are cleared (their finals are already in the parent's history); a child that raised keeps its checkpoint for resume.
  • checkpoint.clear_agent(goal_id, agent_id, episode_id) for per-child cleanup.

Soundness (the careful part)

Resume is continuation, not replay — a resumed parent re-decides and may spawn a different set of children. The keying handles both cases correctly:

  • Identical re-spawn → each child's key matches its prior checkpoint → resumes mid-loop. ✓
  • Divergent re-spawn → new keys → fresh children; the orphaned checkpoints are pruned/cleared. ✓

Deferred (documented): skipping a fully-completed child entirely (memoizing its final across a parent re-decision) needs spawn-intent replay — not done here. The parent still re-runs the swarm step, just without re-running finished children's full loops. This is the correct, sound slice; the memoization is a further optimization.

Verification

  • Off by default, fail-open. Full suite 2481 passed, 0 regressions; 163 spawn/swarm tests green (the hot path this touches).
  • New tests: clear_agent scoping; explicit child checkpoint_id; and an end-to-end spawn_swarm where one child finishes and one crashes — the crashed child keeps its checkpoint, the finished sibling's is cleared.

https://claude.ai/code/session_01V4m74QKcM4ERqAu3rbkr6B


Generated by Claude Code

Builds on the episode-scoped keying (#497). Spawned spawn_swarm children now
checkpoint independently so a crash mid-swarm lets each child resume its own
loop instead of re-running the whole swarm.

- Agent gains an optional checkpoint_id override + _current_step; the depth-0
  gate in run() is lifted so any agent with an explicit checkpoint_id (a swarm
  child) also checkpoints. Children resume their own messages+step but do NOT
  restore the shared budget (only the depth-0 owner does).
- spawn_swarm keys each child by parent.checkpoint_id + spawn-step + index +
  brief-hash (stable across resume). After gather, children that RETURNED are
  cleared (finals are in the parent's history); a child that RAISED keeps its
  checkpoint for resume.
- checkpoint.clear_agent(goal_id, agent_id, episode_id) for per-child cleanup.

Soundness: resume is continuation-not-replay. An identical re-spawn matches
each child key and resumes mid-loop; a divergent re-spawn gets new keys and
starts fresh (stale checkpoints orphaned). Deferred: skipping a fully-completed
child across a parent re-decision (needs spawn-intent replay).

Off by default, fail-open. Full suite 2481 passed, 0 regressions; 163
spawn/swarm tests green. New tests: clear_agent scoping, explicit child
checkpoint_id, and a spawn_swarm crash where the crashed child keeps its
checkpoint and the finished sibling's is cleared.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants