feat(durable): Phase 2 swarm-tree checkpointing#506
Draft
cdayAI wants to merge 1 commit into
Draft
Conversation
Builds on the episode-scoped keying (#497). Spawned spawn_swarm children now checkpoint independently so a crash mid-swarm lets each child resume its own loop instead of re-running the whole swarm. - Agent gains an optional checkpoint_id override + _current_step; the depth-0 gate in run() is lifted so any agent with an explicit checkpoint_id (a swarm child) also checkpoints. Children resume their own messages+step but do NOT restore the shared budget (only the depth-0 owner does). - spawn_swarm keys each child by parent.checkpoint_id + spawn-step + index + brief-hash (stable across resume). After gather, children that RETURNED are cleared (finals are in the parent's history); a child that RAISED keeps its checkpoint for resume. - checkpoint.clear_agent(goal_id, agent_id, episode_id) for per-child cleanup. Soundness: resume is continuation-not-replay. An identical re-spawn matches each child key and resumes mid-loop; a divergent re-spawn gets new keys and starts fresh (stale checkpoints orphaned). Deferred: skipping a fully-completed child across a parent re-decision (needs spawn-intent replay). Off by default, fail-open. Full suite 2481 passed, 0 regressions; 163 spawn/swarm tests green. New tests: clear_agent scoping, explicit child checkpoint_id, and a spawn_swarm crash where the crashed child keeps its checkpoint and the finished sibling's is cleared.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 2 of
docs/specs/durable-execution.md, building on the episode-scoped keying from #497. Closes the gap Phase 1 left for swarm goals: a crash duringspawn_swarm(children in-flight, no tool_result recorded) used to re-run the entire swarm on resume. Now each child resumes its own loop.What's here
Agentgains an optionalcheckpoint_idoverride +_current_step. The depth-0 gate inrun()is lifted so any agent with an explicitcheckpoint_id(a swarm child) also checkpoints. Children resume their ownmessages+stepbut do not restore the shared budget — only the depth-0 owner does (a child restoring would clobber the swarm's shared counter).spawn_swarmkeys each child byparent.checkpoint_id + spawn-step + index + brief-hash(stable across resume). Aftergather, children that returned are cleared (their finals are already in the parent's history); a child that raised keeps its checkpoint for resume.checkpoint.clear_agent(goal_id, agent_id, episode_id)for per-child cleanup.Soundness (the careful part)
Resume is continuation, not replay — a resumed parent re-decides and may spawn a different set of children. The keying handles both cases correctly:
Deferred (documented): skipping a fully-completed child entirely (memoizing its final across a parent re-decision) needs spawn-intent replay — not done here. The parent still re-runs the swarm step, just without re-running finished children's full loops. This is the correct, sound slice; the memoization is a further optimization.
Verification
clear_agentscoping; explicit childcheckpoint_id; and an end-to-endspawn_swarmwhere one child finishes and one crashes — the crashed child keeps its checkpoint, the finished sibling's is cleared.https://claude.ai/code/session_01V4m74QKcM4ERqAu3rbkr6B
Generated by Claude Code