feat(durable): Phase 2 swarm-tree checkpointing by cdayAI · Pull Request #506 · cdayAI/Maverick

cdayAI · 2026-05-31T18:33:25Z

Phase 2 of docs/specs/durable-execution.md, building on the episode-scoped keying from #497. Closes the gap Phase 1 left for swarm goals: a crash during spawn_swarm (children in-flight, no tool_result recorded) used to re-run the entire swarm on resume. Now each child resumes its own loop.

What's here

Agent gains an optional checkpoint_id override + _current_step. The depth-0 gate in run() is lifted so any agent with an explicit checkpoint_id (a swarm child) also checkpoints. Children resume their own messages+step but do not restore the shared budget — only the depth-0 owner does (a child restoring would clobber the swarm's shared counter).
spawn_swarm keys each child by parent.checkpoint_id + spawn-step + index + brief-hash (stable across resume). After gather, children that returned are cleared (their finals are already in the parent's history); a child that raised keeps its checkpoint for resume.
checkpoint.clear_agent(goal_id, agent_id, episode_id) for per-child cleanup.

Soundness (the careful part)

Resume is continuation, not replay — a resumed parent re-decides and may spawn a different set of children. The keying handles both cases correctly:

Identical re-spawn → each child's key matches its prior checkpoint → resumes mid-loop. ✓
Divergent re-spawn → new keys → fresh children; the orphaned checkpoints are pruned/cleared. ✓

Deferred (documented): skipping a fully-completed child entirely (memoizing its final across a parent re-decision) needs spawn-intent replay — not done here. The parent still re-runs the swarm step, just without re-running finished children's full loops. This is the correct, sound slice; the memoization is a further optimization.

Verification

Off by default, fail-open. Full suite 2481 passed, 0 regressions; 163 spawn/swarm tests green (the hot path this touches).
New tests: clear_agent scoping; explicit child checkpoint_id; and an end-to-end spawn_swarm where one child finishes and one crashes — the crashed child keeps its checkpoint, the finished sibling's is cleared.

https://claude.ai/code/session_01V4m74QKcM4ERqAu3rbkr6B

Generated by Claude Code

Builds on the episode-scoped keying (#497). Spawned spawn_swarm children now checkpoint independently so a crash mid-swarm lets each child resume its own loop instead of re-running the whole swarm. - Agent gains an optional checkpoint_id override + _current_step; the depth-0 gate in run() is lifted so any agent with an explicit checkpoint_id (a swarm child) also checkpoints. Children resume their own messages+step but do NOT restore the shared budget (only the depth-0 owner does). - spawn_swarm keys each child by parent.checkpoint_id + spawn-step + index + brief-hash (stable across resume). After gather, children that RETURNED are cleared (finals are in the parent's history); a child that RAISED keeps its checkpoint for resume. - checkpoint.clear_agent(goal_id, agent_id, episode_id) for per-child cleanup. Soundness: resume is continuation-not-replay. An identical re-spawn matches each child key and resumes mid-loop; a divergent re-spawn gets new keys and starts fresh (stale checkpoints orphaned). Deferred: skipping a fully-completed child across a parent re-decision (needs spawn-intent replay). Off by default, fail-open. Full suite 2481 passed, 0 regressions; 163 spawn/swarm tests green. New tests: clear_agent scoping, explicit child checkpoint_id, and a spawn_swarm crash where the crashed child keeps its checkpoint and the finished sibling's is cleared.

cdayAI mentioned this pull request May 31, 2026

ci: retry piptools compile in homebrew-bump on PyPI propagation lag #507

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(durable): Phase 2 swarm-tree checkpointing#506

feat(durable): Phase 2 swarm-tree checkpointing#506
cdayAI wants to merge 1 commit into
mainfrom
claude/security-code-audit-Nq8Tw

cdayAI commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cdayAI commented May 31, 2026

What's here

Soundness (the careful part)

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants