fix(durable): episode-scoped stable checkpoint keying#497
Merged
Conversation
… bug) Phase 1 (#472) keyed checkpoints on Agent.name, which carries a per-process random uuid suffix (for blackboard/agent-bus uniqueness). On a real fresh-process resume, run_goal builds a new orchestrator with a NEW random name, so latest(goal_id, self.name) never matched -> resume silently fell back to warm-restart. The Phase-1 test only passed because it PINNED agent.name. Fix: key on (goal_id, episode_id, checkpoint_id): - checkpoint_id = '{role}-{depth}' (stable; one orchestrator per episode), a new Agent property distinct from the random . - episode_id threaded through SwarmContext (default 0) and set from world.start_episode() in run_goal. This discriminates best-of-N attempts, which run sequential run_goal calls under the SAME goal_id but DISTINCT episodes -- without it, attempt 2 would resume from attempt 1's checkpoint. - checkpoints table gains an episode_id column (own table, still no world-model schema migration); save/latest/_prune updated. Tests: production-shape resume now works WITHOUT pinning the name; episode scoping proves no cross-resume between attempts. Full suite 2459 passed, 0 regressions; durable still off by default. Foundational for Phase 2 (swarm child records key off the now-stable parent identity).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A correctness fix for durable execution, found while scoping Phase 2.
The bug
Phase 1 (#472) keyed checkpoints on
Agent.name— butnamecarries a per-process random uuid suffix (f"{role}-{depth}-{uuid4().hex[:6]}", for blackboard/agent-bus uniqueness). On a real fresh-process resume,run_goalbuilds a new orchestrator with a new random name, solatest(goal_id, self.name)never matched → resume silently fell back to warm-restart. Phase 1's resume only worked in its test because the test pinnedagent.name. In production the feature didn't actually resume.Compounding it:
run_goal_best_of_nruns N sequentialrun_goalcalls under the samegoal_id(distinct episodes), so a naive "stable id per goal" fix would make attempt 2 resume from attempt 1's checkpoint.The fix (episode-scoped, stable keying)
Key checkpoints on
(goal_id, episode_id, checkpoint_id):checkpoint_id = "{role}-{depth}"— a newAgentproperty, stable across processes (one orchestrator per episode), distinct from the randomname.episode_idthreaded throughSwarmContext(default 0), set fromworld.start_episode()inrun_goal. Discriminates best-of-N attempts so they never cross-resume.checkpointstable gains anepisode_idcolumn (still its own table — no world-model schema migration);save/latest/_pruneupdated.Tests
test_resume_works_without_pinning_name— a fresh agent (new random name) resumes from the prior checkpoint via the stable id. This is the exact case Phase 1 silently failed; it now passes without test-only pinning.test_episode_scoping_no_cross_resume— a checkpoint under episode 1 is not picked up resuming episode 2 (the best-of-N safety property).Full suite 2459 passed, 0 regressions; durable still off by default; ruff clean.
This is foundational for Phase 2 (swarm child records key off the now-stable parent identity), which I'll do next. Phase 2 design is on #472 / tracked in #396.
https://claude.ai/code/session_01V4m74QKcM4ERqAu3rbkr6B
Generated by Claude Code