feat: sprint-based resume with scrum master agent#137
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #137 +/- ##
==========================================
+ Coverage 86.88% 86.93% +0.05%
==========================================
Files 50 50
Lines 7127 7151 +24
==========================================
+ Hits 6192 6217 +25
+ Misses 935 934 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
akashgit
left a comment
There was a problem hiding this comment.
Detailed Review — PR #137: Sprint-based Resume with Scrum Master Agent
The design is sound: replacing a prompt-dependent factory checkpoint --save (that was never executed) with event-log-driven recovery is a clear improvement. The symlink fix in events.py is correct and well-tested. The factory log CLI is clean. However, there are several issues that need addressing before this can merge.
Critical — Must Fix
1. Event type mismatch between CEO prompt and Scrum Master detection table
The Scrum Master's phase detection table (scrummaster.md lines 67-74) expects these event types:
phase.build.completedphase.eval.completedphase.archive.completed
But the CEO prompt never logs any of these. The CEO only logs:
sprint.startedphase.research.completedphase.strategy.completedphase.verdictsprint.completed
This means the Scrum Master's primary signal for Build, Eval, and Archive phase completion will never exist in the event log. Build and Eval can fall back to disk artifacts (ceo-verdict-builder.md, experiments/NNN/eval_after.json), but Archive has no fallback signal at all — the Scrum Master doesn't read archivist-checkpoints.md.
Fix: Either (a) add factory log calls for phase.build.completed, phase.eval.completed, and phase.archive.completed to the CEO prompt, or (b) update the Scrum Master's detection table to match the events the CEO actually logs. Also add archivist-checkpoints.md to the Scrum Master's "What You Read" list.
2. Orphaned checkpoint system creates conflicting dual resume paths
cmd_ceo (cli.py:1169-1189) and _run_cycle (cli.py:1698-1724) still:
- Import and call
load_checkpoint(project_path) - Inject
## Resume Contextinto the CEO task if a checkpoint exists - Call
clear_checkpoint(project_path)on success
But the CEO prompt no longer calls factory checkpoint --save, so checkpoint.json is never written. This makes all checkpoint code in cmd_ceo/_run_cycle dead — load_checkpoint always returns None, clear_checkpoint is always a no-op.
This isn't a bug today (it's harmless dead code), but it's confusing: a reader sees checkpoint loading and thinks it works, the CEO prompt has a "legacy fallback" for ## Resume Context blocks that can never be triggered, and there are now two competing resume systems (only one of which is functional).
Fix: Either (a) remove the checkpoint loading from cmd_ceo/_run_cycle (if you're confident the Scrum Master fully replaces it), or (b) keep it as an explicit transition path but add a comment explaining why it's there and when it can be removed. Option (a) is preferred — clean break.
Important — Should Fix
3. Shell quoting fragility in phase.verdict log call
factory log "$PROJECT_PATH" "phase.verdict" --data '{"verdict": "'$VERDICT'", "exp_id": '$EXP_ID'}'This breaks out of single quotes to interpolate $VERDICT and $EXP_ID. If $VERDICT or $EXP_ID contain spaces, special chars, or are empty, the JSON is malformed and cmd_log returns 1 (silently failing the milestone log). Use double quotes with escaped inner quotes:
factory log "$PROJECT_PATH" "phase.verdict" --data "{\"verdict\": \"$VERDICT\", \"exp_id\": $EXP_ID}"4. Temporal disambiguation — Scrum Master can't distinguish current vs previous sprint artifacts
The Scrum Master uses disk artifacts as fallback signals (e.g., "Research completed if strategy/research.md exists"). But these files survive across sprints. If sprint N completed normally and sprint N+1 crashes during research, the Scrum Master will see research.md from sprint N and falsely conclude research is complete in sprint N+1.
The Scrum Master prompt should instruct the agent to compare file modification timestamps against the sprint.started event timestamp. If a file is older than the current sprint start, it's a leftover from a previous sprint and should not be treated as evidence of current-sprint completion.
5. Scrum Master only covers Improve mode — Build mode crashes have no recovery
Step 0 (Scrum Master standup) is only added to Improve mode in the CEO prompt. Build mode sessions can also crash mid-cycle (especially during multi-phase builds), but don't get a standup. The general "Resuming from a Crash" section references the Scrum Master but says "see Improve Mode below" — Build mode has no equivalent entry point.
Fix: Add the Scrum Master standup to Build mode (before B0), or explicitly document that Build mode resume is not supported and why.
6. CLAUDE.md not updated
CLAUDE.md says "Eight specialist Claude Code subprocesses" and lists the roles without Scrum Master. The architecture section should reflect the new 9th agent.
Minor — Nice to Have
7. cmd_log doesn't support --agent flag
emit_event accepts an agent parameter, but cmd_log doesn't expose it. All events logged via factory log will have "agent": null. If the Scrum Master ever needs to distinguish "CEO logged this" from "infra logged this," there's no way to tell. Consider adding an optional --agent argument to cmd_log.
8. checkpoint.py module is now dead code
If you remove checkpoint usage from cmd_ceo/_run_cycle (per issue 2), the entire checkpoint.py module, cmd_checkpoint, cmd_resume, and their subparser registrations can be removed. The tests/test_checkpoint.py can also be removed. This is cleanup — fine to do in a follow-up PR.
9. No integration test for Scrum Master behavior
The test suite covers cmd_log (3 tests) and symlink resolution (1 test), but there's no test that verifies the Scrum Master can actually parse events and produce a meaningful standup. Understandable since it's an LLM agent, but a smoke test that feeds it a known events.jsonl and checks the output format would add confidence.
What's Good
- The diagnosis is spot-on: LLMs don't reliably execute bookkeeping shell commands. Event log + agent-based recovery is the right architecture.
- The symlink fix in
events.pyis clean —.resolve()at the entry point of bothemit_eventandload_events. - The
factory logCLI is minimal and well-tested (3 tests covering happy path, no-data, and invalid JSON). - The Scrum Master prompt's multi-signal approach (event log + disk artifacts) is resilient by design.
- The PR is well-scoped — 242 additions, 26 deletions across 7 files.
Review items addressedAll 7 actionable items fixed in 2 commits (
Items 8 (remove |
|
Closing — has merge conflicts and unaddressed review feedback. Reopen or recreate if the sprint-resume approach is still desired. |
0607fcb to
3adcd25
Compare
akashgit
left a comment
There was a problem hiding this comment.
Review — PR #137: Sprint-based Resume with Scrum Master Agent
The design is strong: replacing the unreliable checkpoint commands with an event-log + Scrum Master agent is the right architectural move. The PR description is excellent and the motivation (CEO never executing checkpoint commands) is well-documented. However, there's a critical design flaw that undermines the core resume functionality, plus several smaller issues.
1. CRITICAL — sprint.started logged unconditionally breaks chained crash recovery
In both Build mode (ceo.md:395) and Improve mode (ceo.md:726), factory log "sprint.started" runs unconditionally — even when the Scrum Master reports RESUME. This creates a new event boundary that hides previously-completed phase signals.
Failure scenario (chained crashes):
- Sprint 1: CEO starts → logs
sprint.started→ research completes → logsphase.research.completed→ crash - Sprint 2 (resume): SM finds Sprint 1's
sprint.started, nosprint.completed→ RESUME → correctly skips research → CEO logs NEWsprint.started→ strategy completes → crash - Sprint 3 (resume): SM finds Sprint 2's
sprint.started(the latest) → RESUME → checks for phase completion events after Sprint 2's boundary →phase.research.completedis BEFORE that boundary → research not detected as complete
The temporal disambiguation rule makes this worse: "If a file is older than the current sprint start, it is a leftover from a previous sprint — do NOT treat it as evidence of current-sprint completion." So disk artifacts (like ceo-verdict-researcher.md) written during Sprint 1 are also hidden.
Result: On a second crash, the SM recommends re-running everything from scratch — exactly the problem this PR solves for the single-crash case.
Fix: Only log sprint.started for FRESH starts. In both Build and Improve modes, wrap it in the conditional:
- If RESUME: Follow the recommendation. Skip completed phases.
- If FRESH: Log sprint start and proceed with research.
Or alternatively, change the SM's algorithm to walk backwards through sprint.started events until finding one with a matching sprint.completed, treating everything after the last completed sprint as the current scope.
2. MEDIUM — Dead ## Resume Context fallback in CEO prompt
The "Resuming from a Crash" section says: "If your task includes a ## Resume Context block (legacy fallback), treat it the same way." But this PR removes the code that injects ## Resume Context from both cmd_ceo and _run_single_cycle (commit 3adcd25f). This creates a dead reference — the CEO describes behavior for a condition that can never occur. Remove the legacy fallback paragraph.
3. MEDIUM — summary.py still reads checkpoint.json for mode detection
factory/summary.py:216-217 reads checkpoint.json for mode detection. With this PR, checkpoints are no longer written by the CEO, so this will silently degrade. Not blocking for this PR, but worth noting as a follow-up — the summary module should read mode from events.jsonl instead.
4. LOW — No test for --agent flag in cmd_log
The parser defines --agent and cmd_log passes it to emit_event, but none of the three tests in test_log_command.py exercise it. Add a test that verifies the agent field appears in the written event.
5. LOW — Unnecessary getattr in cmd_log
data_str = getattr(args, "data", None)
agent = getattr(args, "agent", None)argparse always sets attributes for defined arguments (defaulting to None). args.data and args.agent work directly and are consistent with args.path and args.event_type used in the same function.
6. LOW — SM prompt references eval.completed but CEO logs phase.eval.completed
In the FRESH standup format, the SM says: "Current score: composite score from last eval.completed event or results.tsv". But the CEO logs phase.eval.completed, not eval.completed. The infra emits eval.completed via invoke_agent (if at all), so there are two different event types. Standardize to phase.eval.completed in the SM prompt for consistency, or list both.
Summary
The event-log + Scrum Master approach is a big improvement over prompt-driven checkpoints. Issue #1 is the only blocker — the unconditional sprint.started on resume defeats the mechanism on chained crashes. The fix is straightforward (conditional logging). Everything else is minor.
Review 2 items addressedFixed in commit
|
|
All review comments from both rounds have been addressed: Round 1 (9 items): Items 1-7 fixed in commits Round 2 (6 items): Items 1-2, 4-6 fixed in commit Key fixes:
1208 tests passing, lint clean. Ready for re-review. |
|
@shivchander I tried to run this but it appears to run into the same issue where the CEO ignores the directions we give it. Claude suggests to enforce the standup as a strict step similar to the finalize gate that #188 introduces. |
Standup enforcement fixThe CEO was ignoring the Fix: Moved standup invocation to the Python infra layer. This follows the same enforcement pattern as the finalize gate in #188: if it must happen, enforce it in code, not in the prompt. Changes:
1208 tests passing. |
30651aa to
963a75e
Compare
akashgit
left a comment
There was a problem hiding this comment.
Review — Sprint-based Resume with Scrum Master Agent
The design is the right call. Checkpoint saves via prompt instructions were proven dead code, and replacing that with event-log + infra-enforced scrummaster standup follows the same pattern that worked in #188 (finalize gate). The iterative evolution through three review rounds addressed real issues (chained crash recovery, infra enforcement after @RobotSail caught the CEO ignoring the prompt). Close to mergeable.
Must Fix (4 items)
1. Merge conflict in tests/test_cli.py — one conflict on the heartbeat loop test call_count assertion. Should be trivial: your latest commit (2df2061) already mocks _run_standup directly, so the assertion should stay at mock_agent.call_count == 3 (matching main). Reconcile the mock_sleep assertions with main's current test structure.
2. Rebase needed — 33 commits behind main. The _run_single_cycle signature gained no_github on main; your branch's version doesn't pass it through to _build_ceo_task. After rebase, verify the no_github parameter is preserved — otherwise the flag gets silently dropped when running via heartbeat loop.
3. _run_standup swallows the mode parameter — takes mode as an argument but never uses it. The scrummaster task string doesn't include the current mode, so the scrummaster can't distinguish Build vs Improve context. Either pass it in the task string or remove the parameter.
4. Coverage gap on _run_standup — 11 uncovered lines in cli.py, all in this function. The happy path — where the scrummaster returns a report that gets injected into the CEO task — has no unit test. Add a test that mocks invoke_agent returning a standup string and verifies it appears in the CEO task.
Non-blocking observations
checkpoint.pymodule stays as a debugging tool — fine, butsummary.py:216-217still readscheckpoint.jsonfor mode detection and will silently returnNonenow. Follow-up item.factory logcalls are still prompt-dependent, but the multi-signal detection (event log + disk artifacts) in the scrummaster prompt mitigates this well.- The scrummaster prompt is well-designed — multi-signal detection, temporal disambiguation, comprehensive phase detection table.
- 11 commits is a lot of fix-on-fix history. Recommend squash merge once the above is resolved.
@shivchander — 4 items above, all straightforward. Fix and we can merge. 🤝
Review items addressed (commit
|
| # | Item | Action |
|---|---|---|
| 1 | Merge conflict in test_cli.py | Already resolved in prior commit (2df2061) |
| 2 | Rebase needed + no_github param |
Already rebased (0 behind). Verified no_github is threaded through _run_single_cycle → _build_ceo_task correctly. |
| 3 | mode param unused in _run_standup |
Fixed — mode is now included in the scrummaster task string: "Run standup for /path (mode: improve)" |
| 4 | Coverage gap on _run_standup |
Added 3 tests: happy path (standup report injected into CEO task via mocked invoke_agent side_effect), skip without .factory/, skip without claude CLI |
Re: squash merge — agreed, will squash on merge.
Non-blocking observations addressed
|
Replace broken prompt-based checkpoint saving with event-log-driven recovery via a new Scrum Master agent. The CEO LLM never executed `factory checkpoint --save` commands. This PR makes resume infrastructure-enforced: - New `scrummaster` agent role reads events.jsonl + disk artifacts and produces a standup report (FRESH or RESUME) - `_run_standup()` in cli.py runs the scrummaster BEFORE spawning the CEO, injecting the report into the task string - `factory log` CLI command for CEO milestone recording - Fix symlink resolution in emit_event (/tmp vs /private/tmp) - Remove dead checkpoint.json reads from summary.py and cmd_ceo - CEO prompt updated: Step 0 enforced by infra, factory log calls at phase boundaries, conditional sprint.started on FRESH only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
06dea7b to
75c9c93
Compare
The CEO logs phase.verdict after phase.archive.completed, but the verdict decision happens before archival in the workflow. A crash between the two calls would leave the Scrum Master with an inverted view of progress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Problem
When a factory CEO session crashes or gets killed mid-cycle, all progress is lost. The next
factory ceo /pathstarts from scratch — re-runs the researcher, strategist, builder, everything. A single cycle takes 15-30 minutes and costs significant tokens. There was no reliable way to resume from where the session left off.The previous checkpoint system (
checkpoint.json) relied on the CEO LLM executingfactory checkpoint --saveshell commands embedded in its prompt. Live testing proved this was completely broken — the CEO never executed these commands. Over multiple full cycles (Research → Strategy → Builder → Eval), zerocheckpoint.jsonfiles were ever written. The infrastructure for loading and resuming from checkpoints worked, but the saving side was dead.The root cause: asking an LLM to reliably execute bookkeeping shell commands via prompt instructions is inherently unreliable. The LLM focuses on the workflow (spawning agents, reviewing output, making decisions) and skips the administrative checkpoint commands.
Solution
Replace the checkpoint-based approach with an event-log-driven sprint/standup model — the same pattern human engineering teams use.
1. Event log as single source of truth
events.jsonlalready existed as an observability log, but it had gaps. This PR:/tmp→/private/tmpsymlink resolution inconsistencyfactory logCLI command so the CEO can record milestones:factory log /path "phase.research.completed" --data '{"verdict": "PROCEED"}'sprint.started,phase.research.completed,phase.strategy.completed,phase.verdict, andsprint.completedat each phase boundaryThe event log now contains both infra-level events (emitted automatically by
invoke_agent,cmd_eval, etc.) and CEO-level milestones (emitted by the CEO viafactory log). Together, they form a complete record of every cycle.2. Scrum Master agent
A new specialist agent role (9th in the lineup). At the start of every cycle, the CEO runs
factory agent scrummasteras Step 0. The Scrum Master reads:events.jsonl— finds the lastsprint.startedwithout a matchingsprint.completedreviews/— which agents produced output, CEO verdictsexperiments/— hypothesis state, verdictsstrategy/current.md— current strategyAnd produces a standup report:
3. CEO integration
The CEO prompt now has Step 0 (Scrum Master standup) before any work begins:
The old
factory checkpoint --save/--clearprompt instructions are removed entirely. Thefactory logcalls take their place at the same locations.Why this works where checkpoints didn't
factory checkpoint --save(unreliable)factory logis a simple one-liner the CEO is more likely to runagent.completed) + CEO milestones (phase.*.completed) + disk artifacts (ceo-verdict-*.md)## Resume Contexttext blobcheckpoint.json)events.jsonl) + disk stateEven if the CEO skips some
factory logcalls, the infra-level events and disk artifacts (review files, experiment directories) provide enough signal for the Scrum Master to reconstruct the sprint state.Changes
factory/events.pyproject_path.resolve()toemit_eventandload_eventsfactory/cli.pycmd_logfunction +logsubparser + handlerfactory/agents/runner.py"scrummaster"toAgentRolefactory/agents/prompts/scrummaster.mdfactory/agents/prompts/ceo.mdfactory checkpointwithfactory logcallstests/test_events.pytests/test_log_command.pyfactory log(valid, no-data, invalid JSON)Test plan
factory logcommand)sprint.started, proceeds normallyphase.research.completedandphase.strategy.completedboth recorded in events.jsonlsprint.startedwith nosprint.completedin log🤖 Generated with Claude Code