Problem
After a machine reboot or crash, all builder processes die — shellper daemons (detached, in-memory) and Tower together. Disk-persistent state survives: worktrees in .builders/<id>/, porch status.yaml, SQLite terminal_sessions, spec/plan/review files. But there is no command to walk that state and respawn the builders that should still be running.
The user typically discovers a half-dozen "dead" builders the morning after a reboot and has to afx spawn <id> --resume each one by hand. Most builders are sitting idle at approval gates when this happens, so a fresh respawn (which lands them right back at the gate via status.yaml) is functionally equivalent to true session resume.
Proposal
Add an explicit afx workspace recover command (no auto-trigger on Tower startup) that:
- Enumerates
codev/projects/*/status.yaml (using findStatusPath() to handle spec-653 multi-PR layouts under .builders/<id>/codev/projects/).
- Filters to eligible projects via the predicate below.
- By default, runs in dry-run mode: lists what it would revive and exits. Pass
--apply to actually respawn.
- For each eligible project, respawns the builder via the existing
afx spawn <id> --resume codepath.
Revival predicate
A project is eligible when ALL of:
phase ∉ {verified, complete} — not in a terminal porch state.
- A
terminal_sessions row exists for this project (it had a builder previously).
- The shellper PID is dead OR the Unix socket is unreachable (actually needs recovery, not already alive).
- The worktree at
.builders/<id>/ still exists on disk.
status.yaml.updated_at is within the last 7 days (configurable via --max-age <days>; bypass entirely with --include-stale).
Notes:
- Builders idle at the
pr gate are revived (consistent with all other gate-idle builders; the resource cost is negligible and the human may want to dispatch the builder to address review feedback without manual respawn).
- Projects whose
status.yaml was never written (builder crashed before first phase transition) won't show up; this is the conversation-resume gap and is out of scope.
CLI
afx workspace recover # dry-run, prints eligible projects
afx workspace recover --apply # actually respawn
afx workspace recover --max-age 14
afx workspace recover --include-stale --apply
Out of scope (separate issue)
- Conversation resume — restoring the builder's prior Claude session content (the
~/.claude/projects/<encoded-cwd>/<uuid>.jsonl file). Requires capturing the Claude session UUID at spawn time, storing it in terminal_sessions, and passing --resume <uuid> to the relaunched claude process. Worth it for mid-phase recovery but materially larger scope; file separately if/when mid-phase crashes become a real pain point.
- Auto-trigger on Tower startup — explicitly rejected here to avoid reviving abandoned-but-uncleaned projects. Revisit if explicit-trigger friction proves high.
Implementation notes
- Reuse
findStatusPath() (state.ts:285–303) for status.yaml lookup.
- Liveness check: SQLite
terminal_sessions row → PID alive + socket connectable. Reconciliation logic in tower-terminals.ts:485–711 already detects dead sockets; recovery extends that path with respawn instead of just cleanup.
- Respawn path: invoke the same flow as
afx spawn <id> --resume rather than building a parallel codepath.
Problem
After a machine reboot or crash, all builder processes die — shellper daemons (detached, in-memory) and Tower together. Disk-persistent state survives: worktrees in
.builders/<id>/, porchstatus.yaml, SQLiteterminal_sessions, spec/plan/review files. But there is no command to walk that state and respawn the builders that should still be running.The user typically discovers a half-dozen "dead" builders the morning after a reboot and has to
afx spawn <id> --resumeeach one by hand. Most builders are sitting idle at approval gates when this happens, so a fresh respawn (which lands them right back at the gate viastatus.yaml) is functionally equivalent to true session resume.Proposal
Add an explicit
afx workspace recovercommand (no auto-trigger on Tower startup) that:codev/projects/*/status.yaml(usingfindStatusPath()to handle spec-653 multi-PR layouts under.builders/<id>/codev/projects/).--applyto actually respawn.afx spawn <id> --resumecodepath.Revival predicate
A project is eligible when ALL of:
phase ∉ {verified, complete}— not in a terminal porch state.terminal_sessionsrow exists for this project (it had a builder previously)..builders/<id>/still exists on disk.status.yaml.updated_atis within the last 7 days (configurable via--max-age <days>; bypass entirely with--include-stale).Notes:
prgate are revived (consistent with all other gate-idle builders; the resource cost is negligible and the human may want to dispatch the builder to address review feedback without manual respawn).status.yamlwas never written (builder crashed before first phase transition) won't show up; this is the conversation-resume gap and is out of scope.CLI
Out of scope (separate issue)
~/.claude/projects/<encoded-cwd>/<uuid>.jsonlfile). Requires capturing the Claude session UUID at spawn time, storing it interminal_sessions, and passing--resume <uuid>to the relaunchedclaudeprocess. Worth it for mid-phase recovery but materially larger scope; file separately if/when mid-phase crashes become a real pain point.Implementation notes
findStatusPath()(state.ts:285–303) for status.yaml lookup.terminal_sessionsrow → PID alive + socket connectable. Reconciliation logic intower-terminals.ts:485–711already detects dead sockets; recovery extends that path with respawn instead of just cleanup.afx spawn <id> --resumerather than building a parallel codepath.