Skip to content

afx workspace recover: revive builders after machine reboot/crash #829

@amrmelsayed

Description

@amrmelsayed

Problem

After a machine reboot or crash, all builder processes die — shellper daemons (detached, in-memory) and Tower together. Disk-persistent state survives: worktrees in .builders/<id>/, porch status.yaml, SQLite terminal_sessions, spec/plan/review files. But there is no command to walk that state and respawn the builders that should still be running.

The user typically discovers a half-dozen "dead" builders the morning after a reboot and has to afx spawn <id> --resume each one by hand. Most builders are sitting idle at approval gates when this happens, so a fresh respawn (which lands them right back at the gate via status.yaml) is functionally equivalent to true session resume.

Proposal

Add an explicit afx workspace recover command (no auto-trigger on Tower startup) that:

  1. Enumerates codev/projects/*/status.yaml (using findStatusPath() to handle spec-653 multi-PR layouts under .builders/<id>/codev/projects/).
  2. Filters to eligible projects via the predicate below.
  3. By default, runs in dry-run mode: lists what it would revive and exits. Pass --apply to actually respawn.
  4. For each eligible project, respawns the builder via the existing afx spawn <id> --resume codepath.

Revival predicate

A project is eligible when ALL of:

  1. phase ∉ {verified, complete} — not in a terminal porch state.
  2. A terminal_sessions row exists for this project (it had a builder previously).
  3. The shellper PID is dead OR the Unix socket is unreachable (actually needs recovery, not already alive).
  4. The worktree at .builders/<id>/ still exists on disk.
  5. status.yaml.updated_at is within the last 7 days (configurable via --max-age <days>; bypass entirely with --include-stale).

Notes:

  • Builders idle at the pr gate are revived (consistent with all other gate-idle builders; the resource cost is negligible and the human may want to dispatch the builder to address review feedback without manual respawn).
  • Projects whose status.yaml was never written (builder crashed before first phase transition) won't show up; this is the conversation-resume gap and is out of scope.

CLI

afx workspace recover           # dry-run, prints eligible projects
afx workspace recover --apply   # actually respawn
afx workspace recover --max-age 14
afx workspace recover --include-stale --apply

Out of scope (separate issue)

  • Conversation resume — restoring the builder's prior Claude session content (the ~/.claude/projects/<encoded-cwd>/<uuid>.jsonl file). Requires capturing the Claude session UUID at spawn time, storing it in terminal_sessions, and passing --resume <uuid> to the relaunched claude process. Worth it for mid-phase recovery but materially larger scope; file separately if/when mid-phase crashes become a real pain point.
  • Auto-trigger on Tower startup — explicitly rejected here to avoid reviving abandoned-but-uncleaned projects. Revisit if explicit-trigger friction proves high.

Implementation notes

  • Reuse findStatusPath() (state.ts:285–303) for status.yaml lookup.
  • Liveness check: SQLite terminal_sessions row → PID alive + socket connectable. Reconciliation logic in tower-terminals.ts:485–711 already detects dead sockets; recovery extends that path with respawn instead of just cleanup.
  • Respawn path: invoke the same flow as afx spawn <id> --resume rather than building a parallel codepath.

Metadata

Metadata

Assignees

Labels

area/towerArea: Tower server / agent farm CLI

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions