Skip to content

[FORGE-22] feat(orchestrator): gc reconciler + retry policy (P2-T09)#203

Merged
firatcand merged 4 commits into
mainfrom
feat/FORGE-22
May 20, 2026
Merged

[FORGE-22] feat(orchestrator): gc reconciler + retry policy (P2-T09)#203
firatcand merged 4 commits into
mainfrom
feat/FORGE-22

Conversation

@firatcand
Copy link
Copy Markdown
Owner

Summary

Ships the gc reconciler + retry policy from spec/ORCHESTRATOR.md §"gc reconciliation rules". Pure planner (src/orchestrator/gc.ts) drives 14 row detectors; CLI shim (src/cli/orchestrate/gc.ts) builds the snapshot, formats --dry-run, and executes the plan. Read-band verbs (phases --ready, status) now emit detect-and-warn auto-gc on stderr without mutating.

  • Retry policysrc/orchestrator/retry.ts (pure): Symphony backoff min(1000 * 2^(attempt-1), retry_backoff_ms_max), retry_attempts cap, failure_reason discriminator on TaskStateRecord (decision_key_budget | retries_exhausted | fatal).
  • gc planner — 14 row detectors returning a discriminated union (mark_terminal | mark_abandoned | mark_unclaimed | archive_question | reverify_verdict | release_lease_admin | prune_branch | report_orphan). Cheap-rows set frozen as {2, 5, 8, 11, 12, 13, 14}.
  • Identity-gated lease admin releaseadminReleaseLeaseByIdentity in src/orchestrator/leases.ts. Four-field identity check (claim_id, generation, owner_run_id, expires_at) + verify-before-unlink + row-14 terminal-state re-check immediately before unlink. Closes the heartbeat-renewal TOCTOU.
  • Auto-gc consumersdetectCheapDivergences() invoked from phases --ready and status. Detect-only; warnings to stderr in form [gc] <task>: row <N> (<action>) — run \forge orchestrate gc` to apply`. Spec §640 amended to document this.
  • Legacy migration preserved — FORGE-73 question-tree migration runs as Phase 0 of the new gc verb; all 6 pre-existing legacy tests still pass.

Why

Phase 2 / P0. Unblocks Phase 3 (FORGE-25 doctor checks, FORGE-26 perf-test harness, FORGE-27 cross-host CI matrix) which all depend on a deterministic gc primitive. Also unblocks FORGE-116 (/wrap-up skill) which needs gc as its end-of-task housekeeping engine.

Plan went through two Codex reviews (8 + 7 findings, all addressed); implementation went through one more Codex pass (7 findings — 2 BLOCK on TOCTOU, all fixed in 164800b).

Test plan

  • npm run typecheck — clean
  • npm test — 1355 pass / 0 fail / 19 skipped (+78 from base 1277)
  • npm run build — ESM 369.93 kB / CJS 403.04 kB
  • gitleaks detect — no leaks
  • Three Codex passes (plan + plan rev-2 + diff) — all findings resolved
  • FORGE-73 legacy migration regression — all 6 tests still pass
  • Heartbeat-race scenario — new test fires heartbeat() between snapshot and admin-release; identity mismatch correctly refuses the unlink

Manual smoke: node dist/bin/forge.cjs orchestrate gc --dry-run on an empty .forge/orchestrator → "no divergences found"; orchestrate gc --help synopsis reflects new scope.

Deferred to follow-up tickets

  • Extend src/trackers/types.ts:Issue with claim metadata → unblocks gc row 4 auto-recovery and full row-2 tracker-side variant
  • Emit DECISION_KEY_EXHAUSTED from the question-channel max_attempts site (ties to FORGE-65) → activates the consumer wiring shipped here
  • Add last_failed_at to TaskStateRecord + writer in complete.ts → activates retry-eligibility filter in phases --ready
  • Full executor for mark_terminal / mark_abandoned / mark_unclaimed / reverify_verdict / prune_branch (currently emit reconcilerDeferred warnings)
  • JSON-mode warning surface for stderr-suppressing scripts

Closes FORGE-22.

🤖 Generated with Claude Code

firatcand and others added 4 commits May 20, 2026 22:03
… + failure_reason discriminator

[FORGE-22] P2-T09 units 1+2 — foundation for the gc reconciler.

src/orchestrator/retry.ts (new, pure)
  backoffMs(), nextEligibleAt(), nextRetryState(), classifyError().
  Symphony formula: min(1000 * 2^(attempt-1), retry_backoff_ms_max).
  Defaults: retry_attempts=10, retry_backoff_ms_max=300_000 (matches
  agents.* in settings.yaml — verified against the existing keys).

src/schemas/task-state.ts
  Adds optional failure_reason: 'decision_key_budget' | 'retries_exhausted'
  | 'fatal' with zod refinement: present iff state === 'failed'.
  Schema-additive; existing state.json files round-trip unchanged.

src/core/errors.ts
  Adds DECISION_KEY_EXHAUSTED (consumer wiring only — producer site
  deferred to follow-up per FORGE-22 plan §Scope).
  Adds LEASE_IDENTITY_MISMATCH + LEASE_STATE_NOT_TERMINAL for the
  new admin-release primitive.

src/orchestrator/leases.ts
  Adds adminReleaseLeaseByIdentity({forgeDir, taskId, expectedClaimId,
  expectedGeneration, expectedOwnerRunId, expectedPath,
  requireTerminalState, reason}). Identity-gated, NOT a generic
  ownership-bypass: caller supplies exact lease identity captured at
  snapshot time, executor re-reads + confirms (claim_id, generation,
  owner_run_id) match before unlink. Row-14 path additionally
  re-reads state.json and confirms state ∈ TERMINAL_TASK_STATES.
  Records event: 'admin_released' in claim-history.jsonl with reason.
  Codex 2nd-pass #2 resolution.

Tests: +30 net (16 retry + 9 admin-release + 5 task-state invariant).
Full suite: 1307 pass / 0 fail / 19 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onciler

[FORGE-22] P2-T09 units 3+4 — the deterministic reconciler.

src/orchestrator/gc.ts (new, pure)
  planGc(snapshot) → GcPlan. Discriminated union on `action`:
    mark_terminal | mark_abandoned | mark_unclaimed |
    archive_question | reverify_verdict |
    release_lease_admin | prune_branch | report_orphan.
  14 row detectors mirror spec/ORCHESTRATOR.md §gc reconciliation rules.
  CHEAP_ROW_IDS = {2, 5, 8, 11, 12, 13, 14} — local-only checks for
  detect-and-warn auto-gc consumers (Unit 5).
  Deterministic ordering: (rowId, taskId). assertNeverGcRow() locks
  exhaustiveness at the executor's switch.
  Row 4 auto-recovery DEFERRED (Codex 2nd-pass #1 — Issue lacks claim
  metadata); planner emits report_orphan with kind:
  tracker_claimed_no_local for detection.

src/cli/orchestrate/gc.ts (refactored)
  Two-phase verb:
    Phase 0 — legacy v1 question-tree migration (FORGE-73, preserved
    verbatim; 6 existing tests still pass).
    Phase 1 — buildSnapshot() → planGc() → format (dry-run) or
    executeRow().
  Snapshot scans local state: state.json, lease.json* (multi-lease
  for row 13), attempts/ (verdict, question, answer files).
  Tracker integration deferred — empty trackerIssues means rows
  1/3/4/6/7 don't fire in this PR.
  Executor dispatches: release_lease_admin via adminReleaseLeaseByIdentity;
  archive_question via mkdir+rename. Other actions (mark_terminal,
  mark_abandoned, mark_unclaimed, reverify_verdict, prune_branch)
  emit stderr warnings — deferred to follow-up, NOT errors.
  exitCode is non-zero only on actual mutation failure.

src/cli/orchestrate/index.ts
  gcHandler synopsis updated to reflect reconciler scope.

Tests:
  test/orchestrator/gc.test.ts (new, 30 tests):
    14 row detectors with fixture snapshots, cheap-set partition,
    determinism, no-op alignment, GcPlanRow exhaustiveness check.
  test/cli/orchestrate/gc-reconciler.test.ts (new, 8 tests):
    Clean tree, dry-run vs apply, row 14 end-to-end, row 13
    duplicate-lease release, row 11 archive, multi-row plan,
    idempotence, legacy + reconciler coexist.

Full suite: 1345 pass / 0 fail / 19 skipped (+38 from base 1307).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[FORGE-22] P2-T09 units 5+6 — auto-gc consumer wiring + spec amendment.

src/cli/orchestrate/gc.ts
  Adds detectCheapDivergences(forgeDir, err, now) — reusable helper that
  builds a 'cheap' snapshot, runs planGc, and emits one stderr warning
  per row: `[gc] <task>: row <N> (<action>) — run \`forge orchestrate gc\`
  to apply`. Best-effort; any I/O failure short-circuits silently so
  the host verb never breaks.

src/cli/orchestrate/phases.ts + status.ts
  Both read-band verbs invoke detectCheapDivergences() at the start of
  the --ready path / status assembly. Warnings to stderr only — stdout
  JSON contract unchanged. Read-band invariant preserved (no mutation
  from read verbs).

spec/ORCHESTRATOR.md (2-line amendment)
  Line 601: clarify that read-band invocations DETECT (not "trigger
  mutations"); explicit `gc` is the only mutator.
  Line 640: replace "lightweight reconcile runs on every …" with
  "lightweight reconcile DETECTS the cheap rows …; read-band verbs
  NEVER mutate." Stderr warning format documented inline.

Codex 2nd-pass #3 (spec amendment scope) + #4 (read-band mutation
pattern break) resolved per FORGE-22 plan rev-3.

Tests: test/cli/orchestrate/phases-autogc.test.ts (8 tests):
  unit tests of detectCheapDivergences (row 14, row 2, clean, no-mutation,
  best-effort I/O, expensive-row exclusion) + wiring smoke tests
  confirming phases.ts and status.ts both import the helper.

Full suite: 1353 pass / 0 fail / 19 skipped (+8 from previous commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on safety

[FORGE-22] Codex 3rd-pass review applied (diff against rev-3
implementation, 7 findings — 2 BLOCK, 4 IMPROVEMENT/CONSIDER fixed,
1 doc clarification).

BLOCK 1 (confidence 10) — `adminReleaseLeaseByIdentity` could unlink a
freshly-renewed lease.
  heartbeat() preserves (claim_id, generation, owner_run_id) and only
  advances expires_at + last_heartbeat_at, so a heartbeat firing between
  snapshot and unlink slipped through the 3-field identity check.
  Fix: identity now requires a fourth field `expectedExpiresAt` (caller
  captures at snapshot time, executor confirms). Defense-in-depth:
  verify-before-unlink re-reads the lease one final time and re-confirms
  all four fields match. Row 14 additionally re-reads state.json
  immediately before unlink and re-confirms terminal state — complete.ts
  can transition shipped → running on changes_requested without a
  legal-transition gate, so the stale terminal check was racey.
  src/orchestrator/leases.ts: AdminReleaseByIdentityOptions extended;
  step 4 (verify lease) + step 4b (re-verify state) added between
  identity check and unlink.
  src/orchestrator/gc.ts: GcPlanRow.release_lease_admin payload schema
  extended; row 13 + row 14 planners capture lease.expires_at.

BLOCK 2 (confidence 9) — Row 13 tie sort could target the canonical lease.
  readdirSync order is FS-dependent; equal-generation leases sorted by
  generation alone could put the canonical file in the to-release set.
  Fix: tie-breaker prefers isCanonical=true. Path order as final fallback
  for deterministic behavior under all-corruption scenarios.

CONSIDER 3 (confidence 9) — schema-iff refinement could break adopter
state.json files with state=failed and no failure_reason.
  Fix: loosen to asymmetric invariant — failure_reason IMPLIES
  state === 'failed' (not iff). Adopter state.json with state=failed and
  no reason continues to parse; readers default to treating absent reason
  as 'fatal' classification. Migration-safe.

CONSIDER 4 (confidence 9) — OrchestrateGcResult couldn't distinguish "no
divergence detected" from "all detected rows deferred to follow-up."
  Fix: add `reconcilerDeferred?: readonly GcPlanRow[]` field. Apply mode
  populates both `reconcilerRows` (applied) and `reconcilerDeferred`
  (deferred — stderr-warned but not mutated). Programmatic callers can
  now tell the difference.

IMPROVEMENT 5 (confidence 8) — detectCheapDivergences swallowed
snapshot-build failures silently.
  Fix: catch path now writes "[gc] auto-detect failed (continuing):
  <reason>" to stderr before returning. Operator sees corrupted
  .forge/orchestrator trees instead of mysterious empty output.

CONSIDER 6 (confidence 8) — row-2 tracker-side variant gap not
explicitly noted in code.
  Fix: detectRow2 doc comment now explicitly says the tracker-side
  sub-case is DEFERRED (the CLI snapshot passes empty trackerIssues).

Non-issues 7 & 8 (confidence 8 each) — discriminated-union exhaustiveness
intact; --dry-run does not mutate either phase. Confirmed by Codex.

Tests:
  +1 in leases.test.ts: heartbeat-race scenario — heartbeat fires between
  snapshot and admin-release; identity mismatch on expires_at refuses the
  unlink.
  Updated task-state schema tests: state='failed' without failure_reason
  now ACCEPTED (asymmetric invariant); failure_reason without
  state='failed' still rejected.
  All 9 existing admin-release call sites updated to pass
  expectedExpiresAt.

Full suite: 1355 pass / 0 fail / 19 skipped (+2 from previous commit).
Typecheck clean. Build OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@firatcand firatcand merged commit 207ee41 into main May 20, 2026
10 checks passed
@firatcand firatcand deleted the feat/FORGE-22 branch May 20, 2026 22:02
firatcand added a commit that referenced this pull request May 20, 2026
Brings in FORGE-22 (gc reconciler + retry policy, PR #203) and FORGE-89
(/codex → /second-opinion rename, PR #205).

Resolved conflicts in spec/CONTEXT.md + spec/PRD.md by combining our
structural admonitions (deferred-to-v0.5 boxes, Decision 10/11
strikethrough+redirect, settings.yaml decisions-block annotation) with
their /codex → /second-opinion renames in the preserved v0.5 prose.
No semantic loss either side; both hygiene threads compose cleanly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant