diff --git a/codev/plans/982-vscode-tower-no-active-termina.md b/codev/plans/982-vscode-tower-no-active-termina.md new file mode 100644 index 000000000..06d266c7a --- /dev/null +++ b/codev/plans/982-vscode-tower-no-active-termina.md @@ -0,0 +1,103 @@ +# PIR Plan: Make the "No active terminal" click self-heal (and only suggest recovery as a last resort) + +## Understanding + +Clicking a builder row in the Codev sidebar can produce a dead-end warning toast: + +``` +Codev: No active terminal for +``` + +…and nothing else happens. The row keeps looking healthy, but the click can't open a terminal. + +**Why the two surfaces disagree.** The sidebar and the terminal-opener read different Tower sources: + +- The **sidebar** renders from `OverviewBuilder` (overview cache / `/api/overview`), which is disk-sourced (worktree + `status.yaml`) and has no `terminalId` field (`packages/types/src/api.ts`, `OverviewBuilder`). A builder shows up as long as its worktree exists. +- The **opener** (`openBuilderByRoleOrId`, `packages/vscode/src/terminal-manager.ts:186-215`) calls `getWorkspaceState` (`/api/state`), whose handler builds the `builders` array by iterating `entry.builders` and **including a builder only if its live PTY session resolves** (`tower-routes.ts:~1840` — `const session = manager.getSession(terminalId); if (session) { … }`). No live session → the builder is omitted → `resolveAgentName` finds no match → `builder` is `undefined` → the bare warning fires at `terminal-manager.ts:206-208`. + +**The dominant cause is transient, not a destroyed session.** Every `/api/state` request first runs `getRehydratedTerminalsEntry` (`tower-terminals.ts:164-168`), which awaits a rehydrate + an **on-the-fly shellper reconnect** for any session that's in SQLite but not currently in the in-memory registry (`tower-terminals.ts:780-886`). A click that lands: + +- mid-rehydration / mid-reconnect (sub-100ms, longer under load), +- during Tower's startup reconciliation window (`_reconciling` blocks the on-the-fly reconnect, `tower-terminals.ts:53`/`:780` — seconds), or +- in the spawn race (the worktree/`status.yaml` is visible to `/api/overview` a beat before the PTY session is registered in `entry.builders`) + +…sees a momentary miss that **resolves on the very next call.** The session is not gone; it just isn't resolvable for a beat. + +**The actual defect is the missing retry.** The open path surfaces the warning on the *first* miss with no second attempt (`terminal-manager.ts:206-208`) — Tower would have answered correctly a few hundred milliseconds later. So the user hits a dead-end for a condition that self-heals. + +(Session genuinely destroyed while Tower runs — non-shellper 5-min idle reap, PTY exit past the 30s grace, or a dead shellper after machine sleep — is the *minority* tail. That tail is what `afx workspace recover` addresses, and it should be a secondary suggestion, not the headline.) + +## Proposed Change + +Reframe the fix around the dominant transient case: **let the click heal itself silently**, and only surface a toast on the genuinely-persistent tail — where recovery is a secondary, last-resort line, not the primary action. + +### 1. Primary: bounded silent auto-retry in `openBuilderByRoleOrId` (`terminal-manager.ts:193-209`) + +When the lookup misses (`!builder?.terminalId`), re-query `getWorkspaceState` a few times with a short backoff before surfacing anything. Each retry re-triggers Tower's rehydrate + on-the-fly reconnect, which is exactly the self-heal path. + +**Reuse the system's shared backoff convention, not a hand-rolled delay.** The codebase consolidated four hand-rolled `Math.min(1000 * 2^attempt, cap)` curves into `packages/core/src/reconnect-policy.ts` (#961) specifically so new call sites don't re-invent it; the two closest analogues — `connection-manager.ts` (Tower/SSE reconnect, via `backoffDelayMs`) and `terminal-adapter.ts` (terminal WebSocket reconnect, via `BackoffController`) — already use it. So this retry will call `backoffDelayMs(attempt, opts)` from `@cluesmith/codev-core/reconnect-policy` with a **local attempt counter** (the bare-function form, matching how the SSE/tunnel sites use it — `BackoffController`'s status-machine + give-up wrapper is for event-driven onClose scheduling we don't need here). + +- The module's *defaults* (base 1000ms, cap 30_000ms, 6 attempts → `1s,2s,4s,8s,16s,30s`) are tuned for persistent reconnect loops and are too slow for an interactive click. Use interactive-tuned options instead: `backoffDelayMs(attempt, { baseMs: 150, capMs: 800 })` over ~3–4 attempts → roughly `150ms, 300ms, 600ms` (total budget ~1s). Constants live next to the call site with a comment explaining the interactive tuning vs. the reconnect defaults. +- The first attempt is the existing call (no added latency on the happy path; retries happen only on the miss branch). +- On success at any attempt → open the builder terminal normally. **No toast** — the transient case becomes invisible, which is the best UX and directly implements the issue's "self-recovers gracefully on the next overview tick" acceptance bullet. +- The retry loop only re-fetches and re-resolves; it does not change the happy path (a first-attempt hit returns immediately). +- `classifyUpgradeError` (the module's third export) is **not** used — it classifies WebSocket close codes; our `/api/state` poll gets a builder list, so "resolved or not after N tries" is the only signal. + +### 2. Secondary: an actionable toast only when retries are exhausted (likely-persistent) + +If every attempt still misses, the session is probably genuinely gone — show a toast that leads with a **manual retry** and treats recovery as the fallback: + +- Message (neutral, not over-promising recovery): e.g. `Codev: #'s terminal isn't available — it may still be starting, or its session was dropped. Retry, or recover builders if it was lost.` (Use the friendly `# ` identity from the resolved `OverviewBuilder` row when available.) +- **"Retry" button** (primary) → re-invokes the open (which runs the auto-retry again). Covers the longer startup-reconciliation window where ~1.2s wasn't enough. +- **"Recover Builders" button** (secondary, last resort) → opens a terminal at the workspace root running `afx workspace recover` (dry-run preview), mirroring `commands/run-worktree-setup.ts:51-56` (`createTerminal({ name, cwd: workspacePath })` + `sendText('afx workspace recover')`). Stays at the dry-run because recover is workspace-wide (cannot target one builder) — the user reviews scope before re-running with `--apply`. `workspacePath` is already in scope at `terminal-manager.ts:188`. + +Button handling uses the established positional-args pattern (`notifications/gate-toast.ts:109-139`). + +### What changed from the first draft (and why) + +The first draft led with recovery (per the issue's options 1–2). That mis-weights the problem: the case actually seen in practice is transient unavailability that needs no recovery at all. Recovery is now demoted to the persistent tail, and the headline is the silent auto-retry that makes the dominant case disappear. + +### Deferred options (with reasons) + +- **Option 3 (sidebar icon for dropped sessions)** — `OverviewBuilder` carries no liveness/`terminalId` signal (confirmed unchanged on `main`; `overview.ts`'s recent #907 change added an `area` enrichment cache, not liveness). Flagging the row before a click would need a `hasLiveSession` field threaded from Tower's in-memory registry through the overview server and `@cluesmith/codev-types` — cross-package blast radius, and the auto-retry already removes the dominant dead-end. Recommend a separate follow-up. (Also: with auto-retry, a transient miss never even reaches a "dropped" state worth flagging.) +- **Option 4 (auto-recover on activation)** — a behavior decision the issue defers; out of scope. +- **Option 5 (persist Tower's session registry)** — the root-cause fix for the *destroyed-session* tail; explicitly a separate, larger discussion; out of scope. + +## Files to Change + +- `packages/vscode/src/terminal-manager.ts:193-209` — in `openBuilderByRoleOrId`, wrap the resolve in a bounded retry (re-fetch `getWorkspaceState`, re-run `resolveAgentName`, ~3–4 attempts with delays from `backoffDelayMs(attempt, { baseMs: 150, capMs: 800 })` and a local counter). On exhaustion, call a small private helper for the actionable toast (`Retry` + secondary `Recover Builders`). Keep the `ambiguous` and `not connected` branches as-is. Factor the toast + the sleep into helpers so the method stays readable and unit-testable; reuse the already-fetched `workspacePath`. +- `packages/vscode/src/terminal-manager.ts` (imports) — add `import { backoffDelayMs } from '@cluesmith/codev-core/reconnect-policy';` (same import surface `connection-manager.ts` / `terminal-adapter.ts` already use). +- `packages/vscode/src/__tests__/terminal-manager.test.ts` — extend the suite: (a) **miss-then-hit** → `getWorkspaceState` returns no session on attempt 1 and a session on attempt 2 → builder terminal opens, `showWarningMessage` NOT called; (b) **all-miss** → toast shown with `Retry` + `Recover Builders` labels; (c) selecting **Retry** re-attempts the open; (d) selecting **Recover Builders** → `createTerminal` with `cwd === workspacePath` + `sendText('afx workspace recover')`; (e) happy path (first-attempt session present) → opens immediately, no extra fetches, no warning. Inject a fake/fast sleep so tests don't wait real time. + +No `package.json` command contribution needed (buttons handled inline). No types/server changes. + +## Risks & Alternatives Considered + +- **Risk: auto-retry adds latency.** Only on the miss branch, and bounded (~1.2s). The happy path (first-attempt hit) is unchanged — returns immediately. Mitigation: keep attempts/backoff small and configurable as constants. +- **Risk: retry masks a real persistent failure.** Bounded retries fail fast to the actionable toast; we don't retry indefinitely. The persistent tail still gets a clear signal + recovery path. +- **Risk: startup-reconciliation window exceeds the auto-retry budget (seconds).** Then attempt-set 1 falls through to the toast, but the **Retry** button (and a later natural re-click) succeeds once `_reconciling` clears. Acceptable: the toast is now actionable rather than a dead-end. +- **Risk: `afx workspace recover` doesn't help the live-process/dropped-session sub-case.** True, but it's the correct tool for the dead-shellper tail and is now secondary. The message doesn't over-promise ("it may still be starting, or its session was dropped"). +- **Alternative: lead with recovery (first draft).** Rejected — mis-weights the problem; the dominant case needs no recovery. +- **Alternative: fix it Tower-side (block `/api/state` until rehydration fully settles).** Rejected for this issue — larger blast radius on a hot endpoint, and the client-side retry is the smaller, safer change that fixes the user-visible symptom. Tower-side hardening can be a follow-up (overlaps Option 5). +- **Risk: happy-path regression.** Mitigated by confining changes to the miss branch + an explicit happy-path test. + +## Test Plan + +**Unit (`packages/vscode/src/__tests__/terminal-manager.test.ts`, run via vitest — `pnpm --filter codev-vscode test:unit`):** +- Miss-then-hit → terminal opens, no warning (the transient self-heal). +- All attempts miss → actionable toast with `Retry` + `Recover Builders`. +- `Retry` → re-attempts the open. +- `Recover Builders` → `createTerminal` with `cwd === workspacePath`; `sendText('afx workspace recover')`. +- Happy path → first-attempt session present → opens immediately, no warning, no extra fetches. + +**Manual (reviewer at the `dev-approval` gate — run the worktree):** +1. **Transient (the main case):** click a builder row right after a spawn / while Tower is settling. Expect the terminal to open after a brief pause with **no** dead-end toast (previously: instant warning). To force the window, click during the spawn race or just after a Tower restart while reconciliation runs. +2. **Persistent:** simulate a genuinely lost session (e.g. kill the shellper so it can't reconnect). Click the row → after the bounded retries, the actionable toast appears. Click **Retry** → still missing → toast returns (not a silent dead-end). Click **Recover Builders** → a terminal opens at the workspace root running `afx workspace recover` (dry-run); confirm cwd is the main checkout, not a worktree. +3. **Happy path:** click a healthy builder → terminal opens immediately, no toast (regression check). + +**Cross-platform:** n/a (desktop VSCode extension only). + +## Build / Verify Commands + +- `pnpm --filter codev-vscode test:unit` — the vitest unit suite (where the new tests live). +- `pnpm --filter codev-vscode compile` — typecheck + lint + esbuild bundle of the extension. diff --git a/codev/projects/982-vscode-tower-no-active-termina/982-review-iter1-rebuttals.md b/codev/projects/982-vscode-tower-no-active-termina/982-review-iter1-rebuttals.md new file mode 100644 index 000000000..7d9b70acf --- /dev/null +++ b/codev/projects/982-vscode-tower-no-active-termina/982-review-iter1-rebuttals.md @@ -0,0 +1,35 @@ +# PIR #982 — Review iteration 1 disposition + +Single advisory pass (`max_iterations: 1`). Verdicts: claude=APPROVE, codex=REQUEST_CHANGES, gemini=skipped. + +## codex — REQUEST_CHANGES (HIGH): Recover ran from the wrong cwd — ADDRESSED (fixed in code) + +**Finding:** The **Recover Builders** action launched `afx workspace recover` with `cwd: workspacePath` from `connectionManager.getWorkspacePath()`. Because `detectWorkspacePath` (`workspace-detector.ts`) walks up to the first directory containing `codev/`, and a builder worktree (`.builders/<id>`) is a full checkout with its own `codev/`, a VSCode window *rooted at* a worktree resolves `getWorkspacePath()` to the worktree — so recovery would run inside the worktree, contradicting repo guidance that `afx` must run from the main checkout root. + +**Assessment:** Real defect, confirmed against `workspace-detector.ts:9-20` (`findProjectRoot` stops at the first `codev/`). HIGH confidence is warranted for the worktree-rooted-window scenario. + +**Fix (commit on branch):** +- Added `mainCheckoutRoot(workspacePath)` in `packages/vscode/src/terminal-resolve.ts` — strips a trailing `/.builders/<id>` segment (leaf-only; tolerates trailing slash and Windows separators), returning a normal main-checkout path unchanged. +- `promptNoTerminalRecovery` now uses `cwd: mainCheckoutRoot(workspacePath)` for the recover terminal. + +**Regression coverage:** +- 5 behavioral tests for `mainCheckoutRoot` (unchanged path / strip worktree / trailing slash / non-leaf no-op / Windows separators). +- The `terminal-manager.ts` source-level guard now asserts `cwd: mainCheckoutRoot(workspacePath)` — it fails if the cwd reverts to the raw workspace path. + +**Scope note:** The same `cwd: workspacePath` pattern for an `afx` command pre-exists in `run-worktree-setup.ts` (`afx setup`) and other call sites. A consistent main-root resolution across *all* afx invocations is broader than #982; this PR fixes only the affordance it introduces and flags the rest as a follow-up. + +## codex — test-coverage note: ADDRESSED + +The original source-level regex test would not have caught a wrong cwd. The new behavioral `mainCheckoutRoot` tests plus the tightened source-level assertion now pin the corrected behavior. + +## gemini — skipped (non-blocking) + +Antigravity `agy` CLI not installed in this environment; no review content. No action. + +## claude — APPROVE + +No changes requested. + +## Escalation + +PIR is single-pass: this fix will **not** be independently re-reviewed by the models. The human at the `pr` gate is the only remaining reviewer of the fix — flagged in the architect notification. diff --git a/codev/projects/982-vscode-tower-no-active-termina/status.yaml b/codev/projects/982-vscode-tower-no-active-termina/status.yaml new file mode 100644 index 000000000..f4730308f --- /dev/null +++ b/codev/projects/982-vscode-tower-no-active-termina/status.yaml @@ -0,0 +1,32 @@ +id: '982' +title: vscode-tower-no-active-termina +protocol: pir +phase: verified +plan_phases: [] +current_plan_phase: null +gates: + plan-approval: + status: approved + requested_at: '2026-06-06T05:56:04.976Z' + approved_at: '2026-06-06T12:04:30.865Z' + dev-approval: + status: approved + requested_at: '2026-06-06T12:11:59.157Z' + approved_at: '2026-06-06T12:15:06.203Z' + pr: + status: approved + requested_at: '2026-06-06T12:25:53.858Z' + approved_at: '2026-06-06T12:29:16.508Z' +iteration: 1 +build_complete: true +history: [] +started_at: '2026-06-06T05:49:39.607Z' +updated_at: '2026-06-06T12:29:38.161Z' +pr_history: + - phase: implement + pr_number: 1006 + branch: builder/pir-982 + created_at: '2026-06-06T12:16:48.295Z' + merged: true + merged_at: '2026-06-06T12:29:38.160Z' +pr_ready_for_human: false diff --git a/codev/resources/lessons-learned.md b/codev/resources/lessons-learned.md index 4b8a2530d..40dde6466 100644 --- a/codev/resources/lessons-learned.md +++ b/codev/resources/lessons-learned.md @@ -406,6 +406,8 @@ Generalizable wisdom extracted from review documents, ordered by impact. Updated - [From 812] VS Code gives extensions no control over a panel view container's *position*. The `contributes.viewsContainers` schema is `id`/`title`/`icon` only — no `order`/`priority` — and there is no runtime reposition API. A newly contributed panel tab always lands after the built-ins (Problems/Output/Terminal/…) and is the first to spill into the `…` overflow on a normal-width window, making it effectively undiscoverable on first install. The only supported discoverability lever is a one-time reveal: call `workbench.view.extension.<containerId>` once, guarded by `context.globalState`, so it fires once per profile rather than nagging on every launch. Same constraint applies to *ordering within* a container (views order by registration, not a manifest field). Don't promise a panel surface will appear in a particular spot — it won't. +- [From 982] Two views backed by different freshness sources diverge *transiently*, not only on outright failure. The VSCode sidebar lists a builder from `/api/overview` (disk-sourced: worktree + status.yaml), but the terminal opener needs a live PTY session from `/api/state` (in-memory registry), which momentarily omits a builder while Tower rehydrates and on-the-fly-reconnects its session on every call. The opener warned on the *first* miss, dead-ending on a condition that self-heals on the very next call. Fix: a bounded retry (reuse the shared `backoffDelayMs` curve with interactive params, not a hand-rolled delay) absorbs the window so the dominant case opens silently; only a genuinely persistent miss surfaces an actionable recovery toast. Lead the recovery UX with the cheap self-heal (retry), not the heavy workaround (`afx workspace recover`), which only addresses the rare destroyed-session tail. Sibling to [From 916]: same divergence class, opposite direction (there the whole shared cache nulled; here one per-builder field is briefly stale). + - [From 916] A shared cache that nulls itself on a *transient* failure correlates the failure across every consumer at once. The VSCode sidebar's four data-views share one `OverviewCache`; each renders a falsy read as `[]`, so a brief SSE drop/reconnect that nulled the cache blanked all four simultaneously, then recovered on the next event. Fix: hold last-known-good — a not-connected state or a failed fetch returns without clobbering; only a successful fetch commits. Crucially, distinguish *failure-null* from *legitimately-empty*: the source-of-truth (`TowerClient.getOverview()`) returns `null` only on request failure and a real object with empty arrays for a genuinely empty workspace, so "keep last value on null" never masks real emptiness. The discriminating diagnostic was that one sibling view (Workspace) stayed populated because it reads a *different* source — when N views empty together but one doesn't, suspect a shared dependency of the N, not N independent bugs. --- diff --git a/codev/reviews/982-vscode-tower-no-active-termina.md b/codev/reviews/982-vscode-tower-no-active-termina.md new file mode 100644 index 000000000..2c33e5840 --- /dev/null +++ b/codev/reviews/982-vscode-tower-no-active-termina.md @@ -0,0 +1,64 @@ +# PIR Review: Self-heal the "No active terminal" click (recovery as last resort) + +Fixes #982 + +## Summary + +Clicking a builder row in the Codev sidebar could dead-end on an unactionable `Codev: No active terminal for <id>` warning. The sidebar lists a builder from `/api/overview` (disk-sourced) while the terminal opener needs a live PTY session from `/api/state` (in-memory registry), which momentarily omits a builder while Tower rehydrates / on-the-fly-reconnects sessions on every call. The dominant cause is therefore transient and self-healing, but the opener warned on the first miss. This change wraps the resolve in a bounded retry (reusing the shared `backoffDelayMs` curve with interactive-tuned params) so the transient case opens silently; only a genuinely persistent miss surfaces an actionable toast that leads with **Retry** and offers **Recover Builders** (`afx workspace recover`, dry-run) as a last resort. + +## Files Changed + +- `packages/vscode/src/terminal-resolve.ts` (+91 / -0) — new vscode-free retry/resolve helper +- `packages/vscode/src/terminal-manager.ts` (+81 / -11) — delegate to the helper; new recovery toast +- `packages/vscode/src/__tests__/terminal-resolve.test.ts` (+147 / -0) — 7 behavioral tests +- `packages/vscode/src/__tests__/terminal-manager.test.ts` (+36 / -0) — 5 source-level wiring tests + +## Commits + +- `ed8e3ea4` [PIR #982] Self-heal terminal open with bounded retry; recovery toast as last resort +- `8a52338b` [PIR #982] Tests for retry-resolve helper and recovery-toast wiring + +(Plus `[PIR #982]` plan and thread commits, and porch chore commits, on the branch.) + +## Test Results + +- `pnpm --filter codev-vscode compile` (check-types + lint + esbuild): ✓ pass +- `pnpm --filter codev-vscode test:unit`: ✓ pass (27 files, 353 tests; 17 new) +- Porch phase checks: `build` ✓ (6.5s), `tests` ✓ (20.6s) +- Manual verification: approved by the human at the `dev-approval` gate (running worktree). + +## Architecture Updates + +No `arch.md` changes needed. The server-side resilience for this window (the startup-reconcile barrier behind `/api/state` and `/api/overview`) is already documented at `codev/resources/arch.md:213`. This PR adds a localized, client-side bounded retry inside the existing `terminal-manager` open path; it introduces no new module boundary, endpoint, or cross-component contract, so it sits below the altitude `arch.md` records. + +## Lessons Learned Updates + +Added one entry to `codev/resources/lessons-learned.md` (Debugging and Root Cause Analysis): two views backed by different freshness sources can diverge *transiently*, not only on outright failure, and a bounded retry that absorbs the self-healing window beats dead-ending on the first miss. It is filed as a sibling to `[From 916]` (same divergence class, opposite direction: there the whole shared cache nulled; here a single per-builder field is briefly stale). + +## 3-Way Consultation (single advisory pass) + +- **claude: APPROVE.** +- **codex: REQUEST_CHANGES (HIGH) — addressed in code.** Finding: the **Recover Builders** action used `cwd: workspacePath` from `connectionManager.getWorkspacePath()`; because `detectWorkspacePath` walks up to the first `codev/` dir and a builder worktree has its own, a VSCode window *rooted at* a `.builders/<id>` worktree would run `afx workspace recover` in the worktree, not the main checkout (repo guidance: never run `afx` from inside a worktree). Real edge, confirmed. **Fix:** added a `mainCheckoutRoot(workspacePath)` helper (`terminal-resolve.ts`) that strips a trailing `/.builders/<id>` segment, and the recover terminal now uses `cwd: mainCheckoutRoot(workspacePath)`. Regression coverage: 5 behavioral tests for the helper (unchanged path, strip, trailing slash, non-leaf no-op, Windows separators) + the `terminal-manager` source-level guard now asserts `cwd: mainCheckoutRoot(workspacePath)` (would fail if it reverts to the raw path). Note: the same `cwd: workspacePath` pattern exists pre-existing in `run-worktree-setup.ts` (`afx setup`) and other afx call sites — a consistent main-root resolution across all of them is broader than #982 and worth a follow-up; this PR fixes only the affordance it introduces. +- **gemini: skipped** (Antigravity `agy` CLI not installed in this environment) — non-blocking, no content. + +## Things to Look At During PR Review + +- **`terminal-resolve.ts` is the core.** It is pure / vscode-free (deps injected: `sleep`, `attempts`), which is why the retry behavior is tested for real (`terminal-resolve.test.ts`) rather than through the heavy `TerminalManager` vscode harness. The retry returns a discriminated outcome (`ok` / `ambiguous` / `missing`); ambiguity short-circuits without retrying because it is stable. +- **Backoff convention.** Reuses the shared `backoffDelayMs` from `@cluesmith/codev-core/reconnect-policy` (the curve consolidated in #961), with interactive params `{ baseMs: 150, capMs: 800 }` over 4 attempts (~150/300/600ms, ~1s total) instead of the module's reconnect-loop defaults (base 1s, cap 30s), which would make a click feel laggy. A test asserts the emitted delays equal the shared curve at those params, so the convention can't silently drift back to a hand-rolled delay. +- **Recovery is deliberately demoted.** `afx workspace recover` only addresses the rare destroyed-session tail (e.g. a dead shellper) and is workspace-wide (it cannot target one builder), so the toast leads with **Retry** and the recover button stops at the dry-run preview (no `--apply`) for the user to review scope. +- **Mixed test styles, on purpose.** Behavioral tests for the pure helper; source-level (regex over source) guards in `terminal-manager.test.ts` for the vscode-side wiring (delegation, toast labels, recover command), matching that file's existing harness rationale (constructing `TerminalManager` needs broad vscode mocking). + +## How to Test Locally + +For reviewers pulling the branch: + +- **View diff**: VSCode sidebar → right-click builder `pir-982` → **View Diff** +- **Run dev server**: VSCode sidebar → **Run Dev Server**, or `afx dev pir-982` +- **What to verify** (maps to the plan's Test Plan): + - Transient/self-heal: click a builder row right after a spawn or while Tower is settling → terminal opens after a brief pause with **no** dead-end toast. + - Persistent: kill a builder's shellper so its session can't reconnect → after the retries, the actionable toast appears; **Retry** re-attempts, **Recover Builders** opens a terminal running `afx workspace recover` (dry-run) at the workspace root. + - Happy path: click a healthy builder → opens immediately, no toast. + +## Build / Test Setup Note + +This was a fresh worktree where `@cluesmith/codev-core` and `@cluesmith/codev-types` had no `dist/` yet, which made unrelated subpath-importing tests fail to resolve until those packages were built (`tsc`). Not a code issue, but worth knowing when running the vscode unit suite in a new worktree: build the workspace's TS packages first (or run `pnpm build`). diff --git a/codev/state/pir-982_thread.md b/codev/state/pir-982_thread.md new file mode 100644 index 000000000..bd9b9f910 --- /dev/null +++ b/codev/state/pir-982_thread.md @@ -0,0 +1,73 @@ +# PIR #982 — Builder thread + +Issue: vscode + tower "No active terminal for X" toast is unactionable. Label: `area/cross-cutting`. + +## Plan phase (iteration 1) + +**Root cause confirmed by reading code:** the sidebar tree and the terminal-opener read two different Tower sources. +- Sidebar renders from `OverviewBuilder` (overview cache, `/api/overview`) — disk-sourced, **no `terminalId`** (`packages/types/src/api.ts:141-225`). Row shows while worktree exists. +- Opener `openBuilderByRoleOrId` (`packages/vscode/src/terminal-manager.ts:186-215`) reads `getWorkspaceState` → `DashboardState.builders.terminalId` (in-memory PTY session id). Tower restart drops the in-memory registry → `terminalId` null → bare warning at `terminal-manager.ts:206-209`. + +**Decision:** ship Option 1 (better message) + Option 2 (in-extension recovery), defer 3/4/5. +- New toast (replacing the bare warning): names the builder, explains likely cause (Tower restart), buttons **Recover Builders** + **Retry**. +- **Recover Builders** → `createTerminal({ cwd: workspacePath })` + `sendText('afx workspace recover')` (dry-run), mirroring `commands/run-worktree-setup.ts:51-56`. Deliberately stops at dry-run (recover is workspace-wide, can't target one builder). +- **Retry** → re-attempt open (handles the transient spawn/recover race, cause 3). +- Pure-vscode, confined to the `!terminalId` branch → happy path untouched. +- Deferred: Option 3 (sidebar icon) needs a `terminalId`/liveness field threaded through overview server + types — cross-package, no extra acceptance coverage → follow-up issue. Options 4/5 explicitly deferred by the issue. + +Toast button pattern reference: `notifications/gate-toast.ts:109-139`. Tests extend `packages/vscode/src/__tests__/terminal-manager.test.ts` (vitest, `pnpm --filter codev-vscode test:unit`). + +Plan written to `codev/plans/982-vscode-tower-no-active-termina.md`. Committed, pushed. Sitting at `plan-approval` gate. + +## Plan check vs main + rebase (still at plan-approval) + +Branch was 88 commits behind `origin/main` (based on `66144d90`; main at `586a2fcb`). Checked plan validity against main: +- Warning code at `terminal-manager.ts:206-209` is **byte-identical on main** — bug untouched, same lines. +- None of the plan's targeted files changed since the old base (terminal-manager, builder-row, builders, types/api.ts, run-worktree-setup, gate-toast, the test). +- Only adjacent change intersecting the plan: `overview.ts` got PIR #907's enrichment cache (for `area`) — did NOT add any `terminalId`/liveness field, and `OverviewBuilder` (api.ts) is unchanged. So the Option-3 deferral reasoning ("no liveness signal on OverviewBuilder") still holds. +- No merged PR references #982 — no duplicate fix shipped. + +Rebased onto `origin/main` (586a2fcb) cleanly (no conflicts), force-pushed. Porch state intact (still gate_pending / plan / iter 1). Post-rebase line refs re-verified: warning at `:207`, `run-worktree-setup.ts:51-56` pattern present. + +**Deep-dive on Tower-up causes (answered to reviewer):** `/api/state` (opener) omits a builder whose live `PtySession` is gone from `TerminalManager.sessions` (`tower-routes.ts:1834-1855` — included only `if (session)`); `/api/overview` (sidebar) is disk-sourced so still lists it. Tower-up triggers: non-shellper 5-min idle reap (`pty-session.ts:368-379`, shellper-exempt), PTY exit past 30s grace (`pty-manager.ts:97-101`), spawn race, explicit kill. + +## Plan REVISED after reviewer feedback (still at plan-approval) + +Reviewer: the case they actually hit is **transient unavailability, not session destruction** — wouldn't need afx recovery. They challenged recovery-as-primary. They're right. + +Mechanism confirmed: `/api/state` rehydrates + on-the-fly shellper-reconnects on every call (`tower-terminals.ts:164-168`, `:780-886`); a click landing mid-rehydration / during startup `_reconciling` (`:53`/`:780`) / in the spawn race sees a momentary miss that resolves on the next call. **There is NO retry in the open path** — `terminal-manager.ts:206-208` warns on the first miss. That missing retry is the real defect. + +**Reframed plan:** +- PRIMARY: bounded silent auto-retry in `openBuilderByRoleOrId` (~3 attempts, ~400ms apart). Re-query `getWorkspaceState` (re-triggers Tower self-heal). On any hit → open, NO toast. Transient case becomes invisible (= the issue's "self-recovers on next tick" AC). +- SECONDARY (persistent tail only): actionable toast leading with **Retry**; **Recover Builders** demoted to last-resort line (dry-run, dead-shellper case). Recovery is no longer the headline. +- Happy path untouched (first-attempt hit returns immediately). Option 3/4/5 still deferred. + +## Backoff convention check (reviewer asked "same one used across the system?") + +No — first draft's "fixed 400ms" was ad-hoc. Canonical convention is `packages/core/src/reconnect-policy.ts` (#961 consolidated 4 hand-rolled `Math.min(1000*2^attempt,cap)` copies). Exports `backoffDelayMs(attempt, opts)` (bare curve, parameterized: baseMs/capMs/maxAttempts/jitterMs/random), `BackoffController` (counter+give-up), `classifyUpgradeError` (WS close codes). Already used by `connection-manager.ts` (SSE reconnect → backoffDelayMs) and `terminal-adapter.ts` (WS reconnect → BackoffController). + +Catch: defaults (base 1s, cap 30s, 6 attempts) are for persistent reconnect loops — too slow for an interactive click. Fix: reuse the **bare `backoffDelayMs`** (own a local counter, like the SSE/tunnel sites) with interactive params `{ baseMs: 150, capMs: 800 }` over ~3-4 attempts (~150/300/600ms, ~1s total). Don't use BackoffController (event-driven onClose machinery we don't need) or classifyUpgradeError (WS-only). Plan + Files-to-Change updated to import `@cluesmith/codev-core/reconnect-policy`. + +## IMPLEMENT phase (iteration 1) — done, awaiting dev-approval + +plan-approval approved; advanced to implement. + +Changes: +- NEW `packages/vscode/src/terminal-resolve.ts` — vscode-free `resolveBuilderTerminal(roleOrId, fetchBuilders, {sleep, attempts?})`. Retries the resolve `TERMINAL_RESOLVE_ATTEMPTS` (4) times; uses `backoffDelayMs(attempt, {baseMs:150, capMs:800})` between attempts. Returns discriminated `{kind:'ok'|'ambiguous'|'missing'}`. Ambiguity short-circuits (stable, no retry). Deps injected → behaviorally testable. +- `terminal-manager.ts` — `openBuilderByRoleOrId` delegates to the helper; `missing` → new `promptNoTerminalRecovery` (Retry [primary] + Recover Builders [secondary, dry-run `afx workspace recover` at workspace root, mirrors run-worktree-setup]). `friendlyBuilderId` gives `#<issueId>` in the toast via overviewCache. Old bare "No active terminal" warning removed. +- Tests: NEW `terminal-resolve.test.ts` (7 behavioral — happy path / self-heal miss-then-hit / spawn-race empty-then-hit / persistent missing / backoff-curve-matches-shared-fn / ambiguous short-circuit / attempts override). `terminal-manager.test.ts` +5 source-level (delegation, no bare warning, missing→prompt, Retry+Recover labels, recover command wiring). + +Verify: `pnpm --filter codev-vscode test:unit` → 27 files / 348 tests pass. `pnpm --filter codev-vscode compile` → check-types + lint + esbuild clean. +Note: fresh worktree needed `@cluesmith/codev-core` and `@cluesmith/codev-types` built first (tsc) — their dist wasn't present, caused subpath-resolution errors in unrelated tests until built. Not a code issue. + +Committed (impl + tests separate). Next: push, `porch done`, sit at dev-approval. + +## REVIEW phase — PR + consult + pr gate + +dev-approval approved → review phase. Wrote `codev/reviews/982-...md` (Summary/Files/Commits/Tests/Arch[no change, server barrier already in arch.md:213]/Lessons[added sibling to From 916]/Things-to-look-at/How-to-test). Added lessons-learned.md entry [From 982]. PR #1006 opened (Fixes #982), recorded via `porch done --pr 1006`. + +3-way consult (single advisory pass, type=impl): **claude=APPROVE, codex=REQUEST_CHANGES (HIGH), gemini=skipped (agy CLI not installed)**. +- Codex finding (REAL): Recover Builders used `cwd: workspacePath`; in a `.builders/<id>`-rooted VSCode window that's the worktree (detectWorkspacePath walks up to first `codev/`, worktree has its own), so `afx workspace recover` would run in the wrong dir. Fixed: new `mainCheckoutRoot(workspacePath)` helper strips trailing `/.builders/<id>`; recover cwd now uses it. +5 helper tests, source-guard updated. Pre-existing same pattern in run-worktree-setup.ts etc. = broader follow-up, out of scope. +- Tests: 353 pass (17 new). compile clean. Documented finding+fix in review "3-Way Consultation" + "Things to Look At". + +Escalating REQUEST_CHANGES to architect (single-pass, won't be re-reviewed). Then wait at `pr` gate. diff --git a/packages/vscode/src/__tests__/terminal-manager.test.ts b/packages/vscode/src/__tests__/terminal-manager.test.ts index 8f1aee95f..603a65e5f 100644 --- a/packages/vscode/src/__tests__/terminal-manager.test.ts +++ b/packages/vscode/src/__tests__/terminal-manager.test.ts @@ -66,6 +66,44 @@ describe('Spec 786 Phase 6 — TerminalManager per-name keying', () => { }); }); +describe('PIR #982 — actionable recovery for a missing terminal', () => { + // Behavioral coverage of the retry/resolve itself lives in + // terminal-resolve.test.ts (the vscode-free helper). These source-level + // guards cover the vscode-side wiring that the heavy TerminalManager harness + // makes impractical to exercise behaviorally. + + it('delegates the resolve to the bounded-retry helper', () => { + expect(TM_SRC).toMatch(/import \{[^}]*\bresolveBuilderTerminal\b[^}]*\} from '\.\/terminal-resolve\.js'/); + expect(TM_SRC).toMatch(/await resolveBuilderTerminal\(/); + }); + + it('no longer dead-ends on the bare "No active terminal" warning', () => { + // The old unactionable toast must be gone — replaced by the recovery prompt. + expect(TM_SRC).not.toMatch(/No active terminal for/); + }); + + const recoveryBody = TM_SRC.split('promptNoTerminalRecovery')[2] ?? ''; + + it('routes a missing terminal to the recovery prompt', () => { + expect(TM_SRC).toMatch(/outcome\.kind === 'missing'/); + expect(TM_SRC).toMatch(/this\.promptNoTerminalRecovery\(/); + }); + + it('offers Retry and Recover Builders actions', () => { + expect(recoveryBody).toMatch(/showWarningMessage\(/); + expect(recoveryBody).toMatch(/const RETRY = 'Retry'/); + expect(recoveryBody).toMatch(/const RECOVER = 'Recover Builders'/); + }); + + it('Retry re-attempts the open; Recover runs `afx workspace recover` at the main checkout root', () => { + expect(recoveryBody).toMatch(/choice === RETRY[\s\S]*openBuilderByRoleOrId\(roleOrId, focus\)/); + // cwd must be the main checkout root, not the raw workspace path — a + // worktree-rooted window would otherwise run afx in the wrong dir (#982). + expect(recoveryBody).toMatch(/cwd: mainCheckoutRoot\(workspacePath\)/); + expect(recoveryBody).toMatch(/sendText\('afx workspace recover'\)/); + }); +}); + describe('#921 — dev surface refresh on manual terminal close', () => { // Regression guard: a dev terminal closed via the generic onDidCloseTerminal // path (tab ✕ / process exit) must clear devStartedAt AND re-fire diff --git a/packages/vscode/src/__tests__/terminal-resolve.test.ts b/packages/vscode/src/__tests__/terminal-resolve.test.ts new file mode 100644 index 000000000..a51c2147d --- /dev/null +++ b/packages/vscode/src/__tests__/terminal-resolve.test.ts @@ -0,0 +1,175 @@ +/** + * PIR #982 — behavioral tests for `resolveBuilderTerminal`, the bounded-retry + * resolve that absorbs Tower's transient session-registry window. + * + * The helper is vscode-free (deps injected), so these are real behavioral + * tests with an instrumented `sleep` — no vscode mock, no real wall-clock + * delay — mirroring the `builder-row.ts` testing pattern. + */ + +import { describe, it, expect } from 'vitest'; +import { backoffDelayMs } from '@cluesmith/codev-core/reconnect-policy'; +import { + resolveBuilderTerminal, + mainCheckoutRoot, + TERMINAL_RESOLVE_ATTEMPTS, + type ResolvableBuilder, +} from '../terminal-resolve.js'; + +interface TestBuilder extends ResolvableBuilder { + name: string; +} + +/** A no-op sleep that records the delays it was asked to wait. */ +function recordingSleep() { + const delays: number[] = []; + return { sleep: async (ms: number) => { delays.push(ms); }, delays }; +} + +describe('resolveBuilderTerminal', () => { + it('opens immediately when the first lookup has a live terminal (happy path)', async () => { + const { sleep, delays } = recordingSleep(); + let calls = 0; + const outcome = await resolveBuilderTerminal<TestBuilder>( + '982', + async () => { calls++; return [{ id: 'builder-pir-982', name: 'pir-982', terminalId: 'term-1' }]; }, + { sleep }, + ); + + expect(outcome).toEqual({ + kind: 'ok', + builder: { id: 'builder-pir-982', name: 'pir-982', terminalId: 'term-1' }, + terminalId: 'term-1', + }); + expect(calls).toBe(1); // no retry needed + expect(delays).toEqual([]); // never slept on the happy path + }); + + it('self-heals: a transient miss followed by a live session resolves to ok', async () => { + const { sleep, delays } = recordingSleep(); + let calls = 0; + const outcome = await resolveBuilderTerminal<TestBuilder>( + '982', + async () => { + calls++; + // Attempt 1: builder present but session not yet bound (terminalId absent). + if (calls === 1) { return [{ id: 'builder-pir-982', name: 'pir-982' }]; } + // Attempt 2: Tower's rehydrate/reconnect completed — terminal is live. + return [{ id: 'builder-pir-982', name: 'pir-982', terminalId: 'term-9' }]; + }, + { sleep }, + ); + + expect(outcome.kind).toBe('ok'); + expect(calls).toBe(2); + expect(delays).toHaveLength(1); // slept once, between the two attempts + }); + + it('also self-heals when the builder is absent entirely on the first attempt (spawn race)', async () => { + const { sleep } = recordingSleep(); + let calls = 0; + const outcome = await resolveBuilderTerminal<TestBuilder>( + '982', + async () => { + calls++; + if (calls === 1) { return []; } // not registered in /api/state yet + return [{ id: 'builder-pir-982', name: 'pir-982', terminalId: 'term-3' }]; + }, + { sleep }, + ); + + expect(outcome.kind).toBe('ok'); + expect(calls).toBe(2); + }); + + it('returns missing after exhausting all attempts (persistent — recovery path)', async () => { + const { sleep, delays } = recordingSleep(); + let calls = 0; + const outcome = await resolveBuilderTerminal<TestBuilder>( + '982', + async () => { calls++; return [{ id: 'builder-pir-982', name: 'pir-982' }]; }, + { sleep }, + ); + + expect(outcome.kind).toBe('missing'); + expect(calls).toBe(TERMINAL_RESOLVE_ATTEMPTS); // tried the full budget + expect(delays).toHaveLength(TERMINAL_RESOLVE_ATTEMPTS - 1); // no sleep after the last attempt + }); + + it('uses the shared backoffDelayMs curve with interactive-tuned params', async () => { + const { sleep, delays } = recordingSleep(); + await resolveBuilderTerminal<TestBuilder>( + '982', + async () => [{ id: 'builder-pir-982', name: 'pir-982' }], // always miss + { sleep }, + ); + + // Delays must equal the shared curve at base 150 / cap 800 for attempts 0..N-2. + const expected = Array.from({ length: TERMINAL_RESOLVE_ATTEMPTS - 1 }, (_, i) => + backoffDelayMs(i, { baseMs: 150, capMs: 800 }), + ); + expect(delays).toEqual(expected); + expect(expected[0]).toBe(150); // sanity: snappy first retry, not the 1s reconnect default + }); + + it('short-circuits to ambiguous without retrying when the id tail matches >1 builder', async () => { + const { sleep, delays } = recordingSleep(); + let calls = 0; + const outcome = await resolveBuilderTerminal<TestBuilder>( + '5', + async () => { + calls++; + return [ + { id: 'builder-spir-5', name: 'spir-5' }, + { id: 'builder-bugfix-5', name: 'bugfix-5' }, + ]; + }, + { sleep }, + ); + + expect(outcome.kind).toBe('ambiguous'); + if (outcome.kind === 'ambiguous') { expect(outcome.matches).toHaveLength(2); } + expect(calls).toBe(1); // ambiguity is stable — no point retrying + expect(delays).toEqual([]); + }); + + it('respects an overridden attempt count', async () => { + const { sleep } = recordingSleep(); + let calls = 0; + const outcome = await resolveBuilderTerminal<TestBuilder>( + '982', + async () => { calls++; return []; }, + { sleep, attempts: 2 }, + ); + + expect(outcome.kind).toBe('missing'); + expect(calls).toBe(2); + }); +}); + +describe('mainCheckoutRoot', () => { + it('returns a normal main-checkout path unchanged', () => { + expect(mainCheckoutRoot('/Users/me/repos/codev')).toBe('/Users/me/repos/codev'); + }); + + it('strips a trailing /.builders/<id> worktree segment back to the main root', () => { + // The recover command must run from the main checkout even when VSCode is + // rooted at a worktree window (PIR #982 — the codex review finding). + expect(mainCheckoutRoot('/Users/me/repos/codev/.builders/pir-982')).toBe('/Users/me/repos/codev'); + }); + + it('tolerates a trailing slash on the worktree path', () => { + expect(mainCheckoutRoot('/Users/me/repos/codev/.builders/spir-153/')).toBe('/Users/me/repos/codev'); + }); + + it('does not strip when .builders is not the leaf segment', () => { + // Only a worktree-rooted window (leaf == .builders/<id>) should be rewritten; + // a deeper path is left alone rather than guessed at. + const deep = '/Users/me/repos/codev/.builders/pir-982/packages/vscode'; + expect(mainCheckoutRoot(deep)).toBe(deep); + }); + + it('handles Windows-style separators', () => { + expect(mainCheckoutRoot('C:\\repos\\codev\\.builders\\pir-982')).toBe('C:\\repos\\codev'); + }); +}); diff --git a/packages/vscode/src/terminal-manager.ts b/packages/vscode/src/terminal-manager.ts index 73ce44c71..827afa0ad 100644 --- a/packages/vscode/src/terminal-manager.ts +++ b/packages/vscode/src/terminal-manager.ts @@ -5,6 +5,7 @@ import type { OverviewCache } from './views/overview-data.js'; import { encodeWorkspacePath } from '@cluesmith/codev-core/workspace'; import { resolveAgentName } from '@cluesmith/codev-core/agent-names'; import type { TerminalType } from '@cluesmith/codev-core/tower-client'; +import { resolveBuilderTerminal, mainCheckoutRoot } from './terminal-resolve.js'; const MAX_TERMINALS = 10; @@ -182,6 +183,11 @@ export class TerminalManager { * Resolve a builder by `roleId` or `id` via Tower workspace state, then * open its terminal. Used by the sidebar tree views, terminal link * provider, and command palette so the lookup logic lives in one place. + * + * The resolve is retried with a short backoff (`resolveBuilderTerminal`): + * Tower's `/api/state` can momentarily omit a live builder while its session + * registry rehydrates/reconnects, and that transient miss self-heals on the + * next call (PIR #982). Only a persistent miss surfaces the recovery toast. */ async openBuilderByRoleOrId(roleOrId: string, focus = false): Promise<void> { const client = this.connectionManager.getClient(); @@ -191,29 +197,87 @@ export class TerminalManager { return; } try { - const state = await client.getWorkspaceState(workspacePath); - const builders = state?.builders ?? []; - // Use resolveAgentName so the bare numeric IDs the sidebar passes - // (e.g. '153' from OverviewBuilder) tail-match canonical - // 'builder-spir-153' IDs from Tower's runtime state. - const { builder, ambiguous } = resolveAgentName(roleOrId, builders); - if (ambiguous) { + // resolveBuilderTerminal uses resolveAgentName internally so the bare + // numeric IDs the sidebar passes (e.g. '153' from OverviewBuilder) + // tail-match canonical 'builder-spir-153' IDs from Tower's runtime state. + const outcome = await resolveBuilderTerminal( + roleOrId, + async () => { + const state = await client.getWorkspaceState(workspacePath); + return state?.builders ?? []; + }, + { sleep: (ms) => new Promise<void>((r) => setTimeout(r, ms)) }, + ); + if (outcome.kind === 'ambiguous') { vscode.window.showWarningMessage( - `Codev: Multiple builders match "${roleOrId}": ${ambiguous.map(b => b.name).join(', ')}`, + `Codev: Multiple builders match "${roleOrId}": ${outcome.matches.map(b => b.name).join(', ')}`, ); return; } - if (!builder?.terminalId) { - vscode.window.showWarningMessage(`Codev: No active terminal for ${roleOrId}`); + if (outcome.kind === 'missing') { + await this.promptNoTerminalRecovery(roleOrId, workspacePath, focus); return; } - await this.openBuilder(builder.terminalId, builder.id, `Codev: ${builder.name}`, focus); + const { builder, terminalId } = outcome; + await this.openBuilder(terminalId, builder.id, `Codev: ${builder.name}`, focus); } catch (err) { this.log('ERROR', `Failed to open builder ${roleOrId}: ${(err as Error).message}`); vscode.window.showErrorMessage(`Codev: Failed to open ${roleOrId}`); } } + /** + * Actionable toast for when a builder row is clicked but no live terminal + * resolves after the bounded retry (PIR #982). Replaces the old dead-end + * `No active terminal` warning. Leads with **Retry** — the common case is a + * session still settling (a longer startup-reconciliation window than the + * auto-retry budget covers) — and offers **Recover Builders** as the + * last-resort path for a genuinely lost session (e.g. a dead shellper). + * + * Recover opens a terminal running `afx workspace recover` (its default + * dry-run preview) at the main checkout root, mirroring `run-worktree-setup`. + * It deliberately stops at the preview rather than `--apply`: recover is + * workspace-wide (it can't target one builder), so the user reviews the scope + * before reviving. The cwd is resolved via `mainCheckoutRoot` so the command + * runs from the main checkout even when VSCode is rooted at a builder worktree + * window (where `getWorkspacePath()` resolves to the worktree, not main). + */ + private async promptNoTerminalRecovery(roleOrId: string, workspacePath: string, focus: boolean): Promise<void> { + const RETRY = 'Retry'; + const RECOVER = 'Recover Builders'; + const who = this.friendlyBuilderId(roleOrId); + const choice = await vscode.window.showWarningMessage( + `Codev: ${who}'s terminal isn't available yet — it may still be starting, or its session was dropped. ` + + `Retry, or recover builders if it was lost.`, + RETRY, + RECOVER, + ); + if (choice === RETRY) { + await this.openBuilderByRoleOrId(roleOrId, focus); + } else if (choice === RECOVER) { + const terminal = vscode.window.createTerminal({ + name: 'Codev: Recover Builders', + cwd: mainCheckoutRoot(workspacePath), + }); + terminal.show(); + terminal.sendText('afx workspace recover'); + } + } + + /** + * A short, friendly identity for a builder (`#<issueId>`) from the overview + * cache, falling back to the raw `roleOrId` when no row matches. Used only + * for the recovery toast's prose so it names the row the user clicked. + */ + private friendlyBuilderId(roleOrId: string): string { + const data = this.overviewCache.getData(); + if (!data?.builders?.length) { return roleOrId; } + const shortId = roleOrId.split('-').pop() ?? roleOrId; + const { builder } = resolveAgentName(shortId, data.builders); + const num = builder?.issueId ?? builder?.id; + return num ? `#${num}` : roleOrId; + } + /** * Open a dev-server terminal for a builder's worktree (#690). * Keyed `dev-<builderId>` so it lives alongside (not on top of) the diff --git a/packages/vscode/src/terminal-resolve.ts b/packages/vscode/src/terminal-resolve.ts new file mode 100644 index 000000000..92b98f41b --- /dev/null +++ b/packages/vscode/src/terminal-resolve.ts @@ -0,0 +1,108 @@ +/** + * Pure, vscode-free resolution of a builder's live terminal from Tower's + * workspace-state, with a bounded retry that absorbs the transient window in + * which Tower's session registry disagrees with the sidebar (PIR #982). + * + * The sidebar lists a builder as soon as its worktree exists on disk + * (`/api/overview`), but the terminal opener needs a *live* PTY session + * (`/api/state`, which includes a builder only if `manager.getSession` resolves). + * `/api/state` rehydrates and on-the-fly-reconnects shellper sessions on every + * call, so a click that lands mid-rehydration, during Tower's startup + * reconciliation, or in the spawn race sees the builder momentarily without a + * live `terminalId` — and the very next call resolves it. Retrying a few times + * with a short backoff lets that dominant, transient case heal silently instead + * of dead-ending on the "No active terminal" warning. Only a genuinely + * persistent miss (dead shellper, reaped session) falls through to `missing`. + * + * Kept vscode-free (deps injected) so it unit-tests under the vitest `__tests__/` + * harness — mirroring `views/builder-row.ts` — rather than needing a full vscode + * mock. The toast/recovery UX it feeds lives in `terminal-manager.ts`. + */ + +import { backoffDelayMs } from '@cluesmith/codev-core/reconnect-policy'; +import { resolveAgentName } from '@cluesmith/codev-core/agent-names'; + +/** + * Retry budget for the resolve. The first attempt is the original lookup; + * the remaining attempts each re-trigger Tower's rehydrate/reconnect. Four + * attempts with the backoff below spans ~1s of wall-clock total. + */ +export const TERMINAL_RESOLVE_ATTEMPTS = 4; + +/** + * Interactive-tuned backoff parameters fed to the shared `backoffDelayMs` + * curve (`@cluesmith/codev-core/reconnect-policy`). The module's defaults + * (base 1000ms, cap 30_000ms) are sized for persistent reconnect loops and + * would make a sidebar click feel laggy; these produce ~150ms, 300ms, 600ms + * between attempts — snappy enough for a click while still spanning the + * sub-second self-heal window. Same primitive, interactive parameters. + */ +const RESOLVE_BASE_MS = 150; +const RESOLVE_CAP_MS = 800; + +/** Minimum a builder needs for this resolution: an id and a (maybe) terminalId. */ +export interface ResolvableBuilder { + id: string; + terminalId?: string; +} + +export type TerminalResolveOutcome<T extends ResolvableBuilder> = + /** Builder found with a live terminal session. */ + | { kind: 'ok'; builder: T; terminalId: string } + /** `roleOrId` matched more than one builder — caller should disambiguate. */ + | { kind: 'ambiguous'; matches: T[] } + /** No live terminal after all attempts — likely persistent (recovery path). */ + | { kind: 'missing' }; + +export interface TerminalResolveDeps { + /** Sleep between attempts. Injected so tests run without real delays. */ + sleep: (ms: number) => Promise<void>; + /** Override the attempt count (tests). Defaults to {@link TERMINAL_RESOLVE_ATTEMPTS}. */ + attempts?: number; +} + +/** + * Resolve `roleOrId` to a builder with a live terminal, retrying the + * `fetchBuilders` lookup with a short backoff to absorb the transient + * session-registry window. Ambiguity is stable (it's a matter of how many + * builders share the id tail), so it short-circuits immediately rather than + * retrying. + */ +export async function resolveBuilderTerminal<T extends ResolvableBuilder>( + roleOrId: string, + fetchBuilders: () => Promise<T[]>, + deps: TerminalResolveDeps, +): Promise<TerminalResolveOutcome<T>> { + const attempts = deps.attempts ?? TERMINAL_RESOLVE_ATTEMPTS; + for (let attempt = 0; attempt < attempts; attempt++) { + const builders = await fetchBuilders(); + const { builder, ambiguous } = resolveAgentName(roleOrId, builders); + if (ambiguous) { + return { kind: 'ambiguous', matches: ambiguous }; + } + if (builder?.terminalId) { + return { kind: 'ok', builder, terminalId: builder.terminalId }; + } + if (attempt < attempts - 1) { + await deps.sleep(backoffDelayMs(attempt, { baseMs: RESOLVE_BASE_MS, capMs: RESOLVE_CAP_MS })); + } + } + return { kind: 'missing' }; +} + +/** + * The main-checkout root for an `afx` command, given the VSCode window's + * detected workspace path. + * + * A builder worktree lives at `<root>/.builders/<id>` and is itself a full + * checkout (it has its own `codev/`), so `detectWorkspacePath`'s walk-up stops + * at the worktree when VSCode is *rooted at* that worktree — `getWorkspacePath()` + * then returns the worktree, not the main checkout. `afx workspace recover` must + * run from the main root (repo guidance: never run `afx` from inside a + * worktree), so strip a trailing `/.builders/<id>` segment when present. + * A normal main-checkout window has no such suffix and is returned unchanged. + */ +export function mainCheckoutRoot(workspacePath: string): string { + const match = /[\\/]\.builders[\\/][^\\/]+\/?$/.exec(workspacePath); + return match ? workspacePath.slice(0, match.index) : workspacePath; +}