Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
f6ace76
chore(porch): 982 init pir
amrmelsayed Jun 6, 2026
a336982
[PIR #982] Plan draft
amrmelsayed Jun 6, 2026
620dce9
chore(porch): 982 plan-approval gate-requested
amrmelsayed Jun 6, 2026
9b1c571
[PIR #982] Thread: plan check vs main + rebase
amrmelsayed Jun 6, 2026
aa27d58
[PIR #982] Plan revised: auto-retry primary, recovery demoted to last…
amrmelsayed Jun 6, 2026
17bb820
[PIR #982] Plan: reuse shared reconnect-policy backoff (interactive-t…
amrmelsayed Jun 6, 2026
e5e7d21
[PIR #982] Plan: refresh tower-routes line ref after rebase
amrmelsayed Jun 6, 2026
3c28438
chore(porch): 982 plan-approval gate-approved
amrmelsayed Jun 6, 2026
e34e99d
chore(porch): 982 implement phase-transition
amrmelsayed Jun 6, 2026
ed8e3ea
[PIR #982] Self-heal terminal open with bounded retry; recovery toast…
amrmelsayed Jun 6, 2026
8a52338
[PIR #982] Tests for retry-resolve helper and recovery-toast wiring
amrmelsayed Jun 6, 2026
7fdc5b8
[PIR #982] Thread: implement phase complete
amrmelsayed Jun 6, 2026
ffddc86
chore(porch): 982 dev-approval gate-requested
amrmelsayed Jun 6, 2026
d2659b1
chore(porch): 982 dev-approval gate-approved
amrmelsayed Jun 6, 2026
c5df114
chore(porch): 982 review phase-transition
amrmelsayed Jun 6, 2026
0cd03f9
chore(porch): 982 dev-approval gate-approved
amrmelsayed Jun 6, 2026
1b20840
[PIR #982] Review + retrospective
amrmelsayed Jun 6, 2026
70626dd
chore(porch): 982 record PR #1006
amrmelsayed Jun 6, 2026
09ebcf0
chore(porch): 982 review phase-transition
amrmelsayed Jun 6, 2026
30129fb
chore(porch): 982 review build-complete
amrmelsayed Jun 6, 2026
70452ec
[PIR #982] Run recover from main checkout root, not a worktree window…
amrmelsayed Jun 6, 2026
c6708fc
[PIR #982] Review: record 3-way consult verdicts + codex fix disposition
amrmelsayed Jun 6, 2026
865ec57
chore(porch): 982 pr gate-requested
amrmelsayed Jun 6, 2026
864ff50
chore(porch): 982 pr gate-approved
amrmelsayed Jun 6, 2026
7fc4708
chore(porch): 982 protocol complete
amrmelsayed Jun 6, 2026
5345086
chore(porch): 982 PR #1006 merged
amrmelsayed Jun 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions codev/plans/982-vscode-tower-no-active-termina.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# PIR Plan: Make the "No active terminal" click self-heal (and only suggest recovery as a last resort)

## Understanding

Clicking a builder row in the Codev sidebar can produce a dead-end warning toast:

```
Codev: No active terminal for <builder-id>
```

…and nothing else happens. The row keeps looking healthy, but the click can't open a terminal.

**Why the two surfaces disagree.** The sidebar and the terminal-opener read different Tower sources:

- The **sidebar** renders from `OverviewBuilder` (overview cache / `/api/overview`), which is disk-sourced (worktree + `status.yaml`) and has no `terminalId` field (`packages/types/src/api.ts`, `OverviewBuilder`). A builder shows up as long as its worktree exists.
- The **opener** (`openBuilderByRoleOrId`, `packages/vscode/src/terminal-manager.ts:186-215`) calls `getWorkspaceState` (`/api/state`), whose handler builds the `builders` array by iterating `entry.builders` and **including a builder only if its live PTY session resolves** (`tower-routes.ts:~1840` — `const session = manager.getSession(terminalId); if (session) { … }`). No live session → the builder is omitted → `resolveAgentName` finds no match → `builder` is `undefined` → the bare warning fires at `terminal-manager.ts:206-208`.

**The dominant cause is transient, not a destroyed session.** Every `/api/state` request first runs `getRehydratedTerminalsEntry` (`tower-terminals.ts:164-168`), which awaits a rehydrate + an **on-the-fly shellper reconnect** for any session that's in SQLite but not currently in the in-memory registry (`tower-terminals.ts:780-886`). A click that lands:

- mid-rehydration / mid-reconnect (sub-100ms, longer under load),
- during Tower's startup reconciliation window (`_reconciling` blocks the on-the-fly reconnect, `tower-terminals.ts:53`/`:780` — seconds), or
- in the spawn race (the worktree/`status.yaml` is visible to `/api/overview` a beat before the PTY session is registered in `entry.builders`)

…sees a momentary miss that **resolves on the very next call.** The session is not gone; it just isn't resolvable for a beat.

**The actual defect is the missing retry.** The open path surfaces the warning on the *first* miss with no second attempt (`terminal-manager.ts:206-208`) — Tower would have answered correctly a few hundred milliseconds later. So the user hits a dead-end for a condition that self-heals.

(Session genuinely destroyed while Tower runs — non-shellper 5-min idle reap, PTY exit past the 30s grace, or a dead shellper after machine sleep — is the *minority* tail. That tail is what `afx workspace recover` addresses, and it should be a secondary suggestion, not the headline.)

## Proposed Change

Reframe the fix around the dominant transient case: **let the click heal itself silently**, and only surface a toast on the genuinely-persistent tail — where recovery is a secondary, last-resort line, not the primary action.

### 1. Primary: bounded silent auto-retry in `openBuilderByRoleOrId` (`terminal-manager.ts:193-209`)

When the lookup misses (`!builder?.terminalId`), re-query `getWorkspaceState` a few times with a short backoff before surfacing anything. Each retry re-triggers Tower's rehydrate + on-the-fly reconnect, which is exactly the self-heal path.

**Reuse the system's shared backoff convention, not a hand-rolled delay.** The codebase consolidated four hand-rolled `Math.min(1000 * 2^attempt, cap)` curves into `packages/core/src/reconnect-policy.ts` (#961) specifically so new call sites don't re-invent it; the two closest analogues — `connection-manager.ts` (Tower/SSE reconnect, via `backoffDelayMs`) and `terminal-adapter.ts` (terminal WebSocket reconnect, via `BackoffController`) — already use it. So this retry will call `backoffDelayMs(attempt, opts)` from `@cluesmith/codev-core/reconnect-policy` with a **local attempt counter** (the bare-function form, matching how the SSE/tunnel sites use it — `BackoffController`'s status-machine + give-up wrapper is for event-driven onClose scheduling we don't need here).

- The module's *defaults* (base 1000ms, cap 30_000ms, 6 attempts → `1s,2s,4s,8s,16s,30s`) are tuned for persistent reconnect loops and are too slow for an interactive click. Use interactive-tuned options instead: `backoffDelayMs(attempt, { baseMs: 150, capMs: 800 })` over ~3–4 attempts → roughly `150ms, 300ms, 600ms` (total budget ~1s). Constants live next to the call site with a comment explaining the interactive tuning vs. the reconnect defaults.
- The first attempt is the existing call (no added latency on the happy path; retries happen only on the miss branch).
- On success at any attempt → open the builder terminal normally. **No toast** — the transient case becomes invisible, which is the best UX and directly implements the issue's "self-recovers gracefully on the next overview tick" acceptance bullet.
- The retry loop only re-fetches and re-resolves; it does not change the happy path (a first-attempt hit returns immediately).
- `classifyUpgradeError` (the module's third export) is **not** used — it classifies WebSocket close codes; our `/api/state` poll gets a builder list, so "resolved or not after N tries" is the only signal.

### 2. Secondary: an actionable toast only when retries are exhausted (likely-persistent)

If every attempt still misses, the session is probably genuinely gone — show a toast that leads with a **manual retry** and treats recovery as the fallback:

- Message (neutral, not over-promising recovery): e.g. `Codev: #<id>'s terminal isn't available — it may still be starting, or its session was dropped. Retry, or recover builders if it was lost.` (Use the friendly `#<issueId> <title>` identity from the resolved `OverviewBuilder` row when available.)
- **"Retry" button** (primary) → re-invokes the open (which runs the auto-retry again). Covers the longer startup-reconciliation window where ~1.2s wasn't enough.
- **"Recover Builders" button** (secondary, last resort) → opens a terminal at the workspace root running `afx workspace recover` (dry-run preview), mirroring `commands/run-worktree-setup.ts:51-56` (`createTerminal({ name, cwd: workspacePath })` + `sendText('afx workspace recover')`). Stays at the dry-run because recover is workspace-wide (cannot target one builder) — the user reviews scope before re-running with `--apply`. `workspacePath` is already in scope at `terminal-manager.ts:188`.

Button handling uses the established positional-args pattern (`notifications/gate-toast.ts:109-139`).

### What changed from the first draft (and why)

The first draft led with recovery (per the issue's options 1–2). That mis-weights the problem: the case actually seen in practice is transient unavailability that needs no recovery at all. Recovery is now demoted to the persistent tail, and the headline is the silent auto-retry that makes the dominant case disappear.

### Deferred options (with reasons)

- **Option 3 (sidebar icon for dropped sessions)** — `OverviewBuilder` carries no liveness/`terminalId` signal (confirmed unchanged on `main`; `overview.ts`'s recent #907 change added an `area` enrichment cache, not liveness). Flagging the row before a click would need a `hasLiveSession` field threaded from Tower's in-memory registry through the overview server and `@cluesmith/codev-types` — cross-package blast radius, and the auto-retry already removes the dominant dead-end. Recommend a separate follow-up. (Also: with auto-retry, a transient miss never even reaches a "dropped" state worth flagging.)
- **Option 4 (auto-recover on activation)** — a behavior decision the issue defers; out of scope.
- **Option 5 (persist Tower's session registry)** — the root-cause fix for the *destroyed-session* tail; explicitly a separate, larger discussion; out of scope.

## Files to Change

- `packages/vscode/src/terminal-manager.ts:193-209` — in `openBuilderByRoleOrId`, wrap the resolve in a bounded retry (re-fetch `getWorkspaceState`, re-run `resolveAgentName`, ~3–4 attempts with delays from `backoffDelayMs(attempt, { baseMs: 150, capMs: 800 })` and a local counter). On exhaustion, call a small private helper for the actionable toast (`Retry` + secondary `Recover Builders`). Keep the `ambiguous` and `not connected` branches as-is. Factor the toast + the sleep into helpers so the method stays readable and unit-testable; reuse the already-fetched `workspacePath`.
- `packages/vscode/src/terminal-manager.ts` (imports) — add `import { backoffDelayMs } from '@cluesmith/codev-core/reconnect-policy';` (same import surface `connection-manager.ts` / `terminal-adapter.ts` already use).
- `packages/vscode/src/__tests__/terminal-manager.test.ts` — extend the suite: (a) **miss-then-hit** → `getWorkspaceState` returns no session on attempt 1 and a session on attempt 2 → builder terminal opens, `showWarningMessage` NOT called; (b) **all-miss** → toast shown with `Retry` + `Recover Builders` labels; (c) selecting **Retry** re-attempts the open; (d) selecting **Recover Builders** → `createTerminal` with `cwd === workspacePath` + `sendText('afx workspace recover')`; (e) happy path (first-attempt session present) → opens immediately, no extra fetches, no warning. Inject a fake/fast sleep so tests don't wait real time.

No `package.json` command contribution needed (buttons handled inline). No types/server changes.

## Risks & Alternatives Considered

- **Risk: auto-retry adds latency.** Only on the miss branch, and bounded (~1.2s). The happy path (first-attempt hit) is unchanged — returns immediately. Mitigation: keep attempts/backoff small and configurable as constants.
- **Risk: retry masks a real persistent failure.** Bounded retries fail fast to the actionable toast; we don't retry indefinitely. The persistent tail still gets a clear signal + recovery path.
- **Risk: startup-reconciliation window exceeds the auto-retry budget (seconds).** Then attempt-set 1 falls through to the toast, but the **Retry** button (and a later natural re-click) succeeds once `_reconciling` clears. Acceptable: the toast is now actionable rather than a dead-end.
- **Risk: `afx workspace recover` doesn't help the live-process/dropped-session sub-case.** True, but it's the correct tool for the dead-shellper tail and is now secondary. The message doesn't over-promise ("it may still be starting, or its session was dropped").
- **Alternative: lead with recovery (first draft).** Rejected — mis-weights the problem; the dominant case needs no recovery.
- **Alternative: fix it Tower-side (block `/api/state` until rehydration fully settles).** Rejected for this issue — larger blast radius on a hot endpoint, and the client-side retry is the smaller, safer change that fixes the user-visible symptom. Tower-side hardening can be a follow-up (overlaps Option 5).
- **Risk: happy-path regression.** Mitigated by confining changes to the miss branch + an explicit happy-path test.

## Test Plan

**Unit (`packages/vscode/src/__tests__/terminal-manager.test.ts`, run via vitest — `pnpm --filter codev-vscode test:unit`):**
- Miss-then-hit → terminal opens, no warning (the transient self-heal).
- All attempts miss → actionable toast with `Retry` + `Recover Builders`.
- `Retry` → re-attempts the open.
- `Recover Builders` → `createTerminal` with `cwd === workspacePath`; `sendText('afx workspace recover')`.
- Happy path → first-attempt session present → opens immediately, no warning, no extra fetches.

**Manual (reviewer at the `dev-approval` gate — run the worktree):**
1. **Transient (the main case):** click a builder row right after a spawn / while Tower is settling. Expect the terminal to open after a brief pause with **no** dead-end toast (previously: instant warning). To force the window, click during the spawn race or just after a Tower restart while reconciliation runs.
2. **Persistent:** simulate a genuinely lost session (e.g. kill the shellper so it can't reconnect). Click the row → after the bounded retries, the actionable toast appears. Click **Retry** → still missing → toast returns (not a silent dead-end). Click **Recover Builders** → a terminal opens at the workspace root running `afx workspace recover` (dry-run); confirm cwd is the main checkout, not a worktree.
3. **Happy path:** click a healthy builder → terminal opens immediately, no toast (regression check).

**Cross-platform:** n/a (desktop VSCode extension only).

## Build / Verify Commands

- `pnpm --filter codev-vscode test:unit` — the vitest unit suite (where the new tests live).
- `pnpm --filter codev-vscode compile` — typecheck + lint + esbuild bundle of the extension.
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# PIR #982 — Review iteration 1 disposition

Single advisory pass (`max_iterations: 1`). Verdicts: claude=APPROVE, codex=REQUEST_CHANGES, gemini=skipped.

## codex — REQUEST_CHANGES (HIGH): Recover ran from the wrong cwd — ADDRESSED (fixed in code)

**Finding:** The **Recover Builders** action launched `afx workspace recover` with `cwd: workspacePath` from `connectionManager.getWorkspacePath()`. Because `detectWorkspacePath` (`workspace-detector.ts`) walks up to the first directory containing `codev/`, and a builder worktree (`.builders/<id>`) is a full checkout with its own `codev/`, a VSCode window *rooted at* a worktree resolves `getWorkspacePath()` to the worktree — so recovery would run inside the worktree, contradicting repo guidance that `afx` must run from the main checkout root.

**Assessment:** Real defect, confirmed against `workspace-detector.ts:9-20` (`findProjectRoot` stops at the first `codev/`). HIGH confidence is warranted for the worktree-rooted-window scenario.

**Fix (commit on branch):**
- Added `mainCheckoutRoot(workspacePath)` in `packages/vscode/src/terminal-resolve.ts` — strips a trailing `/.builders/<id>` segment (leaf-only; tolerates trailing slash and Windows separators), returning a normal main-checkout path unchanged.
- `promptNoTerminalRecovery` now uses `cwd: mainCheckoutRoot(workspacePath)` for the recover terminal.

**Regression coverage:**
- 5 behavioral tests for `mainCheckoutRoot` (unchanged path / strip worktree / trailing slash / non-leaf no-op / Windows separators).
- The `terminal-manager.ts` source-level guard now asserts `cwd: mainCheckoutRoot(workspacePath)` — it fails if the cwd reverts to the raw workspace path.

**Scope note:** The same `cwd: workspacePath` pattern for an `afx` command pre-exists in `run-worktree-setup.ts` (`afx setup`) and other call sites. A consistent main-root resolution across *all* afx invocations is broader than #982; this PR fixes only the affordance it introduces and flags the rest as a follow-up.

## codex — test-coverage note: ADDRESSED

The original source-level regex test would not have caught a wrong cwd. The new behavioral `mainCheckoutRoot` tests plus the tightened source-level assertion now pin the corrected behavior.

## gemini — skipped (non-blocking)

Antigravity `agy` CLI not installed in this environment; no review content. No action.

## claude — APPROVE

No changes requested.

## Escalation

PIR is single-pass: this fix will **not** be independently re-reviewed by the models. The human at the `pr` gate is the only remaining reviewer of the fix — flagged in the architect notification.
32 changes: 32 additions & 0 deletions codev/projects/982-vscode-tower-no-active-termina/status.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
id: '982'
title: vscode-tower-no-active-termina
protocol: pir
phase: verified
plan_phases: []
current_plan_phase: null
gates:
plan-approval:
status: approved
requested_at: '2026-06-06T05:56:04.976Z'
approved_at: '2026-06-06T12:04:30.865Z'
dev-approval:
status: approved
requested_at: '2026-06-06T12:11:59.157Z'
approved_at: '2026-06-06T12:15:06.203Z'
pr:
status: approved
requested_at: '2026-06-06T12:25:53.858Z'
approved_at: '2026-06-06T12:29:16.508Z'
iteration: 1
build_complete: true
history: []
started_at: '2026-06-06T05:49:39.607Z'
updated_at: '2026-06-06T12:29:38.161Z'
pr_history:
- phase: implement
pr_number: 1006
branch: builder/pir-982
created_at: '2026-06-06T12:16:48.295Z'
merged: true
merged_at: '2026-06-06T12:29:38.160Z'
pr_ready_for_human: false
2 changes: 2 additions & 0 deletions codev/resources/lessons-learned.md
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,8 @@ Generalizable wisdom extracted from review documents, ordered by impact. Updated

- [From 812] VS Code gives extensions no control over a panel view container's *position*. The `contributes.viewsContainers` schema is `id`/`title`/`icon` only — no `order`/`priority` — and there is no runtime reposition API. A newly contributed panel tab always lands after the built-ins (Problems/Output/Terminal/…) and is the first to spill into the `…` overflow on a normal-width window, making it effectively undiscoverable on first install. The only supported discoverability lever is a one-time reveal: call `workbench.view.extension.<containerId>` once, guarded by `context.globalState`, so it fires once per profile rather than nagging on every launch. Same constraint applies to *ordering within* a container (views order by registration, not a manifest field). Don't promise a panel surface will appear in a particular spot — it won't.

- [From 982] Two views backed by different freshness sources diverge *transiently*, not only on outright failure. The VSCode sidebar lists a builder from `/api/overview` (disk-sourced: worktree + status.yaml), but the terminal opener needs a live PTY session from `/api/state` (in-memory registry), which momentarily omits a builder while Tower rehydrates and on-the-fly-reconnects its session on every call. The opener warned on the *first* miss, dead-ending on a condition that self-heals on the very next call. Fix: a bounded retry (reuse the shared `backoffDelayMs` curve with interactive params, not a hand-rolled delay) absorbs the window so the dominant case opens silently; only a genuinely persistent miss surfaces an actionable recovery toast. Lead the recovery UX with the cheap self-heal (retry), not the heavy workaround (`afx workspace recover`), which only addresses the rare destroyed-session tail. Sibling to [From 916]: same divergence class, opposite direction (there the whole shared cache nulled; here one per-builder field is briefly stale).

- [From 916] A shared cache that nulls itself on a *transient* failure correlates the failure across every consumer at once. The VSCode sidebar's four data-views share one `OverviewCache`; each renders a falsy read as `[]`, so a brief SSE drop/reconnect that nulled the cache blanked all four simultaneously, then recovered on the next event. Fix: hold last-known-good — a not-connected state or a failed fetch returns without clobbering; only a successful fetch commits. Crucially, distinguish *failure-null* from *legitimately-empty*: the source-of-truth (`TowerClient.getOverview()`) returns `null` only on request failure and a real object with empty arrays for a genuinely empty workspace, so "keep last value on null" never masks real emptiness. The discriminating diagnostic was that one sibling view (Workspace) stayed populated because it reads a *different* source — when N views empty together but one doesn't, suspect a shared dependency of the N, not N independent bugs.

---
Expand Down
Loading
Loading