Skip to content

terminal-adapter: WebSocket close-handler spams 'Connection lost' in a tight loop with no backoff, no give-up, no actual reconnect #936

@amrmelsayed

Description

@amrmelsayed

Symptom

When the WebSocket backing a Codev terminal (architect or builder) closes — typically because Tower's in-memory session registry no longer has the terminal ID (Tower restart, PTY died, network blip + stale ID) — the terminal pane fills with repeated yellow notices:

[Codev: Connection lost, reconnecting...]
[Codev: Connection lost, reconnecting...]
[Codev: Connection lost, reconnecting...]
... (continues indefinitely)

Observed in real use today: an architect terminal accumulated dozens of these lines in seconds with no apparent progress and no resolution. The Claude/architect process on Tower's side may still be alive; the user just can't see its output because the WebSocket is stuck in a failure loop.

Why it's broken

Three distinct problems compound in packages/vscode/src/terminal-adapter.ts:137-143:

this.ws.on('close', () => {
  if (!this.disposed) {
    this.log('WARN', 'WebSocket closed');
    this.writeEmitter.fire('\x1b[33m[Codev: Connection lost, reconnecting...]\x1b[0m\r\n');
    // Reconnection handled by terminal-manager
  }
});
  1. The comment is wrong. terminal-manager.ts does not subscribe to the adapter's onDidClose and does not call adapter.reconnect(). The notice claims reconnection is happening but the message-emit path doesn't actually trigger one. Whatever IS re-opening the WebSocket (either an upstream caller or the ws library itself) does so with no coordination with this code.
  2. No rate limiting. Every close-event fires the notice immediately. A connection that's failing fast (Tower closes the WS right after handshake because the requested session ID doesn't exist) produces several messages per second.
  3. No give-up condition. The loop has no upper retry bound, no exponential backoff visible to the user, and no terminal failure state — it just spams until the user manually intervenes (reload window / restart Tower / close + reopen the terminal tab).

Why this matters

  • User-visible terminal is unusable while the loop runs — the architect's actual output (which may still be flowing into a live Tower-side PTY) is buried under reconnect spam.
  • No diagnostic signal. The user can't tell from the notice whether the loop is making progress (e.g. exponential backoff approaching success) or whether it's hopelessly retrying against a session ID Tower has already forgotten.
  • Easy to mistake for "Tower is down" when the actual cause is often "this specific terminal session ID is dead" — fixable by closing and re-opening the tab without restarting Tower.
  • Connection-manager has correct backoff. connection-manager.ts:scheduleReconnect does exponential 1000 * Math.pow(2, attempt) capped at 30s for the SSE / health-check connection. The terminal-adapter's WS reconnect is unprotected by the same discipline — inconsistent across the two transport layers.

Proposed fix

Four coordinated changes in packages/vscode/src/terminal-adapter.ts:

1. Add explicit reconnect orchestration at the adapter level

Move from "ws.on('close') fires a notice and hopes someone reconnects" to "adapter owns the reconnect loop with backoff." Mirror the same exponential-backoff shape connection-manager.ts uses for SSE — 1000 * Math.pow(2, attempt) capped at e.g. 30s.

2. Replace per-attempt spam with a single updating status

The terminal pane today re-prints [Codev: Connection lost, reconnecting...] on every close. Replace with one notice on the first close that updates in place to show backoff progress:

[Codev: Connection lost — retrying in 4s (attempt 3)]

Use ANSI cursor controls (\r + clear-line) to overwrite the previous line, or accept a one-line-per-attempt cadence but only when the backoff actually fires (not multiple per second). The pane should stay readable, not become a wall of identical lines.

3. Add a give-up condition

After N failed retries (proposed: 6 attempts, total elapsed ~63s with the backoff above), stop auto-retrying and surface a quiet failure message in the terminal. The give-up state is what stops the loop; surfacing a recovery affordance on top of that state (terminal link, toast, status-bar action) is tracked separately in #939 — this issue ships the give-up state itself with a plain red message, and #939 adds the one-click recovery affordance on top of it.

4. Detect Tower-side "session not found" and stop retrying early

If Tower's WS close-frame carries enough information to distinguish "session ID unknown" from "transient network blip" (close code, reason string), the adapter should stop retrying on the first occurrence of "session unknown" — retrying against a Tower that has already forgotten the session ID is hopeless and the user wants the give-up state immediately. If no such signal is available, falling back to the N-retry give-up from #3 is fine.

5. Correct or remove the misleading comment

packages/vscode/src/terminal-adapter.ts:141 says "Reconnection handled by terminal-manager." Either restore that contract (add the subscription in terminal-manager.ts) or update the comment to reflect that the adapter owns reconnect now.

Design calls for plan-approval

  1. Backoff curve: exponential 2^attempt matches connection-manager.ts. Cap at 30s like there, or something shorter (terminal reconnect is more user-visible than SSE)? Proposed default: same 30s cap, same exponential curve, for cross-layer consistency.
  2. Give-up threshold: 6 attempts (~63s total) is my proposal. Could be configurable via codev.terminalReconnectMaxAttempts if some users want to keep retrying indefinitely. Proposed default: 6 attempts, no config knob in v1.
  3. Distinguish session-unknown from transient close: if the WS close-frame from Tower carries diagnostic info, exploit it. If not, conservatively retry per Scaling Crisis - AI Autonomy Decreases as Projects Grow #3. Plan should investigate first.

Acceptance

  • On WebSocket close, the terminal shows at most one notice every backoff interval (not multiple per second).
  • The notice updates in place or appears at the backoff cadence — not as wall-of-identical-lines.
  • After N failed retries (proposed N=6), the adapter stops auto-retrying and surfaces a quiet failure message in the terminal pane.
  • The comment at terminal-adapter.ts:141 either reflects the new orchestration or is replaced/removed.
  • No regression to the existing successful-reconnect path (when Tower comes back and the session is still valid, the adapter reconnects cleanly and replays the buffered output).

Out of scope

Why PIR

Multiple design calls (backoff curve, give-up threshold, whether to consume the WS close-frame's diagnostic info) genuinely benefit from a plan-approval gate. The fix is also user-visible UI behavior that wants dev-approval verification — try a few induced disconnect scenarios (kill Tower mid-session, kill the PTY, network drop) and confirm the new notice/backoff/give-up flow reads well in each. PR-diff-only review is insufficient for a resilience fix this user-facing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/terminalArea: Terminal-specific (PTY, vscode terminal pane)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions