Background
Surfaced during PIR #991 (terminal self-heals onto its successor session after a Tower restart).
On startup, tower-server.ts registers the HTTP handler (http.createServer, line 336) and calls server.listen() (line 342), then runs await reconcileTerminalSessions() inside the listen callback (line 398). So Tower starts serving requests, answering /api/state and 404-ing WS upgrades to now-deleted terminal ids, before reconcile re-registers persistent (shellper-backed) sessions under their new terminal ids.
Problem
During that startup window:
- The
role → terminalId mapping in /api/state is incomplete.
- There is no readiness signal to wait on.
GET /health reports process-up, not reconcile-complete.
So a client re-resolving a session's successor id cannot distinguish "no successor yet, reconcile pending" from "no successor ever, session gone". Both look identical: absent from state.
Because of this, #991 had to make its VSCode recovery poll /api/state a few times (bounded ~4s) to ride out the race instead of making one deterministic call. The dashboard tolerates it via its existing indefinite 1s state poll.
Proposed approaches (pick one at plan-gate)
1. Reconcile before serving
await reconcileTerminalSessions() before server.listen(), or gate the handler (503 / hold) until reconcile completes. The first reachable /api/state is then deterministic.
Trade-off: Tower accepts no connections (health checks, other workspaces) until reconcile finishes. A hung shellper-socket probe would block startup. Mitigation: bound reconcile time, make it non-blocking-on-failure for unreachable shellpers.
2. Readiness signal
Add a readiness endpoint (or extend /health) that flips ready only post-reconcile. Clients await it once, then fetch state.
Trade-off: keeps accepting connections during startup (lower risk than approach 1), at the cost of a readiness handshake clients have to learn.
Acceptance
Notes
Related
- #991 (in flight) — VSCode + dashboard terminal stale-tab auto-remount onto successor session id. Client-side workaround for the race this issue removes at the root.
Background
Surfaced during PIR #991 (terminal self-heals onto its successor session after a Tower restart).
On startup,
tower-server.tsregisters the HTTP handler (http.createServer, line 336) and callsserver.listen()(line 342), then runsawait reconcileTerminalSessions()inside the listen callback (line 398). So Tower starts serving requests, answering/api/stateand 404-ing WS upgrades to now-deleted terminal ids, before reconcile re-registers persistent (shellper-backed) sessions under their new terminal ids.Problem
During that startup window:
role→terminalIdmapping in/api/stateis incomplete.GET /healthreports process-up, not reconcile-complete.So a client re-resolving a session's successor id cannot distinguish "no successor yet, reconcile pending" from "no successor ever, session gone". Both look identical: absent from state.
Because of this, #991 had to make its VSCode recovery poll
/api/statea few times (bounded ~4s) to ride out the race instead of making one deterministic call. The dashboard tolerates it via its existing indefinite 1s state poll.Proposed approaches (pick one at plan-gate)
1. Reconcile before serving
await reconcileTerminalSessions()beforeserver.listen(), or gate the handler (503 / hold) until reconcile completes. The first reachable/api/stateis then deterministic.Trade-off: Tower accepts no connections (health checks, other workspaces) until reconcile finishes. A hung shellper-socket probe would block startup. Mitigation: bound reconcile time, make it non-blocking-on-failure for unreachable shellpers.
2. Readiness signal
Add a readiness endpoint (or extend
/health) that flipsreadyonly post-reconcile. Clients await it once, then fetch state.Trade-off: keeps accepting connections during startup (lower risk than approach 1), at the cost of a readiness handshake clients have to learn.
Acceptance
getWorkspaceState(once Tower is reachable or reports ready) deterministically reflects the completed reconcile. No client polling needed to resolve a successor id.recoverSuccessorinpackages/vscode/src/terminal-manager.ts) and the dashboard's reliance on the 1s poll can simplify to a single deterministic fetch.Notes
Related