Skip to content

tower: reconcile terminal sessions before serving requests (kill the restart successor-lookup race) #997

@amrmelsayed

Description

@amrmelsayed

Background

Surfaced during PIR #991 (terminal self-heals onto its successor session after a Tower restart).

On startup, tower-server.ts registers the HTTP handler (http.createServer, line 336) and calls server.listen() (line 342), then runs await reconcileTerminalSessions() inside the listen callback (line 398). So Tower starts serving requests, answering /api/state and 404-ing WS upgrades to now-deleted terminal ids, before reconcile re-registers persistent (shellper-backed) sessions under their new terminal ids.

Problem

During that startup window:

  1. The roleterminalId mapping in /api/state is incomplete.
  2. There is no readiness signal to wait on. GET /health reports process-up, not reconcile-complete.

So a client re-resolving a session's successor id cannot distinguish "no successor yet, reconcile pending" from "no successor ever, session gone". Both look identical: absent from state.

Because of this, #991 had to make its VSCode recovery poll /api/state a few times (bounded ~4s) to ride out the race instead of making one deterministic call. The dashboard tolerates it via its existing indefinite 1s state poll.

Proposed approaches (pick one at plan-gate)

1. Reconcile before serving

await reconcileTerminalSessions() before server.listen(), or gate the handler (503 / hold) until reconcile completes. The first reachable /api/state is then deterministic.

Trade-off: Tower accepts no connections (health checks, other workspaces) until reconcile finishes. A hung shellper-socket probe would block startup. Mitigation: bound reconcile time, make it non-blocking-on-failure for unreachable shellpers.

2. Readiness signal

Add a readiness endpoint (or extend /health) that flips ready only post-reconcile. Clients await it once, then fetch state.

Trade-off: keeps accepting connections during startup (lower risk than approach 1), at the cost of a readiness handshake clients have to learn.

Acceptance

Notes

Related

  • #991 (in flight) — VSCode + dashboard terminal stale-tab auto-remount onto successor session id. Client-side workaround for the race this issue removes at the root.

Metadata

Metadata

Assignees

Labels

area/towerArea: Tower server / agent farm CLI

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions