Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
id: bugfix-826
title: critical-sibling-architects-le
protocol: bugfix
phase: verified
plan_phases: []
current_plan_phase: null
gates: {}
iteration: 1
build_complete: false
history: []
started_at: '2026-05-23T15:09:40.916Z'
updated_at: '2026-05-23T15:26:21.372Z'
16 changes: 10 additions & 6 deletions codev/resources/arch.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,15 +258,19 @@ A workspace can host more than one architect terminal. Each architect has a stab
**Lifecycle (Spec 786 Phase 3 / OQ-B)**:
- **Add**: `afx workspace add-architect [--name <name>]`. Validator at `utils/architect-name.ts` enforces `[a-z][a-z0-9-]*` (max 64), rejects `main` as reserved (Spec 786). Auto-numbers via `autoNumberArchitectName` when `--name` is omitted (smallest unused `architect-<N>` integer ≥ 2).
- **Remove**: `afx workspace remove-architect <name>` (also dashboard close-X + VSCode right-click). Server refuses `main`. Removing an architect with in-flight builders proceeds — builders fall back to `main` routing.
- **Graceful stop**: `afx workspace stop` sets the `intentionallyStopping` flag (`tower-instances.ts`) so the six cascaded exit handlers (4 in `tower-instances.ts`, 2 in `tower-terminals.ts`) skip the `setArchitectByName(name, null)` call. Sibling rows in `state.db.architect` survive the stop.
- **Graceful start**: `launchInstance` creates `main` if absent (gate changed from `entry.architects.size === 0` to `!entry.architects.has('main')`), then iterates persisted siblings via `getArchitects()` and re-spawns each via `addArchitect`. Critical ordering: main FIRST (otherwise `addArchitect`'s `size > 0` guard rejects the sibling spawn).
- **Graceful stop**: `afx workspace stop` sets the `intentionallyStopping` flag (`tower-instances.ts`) so the six cascaded exit handlers (4 in `tower-instances.ts`, 2 in `tower-terminals.ts`) skip the `setArchitectByName(workspacePath, name, null)` call. Sibling rows in `state.db.architect` survive the stop, scoped by `workspace_path`.
- **Graceful start**: `launchInstance` creates `main` if absent (gate changed from `entry.architects.size === 0` to `!entry.architects.has('main')`), then iterates persisted siblings via `getArchitects(resolvedPath)` — workspace-scoped (Bugfix #826) — and re-spawns each via `addArchitect`. Critical ordering: main FIRST (otherwise `addArchitect`'s `size > 0` guard rejects the sibling spawn). The workspace scoping prevents architects registered in workspace A from leaking into workspace B at launch — schema-level isolation rather than per-call-site guards.
- **Crash recovery**: rows in `terminal_sessions` survive because Tower didn't clean them up; `reconcileTerminalSessions()` reconnects via shellper sockets.
- **Permanent exit** (max-restart exhaustion): exit handlers run WITHOUT the intentional-stop flag set, so `setArchitectByName(name, null)` fires and the row is auto-deleted (Spec 786 OQ-B — `state.db` mirrors reality).
- **Stop-all** (`tower-routes.ts:handleWorkspaceStopAll`): explicit "tear everything down" — full wipe of `terminal_sessions` and `state.db.architect`. Semantically distinct from `stopInstance` which preserves sibling registration.
- **Permanent exit** (max-restart exhaustion): exit handlers run WITHOUT the intentional-stop flag set, so `setArchitectByName(workspacePath, name, null)` fires and the row is auto-deleted (Spec 786 OQ-B — `state.db` mirrors reality).
- **Stop-all** (`tower-routes.ts:handleWorkspaceStopAll`): explicit "tear everything down" — full wipe of `terminal_sessions` and per-workspace `state.db.architect` rows. Semantically distinct from `stopInstance` which preserves sibling registration.

**Persistence layers**:
- `state.db.architect` — durable per-architect registration (`id, pid, port, cmd, started_at, terminal_id`). `pid`/`port` persist as `0` (Spec 755 limitation); the live values come from Tower's `PtySession` only.
- `terminal_sessions` — global runtime session registry. Wiped on graceful stop (`stopInstance` and `stop-all`); preserved on crash. Reconciliation reads `role_id` to re-key the in-memory architect map.
- `state.db.architect` — durable per-architect registration. Schema: `(workspace_path TEXT NOT NULL, id TEXT NOT NULL, pid, port, cmd, started_at, terminal_id)` with composite primary key `(workspace_path, id)` (Bugfix #826 migration v11). Workspace scoping is part of the schema: the same architect name (e.g. `main`) can exist in multiple workspaces without collision, and queries scoped to one workspace's `workspace_path` cannot return rows from another. `pid`/`port` persist as `0` (Spec 755 limitation); the live values come from Tower's `PtySession` only.
- `terminal_sessions` — global runtime session registry. Wiped on graceful stop (`stopInstance` and `stop-all`); preserved on crash. Reconciliation reads `role_id` to re-key the in-memory architect map. Not used as a workspace-scoping signal — the architect table carries that directly.

**Migration history**:
- v9 (Spec 755): rebuild architect table as TEXT primary key, rekey to 'main'.
- v11 (Bugfix #826): add `workspace_path` to architect, backfilling from `global.db.terminal_sessions` via ATTACH. Disambiguation uses `architect.terminal_id` as the primary key (matches one unique terminal_session row) with `role_id` as the fallback — for users already hit by the v3.1.1 leak whose `state.db.architect` has names appearing in MULTIPLE workspaces' terminal_session rows, this ensures each architect row is migrated to its LEGITIMATE workspace (the one whose terminal_session has the matching session UUID). Orphans (architects with neither match) are dropped.
- In-memory `WorkspaceTerminals.architects: Map<string,string>` — name → terminal id. Rebuilt on every `launchInstance`/reconciliation.

**Surface enumeration (Spec 786 Phase 5)**:
Expand Down
6 changes: 3 additions & 3 deletions codev/resources/commands/agent-farm.md
Original file line number Diff line number Diff line change
Expand Up @@ -252,12 +252,12 @@ afx send architect "Status update"

#### Persistence and recovery

Spec 786 Phase 3 added graceful-restart persistence for sibling architects:
Spec 786 Phase 3 added graceful-restart persistence for sibling architects; Bugfix #826 extended it with workspace scoping via a schema change — `state.db.architect` now has `workspace_path` as part of the composite primary key, so architects registered in workspace A cannot appear in queries scoped to workspace B.

- **`afx workspace stop` → `afx workspace start`**: sibling architects survive. Tower's `stopInstance` marks the workspace as "intentionally stopping" so the cascaded exit handlers skip deleting `state.db.architect` rows. On next start, `launchInstance` creates `main` and then re-spawns persisted siblings via `addArchitect`.
- **`afx workspace stop` → `afx workspace start`**: sibling architects survive. Tower's `stopInstance` marks the workspace as "intentionally stopping" so the cascaded architect exit handlers skip the `setArchitectByName(workspacePath, name, null)` call, preserving rows in `state.db.architect`. On next start, `launchInstance` creates `main` and then re-spawns persisted siblings via `addArchitect` — `getArchitects(resolvedPath)` returns only this workspace's rows (Bugfix #826).
- **Tower crash**: `terminal_sessions` rows + shellper processes survive. Tower's `reconcileTerminalSessions()` reconnects on startup.
- **Permanent exit (max-restart exhaustion, `remove-architect`)**: rows are auto-deleted from `state.db.architect` (Spec 786 OQ-B — keeps state.db an accurate mirror of reality).
- **Dashboard "Stop All"** (or `POST /workspace/<base64>/api/stop` directly): full wipe, including sibling rows. Use this when you want to start over from scratch. There is no `afx workspace stop-all` CLI today — the full-wipe path is currently API-only via the dashboard.
- **Dashboard "Stop All"** (or `POST /workspace/<base64>/api/stop` directly): full wipe, including sibling rows. Per-workspace `state.db.architect` rows are removed pre-emptively in the route handler. Use this when you want to start over from scratch. There is no `afx workspace stop-all` CLI today — the full-wipe path is currently API-only via the dashboard.

---

Expand Down
132 changes: 132 additions & 0 deletions codev/state/bugfix-826_thread.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# bugfix-826 thread

## 2026-05-23 — Bugfix start

**Issue**: #826 — Sibling architects leak across workspaces. v3.1.1 hotfix.

**Approach (per architect instruction)**: Option B — modify `launchInstance` reconcile loop to only re-spawn architects whose `terminal_sessions.workspace_path` matches the current workspace. No schema migration.

**Phase**: investigate

Plan:
1. Find `launchInstance` reconcile loop (introduced in #786)
2. Identify the architect iteration that re-spawns siblings
3. Find the right join with `terminal_sessions` to gate by `workspace_path`
4. Implement minimal fix
5. Add regression test
6. PR + CMAP review

## Fix shape

- New helper `getArchitectsForWorkspace(workspacePath)` in `state.ts`. Two-query approach (no SQLite cross-DB attach): first reads `role_id` values from `terminal_sessions` in `global.db` filtered by `type='architect' AND workspace_path=?`, then intersects with `state.db.architect` rows by name.
- `launchInstance` reconcile loop in `tower-instances.ts` now calls `getArchitectsForWorkspace(resolvedPath)` instead of unscoped `getArchitects()`.
- Tests: 5 new tests in `state.test.ts` for the helper, 1 source-level sentinel in `tower-instances.test.ts`.

## Known trade-off (called out in PR for architect awareness)

Because `deleteWorkspaceTerminalSessions` wipes all terminal_sessions rows for a workspace on `afx workspace stop`, the matching workspace_path signal is also wiped. That means after a `stop + start` cycle, persisted sibling architects in `state.db.architect` will *not* be re-spawned — a partial regression of Spec 786's stop+start sibling persistence story. This is the trade-off the issue's "Option B" explicitly accepts vs. Option A (proper schema migration). v3.1.2 prioritizes stopping the cross-workspace leak; Option A is the proper long-term fix.

## Architect call: Option B+ (iter-2)

Architect's independent CMAP had Codex flag the regression. Architect chose Option B+: preserve architect rows on stop so Spec 786's stop+start story still works.

Additional changes (iter-2):
- `deleteWorkspaceTerminalSessions(path, { includeArchitects? = false })` — default preserves architect rows on stop. Full-wipe callers (`handleWorkspaceStopAll`) opt in.
- All four architect exit handlers in `tower-instances.ts` now skip `deleteTerminalSession(session.id)` (in addition to the existing skip of `setArchitectByName(name, null)`) when intentionally stopping. Architect rows survive in BOTH `state.db.architect` and `global.db.terminal_sessions` across an `afx workspace stop`.
- `saveTerminalSession` enforces a `(workspace_path, role_id)` uniqueness invariant for `type='architect'` rows: pre-deletes any existing architect row before inserting. Prevents stale-row accumulation across multiple stop+start cycles (since the new PTY gets a fresh terminal id each time).
- 3 new behavioral integration-contract tests in `state.test.ts` exercise the full lifecycle: stop preserves architect rows in both tables, cross-workspace isolation holds, and repeated stop+start cycles don't accumulate stale rows.

## iter-3: gate the per-session exit handlers in tower-terminals.ts

Architect's second independent CMAP run had Codex flag one remaining hole: two PtySession exit handlers in `tower-terminals.ts` (lines 712, 889 — both in `reconcileTerminalSessions`) called `deleteTerminalSession(session.id)` UNCONDITIONALLY. The iter-2 gating only protected `setArchitectByName`. So a single architect PTY exit during intentional stop would wipe its `terminal_sessions` row before the bulk wipe (now architect-preserving) had a chance to preserve it.

iter-3 changes:
- Both `tower-terminals.ts` exit handlers now gate `deleteTerminalSession(session.id)` on `!isIntentionallyStopping(workspacePath) || type !== 'architect'`. Non-architect terminal types (shells, builders) keep deleting on exit as before.
- Source-level sentinel test in `tower-instances.test.ts` scans all 6 architect exit handlers (4 in tower-instances.ts + 2 in tower-terminals.ts) and verifies each gates `deleteTerminalSession` on the intentional-stop flag. Pins the property at source so a future refactor that re-introduces the bug fails the test.
- Doc updates in `codev/resources/arch.md` and `codev/resources/commands/agent-farm.md` reflect the iter-2 + iter-3 lifecycle.

All 215 affected unit tests pass. Type check clean.

## iter-4: Option A (Workspace-scoped schema)

Architect's call: abandon the per-site patching approach. After iter-3's third independent CMAP REQUEST_CHANGES (Codex found another hole), the architect determined the root cause is the schema, not the call sites. Switching to Option A.

iter-4 scope (this is now a much bigger refactor):

1. **Reverted iter-2 and iter-3 patches**:
- `deleteWorkspaceTerminalSessions` back to single-arg, deletes ALL rows
- 4 exit handlers in tower-instances.ts back to: delete terminal_session unconditionally, gate only setArchitectByName
- 2 exit handlers in tower-terminals.ts: same revert
- `saveTerminalSession` uniqueness invariant removed
- `handleWorkspaceStopAll` opt-in arg removed
- iter-2/iter-3 doc edits replaced with Option A docs
- The `intentionallyStopping` mechanism stays (still needed for state.db.architect preservation on graceful stop)

2. **Schema migration v11**: `state.db.architect` gets `workspace_path TEXT NOT NULL` as part of composite primary key `(workspace_path, id)`. Backfill via `ATTACH global.db` and join on `terminal_sessions.role_id`. Orphans (architects with no matching terminal_session) are dropped. `CREATE INDEX idx_architect_workspace` for efficient per-workspace lookups. Migration is idempotent (skips if `workspace_path` column already present in fresh installs).

3. **state.ts accessors all take workspacePath**:
- `getArchitects(workspacePath)` (replaces unscoped `getArchitects()` + the iter-1 `getArchitectsForWorkspace`)
- `setArchitect(workspacePath, architect)`
- `setArchitectByName(workspacePath, name, architect)`
- `removeArchitect(workspacePath, name)`
- `loadState(workspacePath)` (architect read is scoped; builders/utils/annotations remain global per state.db)
- `getArchitect(workspacePath)`, `getArchitectByName(workspacePath, name)`

4. **All callers updated**: tower-instances.ts (4 exit handlers, addArchitect, removeArchitect, launchInstance reconcile + 2 main setArchitect calls), tower-terminals.ts (2 exit handlers), tower-routes.ts (handleWorkspaceStopAll), and CLI commands (status, attach, stop, send, cleanup) all pass workspacePath through.

5. **migrateLocalFromJson** also takes workspacePath (one-time legacy JSON-to-SQLite migration); db/index.ts passes config.workspaceRoot through.

6. **Tests rewritten**:
- `bugfix-826-migration.test.ts` (new): exercises v11 migration — backfill, orphan drop, partitioning of same-name across workspaces, index creation, empty-table case, _migrations record.
- `state.test.ts`: replaced iter-1/iter-2/iter-3 specific tests with a `workspace-scoped architect schema` describe block covering isolation, leak regression, stop+start preservation via clearRuntime + scoped re-read, scoped getArchitectByName, scoped loadState, and per-workspace upsert isolation.
- `tower-instances.test.ts`: replaced iter-1/iter-3 source sentinels with one Option A sentinel (`getArchitects(resolvedPath)` call site).
- `tower-terminals.test.ts`: reverted iter-2 SQL assertion to the original.
- `db.test.ts`: updated multi-architect test to cover composite-PK (workspace_path, id).
- `migrate.test.ts`: updated to pass workspacePath.
- `send.test.ts`: send.ts now uses `detectWorkspaceRoot()` (already mocked) rather than `getConfig()` (would have required new mocking).

7. **Docs**: arch.md and agent-farm.md updated to describe the Option A schema directly. Migration history section added to arch.md. The iter-2/iter-3 per-site preservation mechanisms are gone — replaced by "schema-level isolation."

All 1841 agent-farm unit tests pass. Type check clean.

## iter-5: Migration disambiguation via terminal_id

Architect's iter-4 independent CMAP flagged a real migration correctness bug from Codex (HIGH): the v11 backfill used `ts.role_id = a.id LIMIT 1`. For users already hit by the #826 leak, the same architect name appears in MULTIPLE workspaces' `terminal_sessions` rows. `LIMIT 1` picks non-deterministically — ob-refine could be migrated to the wrong workspace silently.

Fix: use `architect.terminal_id` as the primary disambiguator (matches exactly one terminal_session row by UUID); fall back to `role_id` only when terminal_id is NULL or has no matching row.

Changes:
- `db/index.ts` v11 block: `COALESCE(terminal_id-match, role_id-fallback)` for the workspace_path subquery, and `(ts.id = a.terminal_id OR ts.role_id = a.id)` for the WHERE EXISTS guard
- `bugfix-826-migration.test.ts` helper: mirrors production
- 2 new tests:
- Disambiguation: architect 'ob-refine' with terminal_id 't-shannon-ob-refine'; terminal_sessions has both shannon's legitimate row AND manazil's leaked row. After migration, the architect row gets workspace_path='/shannon' (the legitimate match).
- Fallback: architect with NULL terminal_id falls back to role_id when there's no ambiguity.
- `arch.md`: migration history note explains the disambiguation.

All 1843 agent-farm tests pass.

## iter-6: Path canonicalization at the state.ts boundary

Architect's iter-5 independent CMAP caught a Codex REQUEST_CHANGES (HIGH): Tower writes canonical realpaths via `normalizeWorkspacePath`, but CLI callers and `migrateLocalFromJson` pass raw paths. A user accessing a workspace via symlink would create two rows (one canonical, one symlinked) and subsequent lookups would silently miss.

Fix: canonicalize at the data-layer boundary. Every architect accessor in `state.ts` now passes its `workspacePath` argument through `canonicalize()` before any read/write. `migrateLocalFromJson` does the same for its workspacePath parameter.

`canonicalize()` is inlined in state.ts (~10 LOC: `realpathSync` for existing paths, `path.resolve` fallback). Inlining instead of importing from `tower-utils.ts` avoids pulling server-layer imports into the data layer.

Tests:
- 6 new symlink tests in state.test.ts:
- write via symlink + read via canonical → same row
- write via canonical + read via symlink → same row
- removeArchitect via symlink removes canonical row
- loadState via symlink returns canonical row
- distinct workspaces don't collapse
- non-existent paths fall back to path.resolve
- 1 new migrate test: migrateLocalFromJson stores canonical workspace_path when given a symlinked input.

1849 agent-farm unit tests pass.

## Test run notes

- `state.test.ts` — 27/27 pass (including 5 new tests for `getArchitectsForWorkspace`).
- `tower-instances.test.ts` — 52/52 pass (including new source-level sentinel test).
- Pre-existing flaky test files in this worktree (unrelated to fix): `session-manager.test.ts` (shellper binary not built in worktree) and `update.test.ts` (skeleton dir not copied). Neither file is in my diff vs main.
Loading
Loading