Skip to content

CRITICAL: Sibling architects leak across workspaces — launchInstance reconcile reads global state.db.architect without workspace filtering #826

@waleedkadous

Description

@waleedkadous

Summary

In v3.1.1, opening a workspace other than the one that registered sibling architects causes those siblings to be re-spawned in the new workspace. Sibling architects leak across workspaces.

Reported by user immediately after v3.1.1 installation: creating workspace manazil showed bug-backlog and ob-refine architects from shannon running in manazil's terminal list (with NEW PIDs, meaning they were actually spawned, not just listed).

Repro

  1. Tower is started from /Users/mwk/Development/cluesmith/codev
  2. Shannon adds sibling architects: afx workspace add-architect --name ob-refine (in shannon)
  3. Shannon's state.db.architect row gets written, BUT the row lives in codev/.agent-farm/state.db (the singleton from getDb(), anchored to Tower's CWD)
  4. User opens manazil workspace
  5. launchInstance(manazil) runs
  6. launchInstance iterates state.db.architect (the global codev-side table) and finds ob-refine
  7. launchInstance re-spawns ob-refine as a manazil architect with a new PID
  8. Manazil's /api/state now reports architects: [main, ob-refine, bug-backlog] — all running, all real PTYs, none of them legitimately belonging to manazil

Diagnosis (verified)

$ sqlite3 /Users/mwk/Development/cluesmith/codev/.agent-farm/state.db "SELECT id FROM architect;"
bug-backlog
main
ob-refine

$ sqlite3 /Users/mwk/Development/cluesmith/shannon/.agent-farm/state.db "SELECT id FROM architect;"
(empty — shannon's siblings live in codev/.agent-farm/state.db because state.db is anchored to Tower's CWD)

$ sqlite3 /Users/mwk/Development/cluesmith/codev/.agent-farm/state.db ".schema architect"
CREATE TABLE IF NOT EXISTS "architect" (
    id TEXT PRIMARY KEY,
    pid INTEGER NOT NULL,
    port INTEGER NOT NULL,
    cmd TEXT NOT NULL,
    started_at TEXT NOT NULL DEFAULT (datetime('now')),
    terminal_id TEXT
);

No workspace_path column. The table is global per Tower daemon, not per workspace.

Meanwhile terminal_sessions (in ~/.agent-farm/global.db) DOES have a workspace_path column:

CREATE TABLE terminal_sessions (
    id TEXT PRIMARY KEY,
    workspace_path TEXT NOT NULL,
    type TEXT NOT NULL CHECK(type IN ('architect', 'builder', 'shell')),
    role_id TEXT,
    ...
)

So the workspace-scoping data exists; it's just not joined into the architect reconcile path.

Why this is new in v3.1.1

This issue was flagged by Codex at #786 plan-iter-3 Co1 ("workspace-scoping") and explicitly accepted-as-out-of-scope by the architect because Spec 786 defers cross-workspace concerns:

"Cross-workspace routing. Architects in workspace A cannot address architects in workspace B. Deferred previously; stays deferred."

But the bug ISN'T about routing — it's about persistence leaking architects across workspaces via the launchInstance reconcile loop that #786 added to deliver graceful-stop persistence. Pre-#786, launchInstance only created main; the global table was harmless because nothing iterated it. #786's new iterate-and-respawn loop turns the pre-existing global table into an active cross-workspace leak.

Fix shape

Two options:

Option A (proper, schema migration)

Add workspace_path TEXT NOT NULL column to state.db.architect. Migrate existing rows by joining with terminal_sessions on id = role_id to populate workspace_path. Update setArchitect, setArchitectByName, removeArchitect, loadState, and the launchInstance reconcile to scope by workspace. Most correct but a schema migration with multiple touch points.

Option B (quick, uses existing data)

Modify the launchInstance reconcile loop to only re-spawn architects that have a matching terminal_sessions row WITH workspace_path = <this workspace>. The terminal_sessions table already has the workspace scoping; join on terminal_sessions.role_id = architect.id AND terminal_sessions.type = 'architect' AND terminal_sessions.workspace_path = <this workspace>. No schema change; reconcile is correctly scoped.

Recommend Option B for v3.1.2 (urgent hotfix). Option A is the right long-term fix and can ship as part of a follow-up architectural cleanup.

Severity

Critical. v3.1.1 just shipped (~30 min ago). Anyone with multiple workspaces and a non-main sibling architect anywhere will see leaks the moment they open a different workspace. The leaked architects are real running processes consuming resources (~50MB+ RAM per claude session, plus their claude --dangerously-skip-permissions shell harnesses) AND they could interfere with the actual workspace's workflow if the user accidentally talks to them.

Workaround until fix lands

Don't open additional workspaces in Tower if you have non-main siblings registered anywhere. Or kill the leaked terminals via the dashboard sidebar X button (which kills the PTY but leaves the state.db row intact for the legitimate workspace).

Cleanup of leaked architects

Direct afx workspace remove-architect <name> from a leaked workspace would also delete the state.db row, which corrupts the legitimate owner workspace's state. Per-PTY kill via the dashboard is the safer manual cleanup until the proper fix lands.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions