Skip to content

fix(diagnose): realpath bin path + filter Next.js SSE noise#4

Merged
chorus-codes merged 1 commit into
mainfrom
fix/diagnose-polish
May 8, 2026
Merged

fix(diagnose): realpath bin path + filter Next.js SSE noise#4
chorus-codes merged 1 commit into
mainfrom
fix/diagnose-polish

Conversation

@chorus-codes
Copy link
Copy Markdown
Owner

Summary

Two cosmetic improvements to chorus diagnose output, surfaced from running it against a real v0.8.27 install:

  1. install mode: unknown for global npm installsprocess.argv[1] returns the /usr/bin/chorus symlink, not the resolved node_modules/.../chorus.mjs. Wrap with realpathSync before classifying.

  2. web.log was leaking Next.js 16 SSE pipe-close traces — looked scary in bug reports but is benign client-disconnect behaviour. filterBenignNoise drops the trace block (full and orphan-tail variants); footer counts what was filtered so nothing is silently dropped.

Test plan

  • pnpm typecheck clean
  • pnpm test 654/654 (5 new in tests/diagnose.test.ts)
  • Live chorus diagnose smoke — web.log now shows real Next.js startup messages without the trace blocks; "(N benign SSE-disconnect traces filtered)" footer added

Holding for the next batched npm publish.

Two cosmetic improvements to chorus diagnose output that emerged from
running it against a real install (v0.8.27 on /usr/bin/chorus).

(1) install mode showed "unknown" for global npm installs because
    process.argv[1] returns the symlink path (/usr/bin/chorus), not
    the resolved target inside node_modules. Wrap with realpathSync
    before classification — falls back to raw path on broken symlink
    or "(unknown)" sentinel.

(2) web.log tail was leaking Next.js 16's SSE pipe-close trace into
    the bug report — `Error: failed to pipe response` followed by a
    UND_ERR_SOCKET cause block. Looks like a real error but it's
    expected behaviour when a browser tab closes mid-stream.
    filterBenignNoise() drops the multi-line trace block (matches
    full traces and orphan tails when the window starts mid-trace).
    Footer reports the filter count so nothing is silently dropped.
    Raw read window bumped 60 → 300 lines so most traces are fully
    captured before trimming back to 20 surfaced lines.

Both helpers are pure and exported via _testing for direct
assertions. 654/654 tests green; pnpm typecheck clean.

Holding for publish — accumulating with other fixes for the next
chorus-codes npm release.
@chorus-codes chorus-codes merged commit 4a5ea20 into main May 8, 2026
2 checks passed
@chorus-codes chorus-codes deleted the fix/diagnose-polish branch May 8, 2026 06:21
crypticpy referenced this pull request in crypticpy/chorus May 17, 2026
* feat(cli): add diagnose command + crash-hook

Bundles two upstream changes that ship a self-service triage path for
chorus users hitting opaque failures:

- `chorus diagnose` walks the install, daemon, recent failed chats,
  voice health, and produces a sharable bug report.
- Crash hook captures uncaught exceptions in the CLI and writes them
  to a crash log alongside instructions to attach during a bug report.

Folded back from upstream chorus-codes/chorus:
  7ea712b feat: chorus diagnose command + crash hook for bug reports (#1)
  4a5ea20 fix(diagnose): realpath bin path + filter Next.js SSE noise (#4)

Co-Authored-By: chorus-codes <info@chorus.codes>

* feat(cli): add quickstart self-test command

`chorus quickstart` runs a 30-second activation flow that verifies
the daemon comes up, the SQLite DB initializes, and a minimal chat
round-trips end-to-end. Aimed at first-run users who want to know
"is this thing actually working" before authoring a template.

Folded back from upstream chorus-codes/chorus:
  56610cf feat(cli): chorus quickstart — 30-second activation self-test (chorus-codes#30)

Co-Authored-By: chorus-codes <info@chorus.codes>

* fix(cli): use dynamic import for open package (Node 22 ERR_REQUIRE_ESM)

The `open` package and `chokidar` are both ESM-only as of recent
versions. On Node 22 (the daily-driver target) static `require()`
calls into them throw ERR_REQUIRE_ESM and crash the CLI at boot.

Switch to dynamic import in:
- src/cli/commands/start.ts (open browser after boot)
- src/cli/open-browser.ts (new helper)
- src/cli/index.ts (route open import)
- src/daemon/output-watcher.ts (chokidar file watch)

Includes upstream's post-merge hardening: the setTimeout that triggers
the browser-open no longer wraps an async callback bare, so a missing
default browser doesn't surface as an unhandled rejection.

Folded back from upstream chorus-codes/chorus:
  e8ca2ee fix(cli): dynamic import for open package (chorus-codes#14)
  dcd1837 fix: post-merge hardening for chorus-codes#14 (start.ts portion only;
          cli-precheck.test.ts portion ships with the Keychain fix)

Co-Authored-By: Julien Deudon <deudon.j@gmail.com>
Co-Authored-By: chorus-codes <info@chorus.codes>

* feat(cockpit): seed empty round-1 so QUEUED renders from t=0

Before: when a chat starts but no reviewer has produced an event yet,
enrichRounds returned an empty rounds array and the live-run page
showed nothing for several seconds — the user couldn't tell whether
their chat had launched.

After: seed a synthetic round-1 with QUEUED placeholders for every
expected participant so the page renders the per-reviewer cards
immediately. Real events overwrite placeholders as they arrive.

Folded back from upstream chorus-codes/chorus:
  53e8fb6 feat(cockpit): seed empty round-1 so QUEUED placeholders
          render from t=0 (#2)

Co-Authored-By: chorus-codes <info@chorus.codes>

* feat(daemon): runtime fallback-collision dedup across reviewer slots

When two reviewer slots both fall through their per-slot chains to the
same template-level fallback target (common case: every slot ends in
anthropic/claude-sonnet-4-6), both used to dispatch the same (lineage,
model) in parallel — wasted cost and the lineage diversity that's the
point of multi-LLM peer review collapsed.

Build-time dedup (template-fallback.ts) couldn't catch it because each
slot only knows about other slots' PRIMARIES, not their fallback chains.

Fix: new per-chat/per-round (lineage, model) registry. reviewer-driver
tryClaim's before each chain attempt and releases in a finally. On
collision, return null + emit cli_warning(reason='fallback_collision')
so runWithChainFallback advances to the next entry and the cockpit can
show why the slot skipped.

Ported into fork's reviewer-driver.ts surgically so the verdict-isolation
refactor (2a2cde2) and per-slot repoPath threading stay intact.

Folded back from upstream chorus-codes/chorus:
  c4751fe feat(daemon): runtime fallback-collision dedup (#3)

Co-Authored-By: chorus-codes <info@chorus.codes>

* fix(daemon): write REVIEWER FAILED summary on pre-spawn failure

Before: when a reviewer's precheck fails (e.g. underlying CLI not
installed) or the chat is cancelled while the slot is queued for a
CLI semaphore slot, runReviewer used to return null silently —
leaving NO on-disk participant directory. The cockpit's enrich-rounds
loop then couldn't reconcile the synthesised template slot against
any real participant, so the card sat at "Queued — waiting for an
open slot." forever and the actual error was invisible.

Reproduction: install chorus on a host with only one CLI on PATH
(e.g. just claude-code), open a template that includes lineages
requiring codex/gemini/kimi, fire it. Every reviewer card stayed
"Queued" — chat never visibly progressed even though it was already
done failing.

Fix:
- Create the reviewer dir BEFORE the precheck runs.
- Add a writePreSpawnFailure helper that writes a `## REVIEWER FAILED`
  summary in the canonical format (Kind / Lineage / Model / message)
  that the cockpit's `parseFailureSummary` already understands.
- Wire it into the precheck-failed and cancelled-while-queued paths.

Card now transitions out of pending and shows the actual error
(cli_missing, cancelled, ...).

Folded back from upstream chorus-codes/chorus:
  afc59cc fix(daemon): REVIEWER FAILED summary on pre-spawn failure (chorus-codes#26)

Co-Authored-By: chorus-codes <info@chorus.codes>

* feat(voices): auto-disable on persistent quota_exhausted + lsof timeout

Real pain (upstream chorus-codes#11): a Pro Gemini model on a Flash-only account
fails every chorus run with "exhausted your capacity on this model"
— but Gemini doesn't return a resetAt because the model isn't going
to become available for that account. Without auto-disable, the
runner keeps picking the dead voice on every chat and the user keeps
seeing the same opaque error.

Voice auto-disable:
- New src/lib/voice-failure-tracker.ts records per-voice consecutive
  quota_exhausted strikes in a settings counter.
- Trigger: 2 consecutive strikes WITH no resetAt → set
  voices.enabled=false + disabled_reason='auto_quota'.
- Counter resets on participant_done success; rate-limit strikes
  (hasResetAt=true) bypass the counter entirely so a transient
  429 + a later permanent failure can't trip the threshold on the
  first permanent strike.
- Wired into reviewer-driver alongside recordHealth; emits a
  cli_warning(reason='voice_auto_disabled') so the cockpit can show
  a one-line explanation.
- VoiceDisabledReason union gains 'auto_quota' (schema column was
  already TEXT — no migration).

Lsof timeout (upstream chorus-codes#12):
- findPidsOnPort and findPidsOnPortWithSudo now bound execSync /
  execFileSync to 3s, so a slow-but-functional lsof on a loaded
  macOS box doesn't hang chorus boot. 3s leaves headroom while
  still bounding the hang case.

Ported into fork's reviewer-driver.ts tmux pollHandle + success
path. voices.ts disabled_reason union extended alongside fork's
voice-tier column.

Folded back from upstream chorus-codes/chorus:
  4f6becc v0.8.30 — voice auto-disable (chorus-codes#11) + lsof timeout (chorus-codes#12) (chorus-codes#17)

Co-Authored-By: chorus-codes <info@chorus.codes>
Co-Authored-By: Lumina Mao <luminamao@mac.lan>

* fix(daemon, schema): codex isolation + template-schema validation

Two issues caused chats to fail opaquely at run-start:

CODEX ISOLATION (chorus-codes#10, chorus-codes#16)
The user's ~/.codex/config.toml may declare MCP servers, plugins, or
notification hooks. In headless `codex exec` those integrations have
caused codex to hang or cancel mid-call — two independent
reproductions: codex as our reviewer (chorus-codes#10) and codex as MCP client of
chorus (chorus-codes#16). Add --ignore-user-config to every headless codex argv.
Extracted to a pure `buildHeadlessArgs(opts)` so the argv shape is
unit-testable.

TEMPLATE VALIDATION (chorus-codes#15)
`reviewer.require > candidates.length` used to surface as "Job moves
immediately to failure upon Start press" — the runner queued, failed
to grant enough slots, and emitted an opaque chat-failure. Same for
`require > distinct lineages` when crossLineage:true. Both now
caught at TemplateSchema.parse() time with a clear error message
the user can fix before the run starts.

ReviewerSchema.superRefine() additions slot in cleanly alongside the
fork's audit/orchestrate phase schema work — both are additive
constraints on the same ReviewerSchema object.

Folded back from upstream chorus-codes/chorus:
  8ed970b fix(daemon, schema): codex isolation + template validation

Co-Authored-By: chorus-codes <info@chorus.codes>

* fix(runner): honour iterate.onDisagreement accept-doer/escalate

The template schema, cockpit dialog, and SPEC-D-templates have always
exposed three values for iterate.onDisagreement — 'continue', 'escalate',
'accept-doer' — but the runner only honoured 'continue'. Picking the
other two from the cockpit form was a silent no-op: chats fell through
to phase_failed with 'doer_failed_all_rounds' regardless.

This wires both new branches into the round loop and the terminal
chat_done emission:

- 'accept-doer': after maxRounds without consensus, mark doerSucceeded
  and continue. The chat carries on (subsequent phases, ship, approval)
  as if reviewers had agreed on the doer's last answer.
- 'escalate': halt with status='failed' but verdict='request_changes'
  and error='escalated_on_disagreement', so cockpits can render
  "reviewers disagreed, needs human" distinctly from "doer broke."

Policy table extracted into a pure decidePhaseOutcome() helper so the
3 × 2 input matrix (policy × disagreement-in-last-round) is unit-tested
without standing up the full runChat scaffold.

Gated on disagreementInLastRound (reset at top of every round + on
doer-crash path) so a partial / empty doer answer can never be silently
"accept-doer"'d as final. Preserves the fork's existing
standardPhaseRoundsExhausted #7 surfacing for the 'continue' path; the
'escalate' path takes precedence with its own distinct chat_done.

Upstream PRs chorus-codes#49, chorus-codes#50 (commit 67572e9).

Co-Authored-By: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(cli-precheck): cover macOS Keychain fallback for Claude Code v2

The fork already implements the Keychain fallback in cli-precheck
(hasDarwinKeychainEntry). This adds the missing test coverage:

- passes when no cred file but keychain entry exists
- blocks when no cred file and no keychain entry
- skips keychain check when cred file exists (fast-path preserved)
- does not consult keychain for non-anthropic lineages

vi.mock('node:child_process') uses the importOriginal spread pattern so
spawn / exec / etc. keep their real implementations — a bare module
replacement would silently break any sibling test that imports from
child_process.

Upstream PRs #7, #8, plus the dcd1837 test-mock hardening.

Co-Authored-By: Yura <yurahalych@gmail.com>
Co-Authored-By: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cockpit): derive candidatesWithModels from snapshot's candidates field

Daemon-side TemplateSchema only carries `candidates` on each ReviewerRule.
The cockpit Template type expects `candidatesWithModels` populated —
enrich-rounds iterates that field to build slot→model mappings for
run-page cards. When fromRow parsed template_snapshot and cast it to
Template, the cast was a TypeScript lie: at runtime the parsed object
lacked candidatesWithModels, enrichRounds iterated zero reviewer slots,
and no model name reached the cards (badge appeared empty).

Derive candidatesWithModels at the parse seam (chats.fromRow) so the
cockpit's Template contract is honoured regardless of which path
produced the data. Idempotent — if a future daemon ever serialises
the field directly, that wins. Persona forwarded if present. Audit-
phase single-voice reviewers (no candidates array) are skipped via a
runtime narrow.

Upstream PR #6 (chorus-codes/chorus@ac0c7fd).

Co-Authored-By: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(diagnose): capture failure context — CLI smoke, voice health, recent failed chats

Extends `chorus diagnose` with three signals that triage the most common
breakage modes:

- **CLI smoke**: spawn `<bin> --version` per detected CLI with a hard 2s
  SIGKILL timeout (wrapper scripts may trap SIGTERM). Distinguishes
  `timedOut` from non-zero exit so the report can tell hangs apart from
  crashes.
- **Voice health**: counts `enabled=0` voices grouped by `disabled_reason`
  ('user' vs 'auto_missing' vs 'quota_exhausted'). Added
  `idx_voices_enabled` so the `WHERE enabled = 0` scan stays cheap as
  the table grows.
- **Recent failed chats**: last 5 chats with `status='blocked'` plus the
  errored participants pulled from `~/.chorus/chats/<id>/round-*/<part>/_attempts.jsonl`.
  Only `errorMessageBytes` is exposed — raw error text never leaves the
  user's machine. `$HOME` is redacted from any embedded path strings via
  `redactHomePaths`.

Adapted from upstream chorus-codes#19 (0666dca). Preserves the
fork's existing diagnose shape and adds tests for smokeOneCli /
readLatestAttempt / formatReport rendering of the three new sections.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(diagnose): include no_review in recent failed chats query

The recent-failed-chats section was meant to surface per-participant
failure context from `_attempts.jsonl`, but the WHERE clause only
covered 'failed', 'blocked', 'cancelled'. The most common failure
shape — every reviewer down for missing CLI / auth / quota — ends the
chat in 'no_review', which was being silently filtered out. So the
exact case the section exists to diagnose returned an empty list,
forcing users back into manual log collection.

Adds 'no_review' to the IN-list and a regression test that asserts
both the status and a quota_exhausted errorKind render in the report.

Addresses chatgpt-codex review P2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: chorus-codes <info@chorus.codes>
Co-authored-by: Julien Deudon <deudon.j@gmail.com>
Co-authored-by: Lumina Mao <luminamao@mac.lan>
Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Yura <yurahalych@gmail.com>
chorus-codes added a commit that referenced this pull request May 20, 2026
All six reviewers requested changes. The two HIGH findings were real
correctness bugs that would have allowed the exact failure mode the
gate was built to prevent. Addressed:

HIGH (4-6/6 convergent):
- swap=0 sentinel ambiguity. Prior revision used 0 as "platform
  reports no swap data". On Linux with SwapFree genuinely 0 kB
  (incident-2026-05-20 case!) the gate would admit chats at the
  moment of maximum vulnerability. Fixed: readSwapFreeMb returns -1
  for "no data"; evaluateAdmission treats >=0 as a real number,
  including 0 as "swap exhausted, block".
- tryGrantHead error path. If getChatConcurrency / readResourceStats
  throws, prior catch just logged. Queue would hang until an external
  event (release / abort / new admit) re-poked the gate. Fixed:
  schedule a recheck timer from the catch so the gate retries in 30s.
- pokeGate export + PUT-route hook. Settings change (e.g. raise
  maxConcurrentChats 3 → 5) had no path to re-admit queued waiters
  until an active chat ended. Fixed: PUT /settings/chat-concurrency
  calls pokeGate() after persisting, immediately admitting queued
  chats under the new (loosened) cap.

MEDIUM:
- Double-settle protection on Waiter. settled flag prevents abort
  signal firing AFTER tryGrantHead resolved from rejecting an already-
  settled promise. Belt-and-suspenders against future Promise impl
  changes.
- AdmitDenyReason 'unknown' fallback eliminated. decision.reason is
  always populated when admit=false; the prior `?? 'unknown'` would
  emit a string not in the type union. Now falls back to
  'chats_at_cap' (the only invariant-protected default).
- Lucide Minus/Plus icons replace Unicode -/+ per project rule #25.
- PUT response merge preserves defaults block — UI hint lines no
  longer vanish after first save (until next 5s poll re-loads).

New tests (+3):
- evaluateAdmission with cpuCount=0 — divide-by-zero guard works
- evaluateAdmission with swap=-1 (sentinel) → admit
- evaluateAdmission with swap=0 (exhausted) → BLOCK (the incident case)
- pokeGate-after-setChatConcurrency admits queued chats immediately

Reviewer attribution (6 reviewers, all request_changes):
- codex-cli-0: tryGrantHead error path, UI defaults overwrite, test gaps
- gemini-cli-1: swap=0 sentinel (high), getChatConcurrency throw stall (high),
                settings-change doesn't wake queue (medium)
- opencode-cli-2: TOCTOU race in admitChat signal-aborted check (analyzed —
                  actually safe because addEventListener runs same tick),
                  swap=0 sentinel, missing resource-block integration tests
- opencode-cli-4: cpuCount=0 untested, recheck timer untested, double-reject
                  risk (settled flag), Unicode icons vs rule #25
- opencode-cli-5: orphaned waiter on tryGrantHead exception, reason 'unknown'
                  not in union, missing tests for reentrant scenarios
- opencode-cli-6: tryGrantHead silent error strand (HIGH — same as
                  codex/gemini), settings change doesn't poke queue (medium)

Not changed (deliberate):
- Snapshot stats once per loop iteration (cli-6 #4): sub-ms cost,
  current loop invariant (no awaits between reads/decisions) keeps
  the values consistent. Worth noting but not a real bug.
- Circuit breaker for consecutive tryGrantHead failures (cli-2 #5):
  acceptable retry loop for now; will add if real-world disk blips
  cause notable log spam.

Tests: 910/911 green (was 907; +3 new). Typecheck + build clean.
chorus-codes added a commit that referenced this pull request May 20, 2026
… aware (#64)

* feat(daemon): chat admission gate — caps concurrent chats + swap/load aware

Caused by the 2026-05-20 incident: three concurrent chats × 8 reviewers
each (24 LLM subprocesses) overwhelmed the 31 GB dev box. Load avg hit
320 (15-min), swap exhausted (24 MB free of 8.5 GB), system became
unresponsive. Per-CLI semaphore (cli-semaphore.ts) caps subprocesses
within each binary family but doesn't prevent N chats × M reviewers from
oversubscribing the host.

New higher-level gate at runner-entry — every `runChat` call awaits
`admitChat(...)` before fanning out. Three constraints, all daemon-
wide, all configurable in /settings:

  - maxConcurrentChats (1..20, default 3): hard cap on chats actively
    fanning out reviewers. Whichever chat hits the gate fourth queues.
  - swapMinFreeMb (0..16384, default 1024): refuse new admissions when
    free swap drops below threshold. 0 disables (CI / macOS without
    /proc/meminfo / containers).
  - loadAvgMaxPerCore (0..10, default 4.0): refuse when 1-min load
    avg / CPU count exceeds. Per-core multiplier so a 4-core at 16
    looks identical to a 16-core at 64. 0 disables.

Gated chats stay in the DB as status='drafting'. The runner emits a
new SSE event `chat_queued` with the binding reason + 1-indexed queue
position so the cockpit can render "Waiting for slot — N chats ahead"
without polling.

Architecture mirrors cli-semaphore.ts: FIFO queue, mutex+dirty pattern
to prevent reentrant double-grant under release races, settings re-
read per attempt so changes apply on the next admission (no daemon
restart). Resource recheck timer (30s) fires while the head is blocked
on swap/load — chat_end events drain chats_at_cap; periodic recheck
drains the resource-pressure cases.

Files:

  src/daemon/chat-gate.ts         — gate + evaluateAdmission (pure)
                                    + admitChat (async, FIFO, signal)
                                    + snapshot for diagnostics
  src/daemon/resource-stats.ts    — readResourceStats: swap from
                                    /proc/meminfo, loadavg from os
  src/lib/settings/chat-concurrency.ts
                                  — Zod schema + get/setChatConcurrency
  src/daemon/runner.ts            — await admitChat at top, release
                                    in finally; cancellation handled
  src/daemon/runner/types.ts      — new 'chat_queued' RunnerEvent type
  src/daemon/routes/settings.ts   — GET/PUT /settings/chat-concurrency
                                    response includes live snapshot
                                    (active/queued/swap/load) so the UI
                                    can show "currently 2/3 chats…"
  src/lib/api/settings.ts         — ChatConcurrencySettings + client
  src/app/settings/chat-concurrency-section.tsx
                                  — UI section with three stepper
                                    inputs + live snapshot block (5s
                                    auto-refresh)
  src/app/settings/form-view.tsx  — wire in below the existing
                                    ConcurrencySection
  tests/chat-gate.test.ts         — 15 cases: pure evaluateAdmission
                                    matrix (constraint priority,
                                    per-core scaling, disable knobs)
                                    + FIFO ordering + cancellation +
                                    release idempotence

Tests: 907/908 green (was 892 baseline; +15 new). Typecheck + build clean.

The 2026-05-20 incident pattern (3 chats × 8 reviewers) will now queue
the third chat at the chats_at_cap gate (default 3). Users on tight
boxes can tighten to 1-2; users on big boxes can loosen to 10+ and rely
on the swap+load guardrails instead.

* fix(chat-gate): close 6/6 convergent self-review findings on PR #64

All six reviewers requested changes. The two HIGH findings were real
correctness bugs that would have allowed the exact failure mode the
gate was built to prevent. Addressed:

HIGH (4-6/6 convergent):
- swap=0 sentinel ambiguity. Prior revision used 0 as "platform
  reports no swap data". On Linux with SwapFree genuinely 0 kB
  (incident-2026-05-20 case!) the gate would admit chats at the
  moment of maximum vulnerability. Fixed: readSwapFreeMb returns -1
  for "no data"; evaluateAdmission treats >=0 as a real number,
  including 0 as "swap exhausted, block".
- tryGrantHead error path. If getChatConcurrency / readResourceStats
  throws, prior catch just logged. Queue would hang until an external
  event (release / abort / new admit) re-poked the gate. Fixed:
  schedule a recheck timer from the catch so the gate retries in 30s.
- pokeGate export + PUT-route hook. Settings change (e.g. raise
  maxConcurrentChats 3 → 5) had no path to re-admit queued waiters
  until an active chat ended. Fixed: PUT /settings/chat-concurrency
  calls pokeGate() after persisting, immediately admitting queued
  chats under the new (loosened) cap.

MEDIUM:
- Double-settle protection on Waiter. settled flag prevents abort
  signal firing AFTER tryGrantHead resolved from rejecting an already-
  settled promise. Belt-and-suspenders against future Promise impl
  changes.
- AdmitDenyReason 'unknown' fallback eliminated. decision.reason is
  always populated when admit=false; the prior `?? 'unknown'` would
  emit a string not in the type union. Now falls back to
  'chats_at_cap' (the only invariant-protected default).
- Lucide Minus/Plus icons replace Unicode -/+ per project rule #25.
- PUT response merge preserves defaults block — UI hint lines no
  longer vanish after first save (until next 5s poll re-loads).

New tests (+3):
- evaluateAdmission with cpuCount=0 — divide-by-zero guard works
- evaluateAdmission with swap=-1 (sentinel) → admit
- evaluateAdmission with swap=0 (exhausted) → BLOCK (the incident case)
- pokeGate-after-setChatConcurrency admits queued chats immediately

Reviewer attribution (6 reviewers, all request_changes):
- codex-cli-0: tryGrantHead error path, UI defaults overwrite, test gaps
- gemini-cli-1: swap=0 sentinel (high), getChatConcurrency throw stall (high),
                settings-change doesn't wake queue (medium)
- opencode-cli-2: TOCTOU race in admitChat signal-aborted check (analyzed —
                  actually safe because addEventListener runs same tick),
                  swap=0 sentinel, missing resource-block integration tests
- opencode-cli-4: cpuCount=0 untested, recheck timer untested, double-reject
                  risk (settled flag), Unicode icons vs rule #25
- opencode-cli-5: orphaned waiter on tryGrantHead exception, reason 'unknown'
                  not in union, missing tests for reentrant scenarios
- opencode-cli-6: tryGrantHead silent error strand (HIGH — same as
                  codex/gemini), settings change doesn't poke queue (medium)

Not changed (deliberate):
- Snapshot stats once per loop iteration (cli-6 #4): sub-ms cost,
  current loop invariant (no awaits between reads/decisions) keeps
  the values consistent. Worth noting but not a real bug.
- Circuit breaker for consecutive tryGrantHead failures (cli-2 #5):
  acceptable retry loop for now; will add if real-world disk blips
  cause notable log spam.

Tests: 910/911 green (was 907; +3 new). Typecheck + build clean.

---------

Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
chorus-codes added a commit that referenced this pull request May 20, 2026
PR #75 audit findings:

- HIGH (opencode-cli-2 #1, introduced by this PR): `usePathname()` can
  return `null` during the brief client-render window before the App
  Router has hydrated. The `pathname.startsWith(...)` in
  AppShellRouter would then crash the entire shell with a TypeError on
  first paint. Coalesce to "" — falls through to the wrapped
  (sidebar-present) branch by default.

- MEDIUM drive-by (opencode-cli-2 #4): the
  `err instanceof DaemonError ? "error" : "error"` ternary in
  app-sidebar.tsx had both branches equal — dead code that looked
  unfinished. Collapse to a single set, drop the now-unused
  DaemonError import.

Other PR #75 audit findings deferred to followup:
- Slot matching defaults non-conforming names to index 0 (pre-existing)
- Concurrent fetchChats race (pre-existing)
- Redundant initial fetch / poll race (pre-existing perf)

Tests: 942 pass.
chorus-codes added a commit that referenced this pull request May 20, 2026
…75)

* fix(ux): persist sidebar across navigation + suppress doer-artifact phantom card

Two Victor reports from the v0.8.50 cockpit:

1. Sidebar shows "Loading…" for several seconds after clicking any past
   run. Root cause: every page wraps itself in `<AppShell>` — so each
   route navigation unmounts the entire shell (sidebar included) and
   mounts a fresh one. `useState("loading")` re-initialises, the SSE
   subscription tears down, and the chat list refetches from scratch.

   Fix: move AppShell into the root layout via a tiny client-side
   router (`AppShellRouter`) that opts out for /onboarding (the only
   standalone flow). Strip the per-page `<AppShell>` wrappers. Sidebar
   is now a single long-lived instance — `chatsState` survives,
   chats stay visible, SSE stays connected across nav.

   Also dropped the unnecessary `[pathname]` dep on the sidebar's
   chats-fetch effect. Active-route highlighting reads `pathname`
   during render, not from the effect. The dep was forcing a full
   SSE teardown + rebuild on every nav even after the AppShell fix
   above kept the component instance alive.

2. Transient "extra card" — a DONE card showing the audit prompt body
   appears alongside the reviewer grid, then disappears a few seconds
   later. Root cause: review-only chats create a synthetic
   `doer-artifact` dir whose answer.md holds the input artifact (the
   audit prompt) with a `## DONE` sentinel. The artifacts route
   surfaces it as a participant; enrich-rounds' leftover loop appends
   it as a card because no doer slot exists in review-only templates.

   Fix: in review-only chats, suppress role='doer' participants from
   the cards grid. The artifact IS the prompt being reviewed — not a
   participant. Reviewer leftovers (a dir whose lineage isn't in the
   template's candidate list) still get a defensive card so users see
   "something ran here".

Tests: +2 enrich-rounds cases (filter doer-artifact, preserve reviewer
leftover). 938 pass.

* fix(cockpit): guard against usePathname() null + drop dead error ternary

PR #75 audit findings:

- HIGH (opencode-cli-2 #1, introduced by this PR): `usePathname()` can
  return `null` during the brief client-render window before the App
  Router has hydrated. The `pathname.startsWith(...)` in
  AppShellRouter would then crash the entire shell with a TypeError on
  first paint. Coalesce to "" — falls through to the wrapped
  (sidebar-present) branch by default.

- MEDIUM drive-by (opencode-cli-2 #4): the
  `err instanceof DaemonError ? "error" : "error"` ternary in
  app-sidebar.tsx had both branches equal — dead code that looked
  unfinished. Collapse to a single set, drop the now-unused
  DaemonError import.

Other PR #75 audit findings deferred to followup:
- Slot matching defaults non-conforming names to index 0 (pre-existing)
- Concurrent fetchChats race (pre-existing)
- Redundant initial fetch / poll race (pre-existing perf)

Tests: 942 pass.

---------

Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
chorus-codes added a commit that referenced this pull request May 20, 2026
Pickup from chorus audit of PR #77 (antigravity-cli-8 finding #4):
when pickShimForVoice for a cross-lineage fallback target throws AFTER
tryClaimFallbackTarget succeeds but BEFORE the try block, `threw` never
flips and releaseFallbackClaim never fires. The target stays claimed
for the rest of the round, blocking every other slot that might have
reached it via a fresh shim.

Move the entryShim resolution inside the try block so any resolution
error releases the claim along the existing throw path.
chorus-codes added a commit that referenced this pull request May 20, 2026
)

* fix(runner): hold fallback claim regardless of attempt outcome

Three reviewer cards on chat=019E45413E126AFCD83146524A22BFC4 all
swapped to anthropic/claude-sonnet-4-6 in sequence — codex (auth race
with VSCode), gemini (quota), opencode/qwen3.6-plus (variant of fast-
fail). Diversity collapse, the whole point of multi-LLM peer review
defeated.

Cause: when slot A's claude attempt returned null, the finally block
released the claim. Slot B arrived 36s later, claimed claude itself,
also failed, released. Slot C 43s later, same story. The runtime
registry was working but the policy was wrong — releasing on null
meant a persistently-broken fallback got tried by every slot in
sequence, three reviewers all running the same model.

New policy: claim is sticky for the round on ANY result that isn't an
exception. First slot to reach a shared fallback target wins it.
Subsequent slots see false from tryClaim and advance to the NEXT
chain entry (fallback_collision warning on those cards, no duplicate
run). If the first slot's attempt fails, the other slots simply have
no backup — that's the honest signal that the template's one shared
fallback didn't cover all primary failures. Fix at the template
layer by configuring multiple fallbacks (one per likely-failing
lineage).

Throw still releases — an exception means something went wrong
outside the model call (shim resolution, fs write, etc.) and other
slots should get a chance to try with a fresh shim.

Tests: +3 cases documenting the policy in fallback-registry tests.
945 pass.

* fix(runner): release claim if shim resolution throws after tryClaim

Pickup from chorus audit of PR #77 (antigravity-cli-8 finding #4):
when pickShimForVoice for a cross-lineage fallback target throws AFTER
tryClaimFallbackTarget succeeds but BEFORE the try block, `threw` never
flips and releaseFallbackClaim never fires. The target stays claimed
for the rest of the round, blocking every other slot that might have
reached it via a fresh shim.

Move the entryShim resolution inside the try block so any resolution
error releases the claim along the existing throw path.

---------

Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant