fix: audit followup bundle — runOne crash, doer claim, tmux abort, proc-group kill, slot-idx, fetchChats race by chorus-codes · Pull Request #83 · chorus-codes/chorus

chorus-codes · 2026-05-21T06:43:56Z

Six pre-existing bugs flagged by the PR #74/#77/#79 chorus audits, bundled into one PR per Victor's request.

What's in

#	Bug	File	Risk class
1	runOne catch swallowed exceptions silently → cockpit slot stuck	reviewer-driver.ts	High debuggability
2	Doer had no tryClaim guard → doer + reviewer could run same fallback target	doer-driver.ts + runner.ts	Diversity / cost
3	tmux 6s + 500ms sleeps ignored abortSignal	doer-driver.ts + reviewer-driver.ts	UX (cancel latency)
4	SIGTERM/SIGKILL only hit head process; subproc descendants orphaned	headless.ts	Resource leak
5	parseInt(match ?? "0", 10) hijacked slot 0 for non-conforming names	enrich-rounds.ts	UI correctness
6	Concurrent fetchChats race; slow reply stomped fresh state	app-sidebar.tsx	UX (flicker)

Why one PR

Victor explicitly asked for one bundled PR. Each fix is independently testable; combining them just shares one audit pass + one bump.

Tests

972 → 973 passing (+1 new for slot-idx parser).
Existing tests for retries / claims / cli-precheck unchanged.

Risk callouts

feat(cockpit): QUEUED placeholders for not-yet-spawned reviewers #2: changes doer behavior — it now blocks on tryClaim. If a future caller fires multiple doer slots in the same chat/round, the first claims and the rest see fallback_collision (correct semantics, but new behavior).
fix(diagnose): realpath bin path + filter Next.js SSE noise #4: detached: true changes process-tree topology. The head loses its parent-ID linkage. Combined with process.kill(-pid, ...) and the existing PID-registry reaper, this should be a strict improvement (no more orphans), but worth a self-audit pass.

Test plan

pnpm typecheck clean
pnpm test 973/973 passing
Chorus self-audit on the bundle

…abort, proc-group kill, slot-idx, fetchChats race Six pre-existing bugs surfaced by the PR #74/#77/#79 audits, fixed together. 1. runOne catch swallowed exceptions silently reviewer-driver.ts:181 caught throws with no log and no event. The slot was recorded `failed` but the cockpit card had no terminal signal and stayed visually stuck. Now: console.error the stack and emit cli_error{kind:'reviewer_driver_crash'} so the cockpit transitions out of "running" and post-mortems can find the cause. 2. doer-driver had no fallback-collision guard Reviewer slots claim fallback targets via tryClaim; the doer didn't. When a doer's chain ended at the same shared template fallback as a reviewer slot's chain, both ran the target -> duplicate cost, lineage diversity broken. Added the same claim + sticky-on-success + release-on-throw pattern, mirroring reviewer-driver. runner.ts also resets the round when no reviewer runs (doer-only phase), so the registry doesn't leak. 3. tmux 6s cold-start sleep ignored abortSignal Cancelled chats hung for the full 6s before teardown could proceed. Replaced with abortableSleep + aborted-check on the 500ms paste-then-Enter pause too. Both doer and reviewer. 4. SIGTERM/SIGKILL only hit the head process spawnChild had no `detached:true`, so descendant subprocesses (codex helper python, opencode node workers) orphaned on kill. Added `detached: !isWindows` and route signals via `process.kill(-pid)` so the whole group goes. Reaper logic mirrors the same fallback on Windows. 5. Slot-name index defaulted to 0 on non-conforming participant names enrich-rounds.ts used `parseInt(match ?? "0", 10)`, so any participant string without a trailing -<digit> silently matched slot 0. A future MCP-named participant could hijack the first reviewer card. Now non-conforming names return false and fall to the leftover loop, identified by their actual string. 6. Concurrent fetchChats race Mount + SSE + visibility + 2s poll could all fire fetchChats at once; the slowest reply landed last and stomped fresher state. Added a request-sequence guard so only the latest-seq result updates state. Tests: 972 -> 973 passing.

PR #83 audit fixups from opencode-cli-4 (8/9 reviewers approved overall): 1. MEDIUM: move runner.ts's dynamic import('./runner/fallback-registry.js') to a static import at the top of the file. The dynamic import worked correctly (ES module cache makes repeats free) but added an unnecessary await per loop iteration and was stylistically inconsistent with every other fallback-registry call site. 2. LOW: document the killTree grandchild-setsid escape case so a future shim integration adding a setsid'd worker doesn't get silently orphaned on cancel. No code change — comment-only.

Victor caught this on the PR #83 audit: qwen3.6-plus reviewer ran on opencode-go, exited cleanly with empty output (no errorKind, no message), and went straight to fallback chain advance — no retry attempt. The PR #79 retry classifier returned false because isRetryableErrorKind(undefined) was hard-false. That's the right default for codex/claude/gemini (a null with no kind usually means the model genuinely produced nothing — retry would produce nothing again), but opencode-go's gateway has known transport flakes where a second attempt succeeds with the same prompt. Fix: extend isRetryableErrorKind to accept an optional `lineage` hint. When `kind` is undefined AND `lineage === 'opencode'`, treat as retryable. Other lineages keep the conservative default. The lineage hint does NOT override an explicit non-retryable kind — auth/quota/ db-corrupt are still terminal regardless of lineage. Both reviewer-driver and doer-driver call sites now pass `entry.lineage` so the chain step picks up the new behaviour. Retry visibility: no UI work needed — the existing `transient_retry` cli_warning already renders as an amber chip on the participant card via participant-card.tsx's `participant.warnings` block. The chip appears the moment retry fires; the message reads "Transient X failure on Y/Z — retrying once before advancing fallback." Tests: 974 -> 975 passing (+5 new cases on the lineage hint behavior). Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>

chorus-codes mentioned this pull request May 24, 2026

feat(runner): one-shot retry on opencode null-with-no-errorKind + retry chip visibility #85

Closed

3 tasks

chorus-codes merged commit 60ccb2f into main May 24, 2026
2 checks passed

chorus-codes deleted the fix/audit-followups-bundle branch May 24, 2026 01:54

This was referenced May 24, 2026

chore: bump to v0.8.55 #86

Merged

feat(runner): one-shot retry on opencode null-with-no-errorKind #87

Merged

chorus-codes mentioned this pull request May 24, 2026

chore: bump to v0.8.56 #88

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: audit followup bundle — runOne crash, doer claim, tmux abort, proc-group kill, slot-idx, fetchChats race#83

fix: audit followup bundle — runOne crash, doer claim, tmux abort, proc-group kill, slot-idx, fetchChats race#83
chorus-codes merged 2 commits into
mainfrom
fix/audit-followups-bundle

chorus-codes commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chorus-codes commented May 21, 2026

What's in

Why one PR

Tests

Risk callouts

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant