Skip to content

fix: audit followup bundle — runOne crash, doer claim, tmux abort, proc-group kill, slot-idx, fetchChats race#83

Merged
chorus-codes merged 2 commits into
mainfrom
fix/audit-followups-bundle
May 24, 2026
Merged

fix: audit followup bundle — runOne crash, doer claim, tmux abort, proc-group kill, slot-idx, fetchChats race#83
chorus-codes merged 2 commits into
mainfrom
fix/audit-followups-bundle

Conversation

@chorus-codes
Copy link
Copy Markdown
Owner

Six pre-existing bugs flagged by the PR #74/#77/#79 chorus audits, bundled into one PR per Victor's request.

What's in

# Bug File Risk class
1 runOne catch swallowed exceptions silently → cockpit slot stuck reviewer-driver.ts High debuggability
2 Doer had no tryClaim guard → doer + reviewer could run same fallback target doer-driver.ts + runner.ts Diversity / cost
3 tmux 6s + 500ms sleeps ignored abortSignal doer-driver.ts + reviewer-driver.ts UX (cancel latency)
4 SIGTERM/SIGKILL only hit head process; subproc descendants orphaned headless.ts Resource leak
5 parseInt(match ?? "0", 10) hijacked slot 0 for non-conforming names enrich-rounds.ts UI correctness
6 Concurrent fetchChats race; slow reply stomped fresh state app-sidebar.tsx UX (flicker)

Why one PR

Victor explicitly asked for one bundled PR. Each fix is independently testable; combining them just shares one audit pass + one bump.

Tests

  • 972 → 973 passing (+1 new for slot-idx parser).
  • Existing tests for retries / claims / cli-precheck unchanged.

Risk callouts

Test plan

  • pnpm typecheck clean
  • pnpm test 973/973 passing
  • Chorus self-audit on the bundle

…abort, proc-group kill, slot-idx, fetchChats race

Six pre-existing bugs surfaced by the PR #74/#77/#79 audits, fixed together.

1. runOne catch swallowed exceptions silently
   reviewer-driver.ts:181 caught throws with no log and no event. The
   slot was recorded `failed` but the cockpit card had no terminal
   signal and stayed visually stuck. Now: console.error the stack and
   emit cli_error{kind:'reviewer_driver_crash'} so the cockpit
   transitions out of "running" and post-mortems can find the cause.

2. doer-driver had no fallback-collision guard
   Reviewer slots claim fallback targets via tryClaim; the doer
   didn't. When a doer's chain ended at the same shared template
   fallback as a reviewer slot's chain, both ran the target ->
   duplicate cost, lineage diversity broken. Added the same claim +
   sticky-on-success + release-on-throw pattern, mirroring
   reviewer-driver. runner.ts also resets the round when no reviewer
   runs (doer-only phase), so the registry doesn't leak.

3. tmux 6s cold-start sleep ignored abortSignal
   Cancelled chats hung for the full 6s before teardown could
   proceed. Replaced with abortableSleep + aborted-check on the 500ms
   paste-then-Enter pause too. Both doer and reviewer.

4. SIGTERM/SIGKILL only hit the head process
   spawnChild had no `detached:true`, so descendant subprocesses
   (codex helper python, opencode node workers) orphaned on kill.
   Added `detached: !isWindows` and route signals via
   `process.kill(-pid)` so the whole group goes. Reaper logic mirrors
   the same fallback on Windows.

5. Slot-name index defaulted to 0 on non-conforming participant names
   enrich-rounds.ts used `parseInt(match ?? "0", 10)`, so any
   participant string without a trailing -<digit> silently matched
   slot 0. A future MCP-named participant could hijack the first
   reviewer card. Now non-conforming names return false and fall to
   the leftover loop, identified by their actual string.

6. Concurrent fetchChats race
   Mount + SSE + visibility + 2s poll could all fire fetchChats at
   once; the slowest reply landed last and stomped fresher state.
   Added a request-sequence guard so only the latest-seq result
   updates state.

Tests: 972 -> 973 passing.
PR #83 audit fixups from opencode-cli-4 (8/9 reviewers approved overall):

1. MEDIUM: move runner.ts's dynamic import('./runner/fallback-registry.js')
   to a static import at the top of the file. The dynamic import worked
   correctly (ES module cache makes repeats free) but added an unnecessary
   await per loop iteration and was stylistically inconsistent with every
   other fallback-registry call site.

2. LOW: document the killTree grandchild-setsid escape case so a future
   shim integration adding a setsid'd worker doesn't get silently
   orphaned on cancel. No code change — comment-only.
@chorus-codes chorus-codes merged commit 60ccb2f into main May 24, 2026
2 checks passed
@chorus-codes chorus-codes deleted the fix/audit-followups-bundle branch May 24, 2026 01:54
chorus-codes added a commit that referenced this pull request May 24, 2026
Victor caught this on the PR #83 audit: qwen3.6-plus reviewer ran on
opencode-go, exited cleanly with empty output (no errorKind, no
message), and went straight to fallback chain advance — no retry
attempt. The PR #79 retry classifier returned false because
isRetryableErrorKind(undefined) was hard-false. That's the right
default for codex/claude/gemini (a null with no kind usually means
the model genuinely produced nothing — retry would produce nothing
again), but opencode-go's gateway has known transport flakes where
a second attempt succeeds with the same prompt.

Fix: extend isRetryableErrorKind to accept an optional `lineage` hint.
When `kind` is undefined AND `lineage === 'opencode'`, treat as
retryable. Other lineages keep the conservative default. The lineage
hint does NOT override an explicit non-retryable kind — auth/quota/
db-corrupt are still terminal regardless of lineage.

Both reviewer-driver and doer-driver call sites now pass
`entry.lineage` so the chain step picks up the new behaviour.

Retry visibility: no UI work needed — the existing `transient_retry`
cli_warning already renders as an amber chip on the participant card
via participant-card.tsx's `participant.warnings` block. The chip
appears the moment retry fires; the message reads "Transient X
failure on Y/Z — retrying once before advancing fallback."

Tests: 974 -> 975 passing (+5 new cases on the lineage hint behavior).

Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant