fix: audit followup bundle — runOne crash, doer claim, tmux abort, proc-group kill, slot-idx, fetchChats race#83
Merged
Conversation
…abort, proc-group kill, slot-idx, fetchChats race Six pre-existing bugs surfaced by the PR #74/#77/#79 audits, fixed together. 1. runOne catch swallowed exceptions silently reviewer-driver.ts:181 caught throws with no log and no event. The slot was recorded `failed` but the cockpit card had no terminal signal and stayed visually stuck. Now: console.error the stack and emit cli_error{kind:'reviewer_driver_crash'} so the cockpit transitions out of "running" and post-mortems can find the cause. 2. doer-driver had no fallback-collision guard Reviewer slots claim fallback targets via tryClaim; the doer didn't. When a doer's chain ended at the same shared template fallback as a reviewer slot's chain, both ran the target -> duplicate cost, lineage diversity broken. Added the same claim + sticky-on-success + release-on-throw pattern, mirroring reviewer-driver. runner.ts also resets the round when no reviewer runs (doer-only phase), so the registry doesn't leak. 3. tmux 6s cold-start sleep ignored abortSignal Cancelled chats hung for the full 6s before teardown could proceed. Replaced with abortableSleep + aborted-check on the 500ms paste-then-Enter pause too. Both doer and reviewer. 4. SIGTERM/SIGKILL only hit the head process spawnChild had no `detached:true`, so descendant subprocesses (codex helper python, opencode node workers) orphaned on kill. Added `detached: !isWindows` and route signals via `process.kill(-pid)` so the whole group goes. Reaper logic mirrors the same fallback on Windows. 5. Slot-name index defaulted to 0 on non-conforming participant names enrich-rounds.ts used `parseInt(match ?? "0", 10)`, so any participant string without a trailing -<digit> silently matched slot 0. A future MCP-named participant could hijack the first reviewer card. Now non-conforming names return false and fall to the leftover loop, identified by their actual string. 6. Concurrent fetchChats race Mount + SSE + visibility + 2s poll could all fire fetchChats at once; the slowest reply landed last and stomped fresher state. Added a request-sequence guard so only the latest-seq result updates state. Tests: 972 -> 973 passing.
3 tasks
PR #83 audit fixups from opencode-cli-4 (8/9 reviewers approved overall): 1. MEDIUM: move runner.ts's dynamic import('./runner/fallback-registry.js') to a static import at the top of the file. The dynamic import worked correctly (ES module cache makes repeats free) but added an unnecessary await per loop iteration and was stylistically inconsistent with every other fallback-registry call site. 2. LOW: document the killTree grandchild-setsid escape case so a future shim integration adding a setsid'd worker doesn't get silently orphaned on cancel. No code change — comment-only.
This was referenced May 24, 2026
Merged
chorus-codes
added a commit
that referenced
this pull request
May 24, 2026
Victor caught this on the PR #83 audit: qwen3.6-plus reviewer ran on opencode-go, exited cleanly with empty output (no errorKind, no message), and went straight to fallback chain advance — no retry attempt. The PR #79 retry classifier returned false because isRetryableErrorKind(undefined) was hard-false. That's the right default for codex/claude/gemini (a null with no kind usually means the model genuinely produced nothing — retry would produce nothing again), but opencode-go's gateway has known transport flakes where a second attempt succeeds with the same prompt. Fix: extend isRetryableErrorKind to accept an optional `lineage` hint. When `kind` is undefined AND `lineage === 'opencode'`, treat as retryable. Other lineages keep the conservative default. The lineage hint does NOT override an explicit non-retryable kind — auth/quota/ db-corrupt are still terminal regardless of lineage. Both reviewer-driver and doer-driver call sites now pass `entry.lineage` so the chain step picks up the new behaviour. Retry visibility: no UI work needed — the existing `transient_retry` cli_warning already renders as an amber chip on the participant card via participant-card.tsx's `participant.warnings` block. The chip appears the moment retry fires; the message reads "Transient X failure on Y/Z — retrying once before advancing fallback." Tests: 974 -> 975 passing (+5 new cases on the lineage hint behavior). Co-authored-by: chorus-codes <280607145+chorus-codes@users.noreply.github.com>
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Six pre-existing bugs flagged by the PR #74/#77/#79 chorus audits, bundled into one PR per Victor's request.
What's in
Why one PR
Victor explicitly asked for one bundled PR. Each fix is independently testable; combining them just shares one audit pass + one bump.
Tests
Risk callouts
detached: truechanges process-tree topology. The head loses its parent-ID linkage. Combined withprocess.kill(-pid, ...)and the existing PID-registry reaper, this should be a strict improvement (no more orphans), but worth a self-audit pass.Test plan
pnpm typecheckcleanpnpm test973/973 passing