Skip to content

refactor(runner): shim-owned retry policy (closes PR #87 no_output bypass + moonshot gap)#91

Merged
chorus-codes merged 2 commits into
mainfrom
refactor/shim-owned-retry-policy
May 25, 2026
Merged

refactor(runner): shim-owned retry policy (closes PR #87 no_output bypass + moonshot gap)#91
chorus-codes merged 2 commits into
mainfrom
refactor/shim-owned-retry-policy

Conversation

@chorus-codes
Copy link
Copy Markdown
Owner

Follow-up to the PR #87 chorus self-audit (6 request-changes / 2 approves).

What the audit caught

  1. THE BUG (antigravity + gemini, HIGH severity): the COMMON opencode-go failure mode is empty-stdout exit with `eventCount === 0`. The runner's finally block synthesises `kind: 'no_output'` for that case, and `no_output` is universally terminal — so the new opencode-null retry was unreachable for the bug PR feat(runner): one-shot retry on opencode null-with-no-errorKind #87 was shipped to fix. The retry only helped the rare case where opencode emits some events but no usable content.

  2. Slippery slope (every reviewer): a lineage exception in the central classifier is a one-off escape valve that grows. Moonshot (also opencode-go) was an unprotected gap (codex flagged via the `isOpenCodeFamily` idiom used throughout error-detector.ts).

Fix

Move retry-on-null-kind and retry-on-no-output out of the central classifier and into the shim layer. Each `AgentShim` declares its own `retryPolicy`:

```ts
readonly retryPolicy?: {
readonly extraKinds?: readonly string[];
readonly onNullKind?: boolean;
readonly onNoOutput?: boolean;
};
```

  • opencode shim: opts into both flags (closes the no_output bypass — THE bug)
  • kimi shim (moonshot lineage): same opts-in (closes the moonshot gap)
  • codex/claude/gemini/grok/antigravity/openrouter shims: unchanged — universal transient list as before

`isRetryableErrorKind` signature changes from `(kind, lineage?)` to `(kind, shim?)`. Universal terminal kinds (`quota_exhausted`, `token_refresh_lost`, `opencode_db_corrupt`) still beat shim opt-ins — explicit terminal kinds cannot be overridden.

Tests

978 → 982 passing. Test stubs use `Pick<AgentShim, 'retryPolicy'>` instead of lineage strings — matches production call shape.

Out of scope (audit follow-ups for a separate PR)

  • Telemetry: record retry-succeeded rows so the policy's effectiveness is measurable. Universal across reviewers.
  • Verdict-parse-failure false positive (codex finding Unable to use OpenCode #5): opencode-cli-7 traced this carefully and found non-empty content always returns a real verdict, so apology-shaped responses don't reach the retry path. Closed as non-issue.

Test plan

  • `pnpm typecheck` clean
  • `pnpm test` 982/982 passing
  • Chorus self-audit (recursive: audit the audit-followup)

…pass)

PR #87 added a lineage-dispatch escape valve in isRetryableErrorKind for
opencode null-with-no-kind. The chorus self-audit (6 request-changes /
2 approves) found two real problems:

1. THE BUG (antigravity + gemini, HIGH):
   The COMMON opencode-go failure is empty-stdout exit with eventCount=0.
   The runner's finally block synthesises kind='no_output' for that
   case — and no_output is universally terminal — so the new opencode-
   null retry was UNREACHABLE for the bug it shipped to fix. It only
   helped the rare case where opencode emits some events but no usable
   content.

2. SLIPPERY SLOPE (every reviewer):
   Central classifier with a lineage exception is a one-off escape
   valve that grows into a table. Moonshot (also opencode-go) was
   already an unprotected gap (codex flagged via isOpenCodeFamily idiom).

Fix: move the retry-on-null-kind and retry-on-no-output policies out of
the central classifier and into the shim layer. Each AgentShim declares
its own `retryPolicy: { extraKinds?, onNullKind?, onNoOutput? }` block.

- opencode shim: opts into both flags (closes the no_output bypass)
- kimi shim (moonshot): same opts-in (closes the moonshot gap)
- codex/claude/gemini/grok/antigravity/openrouter shims: unchanged —
  fall through to the universal transient list as before

isRetryableErrorKind signature changes from (kind, lineage?) to
(kind, shim?). Universal terminal/transient layer unchanged; shim
policies layer on top. Universal terminal kinds (quota_exhausted,
token_refresh_lost, opencode_db_corrupt) still beat shim opt-ins —
explicit terminal kinds cannot be overridden by a shim.

Tests: 978 -> 982 passing. Test cases now use tiny shim stubs instead
of lineage strings, matching the production call shape.

Out of scope (separate follow-ups flagged in audit):
- Telemetry: record retry-succeeded rows so the policy's effectiveness
  is measurable. Universal across reviewers.
- Verdict-parse-failure false positive (codex finding #5): opencode-cli-7
  traced this and found non-empty content always returns a real verdict,
  so apology-shaped responses don't reach the retry path. Closed as
  non-issue.
…logy shapes

PR #91 audit catch (multiple reviewers, MEDIUM-HIGH severity): when a
reviewer writes non-empty prose with no approve/reject keyword,
verdictFromReviewerText returns null. Pre-fix, lastError.kind stayed
undefined → the new opencode shim's onNullKind=true policy fired a
retry that wasted tokens on the same prose shape.

This isn't the transport flake retry was designed for — it's a
successful model run that just didn't match the verdict heuristic
(apology, refusal, off-topic continuation). Classify it as
`verdict_ambiguous` (not in any retry list) so the driver advances
the chain cleanly.

Guarded on `!lastError.kind` so a real classified failure earlier in
the run (e.g. stream_failure with the finally block writing a
REVIEWER FAILED summary to answer.md) survives the verdict check.

Closes my own PR description claim (PR #91) that opencode-cli-7
had closed this — that earlier trace was incomplete. The audit
correctly re-traced it and found the path IS reachable.

Tests: 983 -> 984 (+1 contract test for verdict_ambiguous + 1
classifier test confirming it's terminal even on opencode shim).
@chorus-codes chorus-codes merged commit 176b876 into main May 25, 2026
2 checks passed
@chorus-codes chorus-codes deleted the refactor/shim-owned-retry-policy branch May 25, 2026 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant