feat(mt#1131): Reviewer retry with reasoning_effort=low on empty output by minsky-ai[bot] · Pull Request #748 · edobry/minsky

minsky-ai · 2026-04-24T01:10:27Z

Summary

mt#1125 guarded the boundary so empty-body model responses fail loudly instead of posting as silent zero-content reviews. This PR adds the follow-up retry: when GPT-5 (or any reasoning model) exhausts max_completion_tokens on hidden reasoning tokens, a second attempt with reasoning_effort: "low" shifts the budget toward visible output and usually succeeds. One retry, no backoff, OpenAI-only — Google and Anthropic have no equivalent knob.

Fixes the "reviewer failed loudly but PR still got no review" case observed 6+ times on Sprint B PRs before mt#1125 landed.

Changes

services/reviewer/src/providers.ts

New CallReviewerOptions interface with optional reasoningEffort: "low" | "medium" | "high".
callReviewer / callOpenAI accept options and use options?.reasoningEffort ?? "medium" on reasoning-model calls. Non-OpenAI paths ignore the option.

services/reviewer/src/review-worker.ts

Extract callReviewerWithRetry(config, systemPrompt, userPrompt, callReviewerFn?).
- First call → if validates, returns attempt: "first-attempt-success", retryAttempted: false.
- Empty + OpenAI → retry once with reasoningEffort: "low"; returns attempt: "retry-success" | "retry-failed" with retryAttempted: true.
- Empty + non-OpenAI → no retry possible, returns attempt: "retry-failed", retryAttempted: false.
- callReviewerFn parameter is the test seam.
ReviewResult gains attempt and retryAttempted for observability; success reason now includes attempt=….
runReview consumes the helper; skip-notice behavior on final-empty is unchanged.

Tests

6 new callReviewerWithRetry cases via a fake CallReviewerFn seam: first substantive (no retry), first empty OpenAI + retry substantive (reasoningEffort=low passed only on retry), first empty OpenAI + retry empty, Google empty (no retry), Anthropic empty (no retry), never-cascades-beyond-one-retry.

Test plan

services/reviewer suite: 50 pass, 0 fail (was 44; +6 retry cases).
Service-local tsc --noEmit: clean.
Workspace validate_typecheck: 0 errors.
Workspace validate_lint: 0 errors, 0 warnings.
format:check on changed files: clean.
End-to-end verification against a live reasoning-exhausted PR — deferred; the calibration data in mt#1110 (project_mt1110_calibration_data.md) tracks real occurrences and will show retry-success / retry-failed in logs once deployed.

Observability

After deployment, server logs on empty-output events will include attempt=retry-success or attempt=retry-failed so the mt#1110 calibration data can distinguish:

how often the retry path rescues a review (expected primary mode),
how often both attempts fail (indicates the model cannot fit the diff in budget at all — different remediation: raise max_completion_tokens further or shrink the prompt),
how often the first attempt succeeds (baseline; should be the majority).

Out of scope

Retry for non-empty-but-invalid content.
Cross-provider fallback.
Exponential backoff / >1 retry (spec explicitly excludes cascading).
Extending retry to Google / Anthropic (no reasoning_effort equivalent).

mt#1125 guarded the boundary so empty-body model responses fail loudly instead of posting as silent zero-content reviews. This PR adds the missing retry: when GPT-5 exhausts max_completion_tokens on reasoning, a second attempt with reasoning_effort=low shifts the budget toward visible output and usually succeeds. One retry, no backoff, OpenAI-only (Google and Anthropic have no equivalent knob). - providers.ts: add CallReviewerOptions with reasoningEffort; thread through callReviewer and callOpenAI. Default stays medium; retry passes low. - review-worker.ts: extract callReviewerWithRetry helper that encapsulates the single-retry-on-empty rule. Returns attempt trace (first-attempt-success | retry-success | retry-failed) plus a retryAttempted flag for observability. runReview consumes the helper and propagates the trace onto ReviewResult.reason and ReviewResult.attempt so the server log can distinguish the cases. - Tests: 6 new cases against the extracted helper via a fake callReviewer seam covering first success, first empty + retry success, first empty + retry empty, Google empty (no retry), Anthropic empty (no retry), and never-cascades-beyond-one-retry. Total reviewer-service tests: 50 pass, 0 fail.

Resolves conflicts in the reviewer service between mt#1131 (retry with reasoning_effort=low on empty output) and mt#1126 (tool-use loop for OpenAI provider). Both features belong together: callReviewer now accepts (config, systemPrompt, userPrompt, tools?, options?); callReviewerWithRetry threads tools through both attempts. Also picks up mt#1085 tier-lookup changes in runReview. Tests: 113 pass (up from 50, combines mt#1131 retry tests + mt#1126 tool-context tests + mt#1085 tier-routing expansion).

…+ post-mortem docs (mt#1372) ## Summary Implements the rescoped mt#1372 forward instrumentation: every webhook delivery received by the reviewer service is now persisted to a `reviewer_webhook_events` Postgres table, enabling forensic analysis of missed-review incidents that would otherwise leave no trace after Railway log and GitHub App delivery-history retention windows expire. **What ships:** - **Drizzle schema** (`src/db/schemas/webhook-events-schema.ts`): `reviewer_webhook_events` table with `delivery_id` UNIQUE, `event_type`, `headers` (jsonb subset), `body` (jsonb), `outcome` (pgEnum, 8 values), `error_details` (jsonb), `received_at`, `processed_at`; three indexes - **SQL migration** (`migrations/pg/0001_webhook_events.sql`): creates enum type and table; journal updated - **Persistence module** (`src/webhook-events.ts`): `recordWebhookReceipt` (ON CONFLICT DO NOTHING — first-delivery wins, terminal outcomes preserved on re-delivery), `updateOutcome` (sets `processedAt` on terminal outcomes), `pruneOldRows` (returns count or -1), `extractPersistedHeaders` (truncates HMAC signature to 12-char prefix) - **server.ts wiring**: synthesizes `synthetic-${crypto.randomUUID()}` for missing x-github-delivery (R2 fix); `recordWebhookReceipt` on every POST regardless of header presence (R1 fix #3); `updateOutcome` at each pipeline stage (reviewer_called, review_submitted, failed_at_reviewer, skipped, failed_at_signature, failed_at_tier_resolve), 24h in-process retention pruner, `log.error("webhook_processing_failed", ...)` OperatorNotify on dispatch/reviewer failures - **Unit tests** (`src/webhook-events.test.ts` + `src/server.test.ts`): 19 + 9 tests, including 3 pinned regression tests for R1/R2 BLOCKING findings - **Smoke script** (`scripts/smoke-webhook-events.ts`): POSTs a signed webhook, queries the DB 500ms later via raw SQL (decoupled from src/ TS imports — R1 fix #2), asserts row shape; skips gracefully when env vars are absent (exit 0 SKIP) - **Post-mortem guide** (`docs/incidents/reviewer-webhook-investigation.md`): 6 SQL queries, investigation workflow, Railway log correlation, retention policy reference **Scope boundary** (per task spec): no changes to review-worker.ts, providers.ts, sweeper.ts, prior-review-summary.ts, mcp-client.ts, tier-routing.ts, config.ts, logger.ts. ## R1 + R2 review findings addressed - [R1 BLOCKING] UPSERT overwrites terminal outcomes on re-delivery → onConflictDoUpdate → onConflictDoNothing (commit `9d6268792`) - [R1 BLOCKING] Smoke script imports TS schema from src/ → switched to raw SQL via postgres-js tagged template (commit `9d6268792`) - [R1 BLOCKING] POSTs missing x-github-event are skipped → persist every POST with "unknown" sentinel (commit `9d6268792`) - [R2 BLOCKING] Missing x-github-delivery collapses all malformed POSTs into one row → synthesize `synthetic-${crypto.randomUUID()}` per request (commit `5ca07ab11`) ## Test plan - [x] `bun test src/webhook-events.test.ts src/server.test.ts` — 28/28 pass (see Execution evidence below) - [x] `bun run scripts/smoke-webhook-events.ts` — exits 0 (SKIP) when env vars absent - [ ] Post-deploy: run smoke script with `MINSKY_REVIEWER_WEBHOOK_SECRET`, `MINSKY_SESSIONDB_POSTGRES_URL`, and `SMOKE_REVIEWER_BASE_URL` set against the Railway environment — asserts row in `reviewer_webhook_events` with correct fields ### Execution evidence: ``` $ cd services/reviewer && bun test --preload ../../tests/setup.ts --timeout=15000 src/webhook-events.test.ts src/server.test.ts (pass) extractPersistedHeaders > extracts all four canonical headers when present (pass) extractPersistedHeaders > omits absent headers rather than storing null/undefined (pass) extractPersistedHeaders > partial headers — only present ones are included (pass) extractPersistedHeaders > signature prefix is always 12 chars even for short signatures (pass) recordWebhookReceipt > calls db.insert with correct arguments (pass) recordWebhookReceipt > does not throw when DB insert fails (pass) recordWebhookReceipt > handles non-JSON body by wrapping in { raw } object (pass) recordWebhookReceipt > uses ON CONFLICT DO NOTHING (R1 BLOCKING #1 regression — re-delivery must not overwrite terminal outcomes) (pass) recordWebhookReceipt > accepts 'unknown' sentinel eventType (R1 BLOCKING #3 regression — every POST is persisted) (pass) updateOutcome > calls db.update and db.where with the delivery ID (pass) updateOutcome > does NOT set processedAt for non-terminal outcomes (pass) updateOutcome > includes errorDetails when provided (pass) updateOutcome > does not throw when DB update fails (pass) updateOutcome > skipped is a terminal outcome (pass) updateOutcome > received is a non-terminal outcome (pass) pruneOldRows > calls db.delete and returns the row count (pass) pruneOldRows > returns -1 when DB delete fails (pass) pruneOldRows > returns 0 when no rows are deleted (pass) pruneOldRows > uses default retention of 90 days when not specified (pass) webhook handler > responds 200 immediately for pull_request.opened, before runReview finishes (pass) webhook handler > responds 400 when signature header is absent (pass) webhook handler > responds 401 when signature is invalid (pass) webhook handler > skips draft PRs — returns 200 without scheduling a review (pass) webhook handler > /health returns inflightCount (pass) webhook handler > missing x-github-delivery synthesizes a unique synthetic-<uuid> per request (R2 BLOCKING regression) (pass) gracefulShutdown > logs shutdown_drain_start with correct inflightCount, then shutdown_drain_complete (pass) gracefulShutdown > shutdown_drain_start carries inflightCount=0 when no reviews in flight (pass) gracefulShutdown > review errors do NOT prevent graceful shutdown from completing 28 pass 0 fail 55 expect() calls Ran 28 tests across 2 files. [320.00ms] ``` ## Concurrency analysis (N/A — no check-then-act pattern introduced. ON CONFLICT DO NOTHING is structurally atomic in Postgres; per-request UUID synthesis is intra-process and non-shared.) ## Forensic context Three confirmed instances of missed reviews (PR #677, #748, #761, 2026-04-23/24) could not be diagnosed because GitHub App delivery history (14-day retention) and Railway logs were both expired. This table captures all webhook events from deployment of this PR onward. See `docs/incidents/reviewer-webhook-investigation.md` for investigation queries. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: minsky-ai[bot] <minsky-ai[bot]@users.noreply.github.com>

minsky-ai Bot added the authorship/co-authored Co-authored by human and AI agent label Apr 24, 2026

edobry merged commit 17badb8 into main Apr 24, 2026
2 checks passed

edobry deleted the task/mt-1131 branch April 24, 2026 06:44

minsky-ai Bot mentioned this pull request Apr 24, 2026

feat(mt#1228): Hook blocks gh api PR-merge calls with merge_method != merge #761

Merged

6 tasks

minsky-ai Bot mentioned this pull request May 11, 2026

feat(mt#1372): durable webhook-event persistence + retention pruning + post-mortem docs (mt#1372) #1024

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mt#1131): Reviewer retry with reasoning_effort=low on empty output#748

feat(mt#1131): Reviewer retry with reasoning_effort=low on empty output#748
edobry merged 2 commits into
mainfrom
task/mt-1131

minsky-ai Bot commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

minsky-ai Bot commented Apr 24, 2026

Summary

Changes

Test plan

Observability

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant