feat(mt#1131): Reviewer retry with reasoning_effort=low on empty output#748
Merged
Conversation
mt#1125 guarded the boundary so empty-body model responses fail loudly instead of posting as silent zero-content reviews. This PR adds the missing retry: when GPT-5 exhausts max_completion_tokens on reasoning, a second attempt with reasoning_effort=low shifts the budget toward visible output and usually succeeds. One retry, no backoff, OpenAI-only (Google and Anthropic have no equivalent knob). - providers.ts: add CallReviewerOptions with reasoningEffort; thread through callReviewer and callOpenAI. Default stays medium; retry passes low. - review-worker.ts: extract callReviewerWithRetry helper that encapsulates the single-retry-on-empty rule. Returns attempt trace (first-attempt-success | retry-success | retry-failed) plus a retryAttempted flag for observability. runReview consumes the helper and propagates the trace onto ReviewResult.reason and ReviewResult.attempt so the server log can distinguish the cases. - Tests: 6 new cases against the extracted helper via a fake callReviewer seam covering first success, first empty + retry success, first empty + retry empty, Google empty (no retry), Anthropic empty (no retry), and never-cascades-beyond-one-retry. Total reviewer-service tests: 50 pass, 0 fail.
Resolves conflicts in the reviewer service between mt#1131 (retry with reasoning_effort=low on empty output) and mt#1126 (tool-use loop for OpenAI provider). Both features belong together: callReviewer now accepts (config, systemPrompt, userPrompt, tools?, options?); callReviewerWithRetry threads tools through both attempts. Also picks up mt#1085 tier-lookup changes in runReview. Tests: 113 pass (up from 50, combines mt#1131 retry tests + mt#1126 tool-context tests + mt#1085 tier-routing expansion).
6 tasks
3 tasks
edobry
added a commit
that referenced
this pull request
May 11, 2026
…+ post-mortem docs (mt#1372)
## Summary
Implements the rescoped mt#1372 forward instrumentation: every webhook delivery received by the reviewer service is now persisted to a `reviewer_webhook_events` Postgres table, enabling forensic analysis of missed-review incidents that would otherwise leave no trace after Railway log and GitHub App delivery-history retention windows expire.
**What ships:**
- **Drizzle schema** (`src/db/schemas/webhook-events-schema.ts`): `reviewer_webhook_events` table with `delivery_id` UNIQUE, `event_type`, `headers` (jsonb subset), `body` (jsonb), `outcome` (pgEnum, 8 values), `error_details` (jsonb), `received_at`, `processed_at`; three indexes
- **SQL migration** (`migrations/pg/0001_webhook_events.sql`): creates enum type and table; journal updated
- **Persistence module** (`src/webhook-events.ts`): `recordWebhookReceipt` (ON CONFLICT DO NOTHING — first-delivery wins, terminal outcomes preserved on re-delivery), `updateOutcome` (sets `processedAt` on terminal outcomes), `pruneOldRows` (returns count or -1), `extractPersistedHeaders` (truncates HMAC signature to 12-char prefix)
- **server.ts wiring**: synthesizes `synthetic-${crypto.randomUUID()}` for missing x-github-delivery (R2 fix); `recordWebhookReceipt` on every POST regardless of header presence (R1 fix #3); `updateOutcome` at each pipeline stage (reviewer_called, review_submitted, failed_at_reviewer, skipped, failed_at_signature, failed_at_tier_resolve), 24h in-process retention pruner, `log.error("webhook_processing_failed", ...)` OperatorNotify on dispatch/reviewer failures
- **Unit tests** (`src/webhook-events.test.ts` + `src/server.test.ts`): 19 + 9 tests, including 3 pinned regression tests for R1/R2 BLOCKING findings
- **Smoke script** (`scripts/smoke-webhook-events.ts`): POSTs a signed webhook, queries the DB 500ms later via raw SQL (decoupled from src/ TS imports — R1 fix #2), asserts row shape; skips gracefully when env vars are absent (exit 0 SKIP)
- **Post-mortem guide** (`docs/incidents/reviewer-webhook-investigation.md`): 6 SQL queries, investigation workflow, Railway log correlation, retention policy reference
**Scope boundary** (per task spec): no changes to review-worker.ts, providers.ts, sweeper.ts, prior-review-summary.ts, mcp-client.ts, tier-routing.ts, config.ts, logger.ts.
## R1 + R2 review findings addressed
- [R1 BLOCKING] UPSERT overwrites terminal outcomes on re-delivery → onConflictDoUpdate → onConflictDoNothing (commit `9d6268792`)
- [R1 BLOCKING] Smoke script imports TS schema from src/ → switched to raw SQL via postgres-js tagged template (commit `9d6268792`)
- [R1 BLOCKING] POSTs missing x-github-event are skipped → persist every POST with "unknown" sentinel (commit `9d6268792`)
- [R2 BLOCKING] Missing x-github-delivery collapses all malformed POSTs into one row → synthesize `synthetic-${crypto.randomUUID()}` per request (commit `5ca07ab11`)
## Test plan
- [x] `bun test src/webhook-events.test.ts src/server.test.ts` — 28/28 pass (see Execution evidence below)
- [x] `bun run scripts/smoke-webhook-events.ts` — exits 0 (SKIP) when env vars absent
- [ ] Post-deploy: run smoke script with `MINSKY_REVIEWER_WEBHOOK_SECRET`, `MINSKY_SESSIONDB_POSTGRES_URL`, and `SMOKE_REVIEWER_BASE_URL` set against the Railway environment — asserts row in `reviewer_webhook_events` with correct fields
### Execution evidence:
```
$ cd services/reviewer && bun test --preload ../../tests/setup.ts --timeout=15000 src/webhook-events.test.ts src/server.test.ts
(pass) extractPersistedHeaders > extracts all four canonical headers when present
(pass) extractPersistedHeaders > omits absent headers rather than storing null/undefined
(pass) extractPersistedHeaders > partial headers — only present ones are included
(pass) extractPersistedHeaders > signature prefix is always 12 chars even for short signatures
(pass) recordWebhookReceipt > calls db.insert with correct arguments
(pass) recordWebhookReceipt > does not throw when DB insert fails
(pass) recordWebhookReceipt > handles non-JSON body by wrapping in { raw } object
(pass) recordWebhookReceipt > uses ON CONFLICT DO NOTHING (R1 BLOCKING #1 regression — re-delivery must not overwrite terminal outcomes)
(pass) recordWebhookReceipt > accepts 'unknown' sentinel eventType (R1 BLOCKING #3 regression — every POST is persisted)
(pass) updateOutcome > calls db.update and db.where with the delivery ID
(pass) updateOutcome > does NOT set processedAt for non-terminal outcomes
(pass) updateOutcome > includes errorDetails when provided
(pass) updateOutcome > does not throw when DB update fails
(pass) updateOutcome > skipped is a terminal outcome
(pass) updateOutcome > received is a non-terminal outcome
(pass) pruneOldRows > calls db.delete and returns the row count
(pass) pruneOldRows > returns -1 when DB delete fails
(pass) pruneOldRows > returns 0 when no rows are deleted
(pass) pruneOldRows > uses default retention of 90 days when not specified
(pass) webhook handler > responds 200 immediately for pull_request.opened, before runReview finishes
(pass) webhook handler > responds 400 when signature header is absent
(pass) webhook handler > responds 401 when signature is invalid
(pass) webhook handler > skips draft PRs — returns 200 without scheduling a review
(pass) webhook handler > /health returns inflightCount
(pass) webhook handler > missing x-github-delivery synthesizes a unique synthetic-<uuid> per request (R2 BLOCKING regression)
(pass) gracefulShutdown > logs shutdown_drain_start with correct inflightCount, then shutdown_drain_complete
(pass) gracefulShutdown > shutdown_drain_start carries inflightCount=0 when no reviews in flight
(pass) gracefulShutdown > review errors do NOT prevent graceful shutdown from completing
28 pass
0 fail
55 expect() calls
Ran 28 tests across 2 files. [320.00ms]
```
## Concurrency analysis
(N/A — no check-then-act pattern introduced. ON CONFLICT DO NOTHING is structurally atomic in Postgres; per-request UUID synthesis is intra-process and non-shared.)
## Forensic context
Three confirmed instances of missed reviews (PR #677, #748, #761, 2026-04-23/24) could not be diagnosed because GitHub App delivery history (14-day retention) and Railway logs were both expired. This table captures all webhook events from deployment of this PR onward. See `docs/incidents/reviewer-webhook-investigation.md` for investigation queries.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: minsky-ai[bot] <minsky-ai[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mt#1125 guarded the boundary so empty-body model responses fail loudly instead of posting as silent zero-content reviews. This PR adds the follow-up retry: when GPT-5 (or any reasoning model) exhausts
max_completion_tokenson hidden reasoning tokens, a second attempt withreasoning_effort: "low"shifts the budget toward visible output and usually succeeds. One retry, no backoff, OpenAI-only — Google and Anthropic have no equivalent knob.Fixes the "reviewer failed loudly but PR still got no review" case observed 6+ times on Sprint B PRs before mt#1125 landed.
Changes
services/reviewer/src/providers.tsCallReviewerOptionsinterface with optionalreasoningEffort: "low" | "medium" | "high".callReviewer/callOpenAIaccept options and useoptions?.reasoningEffort ?? "medium"on reasoning-model calls. Non-OpenAI paths ignore the option.services/reviewer/src/review-worker.tscallReviewerWithRetry(config, systemPrompt, userPrompt, callReviewerFn?).attempt: "first-attempt-success",retryAttempted: false.reasoningEffort: "low"; returnsattempt: "retry-success" | "retry-failed"withretryAttempted: true.attempt: "retry-failed",retryAttempted: false.callReviewerFnparameter is the test seam.ReviewResultgainsattemptandretryAttemptedfor observability; successreasonnow includesattempt=….runReviewconsumes the helper; skip-notice behavior on final-empty is unchanged.Tests
callReviewerWithRetrycases via a fakeCallReviewerFnseam: first substantive (no retry), first empty OpenAI + retry substantive (reasoningEffort=low passed only on retry), first empty OpenAI + retry empty, Google empty (no retry), Anthropic empty (no retry), never-cascades-beyond-one-retry.Test plan
services/reviewersuite: 50 pass, 0 fail (was 44; +6 retry cases).tsc --noEmit: clean.validate_typecheck: 0 errors.validate_lint: 0 errors, 0 warnings.format:checkon changed files: clean.project_mt1110_calibration_data.md) tracks real occurrences and will show retry-success / retry-failed in logs once deployed.Observability
After deployment, server logs on empty-output events will include
attempt=retry-successorattempt=retry-failedso the mt#1110 calibration data can distinguish:max_completion_tokensfurther or shrink the prompt),Out of scope
reasoning_effortequivalent).