Skip to content

feat(mt#1131): Reviewer retry with reasoning_effort=low on empty output#748

Merged
edobry merged 2 commits into
mainfrom
task/mt-1131
Apr 24, 2026
Merged

feat(mt#1131): Reviewer retry with reasoning_effort=low on empty output#748
edobry merged 2 commits into
mainfrom
task/mt-1131

Conversation

@minsky-ai
Copy link
Copy Markdown
Contributor

@minsky-ai minsky-ai Bot commented Apr 24, 2026

Summary

mt#1125 guarded the boundary so empty-body model responses fail loudly instead of posting as silent zero-content reviews. This PR adds the follow-up retry: when GPT-5 (or any reasoning model) exhausts max_completion_tokens on hidden reasoning tokens, a second attempt with reasoning_effort: "low" shifts the budget toward visible output and usually succeeds. One retry, no backoff, OpenAI-only — Google and Anthropic have no equivalent knob.

Fixes the "reviewer failed loudly but PR still got no review" case observed 6+ times on Sprint B PRs before mt#1125 landed.

Changes

services/reviewer/src/providers.ts

  • New CallReviewerOptions interface with optional reasoningEffort: "low" | "medium" | "high".
  • callReviewer / callOpenAI accept options and use options?.reasoningEffort ?? "medium" on reasoning-model calls. Non-OpenAI paths ignore the option.

services/reviewer/src/review-worker.ts

  • Extract callReviewerWithRetry(config, systemPrompt, userPrompt, callReviewerFn?).
    • First call → if validates, returns attempt: "first-attempt-success", retryAttempted: false.
    • Empty + OpenAI → retry once with reasoningEffort: "low"; returns attempt: "retry-success" | "retry-failed" with retryAttempted: true.
    • Empty + non-OpenAI → no retry possible, returns attempt: "retry-failed", retryAttempted: false.
    • callReviewerFn parameter is the test seam.
  • ReviewResult gains attempt and retryAttempted for observability; success reason now includes attempt=….
  • runReview consumes the helper; skip-notice behavior on final-empty is unchanged.

Tests

  • 6 new callReviewerWithRetry cases via a fake CallReviewerFn seam: first substantive (no retry), first empty OpenAI + retry substantive (reasoningEffort=low passed only on retry), first empty OpenAI + retry empty, Google empty (no retry), Anthropic empty (no retry), never-cascades-beyond-one-retry.

Test plan

  • services/reviewer suite: 50 pass, 0 fail (was 44; +6 retry cases).
  • Service-local tsc --noEmit: clean.
  • Workspace validate_typecheck: 0 errors.
  • Workspace validate_lint: 0 errors, 0 warnings.
  • format:check on changed files: clean.
  • End-to-end verification against a live reasoning-exhausted PR — deferred; the calibration data in mt#1110 (project_mt1110_calibration_data.md) tracks real occurrences and will show retry-success / retry-failed in logs once deployed.

Observability

After deployment, server logs on empty-output events will include attempt=retry-success or attempt=retry-failed so the mt#1110 calibration data can distinguish:

  • how often the retry path rescues a review (expected primary mode),
  • how often both attempts fail (indicates the model cannot fit the diff in budget at all — different remediation: raise max_completion_tokens further or shrink the prompt),
  • how often the first attempt succeeds (baseline; should be the majority).

Out of scope

  • Retry for non-empty-but-invalid content.
  • Cross-provider fallback.
  • Exponential backoff / >1 retry (spec explicitly excludes cascading).
  • Extending retry to Google / Anthropic (no reasoning_effort equivalent).

mt#1125 guarded the boundary so empty-body model responses fail loudly instead of posting as silent zero-content reviews. This PR adds the missing retry: when GPT-5 exhausts max_completion_tokens on reasoning, a second attempt with reasoning_effort=low shifts the budget toward visible output and usually succeeds. One retry, no backoff, OpenAI-only (Google and Anthropic have no equivalent knob).

- providers.ts: add CallReviewerOptions with reasoningEffort; thread through callReviewer and callOpenAI. Default stays medium; retry passes low.
- review-worker.ts: extract callReviewerWithRetry helper that encapsulates the single-retry-on-empty rule. Returns attempt trace (first-attempt-success | retry-success | retry-failed) plus a retryAttempted flag for observability. runReview consumes the helper and propagates the trace onto ReviewResult.reason and ReviewResult.attempt so the server log can distinguish the cases.
- Tests: 6 new cases against the extracted helper via a fake callReviewer seam covering first success, first empty + retry success, first empty + retry empty, Google empty (no retry), Anthropic empty (no retry), and never-cascades-beyond-one-retry.

Total reviewer-service tests: 50 pass, 0 fail.
@minsky-ai minsky-ai Bot added the authorship/co-authored Co-authored by human and AI agent label Apr 24, 2026
Resolves conflicts in the reviewer service between mt#1131 (retry with reasoning_effort=low on empty output) and mt#1126 (tool-use loop for OpenAI provider). Both features belong together: callReviewer now accepts (config, systemPrompt, userPrompt, tools?, options?); callReviewerWithRetry threads tools through both attempts. Also picks up mt#1085 tier-lookup changes in runReview. Tests: 113 pass (up from 50, combines mt#1131 retry tests + mt#1126 tool-context tests + mt#1085 tier-routing expansion).
@edobry edobry merged commit 17badb8 into main Apr 24, 2026
2 checks passed
@edobry edobry deleted the task/mt-1131 branch April 24, 2026 06:44
edobry added a commit that referenced this pull request May 11, 2026
…+ post-mortem docs (mt#1372)

## Summary

Implements the rescoped mt#1372 forward instrumentation: every webhook delivery received by the reviewer service is now persisted to a `reviewer_webhook_events` Postgres table, enabling forensic analysis of missed-review incidents that would otherwise leave no trace after Railway log and GitHub App delivery-history retention windows expire.

**What ships:**

- **Drizzle schema** (`src/db/schemas/webhook-events-schema.ts`): `reviewer_webhook_events` table with `delivery_id` UNIQUE, `event_type`, `headers` (jsonb subset), `body` (jsonb), `outcome` (pgEnum, 8 values), `error_details` (jsonb), `received_at`, `processed_at`; three indexes
- **SQL migration** (`migrations/pg/0001_webhook_events.sql`): creates enum type and table; journal updated
- **Persistence module** (`src/webhook-events.ts`): `recordWebhookReceipt` (ON CONFLICT DO NOTHING — first-delivery wins, terminal outcomes preserved on re-delivery), `updateOutcome` (sets `processedAt` on terminal outcomes), `pruneOldRows` (returns count or -1), `extractPersistedHeaders` (truncates HMAC signature to 12-char prefix)
- **server.ts wiring**: synthesizes `synthetic-${crypto.randomUUID()}` for missing x-github-delivery (R2 fix); `recordWebhookReceipt` on every POST regardless of header presence (R1 fix #3); `updateOutcome` at each pipeline stage (reviewer_called, review_submitted, failed_at_reviewer, skipped, failed_at_signature, failed_at_tier_resolve), 24h in-process retention pruner, `log.error("webhook_processing_failed", ...)` OperatorNotify on dispatch/reviewer failures
- **Unit tests** (`src/webhook-events.test.ts` + `src/server.test.ts`): 19 + 9 tests, including 3 pinned regression tests for R1/R2 BLOCKING findings
- **Smoke script** (`scripts/smoke-webhook-events.ts`): POSTs a signed webhook, queries the DB 500ms later via raw SQL (decoupled from src/ TS imports — R1 fix #2), asserts row shape; skips gracefully when env vars are absent (exit 0 SKIP)
- **Post-mortem guide** (`docs/incidents/reviewer-webhook-investigation.md`): 6 SQL queries, investigation workflow, Railway log correlation, retention policy reference

**Scope boundary** (per task spec): no changes to review-worker.ts, providers.ts, sweeper.ts, prior-review-summary.ts, mcp-client.ts, tier-routing.ts, config.ts, logger.ts.

## R1 + R2 review findings addressed

- [R1 BLOCKING] UPSERT overwrites terminal outcomes on re-delivery → onConflictDoUpdate → onConflictDoNothing (commit `9d6268792`)
- [R1 BLOCKING] Smoke script imports TS schema from src/ → switched to raw SQL via postgres-js tagged template (commit `9d6268792`)
- [R1 BLOCKING] POSTs missing x-github-event are skipped → persist every POST with "unknown" sentinel (commit `9d6268792`)
- [R2 BLOCKING] Missing x-github-delivery collapses all malformed POSTs into one row → synthesize `synthetic-${crypto.randomUUID()}` per request (commit `5ca07ab11`)

## Test plan

- [x] `bun test src/webhook-events.test.ts src/server.test.ts` — 28/28 pass (see Execution evidence below)
- [x] `bun run scripts/smoke-webhook-events.ts` — exits 0 (SKIP) when env vars absent
- [ ] Post-deploy: run smoke script with `MINSKY_REVIEWER_WEBHOOK_SECRET`, `MINSKY_SESSIONDB_POSTGRES_URL`, and `SMOKE_REVIEWER_BASE_URL` set against the Railway environment — asserts row in `reviewer_webhook_events` with correct fields

### Execution evidence:

```
$ cd services/reviewer && bun test --preload ../../tests/setup.ts --timeout=15000 src/webhook-events.test.ts src/server.test.ts

(pass) extractPersistedHeaders > extracts all four canonical headers when present
(pass) extractPersistedHeaders > omits absent headers rather than storing null/undefined
(pass) extractPersistedHeaders > partial headers — only present ones are included
(pass) extractPersistedHeaders > signature prefix is always 12 chars even for short signatures
(pass) recordWebhookReceipt > calls db.insert with correct arguments
(pass) recordWebhookReceipt > does not throw when DB insert fails
(pass) recordWebhookReceipt > handles non-JSON body by wrapping in { raw } object
(pass) recordWebhookReceipt > uses ON CONFLICT DO NOTHING (R1 BLOCKING #1 regression — re-delivery must not overwrite terminal outcomes)
(pass) recordWebhookReceipt > accepts 'unknown' sentinel eventType (R1 BLOCKING #3 regression — every POST is persisted)
(pass) updateOutcome > calls db.update and db.where with the delivery ID
(pass) updateOutcome > does NOT set processedAt for non-terminal outcomes
(pass) updateOutcome > includes errorDetails when provided
(pass) updateOutcome > does not throw when DB update fails
(pass) updateOutcome > skipped is a terminal outcome
(pass) updateOutcome > received is a non-terminal outcome
(pass) pruneOldRows > calls db.delete and returns the row count
(pass) pruneOldRows > returns -1 when DB delete fails
(pass) pruneOldRows > returns 0 when no rows are deleted
(pass) pruneOldRows > uses default retention of 90 days when not specified
(pass) webhook handler > responds 200 immediately for pull_request.opened, before runReview finishes
(pass) webhook handler > responds 400 when signature header is absent
(pass) webhook handler > responds 401 when signature is invalid
(pass) webhook handler > skips draft PRs — returns 200 without scheduling a review
(pass) webhook handler > /health returns inflightCount
(pass) webhook handler > missing x-github-delivery synthesizes a unique synthetic-<uuid> per request (R2 BLOCKING regression)
(pass) gracefulShutdown > logs shutdown_drain_start with correct inflightCount, then shutdown_drain_complete
(pass) gracefulShutdown > shutdown_drain_start carries inflightCount=0 when no reviews in flight
(pass) gracefulShutdown > review errors do NOT prevent graceful shutdown from completing

 28 pass
 0 fail
 55 expect() calls
Ran 28 tests across 2 files. [320.00ms]
```

## Concurrency analysis

(N/A — no check-then-act pattern introduced. ON CONFLICT DO NOTHING is structurally atomic in Postgres; per-request UUID synthesis is intra-process and non-shared.)

## Forensic context

Three confirmed instances of missed reviews (PR #677, #748, #761, 2026-04-23/24) could not be diagnosed because GitHub App delivery history (14-day retention) and Railway logs were both expired. This table captures all webhook events from deployment of this PR onward. See `docs/incidents/reviewer-webhook-investigation.md` for investigation queries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: minsky-ai[bot] <minsky-ai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

authorship/co-authored Co-authored by human and AI agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant