Skip to content

Review pipeline convergence detection — stop cycling when reviewers hallucinate #223

@dean0x

Description

@dean0x

Problem

Multi-cycle code reviews degrade into false positive generators. Observed on autobeat PR #186 (feat/176-tmux-abstraction-layer): 9 review cycles over 26 hours, with the final review producing 34 findings — all false positives. The branch was clean, but the pipeline couldn't tell.

Timeline (autobeat PR #186)

Cycle Time Verdict Key Findings Real Issues
1 05-17 00:13 CHANGES_REQUESTED 5 HIGH security — shell injection via unescaped cwd, communicationTargets All real
2 05-17 07:25 BLOCK MERGE Architecture — N concurrent isAlive() timers spawning N sync processes All real
3 05-17 07:59 CHANGES_REQUESTED Reliability — Map mutation during iteration, 16 total All real
4 05-17 14:20 CHANGES_REQUESTED CRITICAL — triggerExit ordering race, destroy() missing exited=true All real
5 05-17 21:45 CHANGES_REQUESTED Timer churn O(N²) during batch stale exit, mutable handle Mostly real
6 05-17 22:06 DO_NOT_MERGE 15 blocking issues — severity went UP after fixes Mixed — degradation signal
7 05-17 22:42 (unclear) Missing TmuxConnectorPort, DI inconsistency, path traversal Mixed
8 05-17 23:38 CHANGES_REQUESTED agentCommand/agentArgs unescaped, destroy() not calling onExit Real
9 05-18 12:55 CHANGES_REQUESTED 4 CRITICAL, 9 HIGH, 21 MEDIUM All 34 false positives

Key observations

  1. Reviews 1-4 were genuinely productive. Each found distinct, real bugs (shell injection, race conditions, iterator corruption). The fixes were substantive.

  2. Reviews 5-8 were mixed. Still finding some real issues (agentCommand escaping was legitimate), but signal-to-noise was dropping. Findings shifted from "this will crash" to "this function is 53 lines."

  3. Review 6 was the degradation signal. Severity went from CHANGES_REQUESTED UP to DO_NOT_MERGE after a round of fixes. Fixing bugs shouldn't make the verdict worse.

  4. Review 9 was 100% hallucinated. Every CRITICAL, HIGH, and MEDIUM finding had already been fixed by batch commits on the branch. Manual verification against actual code confirmed 0 real issues. The typecheck passed clean. All 164 tests passed.

What finally worked

Manual verification — reading each flagged line in the actual code and confirming the issue didn't exist. This is exactly what the Resolver agent does (resolver.md lines 33-42: "Does issue still exist in code? Reviewer understood context correctly?"). The Resolver's knowledge just never fed back into the next review cycle.

Root Cause Analysis

1. No cross-review memory

Each Reviewer agent receives (from reviewer.md lines 22-25):

  • Focus area, diff command, decisions context, feature knowledge, PR description

Each Reviewer agent does NOT receive:

  • Prior review findings
  • Resolution summaries (which issues were FIXED, FALSE_POSITIVE, or DEFERRED)
  • Fix commit context

Each review cycle starts from zero context about prior cycles. Reviewers analyze code at HEAD, pattern-match against domain rules, and generate findings without knowing that prior fix batches already addressed those exact patterns.

2. Resolution data dies after each cycle

The data flow is one-directional:

review:orch → writes {focus}.md → resolve:orch reads them → writes resolution-summary.md → ❌ nothing reads this

The resolution-summary.md contains exactly the data needed to prevent false positive re-generation (FALSE_POSITIVE classifications with reasoning, FIXED issues with commit SHAs), but it's never consumed by the next review cycle.

3. Incremental detection doesn't prevent hallucination

The .last-review-head mechanism (review:orch Phase 2) gates on whether new commits exist. But when fix commits exist, it runs a fresh review. Reviewer agents see the full code state and cannot distinguish "this pattern was introduced and then fixed within the branch" from "this pattern exists and is a problem."

4. Fix rounds create more hallucination surface

Each fix round adds code (try/catch wrappers, extracted helpers, validation functions). More code = more surface area for pattern-matching hallucinations. Reviewers see safeCallOnExit and flag "callbacks aren't wrapped" because they're analyzing diff shapes, not actual control flow. The signal-to-noise ratio inverts.

5. Synthesizer doesn't cross-reference

The Synthesizer in review mode (synthesizer.md, Mode: Review, lines 237-244) reads the current review's {focus}.md files and aggregates them. It doesn't compare against prior reviews' findings or resolution summaries.

Proposed Solutions

Solution 1: Feed resolution-summary.md back to reviewers (closes the loop)

Where: review:orch Phase 5, reviewer.md
Effort: Medium
Impact: Highest

In review:orch Phase 5, before spawning reviewers, read the most recent resolution-summary.md for this branch. Extract the FALSE_POSITIVE and FIXED tables. Pass as PRIOR_RESOLUTIONS to each reviewer.

Add to reviewer prompt: "These findings were validated by a Resolver and classified as false positives or already fixed. Do not re-raise them unless new code has re-introduced the problem."

Changes needed:

  • review:orch/skill.md: New sub-step in Phase 5 to read resolution-summary.md
  • reviewer.md: New optional input PRIOR_RESOLUTIONS, new responsibility to cross-reference
  • /code-review.md: Same changes for the full code-review command

Solution 2: Convergence gate via Resolver false-positive ratio

Where: review:orch new Phase 2b
Effort: Low
Impact: High

Add a convergence check between Phase 2 (Incremental Detection) and Phase 5 (Reviews):

Phase 2b: Convergence Check

  • Read all resolution-summary.md files for this branch
  • Extract false-positive ratio from the most recent one
  • Decision matrix:
Condition Action
Last resolution >70% false positives Warn: "Reviewers are hallucinating. Consider manual verification or merge."
Last 2 resolutions had 0 new real fixes Recommend merge with confidence
No prior resolutions exist Continue normally

This is the "when to stop" signal we were missing.

Solution 3: Replace second+ review with resolve-only pass

Where: review:orch Phase 2 (or new user guidance)
Effort: Low
Impact: High

Instead of running the full review pipeline repeatedly, after the first review+resolve cycle, skip the review and just re-run resolve:orch against the EXISTING review report. The Resolver already validates each finding against current code — if all remaining issues come back FALSE_POSITIVE, you're clean.

Benefits:

  • Cheaper (Resolvers are Sonnet, Reviewers are Opus)
  • More accurate (Resolvers read actual code at flagged lines, not just diff patterns)
  • Faster (no 8-19 parallel reviewer agents to spawn)

Flow change:

Current:  review → resolve → review → resolve → review → resolve → ... (9 cycles)
Proposed: review → resolve → re-resolve → done (convergence detected)

Solution 4: Reviewer self-verification step

Where: reviewer.md responsibilities
Effort: Low
Impact: High

Add a step after generating findings: "For each CRITICAL or HIGH finding, read the actual code at the flagged location (not just the diff) and confirm the issue exists in the current code. If the code already handles the concern (e.g., callbacks are already wrapped in try/catch), downgrade to FALSE_POSITIVE."

This is what manual verification did to break the cycle — and it can be baked into the reviewer agent itself.

Changes needed:

  • reviewer.md: New step 8.5 between "Assign severity" and "Filter by confidence"

Solution 5: Issue fingerprinting / dedup

Where: synthesizer.md review mode, review:orch Phase 6
Effort: High
Impact: Medium

Generate a fingerprint for each finding (hash of file path + line range + issue type). Compare against all prior review findings for this branch. If same fingerprint appeared before AND was resolved (fixed or false-positive), auto-classify as STALE and exclude from blocking count.

The Synthesizer already reads from disk — it just needs to also read prior review directories.

Priority Order

# Solution Effort Impact Quick Win?
2 Convergence gate Low High ✅ Yes — small addition to review:orch
4 Reviewer self-verification Low High ✅ Yes — small addition to reviewer.md
3 Re-resolve instead of re-review Low High ✅ Yes — guidance change
1 Prior resolution context Medium Highest No — plumbing through pipeline
5 Issue fingerprinting High Medium No — new infrastructure

Solutions 2, 3, and 4 are quick wins that could be implemented in a single session. Solution 1 is the most impactful but requires more plumbing. Solution 5 is the most robust but also the most complex.

Files to Modify

File Changes
shared/skills/review:orch/skill.md Add Phase 2b (convergence gate), modify Phase 5 (prior resolutions)
shared/agents/reviewer.md Add self-verification step, add PRIOR_RESOLUTIONS input
shared/agents/synthesizer.md Cross-reference prior reviews in review mode
plugins/devflow-code-review/commands/code-review.md Mirror changes from review:orch
shared/skills/resolve:orch/skill.md No changes needed — already has the right validation logic
shared/agents/resolver.md No changes needed — already validates against current code

Evidence Artifacts

The full review history is preserved on disk in the autobeat repo:

  • .docs/reviews/feat-176-tmux-abstraction-layer/ — 9 timestamped review directories
  • Each contains per-focus review reports, resolution summaries, and the review-summary.md
  • The 2026-05-18_1255 directory contains the fully-hallucinated review for comparison

Acceptance Criteria

  • After a resolve cycle with >70% false positives, the next review cycle warns the user instead of silently producing another hallucinated report
  • Reviewer agents receive prior resolution context and do not re-flag resolved issues
  • The pipeline can recommend merge after 2 consecutive clean resolve passes
  • A re-resolve path exists as an alternative to a full re-review after fixes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions