Problem
Multi-cycle code reviews degrade into false positive generators. Observed on autobeat PR #186 (feat/176-tmux-abstraction-layer): 9 review cycles over 26 hours, with the final review producing 34 findings — all false positives. The branch was clean, but the pipeline couldn't tell.
Timeline (autobeat PR #186)
| Cycle |
Time |
Verdict |
Key Findings |
Real Issues |
| 1 |
05-17 00:13 |
CHANGES_REQUESTED |
5 HIGH security — shell injection via unescaped cwd, communicationTargets |
All real |
| 2 |
05-17 07:25 |
BLOCK MERGE |
Architecture — N concurrent isAlive() timers spawning N sync processes |
All real |
| 3 |
05-17 07:59 |
CHANGES_REQUESTED |
Reliability — Map mutation during iteration, 16 total |
All real |
| 4 |
05-17 14:20 |
CHANGES_REQUESTED |
CRITICAL — triggerExit ordering race, destroy() missing exited=true |
All real |
| 5 |
05-17 21:45 |
CHANGES_REQUESTED |
Timer churn O(N²) during batch stale exit, mutable handle |
Mostly real |
| 6 |
05-17 22:06 |
DO_NOT_MERGE |
15 blocking issues — severity went UP after fixes |
Mixed — degradation signal |
| 7 |
05-17 22:42 |
(unclear) |
Missing TmuxConnectorPort, DI inconsistency, path traversal |
Mixed |
| 8 |
05-17 23:38 |
CHANGES_REQUESTED |
agentCommand/agentArgs unescaped, destroy() not calling onExit |
Real |
| 9 |
05-18 12:55 |
CHANGES_REQUESTED |
4 CRITICAL, 9 HIGH, 21 MEDIUM |
All 34 false positives |
Key observations
-
Reviews 1-4 were genuinely productive. Each found distinct, real bugs (shell injection, race conditions, iterator corruption). The fixes were substantive.
-
Reviews 5-8 were mixed. Still finding some real issues (agentCommand escaping was legitimate), but signal-to-noise was dropping. Findings shifted from "this will crash" to "this function is 53 lines."
-
Review 6 was the degradation signal. Severity went from CHANGES_REQUESTED UP to DO_NOT_MERGE after a round of fixes. Fixing bugs shouldn't make the verdict worse.
-
Review 9 was 100% hallucinated. Every CRITICAL, HIGH, and MEDIUM finding had already been fixed by batch commits on the branch. Manual verification against actual code confirmed 0 real issues. The typecheck passed clean. All 164 tests passed.
What finally worked
Manual verification — reading each flagged line in the actual code and confirming the issue didn't exist. This is exactly what the Resolver agent does (resolver.md lines 33-42: "Does issue still exist in code? Reviewer understood context correctly?"). The Resolver's knowledge just never fed back into the next review cycle.
Root Cause Analysis
1. No cross-review memory
Each Reviewer agent receives (from reviewer.md lines 22-25):
- Focus area, diff command, decisions context, feature knowledge, PR description
Each Reviewer agent does NOT receive:
- Prior review findings
- Resolution summaries (which issues were FIXED, FALSE_POSITIVE, or DEFERRED)
- Fix commit context
Each review cycle starts from zero context about prior cycles. Reviewers analyze code at HEAD, pattern-match against domain rules, and generate findings without knowing that prior fix batches already addressed those exact patterns.
2. Resolution data dies after each cycle
The data flow is one-directional:
review:orch → writes {focus}.md → resolve:orch reads them → writes resolution-summary.md → ❌ nothing reads this
The resolution-summary.md contains exactly the data needed to prevent false positive re-generation (FALSE_POSITIVE classifications with reasoning, FIXED issues with commit SHAs), but it's never consumed by the next review cycle.
3. Incremental detection doesn't prevent hallucination
The .last-review-head mechanism (review:orch Phase 2) gates on whether new commits exist. But when fix commits exist, it runs a fresh review. Reviewer agents see the full code state and cannot distinguish "this pattern was introduced and then fixed within the branch" from "this pattern exists and is a problem."
4. Fix rounds create more hallucination surface
Each fix round adds code (try/catch wrappers, extracted helpers, validation functions). More code = more surface area for pattern-matching hallucinations. Reviewers see safeCallOnExit and flag "callbacks aren't wrapped" because they're analyzing diff shapes, not actual control flow. The signal-to-noise ratio inverts.
5. Synthesizer doesn't cross-reference
The Synthesizer in review mode (synthesizer.md, Mode: Review, lines 237-244) reads the current review's {focus}.md files and aggregates them. It doesn't compare against prior reviews' findings or resolution summaries.
Proposed Solutions
Solution 1: Feed resolution-summary.md back to reviewers (closes the loop)
Where: review:orch Phase 5, reviewer.md
Effort: Medium
Impact: Highest
In review:orch Phase 5, before spawning reviewers, read the most recent resolution-summary.md for this branch. Extract the FALSE_POSITIVE and FIXED tables. Pass as PRIOR_RESOLUTIONS to each reviewer.
Add to reviewer prompt: "These findings were validated by a Resolver and classified as false positives or already fixed. Do not re-raise them unless new code has re-introduced the problem."
Changes needed:
review:orch/skill.md: New sub-step in Phase 5 to read resolution-summary.md
reviewer.md: New optional input PRIOR_RESOLUTIONS, new responsibility to cross-reference
/code-review.md: Same changes for the full code-review command
Solution 2: Convergence gate via Resolver false-positive ratio
Where: review:orch new Phase 2b
Effort: Low
Impact: High
Add a convergence check between Phase 2 (Incremental Detection) and Phase 5 (Reviews):
Phase 2b: Convergence Check
- Read all
resolution-summary.md files for this branch
- Extract false-positive ratio from the most recent one
- Decision matrix:
| Condition |
Action |
| Last resolution >70% false positives |
Warn: "Reviewers are hallucinating. Consider manual verification or merge." |
| Last 2 resolutions had 0 new real fixes |
Recommend merge with confidence |
| No prior resolutions exist |
Continue normally |
This is the "when to stop" signal we were missing.
Solution 3: Replace second+ review with resolve-only pass
Where: review:orch Phase 2 (or new user guidance)
Effort: Low
Impact: High
Instead of running the full review pipeline repeatedly, after the first review+resolve cycle, skip the review and just re-run resolve:orch against the EXISTING review report. The Resolver already validates each finding against current code — if all remaining issues come back FALSE_POSITIVE, you're clean.
Benefits:
- Cheaper (Resolvers are Sonnet, Reviewers are Opus)
- More accurate (Resolvers read actual code at flagged lines, not just diff patterns)
- Faster (no 8-19 parallel reviewer agents to spawn)
Flow change:
Current: review → resolve → review → resolve → review → resolve → ... (9 cycles)
Proposed: review → resolve → re-resolve → done (convergence detected)
Solution 4: Reviewer self-verification step
Where: reviewer.md responsibilities
Effort: Low
Impact: High
Add a step after generating findings: "For each CRITICAL or HIGH finding, read the actual code at the flagged location (not just the diff) and confirm the issue exists in the current code. If the code already handles the concern (e.g., callbacks are already wrapped in try/catch), downgrade to FALSE_POSITIVE."
This is what manual verification did to break the cycle — and it can be baked into the reviewer agent itself.
Changes needed:
reviewer.md: New step 8.5 between "Assign severity" and "Filter by confidence"
Solution 5: Issue fingerprinting / dedup
Where: synthesizer.md review mode, review:orch Phase 6
Effort: High
Impact: Medium
Generate a fingerprint for each finding (hash of file path + line range + issue type). Compare against all prior review findings for this branch. If same fingerprint appeared before AND was resolved (fixed or false-positive), auto-classify as STALE and exclude from blocking count.
The Synthesizer already reads from disk — it just needs to also read prior review directories.
Priority Order
| # |
Solution |
Effort |
Impact |
Quick Win? |
| 2 |
Convergence gate |
Low |
High |
✅ Yes — small addition to review:orch |
| 4 |
Reviewer self-verification |
Low |
High |
✅ Yes — small addition to reviewer.md |
| 3 |
Re-resolve instead of re-review |
Low |
High |
✅ Yes — guidance change |
| 1 |
Prior resolution context |
Medium |
Highest |
No — plumbing through pipeline |
| 5 |
Issue fingerprinting |
High |
Medium |
No — new infrastructure |
Solutions 2, 3, and 4 are quick wins that could be implemented in a single session. Solution 1 is the most impactful but requires more plumbing. Solution 5 is the most robust but also the most complex.
Files to Modify
| File |
Changes |
shared/skills/review:orch/skill.md |
Add Phase 2b (convergence gate), modify Phase 5 (prior resolutions) |
shared/agents/reviewer.md |
Add self-verification step, add PRIOR_RESOLUTIONS input |
shared/agents/synthesizer.md |
Cross-reference prior reviews in review mode |
plugins/devflow-code-review/commands/code-review.md |
Mirror changes from review:orch |
shared/skills/resolve:orch/skill.md |
No changes needed — already has the right validation logic |
shared/agents/resolver.md |
No changes needed — already validates against current code |
Evidence Artifacts
The full review history is preserved on disk in the autobeat repo:
.docs/reviews/feat-176-tmux-abstraction-layer/ — 9 timestamped review directories
- Each contains per-focus review reports, resolution summaries, and the review-summary.md
- The 2026-05-18_1255 directory contains the fully-hallucinated review for comparison
Acceptance Criteria
Problem
Multi-cycle code reviews degrade into false positive generators. Observed on autobeat PR #186 (feat/176-tmux-abstraction-layer): 9 review cycles over 26 hours, with the final review producing 34 findings — all false positives. The branch was clean, but the pipeline couldn't tell.
Timeline (autobeat PR #186)
Key observations
Reviews 1-4 were genuinely productive. Each found distinct, real bugs (shell injection, race conditions, iterator corruption). The fixes were substantive.
Reviews 5-8 were mixed. Still finding some real issues (agentCommand escaping was legitimate), but signal-to-noise was dropping. Findings shifted from "this will crash" to "this function is 53 lines."
Review 6 was the degradation signal. Severity went from CHANGES_REQUESTED UP to DO_NOT_MERGE after a round of fixes. Fixing bugs shouldn't make the verdict worse.
Review 9 was 100% hallucinated. Every CRITICAL, HIGH, and MEDIUM finding had already been fixed by batch commits on the branch. Manual verification against actual code confirmed 0 real issues. The typecheck passed clean. All 164 tests passed.
What finally worked
Manual verification — reading each flagged line in the actual code and confirming the issue didn't exist. This is exactly what the Resolver agent does (resolver.md lines 33-42: "Does issue still exist in code? Reviewer understood context correctly?"). The Resolver's knowledge just never fed back into the next review cycle.
Root Cause Analysis
1. No cross-review memory
Each Reviewer agent receives (from
reviewer.mdlines 22-25):Each Reviewer agent does NOT receive:
Each review cycle starts from zero context about prior cycles. Reviewers analyze code at HEAD, pattern-match against domain rules, and generate findings without knowing that prior fix batches already addressed those exact patterns.
2. Resolution data dies after each cycle
The data flow is one-directional:
The
resolution-summary.mdcontains exactly the data needed to prevent false positive re-generation (FALSE_POSITIVE classifications with reasoning, FIXED issues with commit SHAs), but it's never consumed by the next review cycle.3. Incremental detection doesn't prevent hallucination
The
.last-review-headmechanism (review:orch Phase 2) gates on whether new commits exist. But when fix commits exist, it runs a fresh review. Reviewer agents see the full code state and cannot distinguish "this pattern was introduced and then fixed within the branch" from "this pattern exists and is a problem."4. Fix rounds create more hallucination surface
Each fix round adds code (try/catch wrappers, extracted helpers, validation functions). More code = more surface area for pattern-matching hallucinations. Reviewers see
safeCallOnExitand flag "callbacks aren't wrapped" because they're analyzing diff shapes, not actual control flow. The signal-to-noise ratio inverts.5. Synthesizer doesn't cross-reference
The Synthesizer in review mode (synthesizer.md, Mode: Review, lines 237-244) reads the current review's
{focus}.mdfiles and aggregates them. It doesn't compare against prior reviews' findings or resolution summaries.Proposed Solutions
Solution 1: Feed resolution-summary.md back to reviewers (closes the loop)
Where:
review:orchPhase 5,reviewer.mdEffort: Medium
Impact: Highest
In
review:orchPhase 5, before spawning reviewers, read the most recentresolution-summary.mdfor this branch. Extract the FALSE_POSITIVE and FIXED tables. Pass asPRIOR_RESOLUTIONSto each reviewer.Add to reviewer prompt: "These findings were validated by a Resolver and classified as false positives or already fixed. Do not re-raise them unless new code has re-introduced the problem."
Changes needed:
review:orch/skill.md: New sub-step in Phase 5 to read resolution-summary.mdreviewer.md: New optional inputPRIOR_RESOLUTIONS, new responsibility to cross-reference/code-review.md: Same changes for the full code-review commandSolution 2: Convergence gate via Resolver false-positive ratio
Where:
review:orchnew Phase 2bEffort: Low
Impact: High
Add a convergence check between Phase 2 (Incremental Detection) and Phase 5 (Reviews):
Phase 2b: Convergence Check
resolution-summary.mdfiles for this branchThis is the "when to stop" signal we were missing.
Solution 3: Replace second+ review with resolve-only pass
Where:
review:orchPhase 2 (or new user guidance)Effort: Low
Impact: High
Instead of running the full review pipeline repeatedly, after the first review+resolve cycle, skip the review and just re-run
resolve:orchagainst the EXISTING review report. The Resolver already validates each finding against current code — if all remaining issues come back FALSE_POSITIVE, you're clean.Benefits:
Flow change:
Solution 4: Reviewer self-verification step
Where:
reviewer.mdresponsibilitiesEffort: Low
Impact: High
Add a step after generating findings: "For each CRITICAL or HIGH finding, read the actual code at the flagged location (not just the diff) and confirm the issue exists in the current code. If the code already handles the concern (e.g., callbacks are already wrapped in try/catch), downgrade to FALSE_POSITIVE."
This is what manual verification did to break the cycle — and it can be baked into the reviewer agent itself.
Changes needed:
reviewer.md: New step 8.5 between "Assign severity" and "Filter by confidence"Solution 5: Issue fingerprinting / dedup
Where:
synthesizer.mdreview mode,review:orchPhase 6Effort: High
Impact: Medium
Generate a fingerprint for each finding (hash of file path + line range + issue type). Compare against all prior review findings for this branch. If same fingerprint appeared before AND was resolved (fixed or false-positive), auto-classify as STALE and exclude from blocking count.
The Synthesizer already reads from disk — it just needs to also read prior review directories.
Priority Order
Solutions 2, 3, and 4 are quick wins that could be implemented in a single session. Solution 1 is the most impactful but requires more plumbing. Solution 5 is the most robust but also the most complex.
Files to Modify
shared/skills/review:orch/skill.mdshared/agents/reviewer.mdshared/agents/synthesizer.mdplugins/devflow-code-review/commands/code-review.mdshared/skills/resolve:orch/skill.mdshared/agents/resolver.mdEvidence Artifacts
The full review history is preserved on disk in the autobeat repo:
.docs/reviews/feat-176-tmux-abstraction-layer/— 9 timestamped review directoriesAcceptance Criteria