Review pipeline convergence detection — stop cycling when reviewers hallucinate

## Problem

Multi-cycle code reviews degrade into false positive generators. Observed on autobeat PR #186 (feat/176-tmux-abstraction-layer): **9 review cycles over 26 hours**, with the final review producing **34 findings — all false positives**. The branch was clean, but the pipeline couldn't tell.

### Timeline (autobeat PR #186)

| Cycle | Time | Verdict | Key Findings | Real Issues |
|-------|------|---------|-------------|-------------|
| 1 | 05-17 00:13 | CHANGES_REQUESTED | 5 HIGH security — shell injection via unescaped cwd, communicationTargets | All real |
| 2 | 05-17 07:25 | BLOCK MERGE | Architecture — N concurrent isAlive() timers spawning N sync processes | All real |
| 3 | 05-17 07:59 | CHANGES_REQUESTED | Reliability — Map mutation during iteration, 16 total | All real |
| 4 | 05-17 14:20 | CHANGES_REQUESTED | CRITICAL — triggerExit ordering race, destroy() missing exited=true | All real |
| 5 | 05-17 21:45 | CHANGES_REQUESTED | Timer churn O(N²) during batch stale exit, mutable handle | Mostly real |
| 6 | 05-17 22:06 | DO_NOT_MERGE | 15 blocking issues — severity went UP after fixes | Mixed — degradation signal |
| 7 | 05-17 22:42 | (unclear) | Missing TmuxConnectorPort, DI inconsistency, path traversal | Mixed |
| 8 | 05-17 23:38 | CHANGES_REQUESTED | agentCommand/agentArgs unescaped, destroy() not calling onExit | Real |
| 9 | 05-18 12:55 | CHANGES_REQUESTED | 4 CRITICAL, 9 HIGH, 21 MEDIUM | **All 34 false positives** |

### Key observations

1. **Reviews 1-4 were genuinely productive.** Each found distinct, real bugs (shell injection, race conditions, iterator corruption). The fixes were substantive.

2. **Reviews 5-8 were mixed.** Still finding some real issues (agentCommand escaping was legitimate), but signal-to-noise was dropping. Findings shifted from "this will crash" to "this function is 53 lines."

3. **Review 6 was the degradation signal.** Severity went from CHANGES_REQUESTED UP to DO_NOT_MERGE after a round of fixes. Fixing bugs shouldn't make the verdict worse.

4. **Review 9 was 100% hallucinated.** Every CRITICAL, HIGH, and MEDIUM finding had already been fixed by batch commits on the branch. Manual verification against actual code confirmed 0 real issues. The typecheck passed clean. All 164 tests passed.

### What finally worked

Manual verification — reading each flagged line in the actual code and confirming the issue didn't exist. This is exactly what the Resolver agent does (resolver.md lines 33-42: "Does issue still exist in code? Reviewer understood context correctly?"). The Resolver's knowledge just never fed back into the next review cycle.

## Root Cause Analysis

### 1. No cross-review memory

Each Reviewer agent receives (from `reviewer.md` lines 22-25):
- Focus area, diff command, decisions context, feature knowledge, PR description

Each Reviewer agent does NOT receive:
- Prior review findings
- Resolution summaries (which issues were FIXED, FALSE_POSITIVE, or DEFERRED)
- Fix commit context

Each review cycle starts from zero context about prior cycles. Reviewers analyze code at HEAD, pattern-match against domain rules, and generate findings without knowing that prior fix batches already addressed those exact patterns.

### 2. Resolution data dies after each cycle

The data flow is one-directional:

```
review:orch → writes {focus}.md → resolve:orch reads them → writes resolution-summary.md → ❌ nothing reads this
```

The `resolution-summary.md` contains exactly the data needed to prevent false positive re-generation (FALSE_POSITIVE classifications with reasoning, FIXED issues with commit SHAs), but it's never consumed by the next review cycle.

### 3. Incremental detection doesn't prevent hallucination

The `.last-review-head` mechanism (review:orch Phase 2) gates on whether new commits exist. But when fix commits exist, it runs a fresh review. Reviewer agents see the full code state and cannot distinguish "this pattern was introduced and then fixed within the branch" from "this pattern exists and is a problem."

### 4. Fix rounds create more hallucination surface

Each fix round adds code (try/catch wrappers, extracted helpers, validation functions). More code = more surface area for pattern-matching hallucinations. Reviewers see `safeCallOnExit` and flag "callbacks aren't wrapped" because they're analyzing diff shapes, not actual control flow. The signal-to-noise ratio inverts.

### 5. Synthesizer doesn't cross-reference

The Synthesizer in review mode (synthesizer.md, Mode: Review, lines 237-244) reads the current review's `{focus}.md` files and aggregates them. It doesn't compare against prior reviews' findings or resolution summaries.

## Proposed Solutions

### Solution 1: Feed resolution-summary.md back to reviewers (closes the loop)

**Where:** `review:orch` Phase 5, `reviewer.md`
**Effort:** Medium
**Impact:** Highest

In `review:orch` Phase 5, before spawning reviewers, read the most recent `resolution-summary.md` for this branch. Extract the FALSE_POSITIVE and FIXED tables. Pass as `PRIOR_RESOLUTIONS` to each reviewer.

Add to reviewer prompt: *"These findings were validated by a Resolver and classified as false positives or already fixed. Do not re-raise them unless new code has re-introduced the problem."*

Changes needed:
- `review:orch/skill.md`: New sub-step in Phase 5 to read resolution-summary.md
- `reviewer.md`: New optional input `PRIOR_RESOLUTIONS`, new responsibility to cross-reference
- `/code-review.md`: Same changes for the full code-review command

### Solution 2: Convergence gate via Resolver false-positive ratio

**Where:** `review:orch` new Phase 2b
**Effort:** Low
**Impact:** High

Add a convergence check between Phase 2 (Incremental Detection) and Phase 5 (Reviews):

**Phase 2b: Convergence Check**
- Read all `resolution-summary.md` files for this branch
- Extract false-positive ratio from the most recent one
- Decision matrix:

| Condition | Action |
|-----------|--------|
| Last resolution >70% false positives | Warn: "Reviewers are hallucinating. Consider manual verification or merge." |
| Last 2 resolutions had 0 new real fixes | Recommend merge with confidence |
| No prior resolutions exist | Continue normally |

This is the "when to stop" signal we were missing.

### Solution 3: Replace second+ review with resolve-only pass

**Where:** `review:orch` Phase 2 (or new user guidance)
**Effort:** Low
**Impact:** High

Instead of running the full review pipeline repeatedly, after the first review+resolve cycle, skip the review and just re-run `resolve:orch` against the EXISTING review report. The Resolver already validates each finding against current code — if all remaining issues come back FALSE_POSITIVE, you're clean.

Benefits:
- Cheaper (Resolvers are Sonnet, Reviewers are Opus)
- More accurate (Resolvers read actual code at flagged lines, not just diff patterns)
- Faster (no 8-19 parallel reviewer agents to spawn)

Flow change:
```
Current:  review → resolve → review → resolve → review → resolve → ... (9 cycles)
Proposed: review → resolve → re-resolve → done (convergence detected)
```

### Solution 4: Reviewer self-verification step

**Where:** `reviewer.md` responsibilities
**Effort:** Low
**Impact:** High

Add a step after generating findings: *"For each CRITICAL or HIGH finding, read the actual code at the flagged location (not just the diff) and confirm the issue exists in the current code. If the code already handles the concern (e.g., callbacks are already wrapped in try/catch), downgrade to FALSE_POSITIVE."*

This is what manual verification did to break the cycle — and it can be baked into the reviewer agent itself.

Changes needed:
- `reviewer.md`: New step 8.5 between "Assign severity" and "Filter by confidence"

### Solution 5: Issue fingerprinting / dedup

**Where:** `synthesizer.md` review mode, `review:orch` Phase 6
**Effort:** High
**Impact:** Medium

Generate a fingerprint for each finding (hash of file path + line range + issue type). Compare against all prior review findings for this branch. If same fingerprint appeared before AND was resolved (fixed or false-positive), auto-classify as STALE and exclude from blocking count.

The Synthesizer already reads from disk — it just needs to also read prior review directories.

## Priority Order

| # | Solution | Effort | Impact | Quick Win? |
|---|----------|--------|--------|------------|
| 2 | Convergence gate | Low | High | ✅ Yes — small addition to review:orch |
| 4 | Reviewer self-verification | Low | High | ✅ Yes — small addition to reviewer.md |
| 3 | Re-resolve instead of re-review | Low | High | ✅ Yes — guidance change |
| 1 | Prior resolution context | Medium | Highest | No — plumbing through pipeline |
| 5 | Issue fingerprinting | High | Medium | No — new infrastructure |

Solutions 2, 3, and 4 are quick wins that could be implemented in a single session. Solution 1 is the most impactful but requires more plumbing. Solution 5 is the most robust but also the most complex.

## Files to Modify

| File | Changes |
|------|---------|
| `shared/skills/review:orch/skill.md` | Add Phase 2b (convergence gate), modify Phase 5 (prior resolutions) |
| `shared/agents/reviewer.md` | Add self-verification step, add PRIOR_RESOLUTIONS input |
| `shared/agents/synthesizer.md` | Cross-reference prior reviews in review mode |
| `plugins/devflow-code-review/commands/code-review.md` | Mirror changes from review:orch |
| `shared/skills/resolve:orch/skill.md` | No changes needed — already has the right validation logic |
| `shared/agents/resolver.md` | No changes needed — already validates against current code |

## Evidence Artifacts

The full review history is preserved on disk in the autobeat repo:
- `.docs/reviews/feat-176-tmux-abstraction-layer/` — 9 timestamped review directories
- Each contains per-focus review reports, resolution summaries, and the review-summary.md
- The 2026-05-18_1255 directory contains the fully-hallucinated review for comparison

## Acceptance Criteria

- [ ] After a resolve cycle with >70% false positives, the next review cycle warns the user instead of silently producing another hallucinated report
- [ ] Reviewer agents receive prior resolution context and do not re-flag resolved issues
- [ ] The pipeline can recommend merge after 2 consecutive clean resolve passes
- [ ] A re-resolve path exists as an alternative to a full re-review after fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review pipeline convergence detection — stop cycling when reviewers hallucinate #223

Problem

Timeline (autobeat PR #186)

Key observations

What finally worked

Root Cause Analysis

1. No cross-review memory

2. Resolution data dies after each cycle

3. Incremental detection doesn't prevent hallucination

4. Fix rounds create more hallucination surface

5. Synthesizer doesn't cross-reference

Proposed Solutions

Solution 1: Feed resolution-summary.md back to reviewers (closes the loop)

Solution 2: Convergence gate via Resolver false-positive ratio

Solution 3: Replace second+ review with resolve-only pass

Solution 4: Reviewer self-verification step

Solution 5: Issue fingerprinting / dedup

Priority Order

Files to Modify

Evidence Artifacts

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Cycle	Time	Verdict	Key Findings	Real Issues
1	05-17 00:13	CHANGES_REQUESTED	5 HIGH security — shell injection via unescaped cwd, communicationTargets	All real
2	05-17 07:25	BLOCK MERGE	Architecture — N concurrent isAlive() timers spawning N sync processes	All real
3	05-17 07:59	CHANGES_REQUESTED	Reliability — Map mutation during iteration, 16 total	All real
4	05-17 14:20	CHANGES_REQUESTED	CRITICAL — triggerExit ordering race, destroy() missing exited=true	All real
5	05-17 21:45	CHANGES_REQUESTED	Timer churn O(N²) during batch stale exit, mutable handle	Mostly real
6	05-17 22:06	DO_NOT_MERGE	15 blocking issues — severity went UP after fixes	Mixed — degradation signal
7	05-17 22:42	(unclear)	Missing TmuxConnectorPort, DI inconsistency, path traversal	Mixed
8	05-17 23:38	CHANGES_REQUESTED	agentCommand/agentArgs unescaped, destroy() not calling onExit	Real
9	05-18 12:55	CHANGES_REQUESTED	4 CRITICAL, 9 HIGH, 21 MEDIUM	All 34 false positives

Condition	Action
Last resolution >70% false positives	Warn: "Reviewers are hallucinating. Consider manual verification or merge."
Last 2 resolutions had 0 new real fixes	Recommend merge with confidence
No prior resolutions exist	Continue normally

#	Solution	Effort	Impact	Quick Win?
2	Convergence gate	Low	High	✅ Yes — small addition to review:orch
4	Reviewer self-verification	Low	High	✅ Yes — small addition to reviewer.md
3	Re-resolve instead of re-review	Low	High	✅ Yes — guidance change
1	Prior resolution context	Medium	Highest	No — plumbing through pipeline
5	Issue fingerprinting	High	Medium	No — new infrastructure

File	Changes
`shared/skills/review:orch/skill.md`	Add Phase 2b (convergence gate), modify Phase 5 (prior resolutions)
`shared/agents/reviewer.md`	Add self-verification step, add PRIOR_RESOLUTIONS input
`shared/agents/synthesizer.md`	Cross-reference prior reviews in review mode
`plugins/devflow-code-review/commands/code-review.md`	Mirror changes from review:orch
`shared/skills/resolve:orch/skill.md`	No changes needed — already has the right validation logic
`shared/agents/resolver.md`	No changes needed — already validates against current code

Review pipeline convergence detection — stop cycling when reviewers hallucinate #223

Description

Problem

Timeline (autobeat PR #186)

Key observations

What finally worked

Root Cause Analysis

1. No cross-review memory

2. Resolution data dies after each cycle

3. Incremental detection doesn't prevent hallucination

4. Fix rounds create more hallucination surface

5. Synthesizer doesn't cross-reference

Proposed Solutions

Solution 1: Feed resolution-summary.md back to reviewers (closes the loop)

Solution 2: Convergence gate via Resolver false-positive ratio

Solution 3: Replace second+ review with resolve-only pass

Solution 4: Reviewer self-verification step

Solution 5: Issue fingerprinting / dedup

Priority Order

Files to Modify

Evidence Artifacts

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions