feat: corroboration scoring with diff-size correction (#432)#437
feat: corroboration scoring with diff-size correction (#432)#437justn-hyeok merged 1 commit intomainfrom
Conversation
Single-reviewer findings (1/N) get confidence penalty: - Small diff: × 0.5 (high hallucination probability) - Large diff (>500 lines): × 0.7 (may be legitimate) Triple+ corroboration (3+/N) gets × 1.2 boost. This is the final layer of the 4-layer hallucination filter, strengthening the signal that MAD's majority voting provides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR implements Layer 2 corroboration scoring with diff-size correction for the confidence calculation system. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
CodeAgora Review
📋 Triage: 3 verify · 3 ignore
Verdict: ✅ ACCEPT · 1 critical · 6 warning
The only flagged issue (d001) was unanimously dismissed by the reviewers after discussion, leaving zero unresolved or confirmed problems of any severity. With no CRITICAL/HARSHLY_CRITICAL findings remaining and no escalated disagreements, the change has been vetted and deemed safe to merge.
Blocking Issues
| Severity | File | Line | Issue | Confidence |
|---|---|---|---|---|
| 🔴 CRITICAL | packages/core/src/pipeline/confidence.ts |
15–48 | Inconsistent Corroboration Boost | 🟡 40% |
5 warning(s)
| Severity | File | Line | Issue | Confidence |
|---|---|---|---|---|
| 🟡 WARNING | packages/core/src/pipeline/confidence.ts |
21 | Potential Division by Zero Error | 🟡 60% |
| 🟡 WARNING | packages/core/src/pipeline/confidence.ts |
37 | Lack of Input Validation for totalDiffLines |
🔴 38% |
| 🟡 WARNING | packages/core/src/pipeline/confidence.ts |
20 | Potential division by zero in computeL1Confidence | 🟡 45% |
| 🟡 WARNING | packages/core/src/pipeline/confidence.ts |
26 | Potential loss of precision in computeL1Confidence | 🔴 34% |
| 🟡 WARNING | packages/core/src/pipeline/orchestrator.ts |
753 | Missing error handling in runPipeline | 🟡 56% |
Issue distribution (2 file(s))
| File | Issues |
|---|---|
packages/core/src/pipeline/confidence.ts |
████████████ 5 |
packages/core/src/pipeline/orchestrator.ts |
██ 1 |
Agent consensus log (1 discussion(s))
✅ d001 — 1 round(s), consensus → DISMISSED
Verdict: DISMISSED — Majority rejected (2/3 disagree)
CodeAgora · Session: 2026-04-01/001
| @@ -15,7 +15,8 @@ export interface DiscussionVerdictLike { | |||
| export function computeL1Confidence( | |||
| doc: EvidenceDocument, | |||
There was a problem hiding this comment.
🔴 CRITICAL — Inconsistent Corroboration Boost
Confidence: 🟡 40%
Problem: In packages/core/src/pipeline/confidence.ts:15-48
The corroboration boost logic does not correctly handle cases where agreeing is exactly 2. According to the comments, a boost should be applied when agreeing >= 3, but the current implementation only applies a penalty when agreeing is 1.
Evidence:
- The condition for applying a boost is
agreeing >= 3, but there's no specific handling foragreeing == 2. - The current implementation only applies a penalty when
agreeingis 1 andtotalReviewers >= 3. - The function
computeL1Confidenceshould apply a boost when there are at least 3 agreeing reviewers.
| doc: EvidenceDocument, | |
| if (agreeing === 1 && totalReviewers >= 3) { | |
| // Diff-size correction: large diffs may have legitimate single-reviewer finds | |
| const isLargeDiff = (totalDiffLines ?? 0) > 500; | |
| const penalty = isLargeDiff ? 0.7 : 0.5; | |
| base = Math.round(base * penalty); | |
| } else if (agreeing >= 2) { // Modified condition | |
| // Apply a smaller boost for 2 or more agreeing reviewers | |
| base = Math.min(100, Math.round(base * 1.1)); | |
| } else if (agreeing >= 3) { | |
| // Strong corroboration boost (capped at 100) | |
| base = Math.min(100, Math.round(base * 1.2)); | |
| } |
Flagged by: r-scout | CodeAgora
| totalDiffLines?: number, | ||
| ): number { | ||
| if (totalReviewers <= 0) return 50; | ||
| const agreeing = allDocs.filter(d => |
There was a problem hiding this comment.
🟡 WARNING — Potential Division by Zero Error
Confidence: 🟡 60%
Problem: In packages/core/src/pipeline/confidence.ts:21-25
The computeL1Confidence function may throw a division by zero error when totalReviewers is zero. Although the function checks if totalReviewers is less than or equal to zero and returns 50 in such cases, this handling may not be sufficient or clear.
Evidence:
- The function returns a fixed value of 50 when
totalReviewersis less than or equal to zero, which might not accurately represent the confidence level in such scenarios. - The check for
totalReviewersbeing less than or equal to zero is present but does not handle negative values explicitly. - There's no explicit documentation or comment explaining why 50 is chosen as the default confidence level when
totalReviewersis zero or negative.
| const agreeing = allDocs.filter(d => | |
| if (totalReviewers <= 0) { | |
| // Consider throwing an error or returning a more meaningful default | |
| // For example: | |
| throw new Error("Total reviewers must be a positive number."); | |
| // or | |
| return 0; // with clear documentation on why 0 is chosen | |
| } |
Flagged by: r-llama33 | CodeAgora
| // Corroboration scoring (#432) | ||
| // Single-reviewer findings are more likely hallucinations | ||
| if (agreeing === 1 && totalReviewers >= 3) { | ||
| // Diff-size correction: large diffs may have legitimate single-reviewer finds |
There was a problem hiding this comment.
🟡 WARNING — Lack of Input Validation for totalDiffLines
Confidence: 🔴 38%
Problem: In packages/core/src/pipeline/confidence.ts:37-41
The computeL1Confidence function uses totalDiffLines to determine if a diff is large or small, but it does not validate if totalDiffLines is a positive number. This could lead to unexpected behavior if totalDiffLines is negative or zero.
Evidence:
- The function uses
totalDiffLinesto calculate the penalty for single-reviewer findings. - There's no check to ensure
totalDiffLinesis a positive number. - The logic for determining a large diff (
totalDiffLines > 500) assumestotalDiffLinesis always non-negative.
| // Diff-size correction: large diffs may have legitimate single-reviewer finds | |
| if (totalDiffLines !== undefined && totalDiffLines < 0) { | |
| throw new Error("Total diff lines must be a non-negative number."); | |
| } |
Flagged by: r-llama33 | CodeAgora
| totalReviewers: number, | ||
| totalDiffLines?: number, | ||
| ): number { | ||
| if (totalReviewers <= 0) return 50; |
There was a problem hiding this comment.
🟡 WARNING — Potential division by zero in computeL1Confidence
Confidence: 🟡 45%
Problem: In packages/core/src/pipeline/confidence.ts:20-22
If totalReviewers is zero, the agreeing count is calculated by filtering allDocs, which could potentially return an empty array. Then, the agreementRate is calculated as Math.abs(agreeing / totalReviewers) * 100, which could result in a division by zero.
Evidence:
totalReviewersis set to zero in the else clause of the function.agreeingis calculated by filteringallDocs, which could return an empty array.
| if (totalReviewers <= 0) return 50; | |
| const agreementRate = agreeing !== null && agreeing !== undefined ? Math.round((agreeing / totalReviewers) * 100) : 0; |
Flagged by: r-llama31 | CodeAgora
| @@ -24,10 +25,26 @@ export function computeL1Confidence( | |||
| ).length; | |||
| const agreementRate = Math.round((agreeing / totalReviewers) * 100); | |||
|
|
|||
There was a problem hiding this comment.
🟡 WARNING — Potential loss of precision in computeL1Confidence
Confidence: 🔴 34%
Problem: In packages/core/src/pipeline/confidence.ts:26-30
The base value is calculated by multiplying the confidence of the current document by 0.6 and adding the agreementRate multiplied by 0.4. This calculation can result in a potential loss of precision due to the float arithmetic.
Evidence:
- The calculation involves multiplying and adding floating-point numbers.
- The result is rounded to the nearest integer, which can lose precision.
| const base = confidence * 36 + agreementRate * 4; |
Flagged by: r-llama31 | CodeAgora
| @@ -752,9 +752,10 @@ export async function runPipeline(input: PipelineInput, progress?: ProgressEmitt | |||
|
|
|||
| // === CONFIDENCE: Compute L1 confidence for non-rule docs === | |||
| const totalReviewers = allReviewerInputs.length; | |||
There was a problem hiding this comment.
🟡 WARNING — Missing error handling in runPipeline
Confidence: 🟡 56%
Problem: In packages/core/src/pipeline/orchestrator.ts:753-761
The runPipeline function assumes that computeL1Confidence will always return a value. However, in the case where totalDiffLines is null or undefined, a TypeError can be thrown. Additionally, the function does not handle any potential errors that may occur during the execution of computeL1Confidence.
Evidence:
- The function does not include any error handling for potential errors in
computeL1Confidence. - A
TypeErrorcan be thrown iftotalDiffLinesisnullorundefined.
| const totalReviewers = allReviewerInputs.length; | |
| try { | |
| // computeL1Confidence call | |
| } catch (error) { | |
| // handle error | |
| } |
Flagged by: r-llama31 | CodeAgora
There was a problem hiding this comment.
🧹 Nitpick comments (2)
packages/core/src/tests/parser-bilingual.test.ts (1)
213-225: Good test coverage for corroboration scoring.The test suite comprehensively covers all corroboration scenarios from the PR objectives:
- Single-reviewer penalty (small/large diff variants)
- Multi-reviewer boost with cap
- Middle-ground (2 reviewers) no-change case
- Guard condition (totalReviewers < 3)
Minor note: The
makeDochelper is duplicated from the earlier describe block (lines 148-159). Consider extracting it to module scope to reduce duplication, though this is optional since the scoping provides isolation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/core/src/tests/parser-bilingual.test.ts` around lines 213 - 225, There is a duplicated helper function makeDoc used in two describe blocks; extract makeDoc to module scope so both computeL1Confidence — corroboration scoring tests and the earlier describe block reuse the same helper, then remove the duplicate definitions inside the describe blocks; ensure the extracted function signature and optional confidence handling exactly match the existing implementations so tests referencing makeDoc still work.packages/core/src/pipeline/confidence.ts (1)
11-14: JSDoc is outdated after the corroboration scoring changes.The function description only mentions the basic agreement calculation but doesn't document the new corroboration penalty/boost logic or the
totalDiffLinesparameter behavior. Consider updating to reflect:
- Single-reviewer penalty (0.5×/0.7× based on diff size) when
agreeing === 1andtotalReviewers >= 3- Corroboration boost (1.2×) when
agreeing >= 3- The optional
totalDiffLinesparameter and its default behavior📝 Suggested JSDoc update
/** - * L1 confidence: (agreeing reviewers / total reviewers) * 100 - * "Agreeing" = docs at same filePath + similar lineRange (within ±5 lines) + * L1 confidence: blends reviewer confidence (60%) with agreement rate (40%). + * "Agreeing" = docs at same filePath + similar lineRange (within ±5 lines). + * + * Corroboration scoring (`#432`): + * - Single reviewer (1/N, N≥3): penalty 0.5× (small diff) or 0.7× (large diff >500 lines) + * - Two reviewers: no change + * - Three+ reviewers: boost 1.2× (capped at 100) + * + * `@param` totalDiffLines - Total lines in diff; if omitted, defaults to 0 (small diff behavior) */🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/core/src/pipeline/confidence.ts` around lines 11 - 14, Update the JSDoc above the confidence calculation in this file to describe the full corroboration scoring rules: keep the base L1 formula (agreeing/totalReviewers * 100) but also document that when agreeing === 1 and totalReviewers >= 3 a single-reviewer penalty is applied (0.5× for small diffs, 0.7× for larger diffs — controlled by the totalDiffLines threshold), that when agreeing >= 3 a corroboration boost of 1.2× is applied, and that totalDiffLines is an optional parameter with its default behavior (explain the default threshold and how it affects the penalty choice); reference the parameter name totalDiffLines and the values/conditions for agreeing and totalReviewers so callers understand the adjusted final score.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/core/src/pipeline/confidence.ts`:
- Around line 11-14: Update the JSDoc above the confidence calculation in this
file to describe the full corroboration scoring rules: keep the base L1 formula
(agreeing/totalReviewers * 100) but also document that when agreeing === 1 and
totalReviewers >= 3 a single-reviewer penalty is applied (0.5× for small diffs,
0.7× for larger diffs — controlled by the totalDiffLines threshold), that when
agreeing >= 3 a corroboration boost of 1.2× is applied, and that totalDiffLines
is an optional parameter with its default behavior (explain the default
threshold and how it affects the penalty choice); reference the parameter name
totalDiffLines and the values/conditions for agreeing and totalReviewers so
callers understand the adjusted final score.
In `@packages/core/src/tests/parser-bilingual.test.ts`:
- Around line 213-225: There is a duplicated helper function makeDoc used in two
describe blocks; extract makeDoc to module scope so both computeL1Confidence —
corroboration scoring tests and the earlier describe block reuse the same
helper, then remove the duplicate definitions inside the describe blocks; ensure
the extracted function signature and optional confidence handling exactly match
the existing implementations so tests referencing makeDoc still work.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 93f40695-97b0-41a2-82e9-2026f6419ebf
📒 Files selected for processing (3)
packages/core/src/pipeline/confidence.tspackages/core/src/pipeline/orchestrator.tspackages/core/src/tests/parser-bilingual.test.ts
Summary
Changes
packages/core/src/pipeline/confidence.ts— ExtendedcomputeL1Confidencewith corroboration penalty/boost logic and optionaltotalDiffLinesparameterpackages/core/src/pipeline/orchestrator.ts— PasstotalDiffLines(from filtered diff content) tocomputeL1Confidencepackages/core/src/tests/parser-bilingual.test.ts— Added 6 new test cases covering all corroboration scoring scenariosTest plan
Closes #432
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests