Skip to content

feat: corroboration scoring with diff-size correction (#432)#437

Merged
justn-hyeok merged 1 commit intomainfrom
feat/corroboration-scoring-432
Apr 1, 2026
Merged

feat: corroboration scoring with diff-size correction (#432)#437
justn-hyeok merged 1 commit intomainfrom
feat/corroboration-scoring-432

Conversation

@justn-hyeok
Copy link
Copy Markdown
Collaborator

@justn-hyeok justn-hyeok commented Apr 1, 2026

Summary

  • Single-reviewer penalty: Findings reported by only 1 out of N reviewers (N>=3) get confidence reduced by 0.5x (small diffs) or 0.7x (large diffs >500 lines), targeting likely hallucinations
  • Triple+ corroboration boost: Findings confirmed by 3+ reviewers get a 1.2x confidence boost (capped at 100)
  • Diff-size correction: Large diffs are treated more leniently for single-reviewer findings since they may contain legitimate unique issues

Changes

  • packages/core/src/pipeline/confidence.ts — Extended computeL1Confidence with corroboration penalty/boost logic and optional totalDiffLines parameter
  • packages/core/src/pipeline/orchestrator.ts — Pass totalDiffLines (from filtered diff content) to computeL1Confidence
  • packages/core/src/tests/parser-bilingual.test.ts — Added 6 new test cases covering all corroboration scoring scenarios

Test plan

  • Single reviewer (1/5), small diff: confidence x 0.5
  • Single reviewer (1/5), large diff (>500 lines): confidence x 0.7
  • Triple corroboration (3/5): confidence x 1.2
  • All reviewers agree (5/5): confidence x 1.2 (capped at 100)
  • 2 reviewers agree: no penalty/boost (middle ground)
  • totalReviewers < 3: no penalty (not enough data)
  • All 26 tests pass (20 existing + 6 new)

Closes #432

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Enhanced confidence scoring to account for reviewer agreement and corroboration levels.
    • Penalties applied for single-reviewer scenarios; boosts applied when consensus is strong (3+ reviewers).
  • Tests

    • Added comprehensive test coverage for the updated confidence-scoring logic with various reviewer agreement scenarios.

Single-reviewer findings (1/N) get confidence penalty:
- Small diff: × 0.5 (high hallucination probability)
- Large diff (>500 lines): × 0.7 (may be legitimate)

Triple+ corroboration (3+/N) gets × 1.2 boost.

This is the final layer of the 4-layer hallucination filter,
strengthening the signal that MAD's majority voting provides.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the size/M <200 lines label Apr 1, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

This PR implements Layer 2 corroboration scoring with diff-size correction for the confidence calculation system. The computeL1Confidence function now accepts an optional totalDiffLines parameter to apply penalties for single-reviewer agreement (scaled by diff size) and boosts for multi-reviewer agreement, with the result clamped to [0, 100].

Changes

Cohort / File(s) Summary
Confidence Computation Logic
packages/core/src/pipeline/confidence.ts
Updated computeL1Confidence to accept optional totalDiffLines parameter. Added corroboration scoring: penalizes confidence (×0.5 or ×0.7 based on diff size) when 1 reviewer agrees and ≥3 total reviewers exist; boosts confidence (×1.2, capped at 100) when ≥3 reviewers agree. Final value clamped to [0, 100].
Orchestrator Integration
packages/core/src/pipeline/orchestrator.ts
Updated runPipeline to compute totalDiffLines from filtered diff content and pass it as an argument to computeL1Confidence for non-rule evidence documents.
Test Coverage
packages/core/src/tests/parser-bilingual.test.ts
Added comprehensive test suite ("corroboration scoring (#432)") validating penalty application for single-reviewer scenarios (small vs. large diffs), boost logic for 3+ agreers, boundary conditions when totalReviewers < 3, and mid-level agreement scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

size/M

Poem

🐰 Whiskers twitch with glee,
Three reviewers now agree,
Confidence blooms bright,
Corroboration's might,
Large diffs show their honesty! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: implementing corroboration scoring with diff-size correction as described in issue #432.
Linked Issues check ✅ Passed All coding requirements from issue #432 are met: confidence adjustments for 1/N (×0.5/×0.7), 2/N (no change), 3+/N (×1.2), diff-size correction, and test coverage.
Out of Scope Changes check ✅ Passed All changes align with issue #432 scope: confidence computation logic, orchestrator integration, and comprehensive test coverage with no extraneous modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/corroboration-scoring-432

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@codeagora-bot codeagora-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeAgora Review

📋 Triage: 3 verify · 3 ignore

Verdict: ✅ ACCEPT · 1 critical · 6 warning

The only flagged issue (d001) was unanimously dismissed by the reviewers after discussion, leaving zero unresolved or confirmed problems of any severity. With no CRITICAL/HARSHLY_CRITICAL findings remaining and no escalated disagreements, the change has been vetted and deemed safe to merge.

Blocking Issues

Severity File Line Issue Confidence
🔴 CRITICAL packages/core/src/pipeline/confidence.ts 15–48 Inconsistent Corroboration Boost 🟡 40%
5 warning(s)
Severity File Line Issue Confidence
🟡 WARNING packages/core/src/pipeline/confidence.ts 21 Potential Division by Zero Error 🟡 60%
🟡 WARNING packages/core/src/pipeline/confidence.ts 37 Lack of Input Validation for totalDiffLines 🔴 38%
🟡 WARNING packages/core/src/pipeline/confidence.ts 20 Potential division by zero in computeL1Confidence 🟡 45%
🟡 WARNING packages/core/src/pipeline/confidence.ts 26 Potential loss of precision in computeL1Confidence 🔴 34%
🟡 WARNING packages/core/src/pipeline/orchestrator.ts 753 Missing error handling in runPipeline 🟡 56%
Issue distribution (2 file(s))
File Issues
packages/core/src/pipeline/confidence.ts ████████████ 5
packages/core/src/pipeline/orchestrator.ts ██ 1
Agent consensus log (1 discussion(s))
✅ d001 — 1 round(s), consensus → DISMISSED

Verdict: DISMISSED — Majority rejected (2/3 disagree)


CodeAgora · Session: 2026-04-01/001

@@ -15,7 +15,8 @@ export interface DiscussionVerdictLike {
export function computeL1Confidence(
doc: EvidenceDocument,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — Inconsistent Corroboration Boost

Confidence: 🟡 40%

Problem: In packages/core/src/pipeline/confidence.ts:15-48

The corroboration boost logic does not correctly handle cases where agreeing is exactly 2. According to the comments, a boost should be applied when agreeing >= 3, but the current implementation only applies a penalty when agreeing is 1.

Evidence:

  1. The condition for applying a boost is agreeing >= 3, but there's no specific handling for agreeing == 2.
  2. The current implementation only applies a penalty when agreeing is 1 and totalReviewers >= 3.
  3. The function computeL1Confidence should apply a boost when there are at least 3 agreeing reviewers.
Suggested change
doc: EvidenceDocument,
if (agreeing === 1 && totalReviewers >= 3) {
// Diff-size correction: large diffs may have legitimate single-reviewer finds
const isLargeDiff = (totalDiffLines ?? 0) > 500;
const penalty = isLargeDiff ? 0.7 : 0.5;
base = Math.round(base * penalty);
} else if (agreeing >= 2) { // Modified condition
// Apply a smaller boost for 2 or more agreeing reviewers
base = Math.min(100, Math.round(base * 1.1));
} else if (agreeing >= 3) {
// Strong corroboration boost (capped at 100)
base = Math.min(100, Math.round(base * 1.2));
}

Flagged by: r-scout  |  CodeAgora

totalDiffLines?: number,
): number {
if (totalReviewers <= 0) return 50;
const agreeing = allDocs.filter(d =>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 WARNING — Potential Division by Zero Error

Confidence: 🟡 60%

Problem: In packages/core/src/pipeline/confidence.ts:21-25

The computeL1Confidence function may throw a division by zero error when totalReviewers is zero. Although the function checks if totalReviewers is less than or equal to zero and returns 50 in such cases, this handling may not be sufficient or clear.

Evidence:

  1. The function returns a fixed value of 50 when totalReviewers is less than or equal to zero, which might not accurately represent the confidence level in such scenarios.
  2. The check for totalReviewers being less than or equal to zero is present but does not handle negative values explicitly.
  3. There's no explicit documentation or comment explaining why 50 is chosen as the default confidence level when totalReviewers is zero or negative.
Suggested change
const agreeing = allDocs.filter(d =>
if (totalReviewers <= 0) {
// Consider throwing an error or returning a more meaningful default
// For example:
throw new Error("Total reviewers must be a positive number.");
// or
return 0; // with clear documentation on why 0 is chosen
}

Flagged by: r-llama33  |  CodeAgora

// Corroboration scoring (#432)
// Single-reviewer findings are more likely hallucinations
if (agreeing === 1 && totalReviewers >= 3) {
// Diff-size correction: large diffs may have legitimate single-reviewer finds
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 WARNING — Lack of Input Validation for totalDiffLines

Confidence: 🔴 38%

Problem: In packages/core/src/pipeline/confidence.ts:37-41

The computeL1Confidence function uses totalDiffLines to determine if a diff is large or small, but it does not validate if totalDiffLines is a positive number. This could lead to unexpected behavior if totalDiffLines is negative or zero.

Evidence:

  1. The function uses totalDiffLines to calculate the penalty for single-reviewer findings.
  2. There's no check to ensure totalDiffLines is a positive number.
  3. The logic for determining a large diff (totalDiffLines > 500) assumes totalDiffLines is always non-negative.
Suggested change
// Diff-size correction: large diffs may have legitimate single-reviewer finds
if (totalDiffLines !== undefined && totalDiffLines < 0) {
throw new Error("Total diff lines must be a non-negative number.");
}

Flagged by: r-llama33  |  CodeAgora

totalReviewers: number,
totalDiffLines?: number,
): number {
if (totalReviewers <= 0) return 50;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 WARNING — Potential division by zero in computeL1Confidence

Confidence: 🟡 45%

Problem: In packages/core/src/pipeline/confidence.ts:20-22

If totalReviewers is zero, the agreeing count is calculated by filtering allDocs, which could potentially return an empty array. Then, the agreementRate is calculated as Math.abs(agreeing / totalReviewers) * 100, which could result in a division by zero.

Evidence:

  1. totalReviewers is set to zero in the else clause of the function.
  2. agreeing is calculated by filtering allDocs, which could return an empty array.
Suggested change
if (totalReviewers <= 0) return 50;
const agreementRate = agreeing !== null && agreeing !== undefined ? Math.round((agreeing / totalReviewers) * 100) : 0;

Flagged by: r-llama31  |  CodeAgora

@@ -24,10 +25,26 @@ export function computeL1Confidence(
).length;
const agreementRate = Math.round((agreeing / totalReviewers) * 100);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 WARNING — Potential loss of precision in computeL1Confidence

Confidence: 🔴 34%

Problem: In packages/core/src/pipeline/confidence.ts:26-30

The base value is calculated by multiplying the confidence of the current document by 0.6 and adding the agreementRate multiplied by 0.4. This calculation can result in a potential loss of precision due to the float arithmetic.

Evidence:

  1. The calculation involves multiplying and adding floating-point numbers.
  2. The result is rounded to the nearest integer, which can lose precision.
Suggested change
const base = confidence * 36 + agreementRate * 4;

Flagged by: r-llama31  |  CodeAgora

@@ -752,9 +752,10 @@ export async function runPipeline(input: PipelineInput, progress?: ProgressEmitt

// === CONFIDENCE: Compute L1 confidence for non-rule docs ===
const totalReviewers = allReviewerInputs.length;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 WARNING — Missing error handling in runPipeline

Confidence: 🟡 56%

Problem: In packages/core/src/pipeline/orchestrator.ts:753-761

The runPipeline function assumes that computeL1Confidence will always return a value. However, in the case where totalDiffLines is null or undefined, a TypeError can be thrown. Additionally, the function does not handle any potential errors that may occur during the execution of computeL1Confidence.

Evidence:

  1. The function does not include any error handling for potential errors in computeL1Confidence.
  2. A TypeError can be thrown if totalDiffLines is null or undefined.
Suggested change
const totalReviewers = allReviewerInputs.length;
try {
// computeL1Confidence call
} catch (error) {
// handle error
}

Flagged by: r-llama31  |  CodeAgora

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
packages/core/src/tests/parser-bilingual.test.ts (1)

213-225: Good test coverage for corroboration scoring.

The test suite comprehensively covers all corroboration scenarios from the PR objectives:

  • Single-reviewer penalty (small/large diff variants)
  • Multi-reviewer boost with cap
  • Middle-ground (2 reviewers) no-change case
  • Guard condition (totalReviewers < 3)

Minor note: The makeDoc helper is duplicated from the earlier describe block (lines 148-159). Consider extracting it to module scope to reduce duplication, though this is optional since the scoping provides isolation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/tests/parser-bilingual.test.ts` around lines 213 - 225,
There is a duplicated helper function makeDoc used in two describe blocks;
extract makeDoc to module scope so both computeL1Confidence — corroboration
scoring tests and the earlier describe block reuse the same helper, then remove
the duplicate definitions inside the describe blocks; ensure the extracted
function signature and optional confidence handling exactly match the existing
implementations so tests referencing makeDoc still work.
packages/core/src/pipeline/confidence.ts (1)

11-14: JSDoc is outdated after the corroboration scoring changes.

The function description only mentions the basic agreement calculation but doesn't document the new corroboration penalty/boost logic or the totalDiffLines parameter behavior. Consider updating to reflect:

  • Single-reviewer penalty (0.5×/0.7× based on diff size) when agreeing === 1 and totalReviewers >= 3
  • Corroboration boost (1.2×) when agreeing >= 3
  • The optional totalDiffLines parameter and its default behavior
📝 Suggested JSDoc update
 /**
- * L1 confidence: (agreeing reviewers / total reviewers) * 100
- * "Agreeing" = docs at same filePath + similar lineRange (within ±5 lines)
+ * L1 confidence: blends reviewer confidence (60%) with agreement rate (40%).
+ * "Agreeing" = docs at same filePath + similar lineRange (within ±5 lines).
+ *
+ * Corroboration scoring (`#432`):
+ * - Single reviewer (1/N, N≥3): penalty 0.5× (small diff) or 0.7× (large diff >500 lines)
+ * - Two reviewers: no change
+ * - Three+ reviewers: boost 1.2× (capped at 100)
+ *
+ * `@param` totalDiffLines - Total lines in diff; if omitted, defaults to 0 (small diff behavior)
  */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/pipeline/confidence.ts` around lines 11 - 14, Update the
JSDoc above the confidence calculation in this file to describe the full
corroboration scoring rules: keep the base L1 formula (agreeing/totalReviewers *
100) but also document that when agreeing === 1 and totalReviewers >= 3 a
single-reviewer penalty is applied (0.5× for small diffs, 0.7× for larger diffs
— controlled by the totalDiffLines threshold), that when agreeing >= 3 a
corroboration boost of 1.2× is applied, and that totalDiffLines is an optional
parameter with its default behavior (explain the default threshold and how it
affects the penalty choice); reference the parameter name totalDiffLines and the
values/conditions for agreeing and totalReviewers so callers understand the
adjusted final score.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/core/src/pipeline/confidence.ts`:
- Around line 11-14: Update the JSDoc above the confidence calculation in this
file to describe the full corroboration scoring rules: keep the base L1 formula
(agreeing/totalReviewers * 100) but also document that when agreeing === 1 and
totalReviewers >= 3 a single-reviewer penalty is applied (0.5× for small diffs,
0.7× for larger diffs — controlled by the totalDiffLines threshold), that when
agreeing >= 3 a corroboration boost of 1.2× is applied, and that totalDiffLines
is an optional parameter with its default behavior (explain the default
threshold and how it affects the penalty choice); reference the parameter name
totalDiffLines and the values/conditions for agreeing and totalReviewers so
callers understand the adjusted final score.

In `@packages/core/src/tests/parser-bilingual.test.ts`:
- Around line 213-225: There is a duplicated helper function makeDoc used in two
describe blocks; extract makeDoc to module scope so both computeL1Confidence —
corroboration scoring tests and the earlier describe block reuse the same
helper, then remove the duplicate definitions inside the describe blocks; ensure
the extracted function signature and optional confidence handling exactly match
the existing implementations so tests referencing makeDoc still work.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 93f40695-97b0-41a2-82e9-2026f6419ebf

📥 Commits

Reviewing files that changed from the base of the PR and between 30e12eb and 8f86311.

📒 Files selected for processing (3)
  • packages/core/src/pipeline/confidence.ts
  • packages/core/src/pipeline/orchestrator.ts
  • packages/core/src/tests/parser-bilingual.test.ts

@justn-hyeok justn-hyeok merged commit 8fc6d82 into main Apr 1, 2026
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M <200 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Layer 2 — corroboration scoring with diff-size correction

1 participant