feat(metrics): persist per-sample metric scores in JSON report by decko · Pull Request #290 · decko/raki

decko · 2026-05-18T22:09:04Z

Summary

Adds reaggregate_scores() utility function in raki.metrics that derives dataset-level aggregate scores from per-sample SampleResult.scores, enabling downstream consumers to re-derive aggregate metrics from saved JSON reports without re-running the metrics engine.

Changes

src/raki/metrics/reaggregate.py (new): Pure utility function collecting per-sample scores, skipping None values from means, returning None for fully-absent metrics. Docstring documents review_severity_distribution and self_correction_rate limitations.
src/raki/metrics/__init__.py (modified): Exports reaggregate_scores.
tests/test_reaggregate.py (new): 7 tests — 6 unit tests (known scores, None skipping, all-None → None, missing metrics, empty input, single session) + round-trip integration test verifying reaggregated output matches engine aggregate for all per-sample metrics.
changes/259.feature (new): Towncrier changelog fragment.

Acceptance Criteria

reaggregate_scores() collects per-sample scores from SampleResult.scores
None scores are skipped from mean calculation
Fully-absent metrics return None
Missing metrics across samples use only present samples
Round-trip: reaggregate_scores(report.sample_results) matches report.aggregate_scores for all per-sample metrics
review_severity_distribution (aggregate-only) absent from reaggregated output
triage_calibration and file_prediction_accuracy covered in round-trip test
All 1590 existing tests pass (+ 3 skipped, 4 deselected)
Towncrier fragment valid

Review Results

rag-specialist — MINOR (addressed)

Finding: Round-trip test omitted triage_calibration and file_prediction_accuracy from roundtrip_metrics. Both metrics populate sample_scores and should round-trip correctly.

Resolution: Added both metrics to the roundtrip_metrics list in the integration test (commit f58a474). The existing if metric_name not in report.aggregate_scores: continue guard handles absence of these metrics in test data.

Refs #259

Assisted-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
Assigned-by: decko

Introduces reaggregate_scores() which collects per-sample metric scores from SampleResult.scores and computes arithmetic means by metric name. Key design decisions: - None scores are skipped from mean calculation; returns None when all scores for a metric are None (not 0.0, which would be misleading). - Metrics absent from a sample are treated the same as None scores -- skipped and not counted in the denominator. - review_severity_distribution: aggregate-only metric, produces no per-sample scores, silently absent from output (by design). - self_correction_rate: ratio-of-sums vs mean-of-ratios divergence documented in docstring; tested separately in Task 2 integration test. 6 unit tests cover: known scores, None handling, all-None edge case, missing metrics, empty input, single session. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-pipeline

…#259) Proves that reaggregate_scores(report.sample_results) reproduces the engine's aggregate_scores for all metrics with per-sample scores. Explicitly skipped metrics: - review_severity_distribution: aggregate-only, no sample_scores, so it does not appear in sample_results and cannot be reaggregated. - self_correction_rate: engine computes ratio-of-sums (resolved / total_findings across all sessions); reaggregation computes mean-of-ratios (mean of per-session 0.0/1.0), which diverges when sessions have different finding counts. The test uses samples with duration_ms and token data so that phase_execution_time and token_efficiency also produce per-sample scores and participate in the round-trip verification. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-pipeline

Documents the new reaggregate_scores() utility as a .feature fragment so it appears in the next CHANGELOG entry. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-pipeline

… round-trip metrics (#259) Per review finding: both metrics populate sample_scores and should be explicitly tested in the round-trip integration test. The existing guard already handles absence of these metrics in test data.

decko added 4 commits May 18, 2026 18:59

chore(changes): add towncrier fragment for reaggregate_scores (#259)

8e147c6

Documents the new reaggregate_scores() utility as a .feature fragment so it appears in the next CHANGELOG entry. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-pipeline

decko added the ai-assisted Implemented with AI assistance label May 18, 2026

decko mentioned this pull request May 18, 2026

test: add triage_calibration and file_prediction_accuracy to round-trip metric coverage in test_reaggregate.py #291

Closed

decko merged commit 198bab1 into main May 19, 2026
4 checks passed

decko deleted the soda/259 branch May 19, 2026 13:06

decko mentioned this pull request May 19, 2026

feat: persist per-sample metric scores in JSON report #259

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): persist per-sample metric scores in JSON report#290

feat(metrics): persist per-sample metric scores in JSON report#290
decko merged 4 commits into
mainfrom
soda/259

decko commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

decko commented May 18, 2026

Summary

Changes

Acceptance Criteria

Review Results

rag-specialist — MINOR (addressed)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant