Skip to content

feat(metrics): persist per-sample metric scores in JSON report#290

Merged
decko merged 4 commits into
mainfrom
soda/259
May 19, 2026
Merged

feat(metrics): persist per-sample metric scores in JSON report#290
decko merged 4 commits into
mainfrom
soda/259

Conversation

@decko
Copy link
Copy Markdown
Owner

@decko decko commented May 18, 2026

Summary

Adds reaggregate_scores() utility function in raki.metrics that derives dataset-level aggregate scores from per-sample SampleResult.scores, enabling downstream consumers to re-derive aggregate metrics from saved JSON reports without re-running the metrics engine.

Changes

  • src/raki/metrics/reaggregate.py (new): Pure utility function collecting per-sample scores, skipping None values from means, returning None for fully-absent metrics. Docstring documents review_severity_distribution and self_correction_rate limitations.
  • src/raki/metrics/__init__.py (modified): Exports reaggregate_scores.
  • tests/test_reaggregate.py (new): 7 tests — 6 unit tests (known scores, None skipping, all-None → None, missing metrics, empty input, single session) + round-trip integration test verifying reaggregated output matches engine aggregate for all per-sample metrics.
  • changes/259.feature (new): Towncrier changelog fragment.

Acceptance Criteria

  • reaggregate_scores() collects per-sample scores from SampleResult.scores
  • None scores are skipped from mean calculation
  • Fully-absent metrics return None
  • Missing metrics across samples use only present samples
  • Round-trip: reaggregate_scores(report.sample_results) matches report.aggregate_scores for all per-sample metrics
  • review_severity_distribution (aggregate-only) absent from reaggregated output
  • triage_calibration and file_prediction_accuracy covered in round-trip test
  • All 1590 existing tests pass (+ 3 skipped, 4 deselected)
  • Towncrier fragment valid

Review Results

rag-specialist — MINOR (addressed)

Finding: Round-trip test omitted triage_calibration and file_prediction_accuracy from roundtrip_metrics. Both metrics populate sample_scores and should round-trip correctly.

Resolution: Added both metrics to the roundtrip_metrics list in the integration test (commit f58a474). The existing if metric_name not in report.aggregate_scores: continue guard handles absence of these metrics in test data.

Refs #259

Assisted-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
Assigned-by: decko

decko added 4 commits May 18, 2026 18:59
Introduces reaggregate_scores() which collects per-sample metric scores
from SampleResult.scores and computes arithmetic means by metric name.

Key design decisions:
- None scores are skipped from mean calculation; returns None when all
  scores for a metric are None (not 0.0, which would be misleading).
- Metrics absent from a sample are treated the same as None scores --
  skipped and not counted in the denominator.
- review_severity_distribution: aggregate-only metric, produces no
  per-sample scores, silently absent from output (by design).
- self_correction_rate: ratio-of-sums vs mean-of-ratios divergence
  documented in docstring; tested separately in Task 2 integration test.

6 unit tests cover: known scores, None handling, all-None edge case,
missing metrics, empty input, single session.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-pipeline
…#259)

Proves that reaggregate_scores(report.sample_results) reproduces the
engine's aggregate_scores for all metrics with per-sample scores.

Explicitly skipped metrics:
- review_severity_distribution: aggregate-only, no sample_scores, so it
  does not appear in sample_results and cannot be reaggregated.
- self_correction_rate: engine computes ratio-of-sums (resolved /
  total_findings across all sessions); reaggregation computes
  mean-of-ratios (mean of per-session 0.0/1.0), which diverges when
  sessions have different finding counts.

The test uses samples with duration_ms and token data so that
phase_execution_time and token_efficiency also produce per-sample scores
and participate in the round-trip verification.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-pipeline
Documents the new reaggregate_scores() utility as a .feature fragment
so it appears in the next CHANGELOG entry.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-pipeline
… round-trip metrics (#259)

Per review finding: both metrics populate sample_scores and should be
explicitly tested in the round-trip integration test. The existing
guard already handles absence of these metrics in test data.
@decko decko added the ai-assisted Implemented with AI assistance label May 18, 2026
@decko decko merged commit 198bab1 into main May 19, 2026
4 checks passed
@decko decko deleted the soda/259 branch May 19, 2026 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-assisted Implemented with AI assistance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant