Conversation
Introduces reaggregate_scores() which collects per-sample metric scores from SampleResult.scores and computes arithmetic means by metric name. Key design decisions: - None scores are skipped from mean calculation; returns None when all scores for a metric are None (not 0.0, which would be misleading). - Metrics absent from a sample are treated the same as None scores -- skipped and not counted in the denominator. - review_severity_distribution: aggregate-only metric, produces no per-sample scores, silently absent from output (by design). - self_correction_rate: ratio-of-sums vs mean-of-ratios divergence documented in docstring; tested separately in Task 2 integration test. 6 unit tests cover: known scores, None handling, all-None edge case, missing metrics, empty input, single session. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-pipeline
…#259) Proves that reaggregate_scores(report.sample_results) reproduces the engine's aggregate_scores for all metrics with per-sample scores. Explicitly skipped metrics: - review_severity_distribution: aggregate-only, no sample_scores, so it does not appear in sample_results and cannot be reaggregated. - self_correction_rate: engine computes ratio-of-sums (resolved / total_findings across all sessions); reaggregation computes mean-of-ratios (mean of per-session 0.0/1.0), which diverges when sessions have different finding counts. The test uses samples with duration_ms and token data so that phase_execution_time and token_efficiency also produce per-sample scores and participate in the round-trip verification. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-pipeline
Documents the new reaggregate_scores() utility as a .feature fragment so it appears in the next CHANGELOG entry. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-pipeline
… round-trip metrics (#259) Per review finding: both metrics populate sample_scores and should be explicitly tested in the round-trip integration test. The existing guard already handles absence of these metrics in test data.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
reaggregate_scores()utility function inraki.metricsthat derives dataset-level aggregate scores from per-sampleSampleResult.scores, enabling downstream consumers to re-derive aggregate metrics from saved JSON reports without re-running the metrics engine.Changes
src/raki/metrics/reaggregate.py(new): Pure utility function collecting per-sample scores, skippingNonevalues from means, returningNonefor fully-absent metrics. Docstring documentsreview_severity_distributionandself_correction_ratelimitations.src/raki/metrics/__init__.py(modified): Exportsreaggregate_scores.tests/test_reaggregate.py(new): 7 tests — 6 unit tests (known scores, None skipping, all-None → None, missing metrics, empty input, single session) + round-trip integration test verifying reaggregated output matches engine aggregate for all per-sample metrics.changes/259.feature(new): Towncrier changelog fragment.Acceptance Criteria
reaggregate_scores()collects per-sample scores fromSampleResult.scoresNonescores are skipped from mean calculationNonereaggregate_scores(report.sample_results)matchesreport.aggregate_scoresfor all per-sample metricsreview_severity_distribution(aggregate-only) absent from reaggregated outputtriage_calibrationandfile_prediction_accuracycovered in round-trip testReview Results
rag-specialist — MINOR (addressed)
Finding: Round-trip test omitted
triage_calibrationandfile_prediction_accuracyfromroundtrip_metrics. Both metrics populatesample_scoresand should round-trip correctly.Resolution: Added both metrics to the
roundtrip_metricslist in the integration test (commitf58a474). The existingif metric_name not in report.aggregate_scores: continueguard handles absence of these metrics in test data.Refs #259
Assisted-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
Assigned-by: decko