feat: metric health checks — detect degenerate and dead metrics (#162) by decko · Pull Request #205 · decko/raki

decko · 2026-04-25T01:29:08Z

Summary

Always-on post-run metric health checks that detect broken or degenerate metrics (Tier 1 — Operational health).

Low-variance detection: flags metrics where variance < 1e-4 across sessions. ERROR for uniform 0.0/1.0 (broken wiring), WARNING for other low-variance cases
N/A rate detection: two-tier thresholds (>50% WARNING, >80% ERROR) with metric-aware expected N/A ceilings (faithfulness/relevancy expect ~10%, context precision/recall expect ~50%)
Minimum sample threshold: n >= 5, below which checks are skipped
warnings array in JSON report (machine-readable) + CLI/HTML rendering
--strict-warnings flag promotes health errors to exit code 1 for CI
JSONL history: warning_count field for future trending

New src/raki/metrics/health.py module (~86 lines), plus integration across engine, CLI, HTML, and history.

Test plan

uv run pytest tests/ -v -m "not slow" — 1259 passed
Health check logic: variance detection, N/A rate, metric-aware thresholds, sample minimum
CLI rendering: warnings displayed, quiet mode suppression
HTML rendering: warnings section in report
--strict-warnings exit code behavior
JSON/JSONL serialization round-trip with warnings

Refs #162

🤖 Generated with SODA + Claude Code

Add MetricWarning Pydantic model to model/report.py and implement run_health_checks() in metrics/health.py. Two checks are implemented: - dead_metric (error): metric is N/A for >95% of sessions, indicating the sessions lack required data fields. - degenerate_metric (warning): metric has constant score across all sessions (zero variance), indicating no discriminating signal. Aggregate-only metrics (empty sample_scores) are skipped. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator

Add warning_count: int = 0 field to HistoryEntry so the history log records how many metric health warnings were emitted per run. The field defaults to 0 for backward compatibility — older JSONL entries without the field load cleanly via Pydantic's default. append_history_entry() now populates warning_count from len(report.warnings). Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator

After computing all metrics, MetricsEngine.run() now calls run_health_checks() for each MetricResult and collects the resulting MetricWarning list into EvalReport.warnings. The total_sessions count from dataset.samples is passed so the dead-metric N/A rate can be computed correctly. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator

When EvalReport.warnings is non-empty, print_summary() now renders a ⚠ Metric health block after the scores, showing: - A banner line counting errors and warnings (e.g. '⚠ Metric health: 1 error') - Per-warning bullet lines with check name in parentheses, color-coded red for errors and yellow for warnings. Uses parentheses (not square brackets) for the check label to avoid Rich treating [check_name] as an unknown markup tag. Uses Console(highlight=False) in tests to prevent Rich's number highlighter from splitting '1 error' across separate ANSI spans. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator

Pass metric_warnings=report.warnings to the Jinja2 template context. The template renders a 'Metric Health' table section when warnings are present, listing severity (error/warning), metric name, check name, and full message. Error warnings use severity-critical CSS; warning warnings use severity-major CSS. The section is omitted entirely when there are no warnings. Jinja2's autoescape=True ensures warning messages with HTML special characters are safely escaped. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator

Add --strict-warnings flag to the run command. When set, the command exits with code 1 if any metric health warning with severity='error' is present in the report. Only 'error' severity triggers non-zero exit; 'warning' severity is informational and does not affect exit code. This matches the ticket spec ('only promotes severity=="error" to exit code 1'). Without --strict-warnings (the default), all warnings are informational only (shown in CLI summary) and do not affect exit code. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator

…th warnings (#162) Add TestMetricWarningSerialization and TestHistoryEntryWarningSerialization: - EvalReport.warnings defaults to [] (backward compat) - EvalReport with warnings survives write/load JSON round-trip - Old JSON without 'warnings' key loads cleanly with warnings=[] - warnings appears in JSON output with correct structure - HistoryEntry.warning_count defaults to 0 (backward compat) - warning_count survives JSONL append/load round-trip - Old JSONL entries without warning_count load cleanly with count=0 - MetricsEngine.run() populates report.warnings via health checks Also updates TestCliSummaryDisplayName.test_summary_does_not_show_raw_names to test_summary_does_not_show_raw_names_in_metric_lines: the new Metric Health section intentionally shows raw metric names for diagnostics, so the assertion now checks only the metric score display section (before the warning block). Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator

Feature: metric health checks (dead_metric, degenerate_metric) with --strict-warnings CLI flag. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator

decko added 8 commits April 24, 2026 22:04

chore(changes): add towncrier fragment for ticket #162

5abab7d

Feature: metric health checks (dead_metric, degenerate_metric) with --strict-warnings CLI flag. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator

decko merged commit 050dc7f into main Apr 25, 2026
4 checks passed

decko deleted the soda/162 branch April 25, 2026 01:30

This was referenced Apr 25, 2026

feat: metric health checks — detect degenerate and dead metrics #162

Closed

chore: bump version to 0.9.1 #206

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: metric health checks — detect degenerate and dead metrics (#162)#205

feat: metric health checks — detect degenerate and dead metrics (#162)#205
decko merged 8 commits into
mainfrom
soda/162

decko commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

decko commented Apr 25, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant