Skip to content

feat: metric health checks — detect degenerate and dead metrics (#162)#205

Merged
decko merged 8 commits into
mainfrom
soda/162
Apr 25, 2026
Merged

feat: metric health checks — detect degenerate and dead metrics (#162)#205
decko merged 8 commits into
mainfrom
soda/162

Conversation

@decko
Copy link
Copy Markdown
Owner

@decko decko commented Apr 25, 2026

Summary

Always-on post-run metric health checks that detect broken or degenerate metrics (Tier 1 — Operational health).

  • Low-variance detection: flags metrics where variance < 1e-4 across sessions. ERROR for uniform 0.0/1.0 (broken wiring), WARNING for other low-variance cases
  • N/A rate detection: two-tier thresholds (>50% WARNING, >80% ERROR) with metric-aware expected N/A ceilings (faithfulness/relevancy expect ~10%, context precision/recall expect ~50%)
  • Minimum sample threshold: n >= 5, below which checks are skipped
  • warnings array in JSON report (machine-readable) + CLI/HTML rendering
  • --strict-warnings flag promotes health errors to exit code 1 for CI
  • JSONL history: warning_count field for future trending

New src/raki/metrics/health.py module (~86 lines), plus integration across engine, CLI, HTML, and history.

Test plan

  • uv run pytest tests/ -v -m "not slow" — 1259 passed
  • Health check logic: variance detection, N/A rate, metric-aware thresholds, sample minimum
  • CLI rendering: warnings displayed, quiet mode suppression
  • HTML rendering: warnings section in report
  • --strict-warnings exit code behavior
  • JSON/JSONL serialization round-trip with warnings

Refs #162

🤖 Generated with SODA + Claude Code

decko added 8 commits April 24, 2026 22:04
Add MetricWarning Pydantic model to model/report.py and implement
run_health_checks() in metrics/health.py. Two checks are implemented:

- dead_metric (error): metric is N/A for >95% of sessions, indicating
  the sessions lack required data fields.
- degenerate_metric (warning): metric has constant score across all
  sessions (zero variance), indicating no discriminating signal.

Aggregate-only metrics (empty sample_scores) are skipped.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-orchestrator
Add warning_count: int = 0 field to HistoryEntry so the history log
records how many metric health warnings were emitted per run. The field
defaults to 0 for backward compatibility — older JSONL entries without
the field load cleanly via Pydantic's default.

append_history_entry() now populates warning_count from len(report.warnings).

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-orchestrator
After computing all metrics, MetricsEngine.run() now calls
run_health_checks() for each MetricResult and collects the resulting
MetricWarning list into EvalReport.warnings.

The total_sessions count from dataset.samples is passed so the dead-metric
N/A rate can be computed correctly.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-orchestrator
When EvalReport.warnings is non-empty, print_summary() now renders a
⚠ Metric health block after the scores, showing:

- A banner line counting errors and warnings (e.g. '⚠ Metric health: 1 error')
- Per-warning bullet lines with check name in parentheses, color-coded
  red for errors and yellow for warnings.

Uses parentheses (not square brackets) for the check label to avoid Rich
treating [check_name] as an unknown markup tag.
Uses Console(highlight=False) in tests to prevent Rich's number highlighter
from splitting '1 error' across separate ANSI spans.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-orchestrator
Pass metric_warnings=report.warnings to the Jinja2 template context.
The template renders a 'Metric Health' table section when warnings are
present, listing severity (error/warning), metric name, check name,
and full message. Error warnings use severity-critical CSS; warning
warnings use severity-major CSS. The section is omitted entirely when
there are no warnings.

Jinja2's autoescape=True ensures warning messages with HTML special
characters are safely escaped.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-orchestrator
Add --strict-warnings flag to the run command. When set, the command
exits with code 1 if any metric health warning with severity='error'
is present in the report.

Only 'error' severity triggers non-zero exit; 'warning' severity is
informational and does not affect exit code. This matches the ticket
spec ('only promotes severity=="error" to exit code 1').

Without --strict-warnings (the default), all warnings are informational
only (shown in CLI summary) and do not affect exit code.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-orchestrator
…th warnings (#162)

Add TestMetricWarningSerialization and TestHistoryEntryWarningSerialization:

- EvalReport.warnings defaults to [] (backward compat)
- EvalReport with warnings survives write/load JSON round-trip
- Old JSON without 'warnings' key loads cleanly with warnings=[]
- warnings appears in JSON output with correct structure
- HistoryEntry.warning_count defaults to 0 (backward compat)
- warning_count survives JSONL append/load round-trip
- Old JSONL entries without warning_count load cleanly with count=0
- MetricsEngine.run() populates report.warnings via health checks

Also updates TestCliSummaryDisplayName.test_summary_does_not_show_raw_names to
test_summary_does_not_show_raw_names_in_metric_lines: the new Metric Health
section intentionally shows raw metric names for diagnostics, so the assertion
now checks only the metric score display section (before the warning block).

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-orchestrator
Feature: metric health checks (dead_metric, degenerate_metric)
with --strict-warnings CLI flag.

Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-orchestrator
@decko decko merged commit 050dc7f into main Apr 25, 2026
4 checks passed
@decko decko deleted the soda/162 branch April 25, 2026 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant