Conversation
Add MetricWarning Pydantic model to model/report.py and implement run_health_checks() in metrics/health.py. Two checks are implemented: - dead_metric (error): metric is N/A for >95% of sessions, indicating the sessions lack required data fields. - degenerate_metric (warning): metric has constant score across all sessions (zero variance), indicating no discriminating signal. Aggregate-only metrics (empty sample_scores) are skipped. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator
Add warning_count: int = 0 field to HistoryEntry so the history log records how many metric health warnings were emitted per run. The field defaults to 0 for backward compatibility — older JSONL entries without the field load cleanly via Pydantic's default. append_history_entry() now populates warning_count from len(report.warnings). Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator
After computing all metrics, MetricsEngine.run() now calls run_health_checks() for each MetricResult and collects the resulting MetricWarning list into EvalReport.warnings. The total_sessions count from dataset.samples is passed so the dead-metric N/A rate can be computed correctly. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator
When EvalReport.warnings is non-empty, print_summary() now renders a ⚠ Metric health block after the scores, showing: - A banner line counting errors and warnings (e.g. '⚠ Metric health: 1 error') - Per-warning bullet lines with check name in parentheses, color-coded red for errors and yellow for warnings. Uses parentheses (not square brackets) for the check label to avoid Rich treating [check_name] as an unknown markup tag. Uses Console(highlight=False) in tests to prevent Rich's number highlighter from splitting '1 error' across separate ANSI spans. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator
Pass metric_warnings=report.warnings to the Jinja2 template context. The template renders a 'Metric Health' table section when warnings are present, listing severity (error/warning), metric name, check name, and full message. Error warnings use severity-critical CSS; warning warnings use severity-major CSS. The section is omitted entirely when there are no warnings. Jinja2's autoescape=True ensures warning messages with HTML special characters are safely escaped. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator
Add --strict-warnings flag to the run command. When set, the command
exits with code 1 if any metric health warning with severity='error'
is present in the report.
Only 'error' severity triggers non-zero exit; 'warning' severity is
informational and does not affect exit code. This matches the ticket
spec ('only promotes severity=="error" to exit code 1').
Without --strict-warnings (the default), all warnings are informational
only (shown in CLI summary) and do not affect exit code.
Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assigned-by: soda-orchestrator
…th warnings (#162) Add TestMetricWarningSerialization and TestHistoryEntryWarningSerialization: - EvalReport.warnings defaults to [] (backward compat) - EvalReport with warnings survives write/load JSON round-trip - Old JSON without 'warnings' key loads cleanly with warnings=[] - warnings appears in JSON output with correct structure - HistoryEntry.warning_count defaults to 0 (backward compat) - warning_count survives JSONL append/load round-trip - Old JSONL entries without warning_count load cleanly with count=0 - MetricsEngine.run() populates report.warnings via health checks Also updates TestCliSummaryDisplayName.test_summary_does_not_show_raw_names to test_summary_does_not_show_raw_names_in_metric_lines: the new Metric Health section intentionally shows raw metric names for diagnostics, so the assertion now checks only the metric score display section (before the warning block). Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator
Feature: metric health checks (dead_metric, degenerate_metric) with --strict-warnings CLI flag. Assisted-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Assigned-by: soda-orchestrator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Always-on post-run metric health checks that detect broken or degenerate metrics (Tier 1 — Operational health).
warningsarray in JSON report (machine-readable) + CLI/HTML rendering--strict-warningsflag promotes health errors to exit code 1 for CIwarning_countfield for future trendingNew
src/raki/metrics/health.pymodule (~86 lines), plus integration across engine, CLI, HTML, and history.Test plan
uv run pytest tests/ -v -m "not slow"— 1259 passed--strict-warningsexit code behaviorRefs #162
🤖 Generated with SODA + Claude Code