Skip to content

feat: warn when judge configs differ in --diff comparison #187

@decko

Description

@decko

Summary

When comparing reports via --diff, warn the user if judge configurations differ between baseline and compare reports. Different judge models/providers/temperatures can explain metric deltas that aren't caused by agent quality changes.

Split from #173, which handles field serialization only.

Depends On

Scope (~55K token budget)

File Change Lines
src/raki/report/diff.py Add compare_judge_configs() function ~40
src/raki/report/cli_summary.py Print judge config warnings in diff output ~20
src/raki/cli.py Wire warnings into _handle_diff() ~15
tests/test_diff.py All diff warning scenarios ~80
tests/test_cli.py CLI integration tests ~40

Behavior

When judge configs differ between reports:

Warning: judge configs differ between reports:
  llm_model: claude-sonnet-4-6 → gemini-2.5-pro
  llm_provider: vertex-anthropic → google
Analytical metric deltas may reflect judge calibration, not agent quality.

When baseline has None (old report without judge fields):

Warning: baseline report has no judge config — cannot compare judge calibration.

Warning is informational only — exit code stays unchanged.

Acceptance Criteria

  • compare_judge_configs(baseline_config, compare_config) returns list of differing fields
  • --diff warns when judge configs differ between reports (per differing field)
  • --diff with matching configs shows no warning
  • --diff with None vs value warns "unknown baseline — cannot compare judge calibration"
  • Warning is informational only — exit code unaffected in all cases
  • Tests: (a) differing configs → warning printed, (b) matching configs → no warning, (c) None vs value → "unknown baseline" warning, (d) exit code unchanged in all scenarios

Implementation Plan

Task 1: Judge config comparison

Files: src/raki/report/diff.py, tests/test_diff.py

  1. Write failing test: two configs with different llm_model → returns [("llm_model", "sonnet", "gemini")]
  2. Write failing test: matching configs → returns []
  3. Write failing test: baseline None, compare has value → returns sentinel for "unknown baseline"
  4. Add compare_judge_configs(baseline: dict, compare: dict) -> list[JudgeConfigDiff]
  5. Add JudgeConfigDiff dataclass: field: str, baseline_value: str | None, compare_value: str | None

Task 2: CLI warning output

Files: src/raki/report/cli_summary.py, src/raki/cli.py, tests/test_cli.py

  1. Write failing test: --diff with differing judge configs → stderr contains warning text
  2. Write failing test: --diff with matching configs → no warning
  3. Write failing test: --diff with None baseline → "unknown baseline" warning
  4. Write failing test: all warning cases → exit code unchanged
  5. Add print_judge_config_warnings(diffs: list[JudgeConfigDiff], console: Console) to cli_summary.py
  6. Wire into _handle_diff() in cli.py: call compare_judge_configs(), then print_judge_config_warnings()

Task 3: Verification

  1. uv run pytest tests/test_diff.py tests/test_cli.py -v
  2. uv run pytest tests/ -v -m "not slow" — no regressions
  3. uv run ruff check src/ tests/ && uv run ruff format src/ tests/
  4. uv run ty check src/raki/

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions