Summary
When comparing reports via --diff, warn the user if judge configurations differ between baseline and compare reports. Different judge models/providers/temperatures can explain metric deltas that aren't caused by agent quality changes.
Split from #173, which handles field serialization only.
Depends On
Scope (~55K token budget)
| File |
Change |
Lines |
src/raki/report/diff.py |
Add compare_judge_configs() function |
~40 |
src/raki/report/cli_summary.py |
Print judge config warnings in diff output |
~20 |
src/raki/cli.py |
Wire warnings into _handle_diff() |
~15 |
tests/test_diff.py |
All diff warning scenarios |
~80 |
tests/test_cli.py |
CLI integration tests |
~40 |
Behavior
When judge configs differ between reports:
Warning: judge configs differ between reports:
llm_model: claude-sonnet-4-6 → gemini-2.5-pro
llm_provider: vertex-anthropic → google
Analytical metric deltas may reflect judge calibration, not agent quality.
When baseline has None (old report without judge fields):
Warning: baseline report has no judge config — cannot compare judge calibration.
Warning is informational only — exit code stays unchanged.
Acceptance Criteria
Implementation Plan
Task 1: Judge config comparison
Files: src/raki/report/diff.py, tests/test_diff.py
- Write failing test: two configs with different
llm_model → returns [("llm_model", "sonnet", "gemini")]
- Write failing test: matching configs → returns
[]
- Write failing test: baseline None, compare has value → returns sentinel for "unknown baseline"
- Add
compare_judge_configs(baseline: dict, compare: dict) -> list[JudgeConfigDiff]
- Add
JudgeConfigDiff dataclass: field: str, baseline_value: str | None, compare_value: str | None
Task 2: CLI warning output
Files: src/raki/report/cli_summary.py, src/raki/cli.py, tests/test_cli.py
- Write failing test:
--diff with differing judge configs → stderr contains warning text
- Write failing test:
--diff with matching configs → no warning
- Write failing test:
--diff with None baseline → "unknown baseline" warning
- Write failing test: all warning cases → exit code unchanged
- Add
print_judge_config_warnings(diffs: list[JudgeConfigDiff], console: Console) to cli_summary.py
- Wire into
_handle_diff() in cli.py: call compare_judge_configs(), then print_judge_config_warnings()
Task 3: Verification
uv run pytest tests/test_diff.py tests/test_cli.py -v
uv run pytest tests/ -v -m "not slow" — no regressions
uv run ruff check src/ tests/ && uv run ruff format src/ tests/
uv run ty check src/raki/
Summary
When comparing reports via
--diff, warn the user if judge configurations differ between baseline and compare reports. Different judge models/providers/temperatures can explain metric deltas that aren't caused by agent quality changes.Split from #173, which handles field serialization only.
Depends On
Scope (~55K token budget)
src/raki/report/diff.pycompare_judge_configs()functionsrc/raki/report/cli_summary.pysrc/raki/cli.py_handle_diff()tests/test_diff.pytests/test_cli.pyBehavior
When judge configs differ between reports:
When baseline has None (old report without judge fields):
Warning is informational only — exit code stays unchanged.
Acceptance Criteria
compare_judge_configs(baseline_config, compare_config)returns list of differing fields--diffwarns when judge configs differ between reports (per differing field)--diffwith matching configs shows no warning--diffwith None vs value warns "unknown baseline — cannot compare judge calibration"Implementation Plan
Task 1: Judge config comparison
Files:
src/raki/report/diff.py,tests/test_diff.pyllm_model→ returns[("llm_model", "sonnet", "gemini")][]compare_judge_configs(baseline: dict, compare: dict) -> list[JudgeConfigDiff]JudgeConfigDiffdataclass:field: str,baseline_value: str | None,compare_value: str | NoneTask 2: CLI warning output
Files:
src/raki/report/cli_summary.py,src/raki/cli.py,tests/test_cli.py--diffwith differing judge configs → stderr contains warning text--diffwith matching configs → no warning--diffwith None baseline → "unknown baseline" warningprint_judge_config_warnings(diffs: list[JudgeConfigDiff], console: Console)tocli_summary.py_handle_diff()incli.py: callcompare_judge_configs(), thenprint_judge_config_warnings()Task 3: Verification
uv run pytest tests/test_diff.py tests/test_cli.py -vuv run pytest tests/ -v -m "not slow"— no regressionsuv run ruff check src/ tests/ && uv run ruff format src/ tests/uv run ty check src/raki/