feat: warn when judge configs differ in --diff comparison

## Summary

When comparing reports via `--diff`, warn the user if judge configurations differ between baseline and compare reports. Different judge models/providers/temperatures can explain metric deltas that aren't caused by agent quality changes.

Split from #173, which handles field serialization only.

## Depends On

- #173 (judge config fields in report JSON) — must be merged first

## Scope (~55K token budget)

| File | Change | Lines |
|------|--------|-------|
| `src/raki/report/diff.py` | Add `compare_judge_configs()` function | ~40 |
| `src/raki/report/cli_summary.py` | Print judge config warnings in diff output | ~20 |
| `src/raki/cli.py` | Wire warnings into `_handle_diff()` | ~15 |
| `tests/test_diff.py` | All diff warning scenarios | ~80 |
| `tests/test_cli.py` | CLI integration tests | ~40 |

## Behavior

When judge configs differ between reports:
```
Warning: judge configs differ between reports:
  llm_model: claude-sonnet-4-6 → gemini-2.5-pro
  llm_provider: vertex-anthropic → google
Analytical metric deltas may reflect judge calibration, not agent quality.
```

When baseline has None (old report without judge fields):
```
Warning: baseline report has no judge config — cannot compare judge calibration.
```

Warning is informational only — exit code stays unchanged.


## Acceptance Criteria

- [ ] `compare_judge_configs(baseline_config, compare_config)` returns list of differing fields
- [ ] `--diff` warns when judge configs differ between reports (per differing field)
- [ ] `--diff` with matching configs shows no warning
- [ ] `--diff` with None vs value warns "unknown baseline — cannot compare judge calibration"
- [ ] Warning is informational only — exit code unaffected in all cases
- [ ] Tests: (a) differing configs → warning printed, (b) matching configs → no warning, (c) None vs value → "unknown baseline" warning, (d) exit code unchanged in all scenarios



## Implementation Plan

### Task 1: Judge config comparison

**Files:** `src/raki/report/diff.py`, `tests/test_diff.py`

1. Write failing test: two configs with different `llm_model` → returns `[("llm_model", "sonnet", "gemini")]`
2. Write failing test: matching configs → returns `[]`
3. Write failing test: baseline None, compare has value → returns sentinel for "unknown baseline"
4. Add `compare_judge_configs(baseline: dict, compare: dict) -> list[JudgeConfigDiff]`
5. Add `JudgeConfigDiff` dataclass: `field: str`, `baseline_value: str | None`, `compare_value: str | None`

### Task 2: CLI warning output

**Files:** `src/raki/report/cli_summary.py`, `src/raki/cli.py`, `tests/test_cli.py`

1. Write failing test: `--diff` with differing judge configs → stderr contains warning text
2. Write failing test: `--diff` with matching configs → no warning
3. Write failing test: `--diff` with None baseline → "unknown baseline" warning
4. Write failing test: all warning cases → exit code unchanged
5. Add `print_judge_config_warnings(diffs: list[JudgeConfigDiff], console: Console)` to `cli_summary.py`
6. Wire into `_handle_diff()` in `cli.py`: call `compare_judge_configs()`, then `print_judge_config_warnings()`

### Task 3: Verification

1. `uv run pytest tests/test_diff.py tests/test_cli.py -v`
2. `uv run pytest tests/ -v -m "not slow"` — no regressions
3. `uv run ruff check src/ tests/ && uv run ruff format src/ tests/`
4. `uv run ty check src/raki/`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: warn when judge configs differ in --diff comparison #187

Summary

Depends On

Scope (~55K token budget)

Behavior

Acceptance Criteria

Implementation Plan

Task 1: Judge config comparison

Task 2: CLI warning output

Task 3: Verification

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Change	Lines
`src/raki/report/diff.py`	Add `compare_judge_configs()` function	~40
`src/raki/report/cli_summary.py`	Print judge config warnings in diff output	~20
`src/raki/cli.py`	Wire warnings into `_handle_diff()`	~15
`tests/test_diff.py`	All diff warning scenarios	~80
`tests/test_cli.py`	CLI integration tests	~40

feat: warn when judge configs differ in --diff comparison #187

Description

Summary

Depends On

Scope (~55K token budget)

Behavior

Acceptance Criteria

Implementation Plan

Task 1: Judge config comparison

Task 2: CLI warning output

Task 3: Verification

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions