Skip to content

feat: serialize judge config fields into report JSON #173

@decko

Description

@decko

Summary

The report JSON's config section stores llm_model but not llm_provider, temperature, or max_tokens. This is the first half of the judge config work — adding the fields and serializing them.

The --diff warning comparison is handled separately in #187.

Scope (~60K token budget)

File Change Lines
src/raki/metrics/protocol.py Add max_tokens: int | None = None to MetricConfig ~5
src/raki/model/report.py Add llm_provider, llm_temperature, llm_max_tokens fields to EvalReport ~10
src/raki/metrics/engine.py Serialize all judge fields into report.config dict ~10
src/raki/report/json_report.py No changes needed (uses model_dump) 0
tests/test_report.py Config serialization tests, backward compat ~60
tests/test_cli.py Verify JSON output includes new fields ~30

Current State

"config": {
  "llm_model": "claude-sonnet-4-6",
  "metrics": [...],
  "skip_llm": false
}

Target State

"config": {
  "llm_provider": "vertex-anthropic",
  "llm_model": "claude-sonnet-4-6",
  "llm_temperature": 0.0,
  "llm_max_tokens": 4096,
  "metrics": [...],
  "skip_llm": false
}

Backward Compatibility

Old reports missing llm_provider/llm_temperature/llm_max_tokens load without error — fields default to None.

Acceptance Criteria

  • max_tokens: int | None = None added to MetricConfig in protocol.py
  • llm_provider: str | None, llm_temperature: float | None, llm_max_tokens: int | None fields on EvalReport (or serialized into config dict)
  • MetricsEngine.run() populates all four judge fields (llm_provider, llm_model, llm_temperature, llm_max_tokens) into report config when LLM is used
  • When skip_llm=True, judge fields are None in report config
  • Old JSON reports without these fields load via load_json_report() without error (default to None)
  • No duplicate keys in config dict (consolidate llm_model if needed)
  • Tests: (a) engine sets all judge fields when LLM used, (b) engine sets None when skip_llm, (c) old report without fields loads cleanly, (d) roundtrip JSON serialization preserves new fields

Implementation Plan

Task 1: Add max_tokens to MetricConfig

Files: src/raki/metrics/protocol.py, tests/test_report.py

  1. Write failing test: MetricConfig(max_tokens=4096)config.max_tokens == 4096
  2. Add max_tokens: int | None = None to MetricConfig

Task 2: Serialize judge fields in engine

Files: src/raki/metrics/engine.py, tests/test_report.py

  1. Write failing test: run engine with requires_llm=True metric → report config has llm_provider, llm_model, llm_temperature, llm_max_tokens
  2. Write failing test: run engine with skip_llm=True → report config has None for all judge fields
  3. Update MetricsEngine.run() to populate config dict with all four fields from self._config
  4. Remove any duplicate llm_model key

Task 3: Backward compatibility

Files: tests/test_report.py

  1. Write failing test: load a JSON fixture without judge fields → no error, missing fields are None
  2. Verify load_json_report() handles missing fields gracefully (Pydantic defaults)

Task 4: Verification

  1. uv run pytest tests/test_report.py -v
  2. uv run pytest tests/ -v -m "not slow" — no regressions
  3. uv run ruff check src/ tests/ && uv run ruff format src/ tests/
  4. uv run ty check src/raki/

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions