feat: LLM-as-Judge evaluator + timeline scoring#56
Conversation
Add LLM-as-Judge evaluator that uses function calling to score agent
outputs on criteria like correctness, coherence, safety, relevance,
and task_completion.
Rust (Phase 1):
- New `llm_judge` match arm in evaluator.rs dispatches to Python subprocess
- `run_llm_judge_evaluator()` sends {input, output, expected, config} on stdin
(extends custom protocol with config field)
- 120s timeout (vs 30s for custom evaluators) for LLM call latency
- `builtin_types()` updated to include `llm_judge`
- CLI config shorthand: `-c correctness` → `{"criteria": "correctness"}`
Python (Phase 2):
- `llm_judge.py` module with 5 criteria presets and Braintrust-style
function calling pattern for structured choice+reasoning extraction
- Lazy openai import with clear error for missing dependency
- Retry on 429/5xx with exponential backoff, max 3 attempts
- temperature=0 default for deterministic scoring
- Validates `expected` is provided for reference-based criteria (correctness)
- `llm_judge_evaluator()` factory for configuring criteria/model
- Subprocess entry point: `python3 -m rewind_agent.llm_judge`
- Registered in `_BUILTIN_EVALUATORS` + exported from `__init__.py`
Tests: 4 Rust + 33 Python tests, all passing. Zero regressions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire LLM-as-judge scoring into the fork/diff workflow. Users can now
score original vs. forked timelines after a replay.
Store:
- New `timeline_scores` table with UNIQUE(timeline_id, evaluator_id)
constraint for upsert semantics on re-scoring
- `TimelineScore` model with session/timeline/evaluator refs + blob storage
- `create_timeline_score()` with ON CONFLICT upsert
- `get_timeline_scores()` and `get_timeline_score()` for retrieval
Eval crate:
- New `timeline_scoring` module with `extract_timeline_output()`:
extracts (input, output) from a timeline's steps by finding
first/last LlmCall steps, skipping ToolCall/ToolResult steps
- `validate_session_for_scoring()` helper
- 5 tests covering extraction, tool-step skipping, and validation
CLI:
- New `rewind eval score <session>` subcommand
- `--evaluator/-e`: specify evaluators (required, multiple allowed)
- `--timeline/-t`: score specific timeline (default: main)
- `--compare-timelines`: score ALL timelines, show comparison table
- `--expected`: JSON for reference-based criteria like correctness
- `--json`: machine-readable output for CI
- Score caching: re-running same evaluator on same timeline returns
cached result from `timeline_scores` table
- Pretty table output with delta calculation and verdict
Usage: `rewind replay latest --from 5` then
`rewind eval score latest -e correctness --compare-timelines`
Tests: 202 Rust tests passing, clippy clean. Zero regressions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code Review: LLM-as-Judge Evaluator + Timeline ScoringOverall: Good implementation that closely follows the plan. 2 bugs, 1 security concern, 2 behavioral issues to address. Bugs (P0)1. Retry loop matches "429" as substring, not HTTP status code
err_str = str(e)
if "429" in err_str or "500" in err_str or "502" in err_str or "503" in err_str:This matches "429" anywhere in the error string. An error like 2. Score cache has no bypass — stale scores returned after code changes
Once scored, Add a Security (P1)3. API key may leak into stored
If the Python subprocess crashes during client construction, stderr may contain the API key (e.g., Behavioral Issues (P2)4.
5. Multi-fork delta compares first/last timeline, not main vs. each fork
Delta is Nits (P3)6.
7.
Tests: Solid ✓42 tests (9 Rust + 33 Python), good coverage of presets, template rendering, expected validation, choice→score mappings, error paths, SDK integration, and subprocess entry point. Missing: a test for the retry logic (429 → retry → succeed on attempt 2). RecommendFix #1 (retry by exception type) and #2 (add |
Fixes all items from PR #56 review: #1 (P0 bug) Retry by OpenAI exception type, not string matching: - New `_is_retryable()` checks `openai.RateLimitError`, `openai.APIStatusError(500/502/503)`, `ConnectionError`, `TimeoutError` - "Token limit 500 exceeded" no longer triggers false retries - Added 7 tests including retry-then-succeed integration test #2 (P0 bug) Add --force flag to bypass stale score cache: - `rewind eval score --force` skips the `get_timeline_score` cache check - Upsert in `create_timeline_score` now reachable after code changes #3 (P1 security) Sanitize API keys from stderr before storing: - `sanitize_secrets()` redacts `sk-[a-zA-Z0-9_-]{10,}` patterns - Applied to LLM judge subprocess stderr before writing to reasoning - 3 tests for single key, multiple keys, and clean text #4 (P2) Remove unused `_session_id` from `extract_timeline_output()`: - Public API now takes `(store, timeline_id)` — cleaner signature - All callers (CLI + tests) updated #5 (P2) Compare each fork against main, not just first vs last: - Delta section now shows one line per fork, each compared to main - Works correctly for 3+ timelines #6 (P3) ValueError exits with code 1 in subprocess: - `ValueError` and `RuntimeError` now exit non-zero so Rust surfaces them as errors, not misleading zero scores - Test added for correctness-without-expected path Tests: 26 Rust + 41 Python passing, clippy clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documentation: - docs/evaluation.md: LLM-as-judge section with setup, scoring, comparison, built-in criteria table, config options, Python SDK usage, cost guidance - Updated built-in evaluators table and CLI commands reference - Added examples links Examples: - examples/13_llm_judge.py: basic LLM-as-judge eval with correctness and coherence criteria, showing advantage over exact_match - examples/14_fork_and_score.py: walkthrough of the full replay -> score -> prove-it-works loop (CLI + Python SDK) MCP server: - New `score_timelines` tool: score session timelines using evaluators, supports compare_timelines, expected, and force params - Caches scores, returns per-timeline breakdown with avg Version bump (0.7.0 -> 0.8.0): - Cargo.toml workspace version: 0.8.0 - python/pyproject.toml: 0.10.0 - python/rewind_agent/__init__.py: 0.10.0 - python/rewind_cli.py CLI_VERSION: 0.8.0 Tests: 205 Rust + 233 Python passing, clippy clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the Python subprocess exits non-zero (e.g., missing API key, missing expected value), it writes structured JSON to stdout before exiting. Previously the Rust side only read stderr on failure, producing empty error messages like "stderr: (empty)". Now reads stdout first and parses the JSON error — surfacing clear messages like "LLM judge config error: requires an API key" instead. Found during e2e testing with real demo data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
llm_judgeevaluator type that uses LLM function calling (Braintrust pattern) to score agent outputs on criteria like correctness, coherence, safety, relevance, and task_completionrewind eval scoreCLI command that scores session timelines and compares original vs. forked timelines after a replaytimeline_scorestable, enabling cached re-scoring and persistenceWhat's included
Phase 1 — Rust evaluator type
llm_judgematch arm inevaluator.rsdispatching to Python subprocessrun_llm_judge_evaluator()with 120s timeout, sends{input, output, expected, config}on stdinbuiltin_types()updated, CLI config shorthand (-c correctness→{"criteria": "correctness"})Phase 2 — Python LLM judge module
llm_judge.pywith 5 criteria presets and full prompt templates (correctness, coherence, relevance, safety, task_completion)openaiimport with clear error, retry on 429/5xx,temperature=0defaultexpectedis required for reference-based criteria (correctness)llm_judge_evaluator()factory + registered in_BUILTIN_EVALUATORSpython3 -m rewind_agent.llm_judgePhase 3 — Fork-score integration
timeline_scorestable withUNIQUE(timeline_id, evaluator_id)upsert constraintextract_timeline_output()— finds first/last LlmCall, skips ToolCall/ToolResult stepsrewind eval score <session> -e <evaluator> [--compare-timelines] [--expected <json>] [--json]Stats
Usage
Test plan
builtin_types()includesllm_judge,is_valid_type()validates itextract_timeline_output()finds first/last LlmCall, skips tool stepsvalidate_session_for_scoring()errors on missing sessionexpectedraises clearValueErrorexpectedllm_judgeregistered in_BUILTIN_EVALUATORSand resolvable by stringrewind eval evaluator create+rewind eval scoree2e flow--compare-timelineswith fork/replay produces delta table🤖 Generated with Claude Code