feat: LLM-as-Judge evaluator + timeline scoring by risjai · Pull Request #56 · agentoptics/rewind

risjai · 2026-04-13T04:40:52Z

Summary

Add llm_judge evaluator type that uses LLM function calling (Braintrust pattern) to score agent outputs on criteria like correctness, coherence, safety, relevance, and task_completion
Add rewind eval score CLI command that scores session timelines and compares original vs. forked timelines after a replay
Wire scoring into the store with a new timeline_scores table, enabling cached re-scoring and persistence

What's included

Phase 1 — Rust evaluator type

llm_judge match arm in evaluator.rs dispatching to Python subprocess
run_llm_judge_evaluator() with 120s timeout, sends {input, output, expected, config} on stdin
builtin_types() updated, CLI config shorthand (-c correctness → {"criteria": "correctness"})

Phase 2 — Python LLM judge module

llm_judge.py with 5 criteria presets and full prompt templates (correctness, coherence, relevance, safety, task_completion)
Function calling pattern for structured choice+reasoning extraction (OpenAI SDK compatible)
Lazy openai import with clear error, retry on 429/5xx, temperature=0 default
Validates expected is required for reference-based criteria (correctness)
llm_judge_evaluator() factory + registered in _BUILTIN_EVALUATORS
Subprocess entry point: python3 -m rewind_agent.llm_judge

Phase 3 — Fork-score integration

timeline_scores table with UNIQUE(timeline_id, evaluator_id) upsert constraint
extract_timeline_output() — finds first/last LlmCall, skips ToolCall/ToolResult steps
rewind eval score <session> -e <evaluator> [--compare-timelines] [--expected <json>] [--json]
Score caching, pretty table output with delta and verdict

Stats

11 files changed, ~1,435 lines added
42 new tests (9 Rust + 33 Python), all passing
Clippy clean, zero regressions on full workspace (202 Rust + 225 Python tests)

Usage

# Create an LLM judge evaluator
rewind eval evaluator create "quality" -t llm_judge -c correctness

# Score a session's main timeline
rewind eval score latest -e quality

# Compare scores across original and forked timelines
rewind replay latest --from 5
rewind eval score latest -e quality --compare-timelines

# With expected output for correctness
rewind eval score latest -e quality --expected '{"answer": "Tokyo"}'

# JSON output for CI
rewind eval score latest -e quality --json

# Python SDK
result = rewind_agent.evaluate(
    dataset=ds,
    target_fn=my_agent,
    evaluators=["llm_judge"],  # default: correctness + gpt-4o-mini
)

# Or with config
result = rewind_agent.evaluate(
    dataset=ds,
    target_fn=my_agent,
    evaluators=[
        rewind_agent.llm_judge_evaluator(criteria="correctness", model="gpt-4o"),
        rewind_agent.llm_judge_evaluator(criteria="safety"),
    ],
)

Test plan

🤖 Generated with Claude Code

Add LLM-as-Judge evaluator that uses function calling to score agent outputs on criteria like correctness, coherence, safety, relevance, and task_completion. Rust (Phase 1): - New `llm_judge` match arm in evaluator.rs dispatches to Python subprocess - `run_llm_judge_evaluator()` sends {input, output, expected, config} on stdin (extends custom protocol with config field) - 120s timeout (vs 30s for custom evaluators) for LLM call latency - `builtin_types()` updated to include `llm_judge` - CLI config shorthand: `-c correctness` → `{"criteria": "correctness"}` Python (Phase 2): - `llm_judge.py` module with 5 criteria presets and Braintrust-style function calling pattern for structured choice+reasoning extraction - Lazy openai import with clear error for missing dependency - Retry on 429/5xx with exponential backoff, max 3 attempts - temperature=0 default for deterministic scoring - Validates `expected` is provided for reference-based criteria (correctness) - `llm_judge_evaluator()` factory for configuring criteria/model - Subprocess entry point: `python3 -m rewind_agent.llm_judge` - Registered in `_BUILTIN_EVALUATORS` + exported from `__init__.py` Tests: 4 Rust + 33 Python tests, all passing. Zero regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wire LLM-as-judge scoring into the fork/diff workflow. Users can now score original vs. forked timelines after a replay. Store: - New `timeline_scores` table with UNIQUE(timeline_id, evaluator_id) constraint for upsert semantics on re-scoring - `TimelineScore` model with session/timeline/evaluator refs + blob storage - `create_timeline_score()` with ON CONFLICT upsert - `get_timeline_scores()` and `get_timeline_score()` for retrieval Eval crate: - New `timeline_scoring` module with `extract_timeline_output()`: extracts (input, output) from a timeline's steps by finding first/last LlmCall steps, skipping ToolCall/ToolResult steps - `validate_session_for_scoring()` helper - 5 tests covering extraction, tool-step skipping, and validation CLI: - New `rewind eval score <session>` subcommand - `--evaluator/-e`: specify evaluators (required, multiple allowed) - `--timeline/-t`: score specific timeline (default: main) - `--compare-timelines`: score ALL timelines, show comparison table - `--expected`: JSON for reference-based criteria like correctness - `--json`: machine-readable output for CI - Score caching: re-running same evaluator on same timeline returns cached result from `timeline_scores` table - Pretty table output with delta calculation and verdict Usage: `rewind replay latest --from 5` then `rewind eval score latest -e correctness --compare-timelines` Tests: 202 Rust tests passing, clippy clean. Zero regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

risjai · 2026-04-13T04:43:47Z

Code Review: LLM-as-Judge Evaluator + Timeline Scoring

Overall: Good implementation that closely follows the plan. 2 bugs, 1 security concern, 2 behavioral issues to address.

Bugs (P0)

1. Retry loop matches "429" as substring, not HTTP status code

python/rewind_agent/llm_judge.py:282

err_str = str(e)
if "429" in err_str or "500" in err_str or "502" in err_str or "503" in err_str:

This matches "429" anywhere in the error string. An error like "Token limit 500 exceeded" or "Connection to 10.0.0.429" would trigger retries incorrectly. The OpenAI SDK raises typed exceptions (openai.RateLimitError, openai.APIStatusError with .status_code) — check those instead of string matching. Also missing: a test for the retry path.

2. Score cache has no bypass — stale scores returned after code changes

crates/rewind-cli/src/main.rs:2277

Once scored, rewind eval score always returns the cached value from timeline_scores. If the user changes their agent, replays to a new fork, then re-scores the same timeline — the stale score is returned. The upsert (ON CONFLICT UPDATE) in create_timeline_score exists but is unreachable because the cache check short-circuits first.

Add a --force flag that skips the cache check.

Security (P1)

3. API key may leak into stored reasoning field

crates/rewind-eval/src/evaluator.rs:106

If the Python subprocess crashes during client construction, stderr may contain the API key (e.g., openai.AuthenticationError: Incorrect API key provided: sk-proj-abc...). This stderr is stored in timeline_scores.reasoning (persisted to SQLite) and printed to the terminal. Consider sanitizing stderr before storage — redact strings matching sk-[a-zA-Z0-9-]+.

Behavioral Issues (P2)

4. _session_id parameter unused in public API

crates/rewind-eval/src/timeline_scoring.rs:16

extract_timeline_output() accepts _session_id but never uses it (underscore-prefixed to suppress warnings). Either use it for validation or remove it — a dead parameter in a public API is confusing for future callers.

5. Multi-fork delta compares first/last timeline, not main vs. each fork

crates/rewind-cli/src/main.rs:2330

Delta is last - first in timeline order. For 3+ timelines (main, fork-1, fork-2), the comparison is fork-2 vs. main — silently ignoring fork-1. Consider comparing each fork against main, or documenting this as first-vs-last only.

Nits (P3)

6. ValueError exits with code 0 in subprocess, masking config errors as zero scores

python/rewind_agent/llm_judge.py:285 — When run_judge raises ValueError ("correctness requires expected"), the subprocess catches it, outputs score: 0.0, and exits 0. The Rust side sees success. The user gets a zero score that looks like the LLM judged poorly, not a config error. Exit with code 1 for ValueError so Rust can surface it as an error.

7. resolve_timeline_ref called but not in the diff

crates/rewind-cli/src/main.rs:2228 — Verify this existing function handles label-based lookup (not just ID/prefix).

Tests: Solid ✓

42 tests (9 Rust + 33 Python), good coverage of presets, template rendering, expected validation, choice→score mappings, error paths, SDK integration, and subprocess entry point. Missing: a test for the retry logic (429 → retry → succeed on attempt 2).

Recommend

Fix #1 (retry by exception type) and #2 (add --force flag) before merge. #3 (API key leak) is worth a fast-follow. Rest are non-blocking.

Fixes all items from PR #56 review: #1 (P0 bug) Retry by OpenAI exception type, not string matching: - New `_is_retryable()` checks `openai.RateLimitError`, `openai.APIStatusError(500/502/503)`, `ConnectionError`, `TimeoutError` - "Token limit 500 exceeded" no longer triggers false retries - Added 7 tests including retry-then-succeed integration test #2 (P0 bug) Add --force flag to bypass stale score cache: - `rewind eval score --force` skips the `get_timeline_score` cache check - Upsert in `create_timeline_score` now reachable after code changes #3 (P1 security) Sanitize API keys from stderr before storing: - `sanitize_secrets()` redacts `sk-[a-zA-Z0-9_-]{10,}` patterns - Applied to LLM judge subprocess stderr before writing to reasoning - 3 tests for single key, multiple keys, and clean text #4 (P2) Remove unused `_session_id` from `extract_timeline_output()`: - Public API now takes `(store, timeline_id)` — cleaner signature - All callers (CLI + tests) updated #5 (P2) Compare each fork against main, not just first vs last: - Delta section now shows one line per fork, each compared to main - Works correctly for 3+ timelines #6 (P3) ValueError exits with code 1 in subprocess: - `ValueError` and `RuntimeError` now exit non-zero so Rust surfaces them as errors, not misleading zero scores - Test added for correctness-without-expected path Tests: 26 Rust + 41 Python passing, clippy clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Documentation: - docs/evaluation.md: LLM-as-judge section with setup, scoring, comparison, built-in criteria table, config options, Python SDK usage, cost guidance - Updated built-in evaluators table and CLI commands reference - Added examples links Examples: - examples/13_llm_judge.py: basic LLM-as-judge eval with correctness and coherence criteria, showing advantage over exact_match - examples/14_fork_and_score.py: walkthrough of the full replay -> score -> prove-it-works loop (CLI + Python SDK) MCP server: - New `score_timelines` tool: score session timelines using evaluators, supports compare_timelines, expected, and force params - Caches scores, returns per-timeline breakdown with avg Version bump (0.7.0 -> 0.8.0): - Cargo.toml workspace version: 0.8.0 - python/pyproject.toml: 0.10.0 - python/rewind_agent/__init__.py: 0.10.0 - python/rewind_cli.py CLI_VERSION: 0.8.0 Tests: 205 Rust + 233 Python passing, clippy clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When the Python subprocess exits non-zero (e.g., missing API key, missing expected value), it writes structured JSON to stdout before exiting. Previously the Rust side only read stderr on failure, producing empty error messages like "stderr: (empty)". Now reads stdout first and parses the JSON error — surfacing clear messages like "LLM judge config error: requires an API key" instead. Found during e2e testing with real demo data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

risjai and others added 3 commits April 13, 2026 09:55

chore: update Cargo.lock for tempfile dev-dependency

9684a68

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

risjai and others added 3 commits April 13, 2026 10:19

risjai enabled auto-merge April 13, 2026 05:33

risjai merged commit dbefcd4 into master Apr 13, 2026
4 checks passed

risjai deleted the feat/llm-judge-evaluator branch April 13, 2026 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LLM-as-Judge evaluator + timeline scoring#56

feat: LLM-as-Judge evaluator + timeline scoring#56
risjai merged 6 commits into
masterfrom
feat/llm-judge-evaluator

risjai commented Apr 13, 2026

Uh oh!

risjai commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

risjai commented Apr 13, 2026

Summary

What's included

Phase 1 — Rust evaluator type

Phase 2 — Python LLM judge module

Phase 3 — Fork-score integration

Stats

Usage

Test plan

Uh oh!

risjai commented Apr 13, 2026

Code Review: LLM-as-Judge Evaluator + Timeline Scoring

Bugs (P0)

Security (P1)

Behavioral Issues (P2)

Nits (P3)

Tests: Solid ✓

Recommend

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant