Skip to content

feat: LLM-as-Judge evaluator + timeline scoring#56

Merged
risjai merged 6 commits into
masterfrom
feat/llm-judge-evaluator
Apr 13, 2026
Merged

feat: LLM-as-Judge evaluator + timeline scoring#56
risjai merged 6 commits into
masterfrom
feat/llm-judge-evaluator

Conversation

@risjai
Copy link
Copy Markdown
Collaborator

@risjai risjai commented Apr 13, 2026

Summary

  • Add llm_judge evaluator type that uses LLM function calling (Braintrust pattern) to score agent outputs on criteria like correctness, coherence, safety, relevance, and task_completion
  • Add rewind eval score CLI command that scores session timelines and compares original vs. forked timelines after a replay
  • Wire scoring into the store with a new timeline_scores table, enabling cached re-scoring and persistence

What's included

Phase 1 — Rust evaluator type

  • llm_judge match arm in evaluator.rs dispatching to Python subprocess
  • run_llm_judge_evaluator() with 120s timeout, sends {input, output, expected, config} on stdin
  • builtin_types() updated, CLI config shorthand (-c correctness{"criteria": "correctness"})

Phase 2 — Python LLM judge module

  • llm_judge.py with 5 criteria presets and full prompt templates (correctness, coherence, relevance, safety, task_completion)
  • Function calling pattern for structured choice+reasoning extraction (OpenAI SDK compatible)
  • Lazy openai import with clear error, retry on 429/5xx, temperature=0 default
  • Validates expected is required for reference-based criteria (correctness)
  • llm_judge_evaluator() factory + registered in _BUILTIN_EVALUATORS
  • Subprocess entry point: python3 -m rewind_agent.llm_judge

Phase 3 — Fork-score integration

  • timeline_scores table with UNIQUE(timeline_id, evaluator_id) upsert constraint
  • extract_timeline_output() — finds first/last LlmCall, skips ToolCall/ToolResult steps
  • rewind eval score <session> -e <evaluator> [--compare-timelines] [--expected <json>] [--json]
  • Score caching, pretty table output with delta and verdict

Stats

  • 11 files changed, ~1,435 lines added
  • 42 new tests (9 Rust + 33 Python), all passing
  • Clippy clean, zero regressions on full workspace (202 Rust + 225 Python tests)

Usage

# Create an LLM judge evaluator
rewind eval evaluator create "quality" -t llm_judge -c correctness

# Score a session's main timeline
rewind eval score latest -e quality

# Compare scores across original and forked timelines
rewind replay latest --from 5
rewind eval score latest -e quality --compare-timelines

# With expected output for correctness
rewind eval score latest -e quality --expected '{"answer": "Tokyo"}'

# JSON output for CI
rewind eval score latest -e quality --json
# Python SDK
result = rewind_agent.evaluate(
    dataset=ds,
    target_fn=my_agent,
    evaluators=["llm_judge"],  # default: correctness + gpt-4o-mini
)

# Or with config
result = rewind_agent.evaluate(
    dataset=ds,
    target_fn=my_agent,
    evaluators=[
        rewind_agent.llm_judge_evaluator(criteria="correctness", model="gpt-4o"),
        rewind_agent.llm_judge_evaluator(criteria="safety"),
    ],
)

Test plan

  • Rust: builtin_types() includes llm_judge, is_valid_type() validates it
  • Rust: extract_timeline_output() finds first/last LlmCall, skips tool steps
  • Rust: validate_session_for_scoring() errors on missing session
  • Python: All 5 criteria presets have correct templates and scores
  • Python: Template rendering handles strings, dicts, None, lists
  • Python: Correctness without expected raises clear ValueError
  • Python: Safety/coherence work without expected
  • Python: Mocked LLM calls return correct scores for each choice
  • Python: Missing openai dep, API errors, and unknown choices handled gracefully
  • Python: llm_judge registered in _BUILTIN_EVALUATORS and resolvable by string
  • Python: Subprocess entry point handles valid/invalid stdin JSON
  • Full workspace: 202 Rust + 225 Python tests pass, clippy clean
  • Manual: rewind eval evaluator create + rewind eval score e2e flow
  • Manual: --compare-timelines with fork/replay produces delta table

🤖 Generated with Claude Code

risjai and others added 3 commits April 13, 2026 09:55
Add LLM-as-Judge evaluator that uses function calling to score agent
outputs on criteria like correctness, coherence, safety, relevance,
and task_completion.

Rust (Phase 1):
- New `llm_judge` match arm in evaluator.rs dispatches to Python subprocess
- `run_llm_judge_evaluator()` sends {input, output, expected, config} on stdin
  (extends custom protocol with config field)
- 120s timeout (vs 30s for custom evaluators) for LLM call latency
- `builtin_types()` updated to include `llm_judge`
- CLI config shorthand: `-c correctness` → `{"criteria": "correctness"}`

Python (Phase 2):
- `llm_judge.py` module with 5 criteria presets and Braintrust-style
  function calling pattern for structured choice+reasoning extraction
- Lazy openai import with clear error for missing dependency
- Retry on 429/5xx with exponential backoff, max 3 attempts
- temperature=0 default for deterministic scoring
- Validates `expected` is provided for reference-based criteria (correctness)
- `llm_judge_evaluator()` factory for configuring criteria/model
- Subprocess entry point: `python3 -m rewind_agent.llm_judge`
- Registered in `_BUILTIN_EVALUATORS` + exported from `__init__.py`

Tests: 4 Rust + 33 Python tests, all passing. Zero regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire LLM-as-judge scoring into the fork/diff workflow. Users can now
score original vs. forked timelines after a replay.

Store:
- New `timeline_scores` table with UNIQUE(timeline_id, evaluator_id)
  constraint for upsert semantics on re-scoring
- `TimelineScore` model with session/timeline/evaluator refs + blob storage
- `create_timeline_score()` with ON CONFLICT upsert
- `get_timeline_scores()` and `get_timeline_score()` for retrieval

Eval crate:
- New `timeline_scoring` module with `extract_timeline_output()`:
  extracts (input, output) from a timeline's steps by finding
  first/last LlmCall steps, skipping ToolCall/ToolResult steps
- `validate_session_for_scoring()` helper
- 5 tests covering extraction, tool-step skipping, and validation

CLI:
- New `rewind eval score <session>` subcommand
  - `--evaluator/-e`: specify evaluators (required, multiple allowed)
  - `--timeline/-t`: score specific timeline (default: main)
  - `--compare-timelines`: score ALL timelines, show comparison table
  - `--expected`: JSON for reference-based criteria like correctness
  - `--json`: machine-readable output for CI
- Score caching: re-running same evaluator on same timeline returns
  cached result from `timeline_scores` table
- Pretty table output with delta calculation and verdict

Usage: `rewind replay latest --from 5` then
       `rewind eval score latest -e correctness --compare-timelines`

Tests: 202 Rust tests passing, clippy clean. Zero regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@risjai
Copy link
Copy Markdown
Collaborator Author

risjai commented Apr 13, 2026

Code Review: LLM-as-Judge Evaluator + Timeline Scoring

Overall: Good implementation that closely follows the plan. 2 bugs, 1 security concern, 2 behavioral issues to address.


Bugs (P0)

1. Retry loop matches "429" as substring, not HTTP status code

python/rewind_agent/llm_judge.py:282

err_str = str(e)
if "429" in err_str or "500" in err_str or "502" in err_str or "503" in err_str:

This matches "429" anywhere in the error string. An error like "Token limit 500 exceeded" or "Connection to 10.0.0.429" would trigger retries incorrectly. The OpenAI SDK raises typed exceptions (openai.RateLimitError, openai.APIStatusError with .status_code) — check those instead of string matching. Also missing: a test for the retry path.

2. Score cache has no bypass — stale scores returned after code changes

crates/rewind-cli/src/main.rs:2277

Once scored, rewind eval score always returns the cached value from timeline_scores. If the user changes their agent, replays to a new fork, then re-scores the same timeline — the stale score is returned. The upsert (ON CONFLICT UPDATE) in create_timeline_score exists but is unreachable because the cache check short-circuits first.

Add a --force flag that skips the cache check.


Security (P1)

3. API key may leak into stored reasoning field

crates/rewind-eval/src/evaluator.rs:106

If the Python subprocess crashes during client construction, stderr may contain the API key (e.g., openai.AuthenticationError: Incorrect API key provided: sk-proj-abc...). This stderr is stored in timeline_scores.reasoning (persisted to SQLite) and printed to the terminal. Consider sanitizing stderr before storage — redact strings matching sk-[a-zA-Z0-9-]+.


Behavioral Issues (P2)

4. _session_id parameter unused in public API

crates/rewind-eval/src/timeline_scoring.rs:16

extract_timeline_output() accepts _session_id but never uses it (underscore-prefixed to suppress warnings). Either use it for validation or remove it — a dead parameter in a public API is confusing for future callers.

5. Multi-fork delta compares first/last timeline, not main vs. each fork

crates/rewind-cli/src/main.rs:2330

Delta is last - first in timeline order. For 3+ timelines (main, fork-1, fork-2), the comparison is fork-2 vs. main — silently ignoring fork-1. Consider comparing each fork against main, or documenting this as first-vs-last only.


Nits (P3)

6. ValueError exits with code 0 in subprocess, masking config errors as zero scores

python/rewind_agent/llm_judge.py:285 — When run_judge raises ValueError ("correctness requires expected"), the subprocess catches it, outputs score: 0.0, and exits 0. The Rust side sees success. The user gets a zero score that looks like the LLM judged poorly, not a config error. Exit with code 1 for ValueError so Rust can surface it as an error.

7. resolve_timeline_ref called but not in the diff

crates/rewind-cli/src/main.rs:2228 — Verify this existing function handles label-based lookup (not just ID/prefix).


Tests: Solid ✓

42 tests (9 Rust + 33 Python), good coverage of presets, template rendering, expected validation, choice→score mappings, error paths, SDK integration, and subprocess entry point. Missing: a test for the retry logic (429 → retry → succeed on attempt 2).


Recommend

Fix #1 (retry by exception type) and #2 (add --force flag) before merge. #3 (API key leak) is worth a fast-follow. Rest are non-blocking.

risjai and others added 3 commits April 13, 2026 10:19
Fixes all items from PR #56 review:

#1 (P0 bug) Retry by OpenAI exception type, not string matching:
  - New `_is_retryable()` checks `openai.RateLimitError`,
    `openai.APIStatusError(500/502/503)`, `ConnectionError`, `TimeoutError`
  - "Token limit 500 exceeded" no longer triggers false retries
  - Added 7 tests including retry-then-succeed integration test

#2 (P0 bug) Add --force flag to bypass stale score cache:
  - `rewind eval score --force` skips the `get_timeline_score` cache check
  - Upsert in `create_timeline_score` now reachable after code changes

#3 (P1 security) Sanitize API keys from stderr before storing:
  - `sanitize_secrets()` redacts `sk-[a-zA-Z0-9_-]{10,}` patterns
  - Applied to LLM judge subprocess stderr before writing to reasoning
  - 3 tests for single key, multiple keys, and clean text

#4 (P2) Remove unused `_session_id` from `extract_timeline_output()`:
  - Public API now takes `(store, timeline_id)` — cleaner signature
  - All callers (CLI + tests) updated

#5 (P2) Compare each fork against main, not just first vs last:
  - Delta section now shows one line per fork, each compared to main
  - Works correctly for 3+ timelines

#6 (P3) ValueError exits with code 1 in subprocess:
  - `ValueError` and `RuntimeError` now exit non-zero so Rust surfaces
    them as errors, not misleading zero scores
  - Test added for correctness-without-expected path

Tests: 26 Rust + 41 Python passing, clippy clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documentation:
- docs/evaluation.md: LLM-as-judge section with setup, scoring, comparison,
  built-in criteria table, config options, Python SDK usage, cost guidance
- Updated built-in evaluators table and CLI commands reference
- Added examples links

Examples:
- examples/13_llm_judge.py: basic LLM-as-judge eval with correctness
  and coherence criteria, showing advantage over exact_match
- examples/14_fork_and_score.py: walkthrough of the full
  replay -> score -> prove-it-works loop (CLI + Python SDK)

MCP server:
- New `score_timelines` tool: score session timelines using evaluators,
  supports compare_timelines, expected, and force params
- Caches scores, returns per-timeline breakdown with avg

Version bump (0.7.0 -> 0.8.0):
- Cargo.toml workspace version: 0.8.0
- python/pyproject.toml: 0.10.0
- python/rewind_agent/__init__.py: 0.10.0
- python/rewind_cli.py CLI_VERSION: 0.8.0

Tests: 205 Rust + 233 Python passing, clippy clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the Python subprocess exits non-zero (e.g., missing API key,
missing expected value), it writes structured JSON to stdout before
exiting. Previously the Rust side only read stderr on failure, producing
empty error messages like "stderr: (empty)".

Now reads stdout first and parses the JSON error — surfacing clear
messages like "LLM judge config error: requires an API key" instead.

Found during e2e testing with real demo data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@risjai risjai enabled auto-merge April 13, 2026 05:33
@risjai risjai merged commit dbefcd4 into master Apr 13, 2026
4 checks passed
@risjai risjai deleted the feat/llm-judge-evaluator branch April 13, 2026 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant