feat: replace custom eval graders with LLM-as-judge scoring#47
Merged
Conversation
Remove ~230 lines of brittle regex/substring graders (grade_dataset_1/2/3, _mentions, alternates tables) and replace with a single LLMJudge that scores investigation answers against criteria derived from expected.json. New modules: eval/results.py (score models), eval/criteria.py (criteria extraction from expected.json keys), eval/judge.py (LLMJudge class + prompt). Adds --judge-provider, --judge-model, --judge-api-key CLI args to run_evals.py. Pass threshold: 0.8 aggregate score.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This PR replaces the per-dataset, hand-coded evaluation graders in eval/run_evals.py with an LLM-as-judge approach that scores investigation outputs against criteria derived from each dataset’s expected.json, producing per-criterion numeric scores and an aggregate pass/fail based on a 0.8 threshold.
Changes:
- Adds a criteria extraction layer (
eval/criteria.py) and Pydantic result models (eval/results.py) to drive judge prompts and structured outputs. - Introduces an
LLMJudgescorer (eval/judge.py) with deterministic prechecks and judge-response parsing. - Updates the eval runner (
eval/run_evals.py) to run the judge, persist numeric results tobug.json, and adds CLI flags for judge provider/model/key; adds test coverage and pytest config.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
eval/run_evals.py |
Replaces dataset-specific graders with LLM judge scoring + CLI configuration and result output. |
eval/judge.py |
Implements judge prompt, deterministic precheck, and parsing of judge outputs into structured results. |
eval/criteria.py |
Extracts criterion names and criteria text from expected.json to drive judging. |
eval/results.py |
Adds Pydantic models for per-criterion scores and overall judge results. |
tests/eval/test_criteria.py |
Adds tests for criteria extraction across dataset shapes. |
tests/eval/test_judge.py |
Adds tests for precheck, judge parsing behavior, and judge scoring flow. |
pyproject.toml |
Configures pytest asyncio strict mode and sets pythonpath for tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+21
to
+23
| from repi.core.container import get_container | ||
| from eval.judge import LLMJudge, deterministic_precheck, PASS_THRESHOLD | ||
| from eval.results import JudgeResult |
Comment on lines
+96
to
+112
| if args.get("judge_model"): | ||
| from repi.llm.adapters import ( | ||
| OpenAIProvider, AnthropicProvider, MistralProvider, | ||
| GeminiProvider, OllamaProvider, | ||
| ) | ||
| model = args["judge_model"] | ||
| if isinstance(llm, OpenAIProvider): | ||
| llm = OpenAIProvider(api_key=llm._api_key, model=model) | ||
| elif isinstance(llm, AnthropicProvider): | ||
| llm = AnthropicProvider(api_key=llm._api_key, model=model) | ||
| elif isinstance(llm, MistralProvider): | ||
| llm = MistralProvider(api_key=llm._api_key, model=model) | ||
| elif isinstance(llm, GeminiProvider): | ||
| llm = GeminiProvider(api_key=llm._api_key, model=model) | ||
| elif isinstance(llm, OllamaProvider): | ||
| llm = OllamaProvider(base_url=llm._base_url, model=model) | ||
|
|
Comment on lines
+130
to
+141
| def _red_herring_criterion(ea: dict) -> str: | ||
| ruled_out = ea.get("ruled_out_hypotheses_must_include") | ||
| if not ruled_out: | ||
| return "" | ||
|
|
||
| lines = ["## Criterion: red_herring_handling"] | ||
| lines.append("The ruled_out_hypotheses must address these red herrings:") | ||
| for item in ruled_out: | ||
| about = item.get("hypothesis_about", "?") | ||
| rationale = item.get("rationale", "") | ||
| lines.append(f" - {about}: {rationale}") | ||
| return "\n".join(lines) |
Comment on lines
+13
to
+23
| # Maps expected.json keys to the criterion name the judge will score. | ||
| # Order here determines evaluation order in the prompt. | ||
| _CRITERION_KEYS = [ | ||
| "trigger_identification", | ||
| "root_cause_accuracy", | ||
| "propagation_chain", | ||
| "red_herring_handling", | ||
| "confidence_calibration", | ||
| "gap_awareness", | ||
| "hallucination_avoidance", | ||
| ] |
Comment on lines
+93
to
+113
| parsed = json.loads(cleaned) | ||
| scores_raw = parsed.get("scores", []) | ||
|
|
||
| criteria: list[CriterionScore] = [] | ||
| for entry in scores_raw: | ||
| criteria.append(CriterionScore( | ||
| name=entry["name"], | ||
| score=max(0.0, min(1.0, float(entry["score"]))), | ||
| explanation=entry.get("explanation", ""), | ||
| )) | ||
|
|
||
| scored_names = {c.name for c in criteria} | ||
| for name in criterion_names: | ||
| if name not in scored_names: | ||
| criteria.append(CriterionScore( | ||
| name=name, | ||
| score=0.0, | ||
| explanation="Judge did not return a score for this criterion.", | ||
| )) | ||
|
|
||
| aggregate = sum(c.score for c in criteria) / len(criteria) if criteria else 0.0 |
- criteria.py: handle ruled_out_hypotheses_must_consider (dataset 2), only add confidence_calibration when present, remove dead _CRITERION_KEYS - judge.py: graceful JSON parse failure, filter extra criteria from aggregate (only score requested criterion_names) - run_evals.py: remove unused JudgeResult import, read API keys from config settings instead of private adapter attributes - tests: add coverage for invalid JSON, extra criteria filtering, dataset 2 ruled_out_hypotheses_must_consider
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
grade_dataset_1/2/3,_mentions(), regex alternates tables) fromeval/run_evals.pyLLMJudgeclass that scores investigation answers against criteria derived fromexpected.json--judge-provider,--judge-model,--judge-api-keyCLI args (defaults to configured provider)New files
eval/results.pyCriterionScoreandJudgeResultpydantic modelseval/criteria.pybuild_criteria()— extracts scoring dimensions fromexpected.jsonkeys genericallyeval/judge.pyLLMJudgeclass, judge system prompt, response parser,deterministic_precheck()tests/eval/test_criteria.pytests/eval/test_judge.pyVerified
uv run pytest tests/ -v)expected.json, no code changesCloses #44
Test plan
uv run pytest tests/eval/ -v— 24 tests passuv run pytest tests/ -v— full suite 98 tests passuv run python eval/run_evals.py --dataset dataset_1— produces numeric scores