feat: replace custom eval graders with LLM-as-judge scoring by VarunGitGood · Pull Request #47 · VarunGitGood/repi

VarunGitGood · 2026-05-31T07:14:06Z

Summary

Removes ~230 lines of brittle hand-coded graders (grade_dataset_1/2/3, _mentions(), regex alternates tables) from eval/run_evals.py
Replaces them with a single LLMJudge class that scores investigation answers against criteria derived from expected.json
Produces per-criterion numeric scores (0.0–1.0) with explanations instead of binary pass/fail
Adds --judge-provider, --judge-model, --judge-api-key CLI args (defaults to configured provider)
Pass threshold: 0.8 aggregate score

New files

File	Purpose
`eval/results.py`	`CriterionScore` and `JudgeResult` pydantic models
`eval/criteria.py`	`build_criteria()` — extracts scoring dimensions from `expected.json` keys generically
`eval/judge.py`	`LLMJudge` class, judge system prompt, response parser, `deterministic_precheck()`
`tests/eval/test_criteria.py`	13 tests — criteria extraction from all 3 dataset shapes
`tests/eval/test_judge.py`	11 tests — precheck, response parsing, mock LLM scoring

Verified

Dataset 1 scored 0.94 with Mistral judge (trigger: 1.0, root cause: 0.9, chain: 0.8, red herrings: 1.0, confidence: 1.0)
98/98 tests pass (uv run pytest tests/ -v)
No per-dataset grader code — new datasets only need expected.json, no code changes

Closes #44

Test plan

uv run pytest tests/eval/ -v — 24 tests pass
uv run pytest tests/ -v — full suite 98 tests pass
uv run python eval/run_evals.py --dataset dataset_1 — produces numeric scores
Run all 3 datasets with a non-rate-limited provider to validate end-to-end

Remove ~230 lines of brittle regex/substring graders (grade_dataset_1/2/3, _mentions, alternates tables) and replace with a single LLMJudge that scores investigation answers against criteria derived from expected.json. New modules: eval/results.py (score models), eval/criteria.py (criteria extraction from expected.json keys), eval/judge.py (LLMJudge class + prompt). Adds --judge-provider, --judge-model, --judge-api-key CLI args to run_evals.py. Pass threshold: 0.8 aggregate score.

vercel · 2026-05-31T07:14:11Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
repi	Ready	Preview, Comment	May 31, 2026 7:35am

Copilot

Pull request overview

This PR replaces the per-dataset, hand-coded evaluation graders in eval/run_evals.py with an LLM-as-judge approach that scores investigation outputs against criteria derived from each dataset’s expected.json, producing per-criterion numeric scores and an aggregate pass/fail based on a 0.8 threshold.

Changes:

Adds a criteria extraction layer (eval/criteria.py) and Pydantic result models (eval/results.py) to drive judge prompts and structured outputs.
Introduces an LLMJudge scorer (eval/judge.py) with deterministic prechecks and judge-response parsing.
Updates the eval runner (eval/run_evals.py) to run the judge, persist numeric results to bug.json, and adds CLI flags for judge provider/model/key; adds test coverage and pytest config.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`eval/run_evals.py`	Replaces dataset-specific graders with LLM judge scoring + CLI configuration and result output.
`eval/judge.py`	Implements judge prompt, deterministic precheck, and parsing of judge outputs into structured results.
`eval/criteria.py`	Extracts criterion names and criteria text from `expected.json` to drive judging.
`eval/results.py`	Adds Pydantic models for per-criterion scores and overall judge results.
`tests/eval/test_criteria.py`	Adds tests for criteria extraction across dataset shapes.
`tests/eval/test_judge.py`	Adds tests for precheck, judge parsing behavior, and judge scoring flow.
`pyproject.toml`	Configures pytest asyncio strict mode and sets pythonpath for tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 from repi.core.container import get_container
+from eval.judge import LLMJudge, deterministic_precheck, PASS_THRESHOLD
+from eval.results import JudgeResult


+    if args.get("judge_model"):
+        from repi.llm.adapters import (
+            OpenAIProvider, AnthropicProvider, MistralProvider,
+            GeminiProvider, OllamaProvider,
+        )
+        model = args["judge_model"]
+        if isinstance(llm, OpenAIProvider):
+            llm = OpenAIProvider(api_key=llm._api_key, model=model)
+        elif isinstance(llm, AnthropicProvider):
+            llm = AnthropicProvider(api_key=llm._api_key, model=model)
+        elif isinstance(llm, MistralProvider):
+            llm = MistralProvider(api_key=llm._api_key, model=model)
+        elif isinstance(llm, GeminiProvider):
+            llm = GeminiProvider(api_key=llm._api_key, model=model)
+        elif isinstance(llm, OllamaProvider):
+            llm = OllamaProvider(base_url=llm._base_url, model=model)
+


+def _red_herring_criterion(ea: dict) -> str:
+    ruled_out = ea.get("ruled_out_hypotheses_must_include")
+    if not ruled_out:
+        return ""
+
+    lines = ["## Criterion: red_herring_handling"]
+    lines.append("The ruled_out_hypotheses must address these red herrings:")
+    for item in ruled_out:
+        about = item.get("hypothesis_about", "?")
+        rationale = item.get("rationale", "")
+        lines.append(f"  - {about}: {rationale}")
+    return "\n".join(lines)


+# Maps expected.json keys to the criterion name the judge will score.
+# Order here determines evaluation order in the prompt.
+_CRITERION_KEYS = [
+    "trigger_identification",
+    "root_cause_accuracy",
+    "propagation_chain",
+    "red_herring_handling",
+    "confidence_calibration",
+    "gap_awareness",
+    "hallucination_avoidance",
+]


+    parsed = json.loads(cleaned)
+    scores_raw = parsed.get("scores", [])
+
+    criteria: list[CriterionScore] = []
+    for entry in scores_raw:
+        criteria.append(CriterionScore(
+            name=entry["name"],
+            score=max(0.0, min(1.0, float(entry["score"]))),
+            explanation=entry.get("explanation", ""),
+        ))
+
+    scored_names = {c.name for c in criteria}
+    for name in criterion_names:
+        if name not in scored_names:
+            criteria.append(CriterionScore(
+                name=name,
+                score=0.0,
+                explanation="Judge did not return a score for this criterion.",
+            ))
+
+    aggregate = sum(c.score for c in criteria) / len(criteria) if criteria else 0.0


- criteria.py: handle ruled_out_hypotheses_must_consider (dataset 2), only add confidence_calibration when present, remove dead _CRITERION_KEYS - judge.py: graceful JSON parse failure, filter extra criteria from aggregate (only score requested criterion_names) - run_evals.py: remove unused JudgeResult import, read API keys from config settings instead of private adapter attributes - tests: add coverage for invalid JSON, extra criteria filtering, dataset 2 ruled_out_hypotheses_must_consider

vercel Bot deployed to Preview May 31, 2026 07:14 View deployment

VarunGitGood requested a review from Copilot May 31, 2026 07:23

Copilot started reviewing on behalf of VarunGitGood May 31, 2026 07:24 View session

Copilot AI reviewed May 31, 2026

View reviewed changes

vercel Bot deployed to Preview May 31, 2026 07:35 View deployment

VarunGitGood merged commit 3b1f74f into main May 31, 2026
3 checks passed

VarunGitGood mentioned this pull request Jun 4, 2026

feat: LLM-as-judge eval harness — gather/compile split, UTC hardening, leaderboard #50

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: replace custom eval graders with LLM-as-judge scoring#47

feat: replace custom eval graders with LLM-as-judge scoring#47
VarunGitGood merged 2 commits into
mainfrom
feat/evaluation-judge

VarunGitGood commented May 31, 2026

Uh oh!

vercel Bot commented May 31, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VarunGitGood commented May 31, 2026

Summary

New files

Verified

Test plan

Uh oh!

vercel Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented May 31, 2026 •

edited

Loading