feat(evals): synthetic-perturbation judge validator — primary automated judge validation (L480)#255
Merged
Conversation
…ed judge validation (L480) Validates the LLM-as-judge on its actual construct (process quality) with ZERO human labels. Takes a known-good agent output, applies a deterministic corruption targeting one rubric dimension, runs the judge on both, and asserts the targeted dimension drops. Constructed ground truth → no annotation. Tests sensitivity + dimension-specificity (catches rubber-stamp judges, halo effects, verbosity bias). Explicitly NOT outcome-IC — never touches stock returns. Outcome is a separate, firewalled system diagnostic (reasoning quality and 21d return are weakly correlated; tuning the judge on outcome would Goodhart it). - evals/perturbation.py: reference fixtures (sector_quant + sector_qual), 8 deterministic corruptions, battery runner with injectable judge_fn, markdown scorecard. - tests/test_judge_perturbation.py: 16 tests — corruption determinism + battery logic with a fake judge (regular mocked CI, no key). - tests/live_smoke/judge_perturbation_smoke.py: paths-filtered live CI gate; tolerant caught-rate threshold (0.75) over a 4-corruption subset. Verified live against claude-haiku-4-5: 4/4 caught (numerical_grounding 5→1, ranking_coherence 5→2, citation_grounding 4→1, reasoning_depth 4→1). - .github/workflows/judge-perturbation-smoke.yml: triggers on judge/ perturbation file changes; checks out alpha-engine-config for the gitignored rubric prompts + uses ANTHROPIC_API_KEY (clean skip on forks). Full research tests/ suite: 1663 passed. Weekly scorecard stage is the scoped Phase B follow-up (needs a Lambda with Anthropic access). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The primary automated judge-validation mechanism from the 2026-05-29 L480 re-scope (config #372), after Brian's process-vs-outcome challenge established that outcome-IC validates the system, not the judge.
What it does
Validates the LLM-as-judge on its actual construct — process quality — with zero human labels. Take a known-good agent output, apply a deterministic corruption targeting one rubric dimension, run the judge on reference + corrupted, assert the targeted dimension drops. Ground truth is constructed (we authored the corruption), so no annotation. Tests sensitivity (does it notice degradation?) + dimension-specificity (does the right dimension move?) — catching rubber-stamp judges, halo effects, and verbosity bias.
Explicitly NOT outcome-IC. Never touches stock returns. Outcome (realized alpha) is a separate, firewalled system diagnostic — reasoning quality and 21d return are weakly correlated, so validating/tuning the judge on outcomes would Goodhart it into a luck-predictor.
Pieces
evals/perturbation.py— reference fixtures (sector_quant + sector_qual), 8 deterministic corruptions (strip numbers, break rank-vs-score coherence, flatten calibration, gut completeness, strip citations, flatten reasoning depth, misalign evidence, verbosity-pad probe), battery runner with injectablejudge_fn(so harness logic is testable offline), markdown scorecard.tests/test_judge_perturbation.py— 16 tests: corruption determinism + battery logic via a fake judge. Runs in regular mocked CI (no API key).tests/live_smoke/judge_perturbation_smoke.py— paths-filtered live CI gate; tolerant caught-rate threshold (0.75) over a 4-corruption subset; clean skip on forks without the secret..github/workflows/judge-perturbation-smoke.yml— triggers on judge/perturbation file changes; checks out alpha-engine-config (gitignored rubric prompts) + usesANTHROPIC_API_KEY.Validated live
Ran against
claude-haiku-4-5— 4/4 caught: numerical_grounding 5→1, ranking_coherence 5→2, citation_grounding 4→1, reasoning_depth 4→1. (Two fixture bugs found+fixed during validation: qual fixture tripped thedegenerate_inputpre-check; the first ranking corruption was too weak — scores travel with picks, so it needed score-vs-rank contradiction.)Full research
tests/suite: 1663 passed.Scope note
This is Phase A (core harness + CI gate). The weekly sensitivity scorecard (emit to S3 + surface in the evaluator email) is the scoped Phase B follow-up — it needs a Lambda with Anthropic access (the EvalRollingMean Lambda is an aggregation stage without it). Tracked in L480.
Independent of the schema-v18 PR (#254); this touches only
evals/+ tests + a workflow.🤖 Generated with Claude Code