feat(evals): synthetic-perturbation judge validator — primary automated judge validation (L480) by cipher813 · Pull Request #255 · cipher813/alpha-engine-research

cipher813 · 2026-05-29T19:15:21Z

The primary automated judge-validation mechanism from the 2026-05-29 L480 re-scope (config #372), after Brian's process-vs-outcome challenge established that outcome-IC validates the system, not the judge.

What it does

Validates the LLM-as-judge on its actual construct — process quality — with zero human labels. Take a known-good agent output, apply a deterministic corruption targeting one rubric dimension, run the judge on reference + corrupted, assert the targeted dimension drops. Ground truth is constructed (we authored the corruption), so no annotation. Tests sensitivity (does it notice degradation?) + dimension-specificity (does the right dimension move?) — catching rubber-stamp judges, halo effects, and verbosity bias.

Explicitly NOT outcome-IC. Never touches stock returns. Outcome (realized alpha) is a separate, firewalled system diagnostic — reasoning quality and 21d return are weakly correlated, so validating/tuning the judge on outcomes would Goodhart it into a luck-predictor.

Pieces

evals/perturbation.py — reference fixtures (sector_quant + sector_qual), 8 deterministic corruptions (strip numbers, break rank-vs-score coherence, flatten calibration, gut completeness, strip citations, flatten reasoning depth, misalign evidence, verbosity-pad probe), battery runner with injectable judge_fn (so harness logic is testable offline), markdown scorecard.
tests/test_judge_perturbation.py — 16 tests: corruption determinism + battery logic via a fake judge. Runs in regular mocked CI (no API key).
tests/live_smoke/judge_perturbation_smoke.py — paths-filtered live CI gate; tolerant caught-rate threshold (0.75) over a 4-corruption subset; clean skip on forks without the secret.
.github/workflows/judge-perturbation-smoke.yml — triggers on judge/perturbation file changes; checks out alpha-engine-config (gitignored rubric prompts) + uses ANTHROPIC_API_KEY.

Validated live

Ran against claude-haiku-4-5 — 4/4 caught: numerical_grounding 5→1, ranking_coherence 5→2, citation_grounding 4→1, reasoning_depth 4→1. (Two fixture bugs found+fixed during validation: qual fixture tripped the degenerate_input pre-check; the first ranking corruption was too weak — scores travel with picks, so it needed score-vs-rank contradiction.)

Full research tests/ suite: 1663 passed.

Scope note

This is Phase A (core harness + CI gate). The weekly sensitivity scorecard (emit to S3 + surface in the evaluator email) is the scoped Phase B follow-up — it needs a Lambda with Anthropic access (the EvalRollingMean Lambda is an aggregation stage without it). Tracked in L480.

Independent of the schema-v18 PR (#254); this touches only evals/ + tests + a workflow.

🤖 Generated with Claude Code

…ed judge validation (L480) Validates the LLM-as-judge on its actual construct (process quality) with ZERO human labels. Takes a known-good agent output, applies a deterministic corruption targeting one rubric dimension, runs the judge on both, and asserts the targeted dimension drops. Constructed ground truth → no annotation. Tests sensitivity + dimension-specificity (catches rubber-stamp judges, halo effects, verbosity bias). Explicitly NOT outcome-IC — never touches stock returns. Outcome is a separate, firewalled system diagnostic (reasoning quality and 21d return are weakly correlated; tuning the judge on outcome would Goodhart it). - evals/perturbation.py: reference fixtures (sector_quant + sector_qual), 8 deterministic corruptions, battery runner with injectable judge_fn, markdown scorecard. - tests/test_judge_perturbation.py: 16 tests — corruption determinism + battery logic with a fake judge (regular mocked CI, no key). - tests/live_smoke/judge_perturbation_smoke.py: paths-filtered live CI gate; tolerant caught-rate threshold (0.75) over a 4-corruption subset. Verified live against claude-haiku-4-5: 4/4 caught (numerical_grounding 5→1, ranking_coherence 5→2, citation_grounding 4→1, reasoning_depth 4→1). - .github/workflows/judge-perturbation-smoke.yml: triggers on judge/ perturbation file changes; checks out alpha-engine-config for the gitignored rubric prompts + uses ANTHROPIC_API_KEY (clean skip on forks). Full research tests/ suite: 1663 passed. Weekly scorecard stage is the scoped Phase B follow-up (needs a Lambda with Anthropic access). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cipher813 merged commit bd68e47 into main May 29, 2026
2 checks passed

cipher813 deleted the feat/judge-perturbation-validator-260529 branch May 29, 2026 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): synthetic-perturbation judge validator — primary automated judge validation (L480)#255

feat(evals): synthetic-perturbation judge validator — primary automated judge validation (L480)#255
cipher813 merged 1 commit into
mainfrom
feat/judge-perturbation-validator-260529

cipher813 commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 29, 2026

What it does

Pieces

Validated live

Scope note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant