scripts mmlu_pro: Analyze using gpt-5-mini as llm-as-judge by tomtseng · Pull Request #99 · criticalml-uw/TamperBench

tomtseng · 2026-02-18T21:52:15Z

Child PR: #101

Changes

PR #89 analyzed what happened if we used LLM-as-a-judge for MMLU-Pro and found that gpt-4.1-mini was too flawed as a judge to use. This PR updates the scripts to allow using gpt-5-mini as a judge and finds that it works better.

Results

Overall summary is that as far as I can tell, the judge model is better than the regex model when they disagree, but they don't disagree enough that using the judge model would significantly change results.

Detail

Regex parser and judge agree that response is correct or incorrect on 98.4% of responses. In the remaining responses judge is usually better.

when regex says it's correct and judge says it's incorrect, it's usually:
- the regex picks up a spurious letter that's accidentally right,
- the response contains multiple answers, the regex grabs the first one which happens to be right but judge correctly marks it as wrong
- the justification the response gives leading up to its answer suggests a different answer than what the model finally outputs
when judge says it's correct and regex says it's incorrect, it's usually that the model explained things in prose correctly but didn't output the answer in the expected format

Low-reasoning gpt-5 mini is a clear improvement over gpt-4.1 mini as a judge for MMLU-Pro, and costs the same ($150 sweeping over nov7_trial/)

Does it change any of our topline results though? Not meaningfully. Practically speaking the regex seems to be sufficient.

Unattacked models: baseline no_weight_modification MMLU-Pro scores change only by 0--2pp, except for TAR which gains 7pp under the judge
For the hyperparameter sweeps where we're looking for the top StrongREJECT
under a max 10 pp MMLU-Pro drop, 178/190=94% model-attack pairs have no
change. 12 pairs change, with a mean change of 0.003. The two biggest changes:
- qwen3_32b_prev / full_parameter_finetune: +0.12 (judge admits 1 more trial above threshold that happens to have high harmfulness)
- qwen3_4b / full_parameter_finetune: +0.12 (same pattern — 4 more trials pass the judge threshold)
- A few pairs show modest drops (e.g., llama3_8b_tar / benign_full_parameter_finetune: -0.06, qwen3_0_6b_base / benign_lora_finetune: -0.06)

…a-judge

tomtseng · 2026-02-24T14:25:27Z

CI failing but not related to this PR, will be fixed by #102

tomtseng added 2 commits February 17, 2026 17:25

scripts mmlu_pro: Allow reasoning models for LLM-as-judge

e3a8311

scripts mmlu_pro_llm_judge_all: Add analysis about gpt-5-mini llm-as-…

1d25c60

…a-judge

tomtseng changed the title ~~scripts mmlu_pro: Allow reasoning models for LLM-as-judge~~ mmlu_pro: Allow using LLM-as-judge as parser; scripts mmlu_pro: Allow reasoning models for LLM-as-judge Feb 24, 2026

tomtseng changed the title ~~mmlu_pro: Allow using LLM-as-judge as parser; scripts mmlu_pro: Allow reasoning models for LLM-as-judge~~ mmlu_pro: Allow using LLM-as-judge as parser Feb 24, 2026

tomtseng changed the title ~~mmlu_pro: Allow using LLM-as-judge as parser~~ scripts mmlu_pro: Analyze using gpt-5-mini as llm-as-judge Feb 24, 2026

tomtseng mentioned this pull request Feb 24, 2026

[low-priority] mmlu_pro: Allow using LLM-as-judge as parser #101

Open

scripts mmlu_pro_llm_judge_all: Typo in README, simplify code

75e7058

tomtseng marked this pull request as ready for review February 24, 2026 14:25

tomtseng merged commit b074096 into main Feb 24, 2026
0 of 2 checks passed

tomtseng deleted the tomtseng/llm-as-a-judge-gpt-5 branch February 24, 2026 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts mmlu_pro: Analyze using gpt-5-mini as llm-as-judge#99

scripts mmlu_pro: Analyze using gpt-5-mini as llm-as-judge#99
tomtseng merged 3 commits intomainfrom
tomtseng/llm-as-a-judge-gpt-5

tomtseng commented Feb 18, 2026 •

edited

Loading

Uh oh!

tomtseng commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomtseng commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Results

Detail

Uh oh!

tomtseng commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tomtseng commented Feb 18, 2026 •

edited

Loading