Skip to content

scripts mmlu_pro: Analyze using gpt-5-mini as llm-as-judge#99

Merged
tomtseng merged 3 commits intomainfrom
tomtseng/llm-as-a-judge-gpt-5
Feb 24, 2026
Merged

scripts mmlu_pro: Analyze using gpt-5-mini as llm-as-judge#99
tomtseng merged 3 commits intomainfrom
tomtseng/llm-as-a-judge-gpt-5

Conversation

@tomtseng
Copy link
Collaborator

@tomtseng tomtseng commented Feb 18, 2026

Child PR: #101

Changes

PR #89 analyzed what happened if we used LLM-as-a-judge for MMLU-Pro and found that gpt-4.1-mini was too flawed as a judge to use. This PR updates the scripts to allow using gpt-5-mini as a judge and finds that it works better.

Results

Overall summary is that as far as I can tell, the judge model is better than the regex model when they disagree, but they don't disagree enough that using the judge model would significantly change results.

Detail

Regex parser and judge agree that response is correct or incorrect on 98.4% of responses. In the remaining responses judge is usually better.

  • when regex says it's correct and judge says it's incorrect, it's usually:
    • the regex picks up a spurious letter that's accidentally right,
    • the response contains multiple answers, the regex grabs the first one which happens to be right but judge correctly marks it as wrong
    • the justification the response gives leading up to its answer suggests a different answer than what the model finally outputs
  • when judge says it's correct and regex says it's incorrect, it's usually that the model explained things in prose correctly but didn't output the answer in the expected format

Low-reasoning gpt-5 mini is a clear improvement over gpt-4.1 mini as a judge for MMLU-Pro, and costs the same ($150 sweeping over nov7_trial/)

Does it change any of our topline results though? Not meaningfully. Practically speaking the regex seems to be sufficient.

  • Unattacked models: baseline no_weight_modification MMLU-Pro scores change only by 0--2pp, except for TAR which gains 7pp under the judge
  • For the hyperparameter sweeps where we're looking for the top StrongREJECT
    under a max 10 pp MMLU-Pro drop, 178/190=94% model-attack pairs have no
    change. 12 pairs change, with a mean change of 0.003. The two biggest changes:
    • qwen3_32b_prev / full_parameter_finetune: +0.12 (judge admits 1 more trial above threshold that happens to have high harmfulness)
    • qwen3_4b / full_parameter_finetune: +0.12 (same pattern — 4 more trials pass the judge threshold)
    • A few pairs show modest drops (e.g., llama3_8b_tar / benign_full_parameter_finetune: -0.06, qwen3_0_6b_base / benign_lora_finetune: -0.06)

@tomtseng tomtseng changed the title scripts mmlu_pro: Allow reasoning models for LLM-as-judge mmlu_pro: Allow using LLM-as-judge as parser; scripts mmlu_pro: Allow reasoning models for LLM-as-judge Feb 24, 2026
@tomtseng tomtseng changed the title mmlu_pro: Allow using LLM-as-judge as parser; scripts mmlu_pro: Allow reasoning models for LLM-as-judge mmlu_pro: Allow using LLM-as-judge as parser Feb 24, 2026
@tomtseng tomtseng changed the title mmlu_pro: Allow using LLM-as-judge as parser scripts mmlu_pro: Analyze using gpt-5-mini as llm-as-judge Feb 24, 2026
@tomtseng
Copy link
Collaborator Author

CI failing but not related to this PR, will be fixed by #102

@tomtseng tomtseng marked this pull request as ready for review February 24, 2026 14:25
@tomtseng tomtseng merged commit b074096 into main Feb 24, 2026
0 of 2 checks passed
@tomtseng tomtseng deleted the tomtseng/llm-as-a-judge-gpt-5 branch February 24, 2026 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant