Can Small Language Models Judge as Well as Large Language Models?
π§ββοΈ 16 SLM Judges β’ π 10 Datasets β’ π³οΈ 3 Advanced Strategies β’ π 6 Persona Prompts
| Date | Update |
|---|---|
| Jun 2026 | Paper submitted: arXiv:2606.07810 |
| May 2026 | Interactive leaderboard website launched: anishh15.github.io/SLMJury |
| May 2026 | v0.1.0 released on PyPI -- first public release |
SLMJury is a comprehensive framework that investigates whether Small Language Models (0.6B-14B parameters) can serve as reliable judges across both closed-ended (accuracy-based) and open-ended (correlation-based) evaluation paradigms. The project explores six evaluation modes: individual judging, persona-based evaluation, majority-vote ensembles, multi-agent debate, human agreement scoring (SummEval), and LLM agreement scoring (MT-Bench).
|
|
|
pip install slmjuryOptional extras:
pip install slmjury[vllm] # GPU inference with vLLM
pip install slmjury[together] # Together API for oracle scoring
pip install slmjury[full] # Everything (vllm + together + dev tools)git clone https://github.com/anishh15/SLMJury.git
cd SLMJury
pip install -e .# Step 1: Run student model inference
python scripts/run_student.py --model qwen2.5-32b --datasets gsm8k math
# Step 2: Run judge evaluations
python scripts/run_judge.py --judge qwen3-4b --max-tokens 10 8192
# Step 3: Evaluate all judgements and generate summaries
python scripts/run_evaluation.pyfrom slmjury.core.solver import StudentSolver
from slmjury.core.judge import JudgeModel
from slmjury.core.evaluator import JudgeEvaluator
# Step 1: Solve problems with a student model
solver = StudentSolver("qwen2.5-32b")
results = solver.solve_batch(problems, "gsm8k")
solver.save_results(results, "gsm8k")
solver.cleanup()
# Step 2: Judge the solutions
judge = JudgeModel("qwen3-4b")
judgements = judge.evaluate_batch(results, max_tokens=10)
judge.save_results(judgements, "qwen2.5-32b", "gsm8k", 10)
judge.cleanup()
# Step 3: Evaluate judge accuracy
evaluator = JudgeEvaluator("qwen3-4b", "qwen2.5-32b", "gsm8k", 10, judgements)
summary = evaluator.evaluate()π§© Advanced: Multi-Agent Strategies
# Majority voting on individual verdicts
from slmjury.strategies.ensemble import majority_vote
verdict = majority_vote(["Correct", "Incorrect", "Correct"]) # β "Correct"
# Generate all C(5,3)=10 ensemble combinations from pre-computed judgements
from slmjury.strategies.ensemble import generate_all_ensembles
generate_all_ensembles(
judgements_dir="results/judgements",
output_dir="results/majority_voting",
)
# Multi-agent debate (3 judges, RCR prompting)
from slmjury.strategies.debate import run_debate
run_debate(
combo_models=["qwen3-4b", "phi4mi-3.8b", "qwen2.5-3b"],
combo_temps=[0, 0, 0],
student_results=results,
dataset_name="gsm8k",
)
# Persona effects
from slmjury.strategies.persona import run_persona_evaluation, get_personas
personas = get_personas()
run_persona_evaluation(
judge_model_key="qwen3-4b",
student_results=results,
student_model="qwen2.5-32b",
dataset="gsm8k",
max_tokens=10,
persona_name="strict",
persona_prompt=personas["strict"],
)π¬ Open-Ended Scoring (SummEval / MT-Bench)
# Score SummEval with a single judge
python scripts/run_scoring_judge.py \
--judge qwen3-4b --dataset summeval
# Score MT-Bench with a single judge
python scripts/run_scoring_judge.py \
--judge qwen3-4b --dataset mtbench \
--oracle-scores results/mtbench_oracle/from slmjury.core.scoring_judge import ScoringJudge
judge = ScoringJudge("qwen3-4b", output_dir="results/scoring")
# Score SummEval (4-dimension scoring)
summeval_data = load_dataset("summeval")
results = judge.score_summeval(summeval_data, max_tokens=8192)
judge.save_results(results, "summeval")
judge.cleanup()| Family | Models | Parameters | Thinking |
|---|---|---|---|
| Qwen 2.5 | 1.5B, 3B, 7B | 1.5B - 7B | - |
| Qwen 3 | 0.6B, 1.7B, 4B, 8B, 14B | 0.6B - 14B | β |
| Llama 3.x | 3.2-1B, 3.2-3B, 3.1-8B | 1B - 8B | - |
| Phi-4 | 14B, Reasoning, R-Plus, Mini, Mini-Reasoning | 3.8B - 14B | β * |
*Phi-4 Reasoning/Plus/Mini-Reasoning always use thinking mode and skip quick verdict (t=10) evaluation.
Closed-ended (verdict: Correct/Incorrect):
| Dataset | Type | Domain | Size |
|---|---|---|---|
| GSM8K | Numeric | Math | 1,319 |
| GSM-Plus | Numeric | Math | 10,552 |
| MATH | LaTeX | Math | 5,000 |
| ARC-Easy | Multiple Choice | Science | 2,376 |
| ARC-Challenge | Multiple Choice | Science | 1,172 |
| HellaSwag | Multiple Choice | General | 10,042 |
| WinoGrande | Multiple Choice | General | 1,267 |
| TruthfulQA | Multiple Choice | General | 684 |
Open-ended (scoring: 1-5):
| Dataset | Type | Turns | Size | Oracle |
|---|---|---|---|---|
| SummEval | Summarization | - | 1,600 pairs | Human annotations |
| MT-Bench | Multi-turn chat | 2 | 80 questions | GPT-OSS-120B, Qwen3.5-397B (Together API) |
SLMJury/
βββ slmjury/ # Python package
β βββ configs/ # Centralized YAML model configurations
β βββ data/ # Dataset loaders (HuggingFace β local JSON)
β βββ parsers/ # Answer extraction, normalization, verdict/score parsing
β βββ core/ # Pipeline: solver β judge β evaluator + scoring
β βββ strategies/ # Ensemble voting, multi-agent debate, personas
βββ scripts/ # CLI entry-points (student, judge, oracle, scoring)
βββ bash/ # Bash wrappers for full experiment runs
βββ tests/ # Unit & integration tests (pytest)
βββ website/ # React leaderboard (Vite + Tailwind)
βββ assets/ # SVG banner and logo
βββ pyproject.toml # Package config (pip install slmjury)
βββ README.md
| π Rank | π€ Judge Model | Params | Max Tokens | π Accuracy | π― IFR |
|---|---|---|---|---|---|
| π₯ | Phi-4 | 14B | 10 | 89.55% | 99.98% |
| π₯ | Qwen3-14B | 14B | 10 | 89.51% | 100.0% |
| π₯ | Qwen3-8B | 8B | 10 | 88.96% | 100.0% |
| 4 | Phi-4-Reasoning-Plus | 14B | 8192 | 88.75% | 100.0% |
| 5 | Phi-4-Reasoning | 14B | 8192 | 88.24% | 100.0% |
Top judges ranked by overall accuracy across 8 closed-ended benchmarks (N=64,824 judgments per configuration). Full results for all 16 judges available on the leaderboard and in the paper.
Explore full results on the interactive leaderboard:
- Overthinking is domain-dependent: Quick 10-token verdicts match or beat extended reasoning on math judging, while reasoning wins on general tasks by up to 23%
- Domain generalization separates families: Math-to-general accuracy gaps range from under 10% to nearly 40% across model families
- Closed vs. open-ended judging differ: The best binary judge (Phi-4) drops to rank 9 on MT-Bench; reasoning-trained models invert this ordering
- Multi-agent debate degrades accuracy: Under the RCR protocol, debate hurts performance across all tested configurations, while top judges resist six adversarial personas with β€0.55% variance
If you use SLMJury in your research, please cite:
@misc{laddha2026slmjurysmalllanguagemodels,
title={SLMJury: Can Small Language Models Judge as Well as Large Ones?},
author={Anish Laddha and Nitesh Pradhan and Gaurav Srivastava},
year={2026},
eprint={2606.07810},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.07810},
}We welcome contributions! Here's how to get started:
git clone https://github.com/anishh15/SLMJury.git
cd SLMJury
pip install -e ".[dev]"
pytest tests/ -v- π Bug Reports: Found an issue? Report it here
- β¨ Feature Requests: Have ideas? Share them here
- π§ Code Contributions: Submit PRs for improvements
- π Documentation: Help improve our docs
- π€ Model Submissions: Suggest new judge models for evaluation
- π§ Email: anshladdha15@gmail.com, nitesh.pradhan@lnmiit.ac.in, gks@vt.edu
- π Issues: GitHub Issues
Apache License 2.0 -- see LICENSE for details.
Made with β€οΈ by Anish Laddha, Nitesh Pradhan, and Gaurav Srivastava