Skip to content

anishh15/SLMJury

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SLMJury Banner

PyPI Paper Python License Stars

Can Small Language Models Judge as Well as Large Language Models?

πŸ§‘β€βš–οΈ 16 SLM Judges β€’ πŸ“Š 10 Datasets β€’ πŸ—³οΈ 3 Advanced Strategies β€’ 🎭 6 Persona Prompts

πŸ† Leaderboard | πŸ“– Read Paper | πŸš€ Get Started


πŸ“’ Latest News

Date Update
Jun 2026 Paper submitted: arXiv:2606.07810
May 2026 Interactive leaderboard website launched: anishh15.github.io/SLMJury
May 2026 v0.1.0 released on PyPI -- first public release

πŸ’‘ What is SLMJury?

SLMJury is a comprehensive framework that investigates whether Small Language Models (0.6B-14B parameters) can serve as reliable judges across both closed-ended (accuracy-based) and open-ended (correlation-based) evaluation paradigms. The project explores six evaluation modes: individual judging, persona-based evaluation, majority-vote ensembles, multi-agent debate, human agreement scoring (SummEval), and LLM agreement scoring (MT-Bench).

🌟 Key Highlights

🧠 Individual Judging

  • 16 SLM judges from 4 model families
  • Quick verdict vs. reasoned response
  • Accuracy & Instruction Following Rate

πŸ—³οΈ Majority Voting

  • C(5,3) ensemble combinations
  • Top-5 best individual judges
  • Boosted accuracy via consensus

🀝 Multi-Agent Debate

  • RCR (Reflect-Critique-Refine) prompting
  • Cross-architecture & same-model variants
  • Up to 5 rounds with consensus fallback

⚑ Installation

πŸ“¦ From PyPI

pip install slmjury

Optional extras:

pip install slmjury[vllm]       # GPU inference with vLLM
pip install slmjury[together]   # Together API for oracle scoring
pip install slmjury[full]       # Everything (vllm + together + dev tools)

πŸ”§ From Source (Development)

git clone https://github.com/anishh15/SLMJury.git
cd SLMJury
pip install -e .

πŸš€ Quick Start

πŸ’» CLI Scripts

# Step 1: Run student model inference
python scripts/run_student.py --model qwen2.5-32b --datasets gsm8k math

# Step 2: Run judge evaluations
python scripts/run_judge.py --judge qwen3-4b --max-tokens 10 8192

# Step 3: Evaluate all judgements and generate summaries
python scripts/run_evaluation.py

🐍 Python API

from slmjury.core.solver import StudentSolver
from slmjury.core.judge import JudgeModel
from slmjury.core.evaluator import JudgeEvaluator

# Step 1: Solve problems with a student model
solver = StudentSolver("qwen2.5-32b")
results = solver.solve_batch(problems, "gsm8k")
solver.save_results(results, "gsm8k")
solver.cleanup()

# Step 2: Judge the solutions
judge = JudgeModel("qwen3-4b")
judgements = judge.evaluate_batch(results, max_tokens=10)
judge.save_results(judgements, "qwen2.5-32b", "gsm8k", 10)
judge.cleanup()

# Step 3: Evaluate judge accuracy
evaluator = JudgeEvaluator("qwen3-4b", "qwen2.5-32b", "gsm8k", 10, judgements)
summary = evaluator.evaluate()
🧩 Advanced: Multi-Agent Strategies
# Majority voting on individual verdicts
from slmjury.strategies.ensemble import majority_vote
verdict = majority_vote(["Correct", "Incorrect", "Correct"])  # β†’ "Correct"

# Generate all C(5,3)=10 ensemble combinations from pre-computed judgements
from slmjury.strategies.ensemble import generate_all_ensembles
generate_all_ensembles(
    judgements_dir="results/judgements",
    output_dir="results/majority_voting",
)

# Multi-agent debate (3 judges, RCR prompting)
from slmjury.strategies.debate import run_debate
run_debate(
    combo_models=["qwen3-4b", "phi4mi-3.8b", "qwen2.5-3b"],
    combo_temps=[0, 0, 0],
    student_results=results,
    dataset_name="gsm8k",
)

# Persona effects
from slmjury.strategies.persona import run_persona_evaluation, get_personas
personas = get_personas()
run_persona_evaluation(
    judge_model_key="qwen3-4b",
    student_results=results,
    student_model="qwen2.5-32b",
    dataset="gsm8k",
    max_tokens=10,
    persona_name="strict",
    persona_prompt=personas["strict"],
)
πŸ”¬ Open-Ended Scoring (SummEval / MT-Bench)
# Score SummEval with a single judge
python scripts/run_scoring_judge.py \
  --judge qwen3-4b --dataset summeval

# Score MT-Bench with a single judge
python scripts/run_scoring_judge.py \
  --judge qwen3-4b --dataset mtbench \
  --oracle-scores results/mtbench_oracle/
from slmjury.core.scoring_judge import ScoringJudge

judge = ScoringJudge("qwen3-4b", output_dir="results/scoring")

# Score SummEval (4-dimension scoring)
summeval_data = load_dataset("summeval")
results = judge.score_summeval(summeval_data, max_tokens=8192)
judge.save_results(results, "summeval")
judge.cleanup()

πŸ€– Supported Models

Family Models Parameters Thinking
Qwen 2.51.5B, 3B, 7B1.5B - 7B-
Qwen 30.6B, 1.7B, 4B, 8B, 14B0.6B - 14Bβœ…
Llama 3.x3.2-1B, 3.2-3B, 3.1-8B1B - 8B-
Phi-414B, Reasoning, R-Plus, Mini, Mini-Reasoning3.8B - 14Bβœ…*

*Phi-4 Reasoning/Plus/Mini-Reasoning always use thinking mode and skip quick verdict (t=10) evaluation.

πŸ“Š Datasets

Closed-ended (verdict: Correct/Incorrect):

Dataset Type Domain Size
GSM8K Numeric Math 1,319
GSM-Plus Numeric Math 10,552
MATH LaTeX Math 5,000
ARC-Easy Multiple Choice Science 2,376
ARC-Challenge Multiple Choice Science 1,172
HellaSwag Multiple Choice General 10,042
WinoGrande Multiple Choice General 1,267
TruthfulQA Multiple Choice General 684

Open-ended (scoring: 1-5):

Dataset Type Turns Size Oracle
SummEval Summarization - 1,600 pairs Human annotations
MT-Bench Multi-turn chat 2 80 questions GPT-OSS-120B, Qwen3.5-397B (Together API)

πŸ—οΈ Project Structure

SLMJury/
β”œβ”€β”€ slmjury/                  # Python package
β”‚   β”œβ”€β”€ configs/              # Centralized YAML model configurations
β”‚   β”œβ”€β”€ data/                 # Dataset loaders (HuggingFace β†’ local JSON)
β”‚   β”œβ”€β”€ parsers/              # Answer extraction, normalization, verdict/score parsing
β”‚   β”œβ”€β”€ core/                 # Pipeline: solver β†’ judge β†’ evaluator + scoring
β”‚   └── strategies/           # Ensemble voting, multi-agent debate, personas
β”œβ”€β”€ scripts/                  # CLI entry-points (student, judge, oracle, scoring)
β”œβ”€β”€ bash/                     # Bash wrappers for full experiment runs
β”œβ”€β”€ tests/                    # Unit & integration tests (pytest)
β”œβ”€β”€ website/                  # React leaderboard (Vite + Tailwind)
β”œβ”€β”€ assets/                   # SVG banner and logo
β”œβ”€β”€ pyproject.toml            # Package config (pip install slmjury)
└── README.md

πŸ“Š Results

πŸ† Leaderboard (Top Judges - Closed-Ended)

πŸ… Rank πŸ€– Judge Model Params Max Tokens πŸ“Š Accuracy 🎯 IFR
πŸ₯‡Phi-414B1089.55%99.98%
πŸ₯ˆQwen3-14B14B1089.51%100.0%
πŸ₯‰Qwen3-8B8B1088.96%100.0%
4Phi-4-Reasoning-Plus14B819288.75%100.0%
5Phi-4-Reasoning14B819288.24%100.0%

Top judges ranked by overall accuracy across 8 closed-ended benchmarks (N=64,824 judgments per configuration). Full results for all 16 judges available on the leaderboard and in the paper.

Explore full results on the interactive leaderboard:

πŸ” Key Findings

  • Overthinking is domain-dependent: Quick 10-token verdicts match or beat extended reasoning on math judging, while reasoning wins on general tasks by up to 23%
  • Domain generalization separates families: Math-to-general accuracy gaps range from under 10% to nearly 40% across model families
  • Closed vs. open-ended judging differ: The best binary judge (Phi-4) drops to rank 9 on MT-Bench; reasoning-trained models invert this ordering
  • Multi-agent debate degrades accuracy: Under the RCR protocol, debate hurts performance across all tested configurations, while top judges resist six adversarial personas with ≀0.55% variance

πŸ“– Citation

If you use SLMJury in your research, please cite:

@misc{laddha2026slmjurysmalllanguagemodels,
      title={SLMJury: Can Small Language Models Judge as Well as Large Ones?},
      author={Anish Laddha and Nitesh Pradhan and Gaurav Srivastava},
      year={2026},
      eprint={2606.07810},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.07810},
}

🀝 Contributing

We welcome contributions! Here's how to get started:

git clone https://github.com/anishh15/SLMJury.git
cd SLMJury
pip install -e ".[dev]"
pytest tests/ -v

πŸ› οΈ Ways to Contribute

  • πŸ› Bug Reports: Found an issue? Report it here
  • ✨ Feature Requests: Have ideas? Share them here
  • πŸ”§ Code Contributions: Submit PRs for improvements
  • πŸ“š Documentation: Help improve our docs
  • πŸ€– Model Submissions: Suggest new judge models for evaluation

πŸ“ž Contact & Support


πŸ“„ License

Apache License 2.0 -- see LICENSE for details.


Get Started Leaderboard Paper GitHub

Made with ❀️ by Anish Laddha, Nitesh Pradhan, and Gaurav Srivastava

SLMJury Logo

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors