Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.
- [2026.05.31] 🚀 Our code is now released!
Install dependencies:
cd ECC
pip install -r requirements.txtDownload benchmark data and generate embeddings:
python data_gen.pyTo reproduce results on the three benchmarks, run:
# ECC
python clustering_ecc.py --cc=1 --ec=1 --k=30 --pairs=7 --lam=3 --data=sprout
# Comp-only
python clustering_ecc.py --cc=0 --ec=1 --k=30 --pairs=7 --lam=3 --data=sprout
# Emb-only
python clustering_emb.py --ec=1 --k=30 --pairs=7 --lam=3 --data=sproutKey arguments:
k: number of clustersdata: benchmark dataset in{sprout, routerbench, leaderboard}pairs: number of model comparison pairs used per query during clusteringlam: trade-off between prior embedding signal and posterior comparison signalcc: whether to use embeddings during clustering (1= yes,0= no)ec: whether to use embeddings during inference (1= yes,0= no)
To compare with human-labeled clustering on datasets with human-labeled categories, run:
# ECC
python human_ecc.py --pairs=7 --lam=3 --data=mmlu
# Human category
python human.py --pairs=7 --lam=3 --data=mmlu
# Emb-only
python human_emb.py --pairs=7 --lam=3 --data=mmludatain{mmlu, mmlupro, math}
To evaluate the downstream task of optimal query routing, run:
# ECC
python routing_ecc.py --cc=1 --ec=1 --k=30 --pairs=7 --lam=3 --data=leaderboard
# Comp-only
python routing_ecc.py --cc=0 --ec=1 --k=30 --pairs=7 --lam=3 --data=leaderboard
# Emb-only
python routing_emb.py --ec=1 --k=30 --pairs=7 --lam=3 --data=leaderboardTo evaluate sample-efficient ranking for a new model with a limited comparison budget, run:
# ECC
python routing_ecc.py --cc=1 --ec=1 --k=30 --pairs=3 --lam=3 --data=leaderboard --comp_num=100 --rd=4
# Comp-only
python routing_ecc.py --cc=0 --ec=1 --k=30 --pairs=3 --lam=3 --data=leaderboard --comp_num=100 --rd=4
# Emb-only
python routing_emb.py --ec=1 --k=30 --pairs=3 --lam=3 --data=leaderboard --comp_num=100 --rd=4Key arguments:
comp_num: comparison budget for the new modelrd: index of the model to hold out as the new model- Open LLM Leaderboard:
rdin[0, 16] - RouterBench:
rdin[0, 11] - SPROUT:
rdin[0, 13]
- Open LLM Leaderboard:
If you use this codebase, please consider citing our paper:
@misc{wu2026capturingllmcapabilitiesevidencecalibrated,
title={Capturing LLM Capabilities via Evidence-Calibrated Query Clustering},
author={Fangzhou Wu and Sandeep Silwal and Qiuyi Zhang},
year={2026},
eprint={2605.17110},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.17110},
}