Interesting paper : Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models #269
simonaszilinskas
started this conversation in
Random
Replies: 2 comments 2 replies
-
|
My immediate thought was that what happens is many models from the same company (does number of models lead to a bias) Though I suspect that you could address this by simply annotating the affiliation and the reducing having those model rate other models as one to not give the company with the most models the highest influence. |
Beta Was this translation helpful? Give feedback.
2 replies
-
|
Another note here after looking for at the paper:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I wanted to share a recent paper that feels very close to a lot of questions we have been discussing around compar:IA.
“Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models” proposes a fully automated evaluation framework where models evaluate each other, instead of relying on a single judge model or large-scale human annotation .
Why this is interesting for us:
First, it directly tackles the main bottleneck of LLM arenas: scalability. Human voting does not scale well when you want to evaluate many models, many dimensions, or continuously add new models. The paper proposes a coarse-to-fine ranking algorithm that reduces the number of comparisons while keeping rankings stable.
Second, it addresses judge bias in a strong way. Instead of “LLM-as-a-judge” with one authority model, all models act as judges. The idea is that collective intelligence across diverse models reduces self-preference, style bias, and family bias. This is very aligned with the idea that no single model should define what “good” looks like.
Third, they automate the creation of new evaluation dimensions by selecting representative questions from open datasets based on ranking consistency. That is a concrete answer to a hard problem: how to scale evaluation beyond a fixed benchmark without manual curation every time.
The authors report very high correlation with Chatbot Arena rankings, while reducing cost and enabling fine-grained dimensions.
I think this raises good questions for compar:IA:
Curious to hear what others think, not about how this could be implemented in compar:IA (at least for now), but at a more general level.
Beta Was this translation helpful? Give feedback.
All reactions