Interesting paper : Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models #269

simonaszilinskas · 2026-01-19T14:40:55Z

simonaszilinskas
Jan 19, 2026
Maintainer

I wanted to share a recent paper that feels very close to a lot of questions we have been discussing around compar:IA.

“Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models” proposes a fully automated evaluation framework where models evaluate each other, instead of relying on a single judge model or large-scale human annotation .

Why this is interesting for us:

First, it directly tackles the main bottleneck of LLM arenas: scalability. Human voting does not scale well when you want to evaluate many models, many dimensions, or continuously add new models. The paper proposes a coarse-to-fine ranking algorithm that reduces the number of comparisons while keeping rankings stable.

Second, it addresses judge bias in a strong way. Instead of “LLM-as-a-judge” with one authority model, all models act as judges. The idea is that collective intelligence across diverse models reduces self-preference, style bias, and family bias. This is very aligned with the idea that no single model should define what “good” looks like.

Third, they automate the creation of new evaluation dimensions by selecting representative questions from open datasets based on ranking consistency. That is a concrete answer to a hard problem: how to scale evaluation beyond a fixed benchmark without manual curation every time.

The authors report very high correlation with Chatbot Arena rankings, while reducing cost and enabling fine-grained dimensions.

I think this raises good questions for compar:IA:

Should we explore automated judging for some parts of the platforms?
Where do humans still add the most value, and where do they not? My hypothesis is that the value is in the prompt.

Curious to hear what others think, not about how this could be implemented in compar:IA (at least for now), but at a more general level.

KennethEnevoldsen · 2026-01-19T15:24:42Z

KennethEnevoldsen
Jan 19, 2026

My immediate thought was that what happens is many models from the same company (does number of models lead to a bias)

Though I suspect that you could address this by simply annotating the affiliation and the reducing having those model rate other models as one to not give the company with the most models the highest influence.

2 replies

simonaszilinskas Jan 19, 2026
Maintainer Author

Good point ! I suppose, as you say, there could be some kind of mechanism to compensate for it.

KennethEnevoldsen Jan 19, 2026

Yea and also not sure how big the effect is - but probably something to check for

Could also suspect there could be language bias: E.g. if a decent enough group of models makes the same types of problems (e.g. confuse danish with norwegian) I am unsure if these would be caught. However we could test if this method also works for non-english languages using comparia as a way to test this.

KennethEnevoldsen · 2026-01-20T10:36:04Z

KennethEnevoldsen
Jan 20, 2026

Another note here after looking for at the paper:

Def. agree that the prompt seems to be a good part of the problem
However, there is also a lack of continuous validation of this approach. One way to ensure that it remains aligned with what we want is to select certain questions (those that give us the most information) and see if the human evaluators agree.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interesting paper : Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models #269

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Interesting paper : Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models #269

Uh oh!

simonaszilinskas Jan 19, 2026 Maintainer

Replies: 2 comments · 2 replies

Uh oh!

KennethEnevoldsen Jan 19, 2026

Uh oh!

simonaszilinskas Jan 19, 2026 Maintainer Author

Uh oh!

KennethEnevoldsen Jan 19, 2026

Uh oh!

KennethEnevoldsen Jan 20, 2026

simonaszilinskas
Jan 19, 2026
Maintainer

Replies: 2 comments 2 replies

KennethEnevoldsen
Jan 19, 2026

simonaszilinskas Jan 19, 2026
Maintainer Author

KennethEnevoldsen
Jan 20, 2026