🏆 ALL Bench Leaderboard 2026

annotations_creators

language

license

multilinguality

pretty_name

size_categories

source_datasets

🏆 ALL Bench Leaderboard 2026

The only AI benchmark dataset covering LLM · VLM · Agent · Image · Video · Music in a single unified file.

Dataset Summary

ALL Bench Leaderboard aggregates and cross-verifies benchmark scores for 91 AI models across 6 modalities. Every numerical score is tagged with a confidence level (cross-verified, single-source, or self-reported) and its original source. The dataset is designed for researchers, developers, and decision-makers who need a trustworthy, unified view of the AI model landscape.

Category	Models	Benchmarks	Description
LLM	42	31 fields	MMLU-Pro, GPQA, AIME, HLE, ARC-AGI-2, Metacog, SWE-Pro, IFEval, LCB, etc.
VLM Flagship	11	10 fields	MMMU, MMMU-Pro, MathVista, AI2D, OCRBench, MMStar, HallusionBench, etc.
VLM Lightweight	5	34 fields	Detailed Qwen-series edge model comparison across 3 sub-categories
Agent	10	8 fields	OSWorld, τ²-bench, BrowseComp, Terminal-Bench 2.0, GDPval-AA, SWE-Pro
Image Gen	10	7 fields	Photo realism, text rendering, instruction following, style, aesthetics
Video Gen	10	7 fields	Quality, motion, consistency, text rendering, duration, resolution
Music Gen	8	6 fields	Quality, vocals, instrumental, lyrics, duration

Live Leaderboard

👉 https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard

Interactive features: composite ranking, dark mode, advanced search (GPQA > 90 open, price < 1), Model Finder, Head-to-Head comparison, Trust Map heatmap, Bar Race animation, and downloadable Intelligence Report (PDF/DOCX).

Data Structure

all_bench_leaderboard_v2.1.json
├── metadata          # version, formula, links, model counts
├── llm[42]           # 42 LLMs × 31 fields
├── vlm
│   ├── flagship[11]  # 11 flagship VLMs × 10 benchmarks
│   └── lightweight[5]# 5 edge models × 34 benchmarks (3 sub-tables)
├── agent[10]         # 10 agent models × 8 benchmarks
├── image[10]         # 10 image gen models × S/A/B/C ratings
├── video[10]         # 10 video gen models × S/A/B/C ratings
├── music[8]          # 8 music gen models × S/A/B/C ratings
└── confidence{42}    # per-model, per-benchmark source & trust level

LLM Field Schema

Field	Type	Description
`name`	string	Model name
`provider`	string	Organization
`type`	string	`open` or `closed`
`group`	string	`flagship`, `open`, `korean`, etc.
`released`	string	Release date (YYYY.MM)
`mmluPro`	float \| null	MMLU-Pro score (%)
`gpqa`	float \| null	GPQA Diamond (%)
`aime`	float \| null	AIME 2025 (%)
`hle`	float \| null	Humanity's Last Exam (%)
`arcAgi2`	float \| null	ARC-AGI-2 (%)
`metacog`	float \| null	FINAL Bench Metacognitive score
`swePro`	float \| null	SWE-bench Pro (%)
`bfcl`	float \| null	Berkeley Function Calling (%)
`ifeval`	float \| null	IFEval instruction following (%)
`lcb`	float \| null	LiveCodeBench (%)
`sweV`	float \| null	SWE-bench Verified (%) — deprecated
`mmmlu`	float \| null	Multilingual MMLU (%)
`termBench`	float \| null	Terminal-Bench 2.0 (%)
`sciCode`	float \| null	SciCode (%)
`priceIn` / `priceOut`	float \| null	USD per 1M tokens
`elo`	int \| null	Arena Elo rating
`license`	string	`Prop`, `Apache2`, `MIT`, `Open`, etc.

Composite Score

Score = Avg(confirmed benchmarks) × √(N/10)

10 core benchmarks across the 5-Axis Intelligence Framework: Knowledge · Expert Reasoning · Abstract Reasoning · Metacognition · Execution.

Confidence System

Each benchmark score in the confidence object is tagged:

Level	Badge	Meaning
`cross-verified`	✓✓	Confirmed by 2+ independent sources
`single-source`	✓	One official or third-party source
`self-reported`	~	Provider's own claim, unverified

Example:

"Claude Opus 4.6": {
  "gpqa": { "level": "cross-verified", "source": "Anthropic + Vellum + DataCamp" },
  "arcAgi2": { "level": "cross-verified", "source": "Vellum + llm-stats + NxCode + DataCamp" },
  "metacog": { "level": "single-source", "source": "FINAL Bench dataset" }
}

Usage

import json
from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="FINAL-Bench/ALL-Bench-Leaderboard",
    filename="all_bench_leaderboard_v2.1.json",
    repo_type="dataset"
)
data = json.load(open(path))

# Top 5 LLMs by GPQA
ranked = sorted(data["llm"], key=lambda x: x["gpqa"] or 0, reverse=True)
for m in ranked[:5]:
    print(f"{m['name']:25s} GPQA={m['gpqa']}")

# Check confidence for a score
print(data["confidence"]["Gemini 3.1 Pro"]["gpqa"])
# → {"level": "single-source", "source": "Google DeepMind model card"}

FINAL Bench — Metacognitive Benchmark

FINAL Bench measures AI self-correction ability. Error Recovery (ER) explains 94.8% of metacognitive performance variance. 9 frontier models evaluated.

🧬 FINAL-Bench/Metacognitive Dataset
🏆 FINAL-Bench/Leaderboard

Citation

@misc{allbench2026,
    title={ALL Bench Leaderboard 2026: Unified Multi-Modal AI Evaluation},
    author={ALL Bench Team},
    year={2026},
    url={https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard}
}

#AIBenchmark #LLMLeaderboard #GPT5 #Claude #Gemini #ALLBench #FINALBench #Metacognition #VLM #AIAgent #MultiModal #HuggingFace #ARC-AGI #AIEvaluation #VIDRAFT.net

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
1.png		1.png
10.png		10.png
11.png		11.png
2.png		2.png
3.png		3.png
4.png		4.png
5.png		5.png
6.png		6.png
7.png		7.png
8.png		8.png
9.png		9.png
CITATION.cff		CITATION.cff
README.md		README.md
all_bench_leaderboard_v2.1.json		all_bench_leaderboard_v2.1.json
index.html		index.html
llms-full.txt		llms-full.txt
llms.txt		llms.txt
schema.jsonld		schema.jsonld

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏆 ALL Bench Leaderboard 2026

Dataset Summary

Live Leaderboard

Data Structure

LLM Field Schema

Composite Score

Confidence System

Usage

FINAL Bench — Metacognitive Benchmark

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

🏆 ALL Bench Leaderboard 2026

Dataset Summary

Live Leaderboard

Data Structure

LLM Field Schema

Composite Score

Confidence System

Usage

FINAL Bench — Metacognitive Benchmark

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages