annotations_creators
language
license
multilinguality
pretty_name
size_categories
source_datasets
tags
task_categories
dataset_info
apache-2.0
ALL Bench Leaderboard 2026
benchmark
leaderboard
llm
vlm
ai-evaluation
gpt-5
claude
gemini
final-bench
metacognition
multimodal
ai-agent
image-generation
video-generation
music-generation
text-generation
visual-question-answering
text-to-image
text-to-video
text-to-audio
features
name
dtype
confidence
dict
🏆 ALL Bench Leaderboard 2026
The only AI benchmark dataset covering LLM · VLM · Agent · Image · Video · Music in a single unified file.
ALL Bench Leaderboard aggregates and cross-verifies benchmark scores for 91 AI models across 6 modalities. Every numerical score is tagged with a confidence level (cross-verified, single-source, or self-reported) and its original source. The dataset is designed for researchers, developers, and decision-makers who need a trustworthy, unified view of the AI model landscape.
Category
Models
Benchmarks
Description
LLM
42
31 fields
MMLU-Pro, GPQA, AIME, HLE, ARC-AGI-2, Metacog, SWE-Pro, IFEval, LCB, etc.
VLM Flagship
11
10 fields
MMMU, MMMU-Pro, MathVista, AI2D, OCRBench, MMStar, HallusionBench, etc.
VLM Lightweight
5
34 fields
Detailed Qwen-series edge model comparison across 3 sub-categories
Agent
10
8 fields
OSWorld, τ²-bench, BrowseComp, Terminal-Bench 2.0, GDPval-AA, SWE-Pro
Image Gen
10
7 fields
Photo realism, text rendering, instruction following, style, aesthetics
Video Gen
10
7 fields
Quality, motion, consistency, text rendering, duration, resolution
Music Gen
8
6 fields
Quality, vocals, instrumental, lyrics, duration
👉 https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard
Interactive features: composite ranking, dark mode, advanced search (GPQA > 90 open, price < 1), Model Finder, Head-to-Head comparison, Trust Map heatmap, Bar Race animation, and downloadable Intelligence Report (PDF/DOCX).
all_bench_leaderboard_v2.1.json
├── metadata # version, formula, links, model counts
├── llm[42] # 42 LLMs × 31 fields
├── vlm
│ ├── flagship[11] # 11 flagship VLMs × 10 benchmarks
│ └── lightweight[5]# 5 edge models × 34 benchmarks (3 sub-tables)
├── agent[10] # 10 agent models × 8 benchmarks
├── image[10] # 10 image gen models × S/A/B/C ratings
├── video[10] # 10 video gen models × S/A/B/C ratings
├── music[8] # 8 music gen models × S/A/B/C ratings
└── confidence{42} # per-model, per-benchmark source & trust level
Field
Type
Description
name
string
Model name
provider
string
Organization
type
string
open or closed
group
string
flagship, open, korean, etc.
released
string
Release date (YYYY.MM)
mmluPro
float | null
MMLU-Pro score (%)
gpqa
float | null
GPQA Diamond (%)
aime
float | null
AIME 2025 (%)
hle
float | null
Humanity's Last Exam (%)
arcAgi2
float | null
ARC-AGI-2 (%)
metacog
float | null
FINAL Bench Metacognitive score
swePro
float | null
SWE-bench Pro (%)
bfcl
float | null
Berkeley Function Calling (%)
ifeval
float | null
IFEval instruction following (%)
lcb
float | null
LiveCodeBench (%)
sweV
float | null
SWE-bench Verified (%) — deprecated
mmmlu
float | null
Multilingual MMLU (%)
termBench
float | null
Terminal-Bench 2.0 (%)
sciCode
float | null
SciCode (%)
priceIn / priceOut
float | null
USD per 1M tokens
elo
int | null
Arena Elo rating
license
string
Prop, Apache2, MIT, Open, etc.
Score = Avg(confirmed benchmarks) × √(N/10)
10 core benchmarks across the 5-Axis Intelligence Framework : Knowledge · Expert Reasoning · Abstract Reasoning · Metacognition · Execution.
Each benchmark score in the confidence object is tagged:
Level
Badge
Meaning
cross-verified
✓✓
Confirmed by 2+ independent sources
single-source
✓
One official or third-party source
self-reported
~
Provider's own claim, unverified
Example:
"Claude Opus 4.6" : {
"gpqa" : { "level" : " cross-verified" , "source" : " Anthropic + Vellum + DataCamp" },
"arcAgi2" : { "level" : " cross-verified" , "source" : " Vellum + llm-stats + NxCode + DataCamp" },
"metacog" : { "level" : " single-source" , "source" : " FINAL Bench dataset" }
}
import json
from huggingface_hub import hf_hub_download
path = hf_hub_download (
repo_id = "FINAL-Bench/ALL-Bench-Leaderboard" ,
filename = "all_bench_leaderboard_v2.1.json" ,
repo_type = "dataset"
)
data = json .load (open (path ))
# Top 5 LLMs by GPQA
ranked = sorted (data ["llm" ], key = lambda x : x ["gpqa" ] or 0 , reverse = True )
for m in ranked [:5 ]:
print (f"{ m ['name' ]:25s} GPQA={ m ['gpqa' ]} " )
# Check confidence for a score
print (data ["confidence" ]["Gemini 3.1 Pro" ]["gpqa" ])
# → {"level": "single-source", "source": "Google DeepMind model card"}
FINAL Bench — Metacognitive Benchmark
FINAL Bench measures AI self-correction ability. Error Recovery (ER) explains 94.8% of metacognitive performance variance. 9 frontier models evaluated.
@misc {allbench2026 ,
title ={ ALL Bench Leaderboard 2026: Unified Multi-Modal AI Evaluation} ,
author ={ ALL Bench Team} ,
year ={ 2026} ,
url ={ https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard}
}
#AIBenchmark #LLMLeaderboard #GPT5 #Claude #Gemini #ALLBench #FINALBench #Metacognition #VLM #AIAgent #MultiModal #HuggingFace #ARC-AGI #AIEvaluation #VIDRAFT.net