Skip to content

final-bench/ALL-Bench-Leaderboard

Repository files navigation

annotations_creators language license multilinguality pretty_name size_categories source_datasets tags task_categories dataset_info
expert-generated
en
apache-2.0
monolingual
ALL Bench Leaderboard 2026
n<1K
original
benchmark
leaderboard
llm
vlm
ai-evaluation
gpt-5
claude
gemini
final-bench
metacognition
multimodal
ai-agent
image-generation
video-generation
music-generation
text-generation
visual-question-answering
text-to-image
text-to-video
text-to-audio
features
name dtype
llm
list
name dtype
vlm
dict
name dtype
agent
list
name dtype
image
list
name dtype
video
list
name dtype
music
list
name dtype
confidence
dict

🏆 ALL Bench Leaderboard 2026

The only AI benchmark dataset covering LLM · VLM · Agent · Image · Video · Music in a single unified file.

Live Leaderboard

GitHub FINAL Bench FINAL Leaderboard

ALL Bench Leaderboard

ALL Bench Leaderboard

Dataset Summary

ALL Bench Leaderboard aggregates and cross-verifies benchmark scores for 91 AI models across 6 modalities. Every numerical score is tagged with a confidence level (cross-verified, single-source, or self-reported) and its original source. The dataset is designed for researchers, developers, and decision-makers who need a trustworthy, unified view of the AI model landscape.

Category Models Benchmarks Description
LLM 42 31 fields MMLU-Pro, GPQA, AIME, HLE, ARC-AGI-2, Metacog, SWE-Pro, IFEval, LCB, etc.
VLM Flagship 11 10 fields MMMU, MMMU-Pro, MathVista, AI2D, OCRBench, MMStar, HallusionBench, etc.
VLM Lightweight 5 34 fields Detailed Qwen-series edge model comparison across 3 sub-categories
Agent 10 8 fields OSWorld, τ²-bench, BrowseComp, Terminal-Bench 2.0, GDPval-AA, SWE-Pro
Image Gen 10 7 fields Photo realism, text rendering, instruction following, style, aesthetics
Video Gen 10 7 fields Quality, motion, consistency, text rendering, duration, resolution
Music Gen 8 6 fields Quality, vocals, instrumental, lyrics, duration

ALL Bench Leaderboard

ALL Bench Leaderboard

ALL Bench Leaderboard

Live Leaderboard

👉 https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard

Interactive features: composite ranking, dark mode, advanced search (GPQA > 90 open, price < 1), Model Finder, Head-to-Head comparison, Trust Map heatmap, Bar Race animation, and downloadable Intelligence Report (PDF/DOCX).

Data Structure

all_bench_leaderboard_v2.1.json
├── metadata          # version, formula, links, model counts
├── llm[42]           # 42 LLMs × 31 fields
├── vlm
│   ├── flagship[11]  # 11 flagship VLMs × 10 benchmarks
│   └── lightweight[5]# 5 edge models × 34 benchmarks (3 sub-tables)
├── agent[10]         # 10 agent models × 8 benchmarks
├── image[10]         # 10 image gen models × S/A/B/C ratings
├── video[10]         # 10 video gen models × S/A/B/C ratings
├── music[8]          # 8 music gen models × S/A/B/C ratings
└── confidence{42}    # per-model, per-benchmark source & trust level

LLM Field Schema

Field Type Description
name string Model name
provider string Organization
type string open or closed
group string flagship, open, korean, etc.
released string Release date (YYYY.MM)
mmluPro float | null MMLU-Pro score (%)
gpqa float | null GPQA Diamond (%)
aime float | null AIME 2025 (%)
hle float | null Humanity's Last Exam (%)
arcAgi2 float | null ARC-AGI-2 (%)
metacog float | null FINAL Bench Metacognitive score
swePro float | null SWE-bench Pro (%)
bfcl float | null Berkeley Function Calling (%)
ifeval float | null IFEval instruction following (%)
lcb float | null LiveCodeBench (%)
sweV float | null SWE-bench Verified (%) — deprecated
mmmlu float | null Multilingual MMLU (%)
termBench float | null Terminal-Bench 2.0 (%)
sciCode float | null SciCode (%)
priceIn / priceOut float | null USD per 1M tokens
elo int | null Arena Elo rating
license string Prop, Apache2, MIT, Open, etc.

ALL Bench Leaderboard

ALL Bench Leaderboard

ALL Bench Leaderboard

Composite Score

Score = Avg(confirmed benchmarks) × √(N/10)

10 core benchmarks across the 5-Axis Intelligence Framework: Knowledge · Expert Reasoning · Abstract Reasoning · Metacognition · Execution.

Confidence System

Each benchmark score in the confidence object is tagged:

Level Badge Meaning
cross-verified ✓✓ Confirmed by 2+ independent sources
single-source One official or third-party source
self-reported ~ Provider's own claim, unverified

Example:

"Claude Opus 4.6": {
  "gpqa": { "level": "cross-verified", "source": "Anthropic + Vellum + DataCamp" },
  "arcAgi2": { "level": "cross-verified", "source": "Vellum + llm-stats + NxCode + DataCamp" },
  "metacog": { "level": "single-source", "source": "FINAL Bench dataset" }
}

Usage

import json
from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="FINAL-Bench/ALL-Bench-Leaderboard",
    filename="all_bench_leaderboard_v2.1.json",
    repo_type="dataset"
)
data = json.load(open(path))

# Top 5 LLMs by GPQA
ranked = sorted(data["llm"], key=lambda x: x["gpqa"] or 0, reverse=True)
for m in ranked[:5]:
    print(f"{m['name']:25s} GPQA={m['gpqa']}")

# Check confidence for a score
print(data["confidence"]["Gemini 3.1 Pro"]["gpqa"])
# → {"level": "single-source", "source": "Google DeepMind model card"}

ALL Bench Leaderboard

ALL Bench Leaderboard

ALL Bench Leaderboard

FINAL Bench — Metacognitive Benchmark

FINAL Bench measures AI self-correction ability. Error Recovery (ER) explains 94.8% of metacognitive performance variance. 9 frontier models evaluated.

Citation

@misc{allbench2026,
    title={ALL Bench Leaderboard 2026: Unified Multi-Modal AI Evaluation},
    author={ALL Bench Team},
    year={2026},
    url={https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard}
}

#AIBenchmark #LLMLeaderboard #GPT5 #Claude #Gemini #ALLBench #FINALBench #Metacognition #VLM #AIAgent #MultiModal #HuggingFace #ARC-AGI #AIEvaluation #VIDRAFT.net

About

ALL-Bench-Leaderboard

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages