Skip to content

UniX-AI-Lab/WorldReasonBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Can a video generator reason about how the world should evolve — not just render it?

436 cases  ·  4 reasoning dimensions  ·  22 subcategories  ·  13 generators  ·  ~6K expert preference pairs

arXiv Website GitHub Data Daily Paper

WorldReasonBench overview


Abstract

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators". Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time.

We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? It contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories.

We further release WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation.


Key Numbers

436 22 11 6,000+ 15 0.955
Test Cases Subcategories Generators Preference Pairs Annotators ScorePR – Human ρ

Method — Construction & Evaluation Pipelines

A three-stage VLM-assisted construction pipeline produces structured ground-truth QA pairs, and a two-part evaluation methodology turns generated videos into human-aligned scores.

Construction
Data construction pipeline
WorldReasonBench & WorldRewardBench construction. Taxonomy-aware captioning → reasoning-aware prompt generation → structured QA generation, with expert scoring and preference-pair construction for the reward bench.
Evaluation
Evaluation pipeline
Two complementary components. Process-aware Reasoning Verification turns structured QA into reasoning-phase diagnostics; Multi-dimensional Quality Assessment scores each video on reasoning quality, temporal consistency, and visual aesthetics.

Scoring formulas

ScorePR = AccQA0.8 · sdyn0.2

Process-aware reasoning score — outcome-completeness penalised on dynamic-phase failures.

S(v) = 0.4 · sr + 0.3 · sc + 0.3 · sa

Multi-dimensional quality score — reasoning, temporal consistency, visual aesthetics.


Leaderboard — Main Results

All 11 generators evaluated on a shared evaluation set for fully controlled cross-model comparison. Higher is better. Closed-source models are listed above the divider, open-source models below.

ScorePR (Process-aware Reasoning, %)

# Model Family Overall World Knowledge Human-Centric Logic Reasoning Information-Based
1 Seedance2.0 Closed 39.8 43.2 35.9 31.7 47.6
2 Veo3.1-Fast Closed 35.3 55.0 35.1 25.7 28.6
3 Sora2 Closed 34.3 36.9 44.7 25.9 37.3
4 Kling Closed 32.7 42.2 32.5 22.4 35.7
5 Wan2.6 Closed 32.4 35.2 34.5 26.2 35.5
6 HunyuanVideo-1.5 Open 17.9 21.6 8.1 12.7 24.2
7 Wan2.2-14B Open 17.5 22.9 14.5 16.4 15.0
8 LongCat-Video Open 17.4 13.3 22.8 12.6 22.8
9 Cosmos-Predict2.5 Open 16.9 15.2 22.2 7.1 24.7
10 LTX2.3 Open 16.8 15.6 19.3 11.9 22.7
11 UniVideo Open 14.4 13.8 15.8 11.2 17.3

S(v) (Multi-dimensional Quality, %)

# Model Family Overall World Knowledge Human-Centric Logic Reasoning Information-Based
1 Seedance2.0 Closed 59.4 70.4 83.9 56.7 42.5
2 Sora2 Closed 56.9 62.6 76.7 43.0 58.0
3 Kling Closed 55.4 72.0 87.2 37.3 48.8
4 Veo3.1-Fast Closed 54.8 80.1 77.2 31.5 47.2
5 Wan2.6 Closed 50.3 61.8 64.2 42.3 42.6
6 Cosmos-Predict2.5 Open 30.5 40.8 30.8 26.7 26.1
7 Wan2.2-14B Open 30.0 39.4 38.1 19.5 30.5
8 LTX2.3 Open 28.1 35.1 27.8 24.7 25.8
9 HunyuanVideo-1.5 Open 27.0 37.7 35.3 19.8 22.5
10 LongCat-Video Open 25.3 35.1 42.8 16.3 20.2
11 UniVideo Open 21.3 29.4 37.2 14.4 16.0

Validation — Metrics Align with Human Ranking

Pairwise expert evaluation gives each model a Human Elo. Our automated ScorePR reproduces the human ranking with absolute rank displacement |Δr| ≤ 1 on 8 of 11 models, while a generic Qwen3.5-Thinking judge drifts by up to four positions.

# Model Human Elo Judge Elo Judge Rank AccQA (%) |Δr| ScorePR (%) |Δr| sdyn/AccQA
1 Seedance2.0 1471 1183 3 41.2 0 39.8 0 0.84
2 Veo3.1-Fast 1253 1151 4 36.0 0 35.3 0 0.91
3 Kling 1240 1142 5 34.0 2 32.7 1 0.82
4 Wan2.6 1211 1130 6 34.7 0 32.4 1 0.71
5 Sora2-8s 1118 1222 1 35.3 2 34.3 2 0.86
6 Sora2-12s 1109 1217 2 33.5 0 32.4 0 0.84
7 Wan2.2-14B 953 913 7 19.6 2 17.5 1 0.57
8 HunyuanVideo-1.5 911 841 9 20.2 1 17.9 1 0.56
9 LongCat-Video 904 876 8 19.7 1 17.4 0 0.54
10 UniVideo 665 737 11 16.2 1 14.4 1 0.56
11 LTX2.3 587 802 10 18.5 1 16.8 1 0.63

Qualitative — Generated Videos Across the Reasoning Taxonomy

Visually plausible generations can still fail process-level world reasoning. Representative examples from each reasoning dimension:

How videos display: GitHub renders inline <video> tags only when the src points to a GitHub-hosted asset URL (e.g. an attachment URL like https://github.com/<org>/<repo>/assets/...). The tags below use relative paths to the MP4s under assets/videos/, so they will play inline when this README is rendered locally (e.g. in VS Code, Cursor, Obsidian, the project website) and degrade to a download link on GitHub. To get inline playback on the GitHub repo page, drag-and-drop each MP4 into a GitHub issue/PR comment, then replace src=... with the resulting attachment URL.

World Knowledge

Domino chains — Seedance 2.0 Pencil in water — Seedance 2.0 Releasing two objects — Wan 2.6
wk_1.mp4
"These are two domino chains that have already started to fall; show how both groups topple."
wk_2.mp4
"The pencil is placed in water; what happens?"
wk_3.mp4
"When the person releases both objects in their hands at the same time, what will happen next?"

Human-Centric

Moving hands — Kling V3 Going to a car wash — Seedance 2.0 Reaching a high box — Kling V3
hc_1.mp4
"What happens when this man moves his hands?"
hc_2.mp4
"The character in the picture wants to wash his car… going to the car wash shop."
hc_3.mp4
"This box is too high. How should the characters in the picture use tools to get to the box?"

Logic Reasoning

2D → 3D methane — Kling V3 Solving a math problem — Seedance 2.0 Flame reaction experiment — Wan 2.6
lr_1.mp4
"Transform 2D atomic / structural diagrams of methane into a clear 3D geometric model."
lr_2.mp4
"Use the input image as the first frame to solve the problem and give the solution process."
lr_3.mp4
"The picture shows the flame reaction of various metals — show the subsequent process of this experiment."

Information-Based

The Little PrinceVeo 3.1 Embryonic development — Sora 2.0 Tesla logo from dust — Seedance 2.0
ib_1.mp4
Visualise a passage from "The Little Prince", one sentence per segment with subtitles.
ib_2.mp4
"Give the change process in the 10 months after fertilization of sperm and egg."
ib_3.mp4
"The red dust kicked up by the tires condenses into the Tesla brand T logo in the air."

Qualitative comparison across models

Qualitative comparison

Qualitative comparison on representative reasoning cases. Higher-scoring models better preserve the intended state transition and temporal dynamics.


Human Study — Expert Annotation for WorldRewardBench

Fifteen trained annotators rate each video on reasoning quality, temporal consistency, and visual aesthetics. Ratings are aggregated into per-video scores and pairwise preferences for reward-model calibration.

Annotation platform

Annotation interface: input image, prompt, eight anonymised generations, and a 1–5 rubric over three dimensions.


Reward-Model Alignment on WorldRewardBench

  • Pair-wise: agreement (%) w/ Ties / w/o Ties.
  • Point-wise: induced pairwise accuracy / Spearman ρ.

Open Qwen3.5-Thinking matches GPT-5.4 on three of four reasoning dimensions.

Dimension Protocol GPT-5.4 (closed) Gemini-3.1-Flash (closed) Qwen3.5-9B Thinking (open) Qwen3.5-27B Instruct (open) Qwen3.5-27B Thinking (open) Qwen3.5-27B Thinking · 4 FPS (open)
Frames used 8 1 FPS ~10 ~10 ~10 4 FPS
World Knowledge Pair w/  /  w/o 60.77 / 67.84 51.50 / 60.44 70.81 / 76.19 69.37 / 74.16 69.94 / 74.64 69.51 / 74.23
World Knowledge Point Acc  /  ρ 54.55 / 0.592 59.86 / 0.582 60.70 / 0.720 54.01 / 0.658 60.57 / 0.687 62.09 / 0.711
Human-Centric Pair w/  /  w/o 68.37 / 76.80 58.22 / 66.27 71.71 / 77.52 71.25 / 76.05 72.61 / 77.81 69.08 / 74.41
Human-Centric Point Acc  /  ρ 59.14 / 0.626 60.06 / 0.675 59.54 / 0.702 55.94 / 0.682 62.81 / 0.713 60.49 / 0.703
Logic Reasoning Pair w/  /  w/o 67.41 / 78.43 58.23 / 67.68 69.33 / 77.13 68.46 / 74.51 70.16 / 76.23 68.53 / 74.97
Logic Reasoning Point Acc  /  ρ 53.42 / 0.523 57.65 / 0.562 57.50 / 0.617 55.71 / 0.573 60.17 / 0.606 58.40 / 0.597
Information-Based Pair w/  /  w/o 56.95 / 63.68 50.21 / 58.10 52.45 / 61.76 60.44 / 65.22 60.24 / 65.32 61.50 / 66.39
Information-Based Point Acc  /  ρ 48.15 / 0.484 47.89 / 0.432 53.59 / 0.471 47.95 / 0.408 50.15 / 0.445 52.41 / 0.526
Overall Pair w/  /  w/o 63.04 / 71.36 54.39 / 62.99 67.14 / 74.35 66.89 / 72.07 67.74 / 73.05 66.90 / 72.30
Overall Point Acc  /  ρ 53.43 / 0.565 55.84 / 0.568 57.76 / 0.655 53.15 / 0.591 57.85 / 0.626 57.83 / 0.644

Bold = best across all six reward models. Source: Table 4 of the paper.


Directory Structure

WorldReasonBench/
├── assets/
│   ├── images/                          # paper figures
│   │   ├── bench_overview.png
│   │   ├── data_pipe.png
│   │   ├── eval_pipe.png
│   │   ├── qualitative_comparison.png
│   │   └── annotation_platform.png
│   └── videos/                          # qualitative examples (4 categories × 3 MP4s)
│       ├── WorldKnowledge/{wk_1,wk_2,wk_3}.mp4
│       ├── HumanCentric/{hc_1,hc_2,hc_3}.mp4
│       ├── LogicReasoning/{lr_1,lr_2,lr_3}.mp4
│       └── InformationBased/{ib_1,ib_2,ib_3}.mp4
├── data/
│   ├── *_with_prompts.json              # Task metadata with video prompts (4 categories)
│   ├── data_with_qa_gemini/
│   │   └── qa_*.json                    # QA evaluation data (open-ended + binary)
│   └── statistics_model_pairs_*.json    # Human-annotated preference pairs (5,969 pairs)
├── evaluation/
│   ├── eval_qa.py                       # QA pipeline (Stage 1: VLM answer, Stage 2: LLM judge)
│   └── reward_bench/
│       ├── __init__.py                  # Package init
│       ├── utils.py                     # Shared utilities (video/image encoding, templates)
│       ├── run_pairwise_eval.py         # Pairwise comparison evaluation
│       ├── run_pointwise_eval.py        # Pointwise S(v) scoring
│       ├── run_pointwise_eval_main_table.py  # S(v) for all model videos
│       ├── compute_pairwise_accuracy.py # Compute pairwise metrics
│       ├── compute_pointwise_metrics.py # Compute pointwise metrics (Spearman rho)
│       ├── mllm_tools/
│       │   ├── __init__.py              # Model registry
│       │   └── qwen3_5_eval.py          # Qwen3.5 OpenAI-compatible wrapper
│       └── templates/
│           └── video_generation/
│               ├── viescore.txt          # Pointwise scoring template
│               └── pairwise.txt          # Pairwise comparison template
├── scripts/
│   ├── run_qa_eval.sh                   # Example: QA evaluation
│   ├── run_pointwise_eval.sh            # Example: Pointwise evaluation
│   └── run_pairwise_eval.sh             # Example: Pairwise evaluation
├── requirements.txt
└── README.md

Setup

1. Install dependencies

pip install -r requirements.txt

2. Launch vLLM server with Qwen3.5

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-27B \
  --port 30002 \
  --tensor-parallel-size 4 \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --limit-mm-per-prompt video=1,image=1

The --media-io-kwargs flag is required for FPS-based video frame sampling.

3. Prepare video data

Organize your generated videos in the expected directory structure (see MODEL_VIDEO_DIRS in run_pointwise_eval_main_table.py).

Usage

QA-Based Reasoning Verification (Pillar I)

Evaluates whether generated videos contain the expected reasoning elements through a 2-stage pipeline:

  • Stage 1: VLM answers questions about the video
  • Stage 2: LLM judges answer correctness
python3 evaluation/eval_qa.py \
  --qa_json data/data_with_qa_gemini/qa_World-Knowledge.json \
  --video_dir /path/to/videos/World-Knowledge \
  --output_dir outputs/qa_eval/ \
  --base_url http://127.0.0.1:30002/v1 \
  --video_fps 4 \
  --qa_mode open_ended \
  --use_mm_processor_kwargs

Key metrics produced:

  • AccQA: Simple QA accuracy
  • ScorePR: Process-aware score combining accuracy with dynamic reasoning quality
  • ΔRG: Gap between easy (with hints) and difficult (without hints) accuracy

Multi-Dimensional Quality Assessment (Pillar II)

Pointwise S(v) Scoring

Scores each video on 3 dimensions: reasoning correctness, content fidelity, visual aesthetics.

python3 evaluation/reward_bench/run_pointwise_eval.py \
  --pairs-json data/statistics_model_pairs_by_task_stratified_balanced_tie_v2.json \
  --judge-model qwen3.5-27b \
  --judge-base-url http://127.0.0.1:30002/v1 \
  --num-workers 2 \
  --max-parse-attempts 3 \
  --resume

Final score: S(v) = 0.4 * s_reasoning + 0.3 * s_content + 0.3 * s_aesthetics

Pairwise Comparison

Directly compares two videos and produces A>B / B>A / A=B verdicts.

python3 evaluation/reward_bench/run_pairwise_eval.py \
  --pairs-json data/statistics_model_pairs_by_task_stratified_balanced_tie_v2.json \
  --judge-model qwen3.5-27b \
  --judge-base-url http://127.0.0.1:30002/v1 \
  --num-workers 2 \
  --resume

Computing Metrics

# Pairwise accuracy (with/without ties)
python3 evaluation/reward_bench/compute_pairwise_accuracy.py outputs/pairwise_eval.jsonl

# Pointwise correlation with human ratings
python3 evaluation/reward_bench/compute_pointwise_metrics.py \
  --videos outputs/pointwise_eval.jsonl \
  --induced-pairs outputs/pointwise_eval.induced_pairs.jsonl

Benchmark Categories

Category Description
World-Knowledge Physics, chemistry, biology, geography reasoning
Human-Centric Human behavior, social dynamics, emotion
Logic-Reasoning Logical deduction, mathematical reasoning
Information-based-reasoning Text comprehension, data interpretation

Environment Variables

Variable Description
OPENAI_BASE_URL Default vLLM server URL
OPENAI_API_KEY API key (use "EMPTY" for local vLLM)
QWEN3_5_VIDEO_FPS Override video FPS for frame sampling
QWEN3_5_NO_THINKING Set to "1" to disable thinking chain
QWEN3_5_MAX_TOKENS Max generation tokens (default: 16384)

Citation

If you find this project helpful, please consider giving us a star and citing our paper with:

@misc{wu2026worldreasonbenchhumanalignedstresstesting,
      title={WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors}, 
      author={Keming Wu and Yijing Cui and Wenhan Xue and Qijie Wang and Xuan Luo and Zhiyuan Feng and Zuhao Yang and Sudong Wang and Sicong Jiang and Haowei Zhu and Zihan Wang and Ping Nie and Wenhu Chen and Bin Wang},
      year={2026},
      eprint={2605.10434},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.10434}, 
}

License

This project is released under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors