Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao,
We present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning.
RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories: Commonsense Knowledge, Subject Knowledge, Perceptual Knowledge, Societal Knowledge, Logical Capability, Experiential Knowledge, Spatial Knowledge, Temporal Knowledge, providing a structured testbed for probing model intelligence across diverse dimensions.
Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment.
Evaluation pipeline |
Specialized Evaluation Pipeline |
We conduct a comprehensive evaluation on 11 representative TI2V models, revealing systematic reasoning limitations and providing insights into current model capabilities.
RA: Reasoning Alignment; TC: Temporal Consistency; PR: Physical Rationality; VQ: Visual Quality; W.Score: Weighted Score (computed by assigning weights of 0.4, 0.25, 0.25, and 0.1 to Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality).
The first frame and text prompt for video generation are here. The generated videos should be organized in the form of:
{MODEL NAME}/{CATEGORY}/{TASK_ID}
MODEL NAME: Generation model.CATEGORY: The category of the sample (e.g.,Subject Knowledge).TASK_ID: The unique ID of each sample (corresponding to the"task_id"field in the JSON).
All video paths must be written into the "video_path" field of the JSON.
To generate the video frames required for the Reasoning Alignment dimension and facilitate visualization. First, configure the path to the JSON file containing the "video_path" field and the frame storage path in the data_json and root_folder parameters here.
Then, run the following code to extract and store the frames:
cd reasoning_fps
python fps_clip.py
The extracted and saved frames will be automatically written into the "frame_path".
Configure the parameters here:
data_json: Path to video result json with"frame_path".root_dir: Intermediate file storage root directory.relax_save_root: The root directory for the storage file of the model's weighted scores.strcit_save_root: The root directory for the storage file of the model's accuracy.GPT_URL: Your OpenAI API base URL.GPT_KEY: Your OpenAI API key.
And then run:
python eval.py
You can then view the video evaluation results and scores in the corresponding folder.
@misc{liu2026risevideovideogeneratorsdecode,
title={RISE-Video: Can Video Generators Decode Implicit World Rules?},
author={Mingxin Liu and Shuran Ma and Shibei Meng and Xiangyu Zhao and Zicheng Zhang and Shaofeng Zhang and Zhihang Zhong and Peixian Chen and Haoyu Cao and Xing Sun and Haodong Duan and Xue Yang},
year={2026},
eprint={2602.05986},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.05986},
}



