Gangwei Xu, Qihang Zhang, Jiaming Zhou, Xing Zhu, Yujun Shen, Xin Yang, Yinghao Xu
Next Forcing tackles the myopic supervision problem in autoregressive video world models: next-chunk denoising often learns local appearance shortcuts instead of long-range dynamics, especially at high frame rates.
By training lightweight Multi-Chunk Prediction (MCP) modules to predict multiple future chunks, Next Forcing provides denser temporal supervision, achieves faster and more stable convergence across frame rates, sets new state-of-the-art results on RoboTwin, and enables 2x inference acceleration via parallel chunk generation.
- Multi-Chunk Prediction (MCP): auxiliary modules predict
next^1,next^2, andnext^3chunks to provide long-range temporal supervision beyond the current chunk. - Faster and stable training: Next Forcing converges faster and reaches higher success rates across frame rates, with the strongest gains at high FPS where appearance shortcuts are most severe.
- LLM-style inference acceleration: the MCP module can be retained at inference to predict the next chunk in parallel with the current chunk, similar in spirit to parallel/speculative decoding in LLMs.
During training, the main model denoises the current chunk, while lightweight MCP modules predict multiple future chunks through a causal chain. These future prediction losses provide dense temporal supervision to the backbone and encourage the model to learn long-range dynamics instead of local appearance shortcuts.
The same trained checkpoint supports two inference modes:
- Zero-overhead mode: remove MCP modules and run the main model exactly like the baseline.
- MCP-accelerated mode: keep the first MCP module so one autoregressive step produces both the current chunk and the next chunk.
Next Forcing converges faster than LingBot-VA across frame rates. The gain is most pronounced at 50 fps: on the Random setting, Next Forcing reaches LingBot-VA's 45k-step accuracy at only 20k steps, corresponding to 2.3x faster convergence.
Next Forcing achieves the best average success rate on the RoboTwin benchmark across 50 bimanual manipulation tasks.
| Setting | X-VLA | pi_0 | pi_0.5 | Motus | Being-H0.7 | Fast-WAM | LingBot-VA | Next Forcing |
|---|---|---|---|---|---|---|---|---|
| Clean | 72.9 | 65.9 | 82.7 | 88.7 | 90.2 | 91.9 | 92.9 | 94.1 |
| Random | 72.8 | 58.4 | 76.8 | 87.0 | 89.6 | 91.8 | 91.5 | 93.5 |
MCP-accelerated inference predicts the next video chunk in parallel with the current chunk, reducing sequential video denoising cost while preserving comparable accuracy.
| Inference Mode | 12 fps Clean | 12 fps Random | 25 fps Clean | 25 fps Random | 50 fps Clean | 50 fps Random |
|---|---|---|---|---|---|---|
| Standard | 94.1 | 93.5 | 92.6 | 91.4 | 91.8 | 90.5 |
MCP-accelerated (2x) |
93.5 | 90.6 | 91.0 | 89.8 | 92.2 | 91.3 |
On PhyWorld, Next Forcing improves both video quality and physical consistency over LingBot-VA.
| Method | FVD (↓) | Abnormal Ratio (↓) | ||
|---|---|---|---|---|
| OOT | IT | OOT | IT | |
| LingBot-VA | 5.3 | 3.5 | 12% | 3% |
| Next Forcing | 4.7 | 3.2 | 8% | 2% |
On 3.5M in-house general video clips, Next Forcing also improves pure video generation after removing the action stream.
At 50k training steps, Next Forcing reduces FVD by 58% on Test Set 1 (94 vs. 225) and by 52% on Test Set 2 (97 vs. 204). It also surpasses LingBot-VA's 50k-step FVD with only 10k training steps.
- Project page and demos
- Paper
- Training and inference code
- Model checkpoints
@article{nextforcing,
title={Next Forcing: Causal World Modeling with Multi-Chunk Prediction},
author={Gangwei Xu and Qihang Zhang and Jiaming Zhou and Xing Zhu and Yujun Shen and Xin Yang and Yinghao Xu},
journal={},
year={2026}
}

