DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving
Chen Shi*, Jinrui Xu*, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang†
The Chinese University of Hong Kong, Shenzhen & Voyager Research, Didi Chuxing
*Equal Contribution, †Corresponding Author
- Unified Video-Action Policy: Adapts a pretrained video diffusion transformer (Wan2.2-TI2V-5B) into an end-to-end driving policy via joint flow-matching over video and action tokens.
- Scene-Evolving Driving Guidance: A frozen VLM (Qwen3-VL-8B) generates chunk-specific semantic intent injected via temporally localized cross-attention.
- Selective KV Memory: Training-free modality-aware cache selection achieves 12x memory reduction for 300s rollouts with minimal accuracy loss.
- Strong Performance: 90.1 PDMS on NAVSIM v1 (single front-view camera, simple regression head) and 0.83m ADE@4s on PhysicalAI-Autonomous-Vehicles benchmark.
- [2025/27] Code will be released soon. Stay tuned!
- Release inference code and pretrained checkpoints
- Release data preparation scripts
- Release training code
If you find this work useful, please consider citing:
@article{shi2025drivewam,
title = {DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving},
author = {Shi, Chen and Xu, Jinrui and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li},
journal = {arXiv preprint arXiv:2605.28544},
year = {2025}
}DriveWAM is built upon LinBotVA, Wan2.2, and Qwen3-VL. We thank the authors for their great work.