Jiehui Huang1 ·
Yuechen Zhang2 ·
Xu He3 ·
Yuan Gao4 ·
Zhi Cen4 ·
Bin Xia2 ·
Yan Zhou4 ·
Xin Tao4 ·
Pengfei Wan4 ·
Jiaya Jia1,✉
1HKUST · 2CUHK · 3Tsinghua University · 4Kling Team, Kuaishou Technology
✉Corresponding Author
UnityVideo is a unified generalist framework for multi-task multi-modal video understanding that enables:
- 🎨 Text-to-Video Generation: Create high-quality videos from text descriptions
- 🎮 Controllable Generation: Fine-grained control over video generation with various modalities
- 🔍 Modality Estimation: Estimate depth, normal, and other modalities from video
- 🌟 Zero-Shot Generalization: Strong generalization to novel objects and styles without additional training
Our unified architecture achieves state-of-the-art performance across multiple video generation benchmarks while maintaining efficiency and scalability.
- ✅ Unified Framework: Single model handles multiple video understanding tasks
- ✅ Multi-Modal Support: Seamlessly processes text, image, and video inputs
- ✅ World-Aware Generation: Enhanced physical understanding and consistency
- ✅ Flexible Control: Support for various control signals (depth, edge, pose, etc.)
- ✅ High Quality: State-of-the-art visual quality and temporal consistency
- ✅ Efficient Training: Joint multi-task learning improves data efficiency
UnityVideo employs a unified multi-modal multi-task learning framework that consists of:
- Multi-Modal Encoder: Processes diverse input modalities (text, image, video)
- Unified Transformer Backbone: Shared representation learning across tasks
- Task-Specific Heads: Specialized decoders for different generation and estimation tasks
- Joint Training Strategy: Simultaneous optimization across all tasks
This architecture enables knowledge sharing and improves generalization across different video understanding tasks.
| More examples coming Soon |
| More examples coming Soon |
| More examples coming Soon |
- Release training code
- Release inference code
- Release pretrained models
- Release data
- Add Gradio demo, Colab notebook, and more usage examples
- Release arXiv paper
This repository is released under the Apache-2.0 license as found in the LICENSE file.
Follow this project to get notified when we release the code!
If you find this work useful for your research, please cite:
@article{huang2024unityvideo,
title={UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation},
author={Huang, Jiehui and Zhang, Yuechen and He, Xu and Gao, Yuan and Cen, Zhi and Xia, Bin and Zhou, Yan and Tao, Xin and Wan, Pengfei and Jia, Jiaya},
journal={arXiv preprint arXiv:2512.07831},
year={2025}
}

