UnityVideo : Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Jiehui Huang¹ · Yuechen Zhang² · Xu He³ · Yuan Gao⁴ · Zhi Cen⁴ · Bin Xia² ·
Yan Zhou⁴ · Xin Tao⁴ · Pengfei Wan⁴ · Jiaya Jia^1,✉

¹HKUST · ²CUHK · ³Tsinghua University · ⁴Kling Team, Kuaishou Technology

^✉Corresponding Author

📢 Code will be released soon! Stay tuned! 🚀

📖 Introduction

UnityVideo is a unified generalist framework for multi-task multi-modal video understanding that enables:

🎨 Text-to-Video Generation: Create high-quality videos from text descriptions
🎮 Controllable Generation: Fine-grained control over video generation with various modalities
🔍 Modality Estimation: Estimate depth, normal, and other modalities from video
🌟 Zero-Shot Generalization: Strong generalization to novel objects and styles without additional training

Our unified architecture achieves state-of-the-art performance across multiple video generation benchmarks while maintaining efficiency and scalability.

🔥 Highlights

✅ Unified Framework: Single model handles multiple video understanding tasks
✅ Multi-Modal Support: Seamlessly processes text, image, and video inputs
✅ World-Aware Generation: Enhanced physical understanding and consistency
✅ Flexible Control: Support for various control signals (depth, edge, pose, etc.)
✅ High Quality: State-of-the-art visual quality and temporal consistency
✅ Efficient Training: Joint multi-task learning improves data efficiency

🎯 Method

UnityVideo employs a unified multi-modal multi-task learning framework that consists of:

Multi-Modal Encoder: Processes diverse input modalities (text, image, video)
Unified Transformer Backbone: Shared representation learning across tasks
Task-Specific Heads: Specialized decoders for different generation and estimation tasks
Joint Training Strategy: Simultaneous optimization across all tasks

This architecture enables knowledge sharing and improves generalization across different video understanding tasks.

📊 Results Gallery

🎬 Text-to-Video Generation

More examples coming Soon

🎮 Controllable Generation

More examples coming Soon

🔍 Modality Estimation

More examples coming Soon

🗓️ TODO List

⚖️ License

This repository is released under the Apache-2.0 license as found in the LICENSE file.

🚀 Stay Tuned for Updates!

Follow this project to get notified when we release the code!

📚 Citation

If you find this work useful for your research, please cite:

@article{huang2024unityvideo,
  title={UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation},
  author={Huang, Jiehui and Zhang, Yuechen and He, Xu and Gao, Yuan and Cen, Zhi and Xia, Bin and Zhou, Yan and Tao, Xin and Wan, Pengfei and Jia, Jiaya},
  journal={arXiv preprint arXiv:2512.07831},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UnityVideo : Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

📢 Code will be released soon! Stay tuned! 🚀

📖 Introduction

🔥 Highlights

🎯 Method

📊 Results Gallery

🎬 Text-to-Video Generation

🎮 Controllable Generation

🔍 Modality Estimation

🗓️ TODO List

⚖️ License

🚀 Stay Tuned for Updates!

📚 Citation

About

Uh oh!

Releases

Packages

License

dvlab-research/UnityVideo

Folders and files

Latest commit

History

Repository files navigation

UnityVideo : Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

📢 Code will be released soon! Stay tuned! 🚀

📖 Introduction

🔥 Highlights

🎯 Method

📊 Results Gallery

🎬 Text-to-Video Generation

🎮 Controllable Generation

🔍 Modality Estimation

🗓️ TODO List

⚖️ License

🚀 Stay Tuned for Updates!

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages