Skip to content

This project is the official implementation of "UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation"

License

Notifications You must be signed in to change notification settings

dvlab-research/UnityVideo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

UnityVideo Logo

UnityVideo : Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

arXiv Project Page License Model Dataset

Jiehui Huang1 · Yuechen Zhang2 · Xu He3 · Yuan Gao4 · Zhi Cen4 · Bin Xia2 ·
Yan Zhou4 · Xin Tao4 · Pengfei Wan4 · Jiaya Jia1,✉

1HKUST · 2CUHK · 3Tsinghua University · 4Kling Team, Kuaishou Technology

Corresponding Author


📢 Code will be released soon! Stay tuned! 🚀


📖 Introduction

UnityVideo is a unified generalist framework for multi-task multi-modal video understanding that enables:

  • 🎨 Text-to-Video Generation: Create high-quality videos from text descriptions
  • 🎮 Controllable Generation: Fine-grained control over video generation with various modalities
  • 🔍 Modality Estimation: Estimate depth, normal, and other modalities from video
  • 🌟 Zero-Shot Generalization: Strong generalization to novel objects and styles without additional training

Our unified architecture achieves state-of-the-art performance across multiple video generation benchmarks while maintaining efficiency and scalability.


🔥 Highlights

  • Unified Framework: Single model handles multiple video understanding tasks
  • Multi-Modal Support: Seamlessly processes text, image, and video inputs
  • World-Aware Generation: Enhanced physical understanding and consistency
  • Flexible Control: Support for various control signals (depth, edge, pose, etc.)
  • High Quality: State-of-the-art visual quality and temporal consistency
  • Efficient Training: Joint multi-task learning improves data efficiency

🎯 Method

UnityVideo employs a unified multi-modal multi-task learning framework that consists of:

  1. Multi-Modal Encoder: Processes diverse input modalities (text, image, video)
  2. Unified Transformer Backbone: Shared representation learning across tasks
  3. Task-Specific Heads: Specialized decoders for different generation and estimation tasks
  4. Joint Training Strategy: Simultaneous optimization across all tasks

This architecture enables knowledge sharing and improves generalization across different video understanding tasks.


📊 Results Gallery

🎬 Text-to-Video Generation

More examples coming Soon

🎮 Controllable Generation

More examples coming Soon

🔍 Modality Estimation

More examples coming Soon

🗓️ TODO List

  • Release training code
  • Release inference code
  • Release pretrained models
  • Release data
  • Add Gradio demo, Colab notebook, and more usage examples
  • Release arXiv paper

⚖️ License

This repository is released under the Apache-2.0 license as found in the LICENSE file.

🚀 Stay Tuned for Updates!

Follow this project to get notified when we release the code!


📚 Citation

If you find this work useful for your research, please cite:

@article{huang2024unityvideo,
  title={UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation},
  author={Huang, Jiehui and Zhang, Yuechen and He, Xu and Gao, Yuan and Cen, Zhi and Xia, Bin and Zhou, Yan and Tao, Xin and Wan, Pengfei and Jia, Jiaya},
  journal={arXiv preprint arXiv:2512.07831},
  year={2025}
}

About

This project is the official implementation of "UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published