Skip to content

deepgenteam/deepgen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepGen

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Shanghai Innovation Institut, DeepGen Team

Paper PDF Project Page DeepGen RL Model CkPT Data

🔥 News

  • Feb 13, 2026: We released DeepGen 1.0, Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints can be found in Huggingface, support both T2I generation and image editing.
  • Feb 13, 2026: We released the training code support Pre-traning, Supervised Fine-Tuning, Reinforcement Learning(deepgen_rl) and evaluation code support wide range of benchmarks.
  • Feb 13, 2026: We released the DeepGen 1.0 technical report on Arxiv

✨ Introduction

Broader Scenario and Dimension Coverage We propose DeepGen 1.0, a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities—general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering—within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with competitive with or surpassing the state-of-the-art unified multimodal models that are 3× to 16× larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.

🧠 Method

Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ``think tokens'' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.

💻 Train & Eval

Set up environment

git clone https://github.com/deepgenteam/deepgen.git
cd deepgen
conda create -n deepgen python=3.10 -y
conda activate deepgen
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation

Data Prepare

Please See DATA for more details. We provide a detailed description of the data download and usage procedures for both the Pre-traning stage and the Supervised Fine-Tuning stage.

Train

We provide the scripts for Interleaved Reasoning Tuning.

bash scripts/sft.sh

You can replace the variables in the script with your own before running. See TRAIN for more details. We provide a detailed description of the data download and usage procedures for both the pretraining stage and the SFT stage.

Eval

We provide the scripts for evaluating T2I and Editing benchmarks, support World Knowledge-Enhanced Textual Reasoning and Fine-grained Editing-like Visual Refinement. Please See EVAL for more details.

📊 Benchmarks

1. General Image Generation

Model Params Geneval ↑ DPGBench ↑ UniGenBench ↑
OmniGen2 3B + 4B 0.80 83.57 63.09
BAGEL 14B 0.82 85.10 61.53
X-Omni 7B + 12B 0.83 87.65🥉 53.77
Lumina-DiMOO 8B 0.88🥇 86.04 71.12
Hunyuan-Image-3.0 80B 0.72 86.10
Qwen-Image 7B + 20B 0.87 🥈 88.32 🥇 78.81 🥇
LongCat-Image 7B + 6B 0.87 🥈 86.80
Z-Image-Turbo 4B + 6B 0.84 85.15 71.40
GLM-Image 9B + 7B 84.78
DeepGen 1.0 (SFT) 3B + 2B 0.86 🥉 87.05 74.18 🥉
DeepGen 1.0 (RL) 3B + 2B 0.87 🥈 87.90 🥈 75.74 🥈

2. General Image Editing

Model Params GEdit-EN ↑ ImgEdit ↑
BAGEL 14B 6.52 3.20
Qwen-Image-Edit [2509] 7B + 20B 7.54 🥈 4.35 🥈
LongCat-Image-Edit 7B + 6B 7.60 🥇 4.50 🥇
Mammoth2 8B + 3B + 2B 6.60 4.06
DeepGen 1.0 (SFT) 3B + 2B 7.12 4.09
DeepGen 1.0 (RL) 3B + 2B 7.17 🥉 4.14 🥉

3. Reasoning Image Generation

Model Params WISE ↑ T2I-CoREBench ↑
OmniGen2 3B + 4B 0.47 36.1
BAGEL 14B 0.70 🥉 41.1
Hunyuan-Image-3.0 80B 0.57 46.0
Qwen-Image 7B + 20B 0.62 46.3 🥉
LongCat-Image 7B + 6B 0.65 52.2 🥇
Z-Image-Turbo 4B + 6B - 43.7
DeepGen 1.0 (SFT) 3B + 2B 0.72 🥈 45.7
DeepGen 1.0 (RL) 3B + 2B 0.73 🥇 46.5 🥈

4. Reasoning Image Editing

Model Params RISE ↑ UniREditBench ↑
OmniGen2 3B + 4B - 43.4
BAGEL 14B 11.9 🥈 51.0
Qwen-Image-Edit [2509] 7B + 20B 8.9 56.5 🥉
DeepGen 1.0 (SFT) 3B + 2B 13.3 🥇 77.5 🥇
DeepGen 1.0 (RL) 3B + 2B 10.8 🥉 75.7 🥈

📧 Contact

dywang24@m.fudan.edu.cn, wjqdev@gmail.com

🎨 Quantitative results

⭐ Citation

🙏 Acknowledgement

The project builds upon the following pioneering works:

  • OpenUni: We thank the OpenUni releasing the elegant and concise code and pretrain dataset.
  • UniPic2-SD3.5M-Kontext-2B: We use UniPic2-SD3.5M-Kontext-2B as our diffusion module, considering its efficiency and strong performance on both t2i and editing.
  • UnifiedReward-Think: We use UnifiedReward-Think as our reward model for RL, considering its strong performance.
  • Qwen2.5 VL: We useQwen2.5 VL-3B as our VLM module, considering its efficiency and strong performance on multimodal understanding abilities.
  • BLIP3-o: We thank the BLIP3-o team for releasing the precious high-quality tuning dataset.
  • OpenGPT-4o-Image: We thank the OpenGPT-4o-Image team for releasing the precious high-quality tuning dataset.
  • ShareGPT-4o-Image: We thank the ShareGPT-4o-Image team for releasing the precious high-quality tuning dataset.
  • Echo-4o: We thank the Echo-4o team for releasing the precious high-quality tuning dataset.
  • OmniGen2: We thank the OmniGen2 team for releasing the precious high-quality editing tuning dataset and code.
  • Uniworld-V1: We thank the Uniworld team for releasing the precious high-quality tuning dataset and code.
  • Picobanana: We thank the Picobanana team for releasing the precious high-quality editing tuning dataset.
  • Nano-consist: We thank the Nano-consist team for releasing the precious high-quality editing tuning dataset.
  • NHR-edit: We thank the NHR-edit team for releasing the precious high-quality editing tuning dataset.
  • UniREditBench: We thank the UniREditBench team for releasing the precious high-quality reason-based editing tuning dataset.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •