Skip to content

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

License

Notifications You must be signed in to change notification settings

apple/ml-unigen

Repository files navigation

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

This project accompanies the research paper,

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
Rui Tian*, Mingfei Gao*, Mingze Xu*, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, Afshin Dehghan

The workflow of UniGen using test-time scaling and CoT-V

UniGen is a unified multimodal large language model (MLLM) capable of both image understanding and generation. We detail UniGen's full training pipeline from a data-centric perspective, including its multi-stage pre-training, supervised fine-tuning, and direct preference optimization.

More importantly, we introduce Chain-of-Thought Verification (CoT-V), a novel test-time strategy that significantly boosts image generation quality using a simple Best-of-N approach.

📢 News

  • [11/18] 🚀🚀🚀 UniGen-1.5 is on ArXiv!
  • [9/19] 🔥🔥🔥 UniGen has been accepted to NeurIPS 2025!

📚 Table of Contents

🚀 Getting Started

Installation

This code requires Python >= 3.10.12, PyTorch >= 2.4.1, and CUDA 12.4.

  1. [Optional but recommended] Create and activate a new conda environment.

    conda create -n unigen python=3.10.12

    And activate the environment.

    conda activate unigen
    
  2. Install the required dependencies.

    bash scripts/setup.sh
  3. Download the pre-trained weights for Qwen2.5-1.5b, MAGViTv2, and SigLIP from Hugging Face and place them in the unigen_data/checkpoints directory.

    huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --repo-type model --local-dir unigen_data/checkpoints/Qwen2.5-1.5B-Instruct
    
    huggingface-cli download showlab/magvitv2 --repo-type model --local-dir unigen_data/checkpoints/magvitv2
    
    huggingface-cli download google/siglip-so400m-patch14-384 --repo-type model --local-dir unigen_data/checkpoints/siglip-so400m-patch14-384
  4. Add your OpenAI API key and organization to your environment variables for model evaluation.

    export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
    export OPENAI_ORG="YOUR_OPENAI_ORG"  # Optional
  5. [Optional] Add your Weights & Biases API key to enable logging during training.

    wandb login "YOUR_WANDB_API_KEY"

Data Preparation

Prepare the following datasets and place them in the unigen_data/datasets directory.

  1. Text-only Dataset: Download RefinedWeb from Hugging Face.

    huggingface-cli download  tiiuae/falcon-refinedweb --repo-type dataset  --local-dir unigen_data/datasets/falcon-refinedweb
  2. Image-Text Pair Dataset (for Pre-training): Download CC-12M, CC-3M, Segment-Anything-11M, and ImageNet-21K. Prepare all datasets in the WebDataset format and perform re-captioning using the following system prompt:

    <|im_start|>system
    You are a helpful assistant.<|im_end|>
    <|im_start|>user
    <|vision_start|><|image_pad|><|vision_end|>
    What is the content of this image?<|im_end|>
    <|im_start|>assistant
    

    The re-annotated caption should be saved in the key pf .txt in webdataset, and the image should be saved in the key of .png|.jpg|.jpeg|.wbep.

  3. Supervised Fine-Tuning (SFT) Data:

  4. Direct Preference Optimization (DPO) Data:

    • Prepare text prompts from various sources.

    • Set up the data annotation environment with vllm.

      pip install vllm==0.7.3
    • Convert prompts into related visual questions using an LLM.

      python scripts/dataflows/zeroshot_questions.py --metadata_path /path/to/prompt --out_path /path/to/out --model_name Qwen/Qwen2.5-7B-Instruct
    • Generate N image samples for each text prompt with the UniGen-SFT model.

    • Run the pseudo-labeling pipeline on each image-question pair.

      python scripts/dataflows/zeroshot_vqa.py --metadata_path /path/to/visual_question --out_path /path/to/out --image_root /path/to/img --model_name Qwen/Qwen2.5-VL-7B-Instruct

⚙️ Training Scripts

Pre-training: Stage 1

Run the following script for Stage 1 pre-training on 2x 80GB H100/A100 GPUs.

bash scripts/run_pretraining.sh \
     --experiment_config configs/unigen_1_5b/unigen_pt1.yaml \
     --output_dir path_to_your_out \
     --train_module train.py 

Pre-training Stage 2

Place the final checkpoint from Stage 1 (unigen_pt1/checkpoint-150000) in unigen_data/checkpoints. Then, run the following command for Stage 2 pre-training on 4x 80GB H100/A100 GPUs.

bash scripts/run_pretraining.sh \
     --experiment_config configs/unigen_1_5b/unigen_pt2.yaml \
     --pretrained_model  unigen_pt1/checkpoint-150000  \
     --output_dir path_to_your_out \
     --train_module train.py 

Supervised Finetuning

Place the final checkpoint from Stage 2 (unigen_pt2/checkpoint-400000) in unigen_data/checkpoints. Then, run the following command for SFT on 1x 80GB H100/A100 GPU.

bash scripts/run_sft.sh \
     --experiment_config configs/unigen_1_5b/unigen_sft.yaml \
     --pretrained_model  unigen_pt2/checkpoint-400000 \
     --train_module train_w_clip_vit.py \
     --output_dir path_to_your_out 

Direct Preference Optimization

Place the final SFT checkpoint (unigen_sft/checkpoint-145824) in unigen_data/checkpoints. Then, run the following command for DPO on 1x 80GB H100/A100 GPU.

bash scripts/run_sft.sh \
    --experiment_config configs/unigen_1_5b/unigen_dpo.yaml  \
    --pretrained_model unigen_sft/checkpoint-145824 \
    --train_module train_dpo.py \
    --output_dir path_to_your_out 

CoT-V Post-Training

Place the final DPO checkpoint (unigen_dpo/unwrapped_model) in unigen_data/checkpoints. Then, run the following command for CoT-V post-training on 1x 80GB H100/A100 GPU.

bash scripts/run_cotv.sh \
    --experiment_config configs/unigen_1_5b/unigen_cotv_post_sft.yaml \
    --pretrained_model unigen_dpo/unwrapped_model \
    --train_module train_w_clip_vit.py \
    --output_dir path_to_your_out 

Evaluation Scripts

Evaluation Installation

Install the necessary requirements and clone required repos for evaluating on understanding (lmms-eval) and generation (DPGbench, GenEval) benchmarks.

bash scripts/setup_eval.sh

Next, download the checkpoints required for evaluation.

LOCAL_CHECKPOINT_DIR=unigen_data/checkpoints
python -c $'from modelscope.hub.snapshot_download import snapshot_download\nsnapshot_download("damo/mplug_visual-question-answering_coco_large_en")'
bash third_party/geneval/evaluation/download_models.sh $LOCAL_CHECKPOINT_DIR

Evaluating UniGen-PT1 Checkpoints

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_pt1.yaml  \
    --eval_modules geneval+dpgbench \
    --eval_checkpoint unigen_pt1/checkpoint-150000 \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

Evaluating UniGen-PT2 Checkpoints

bash scripts/run_evaluation.sh \
     --config  configs/unigen_1_5b/unigen_pt2.yaml \
     --eval_modules geneval+dpgbench \
     --eval_checkpoint unigen_pt2/checkpoint-400000 \
     --output_dir path_to_your_out \
     --local_shared_fs unigen_data

Evaluating UniGen-SFT Checkpoints

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_sft.yaml  \
    --lmms_tasks "mmmu_val,gqa,ai2d,mme,mathvista_testmini,mmvet" \
    --eval_modules lmms \
    --eval_checkpoint unigen_sft/checkpoint-145824 \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_sft.yaml  \
    --lmms_tasks "realworldqa,scienceqa_img,seedbench,pope" \
    --eval_modules lmms+geneval+dpgbench \
    --eval_checkpoint unigen_sft/checkpoint-145824 \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

Evaluating UniGen-DPO Checkpoints

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_dpo.yaml  \
    --lmms_tasks "mmmu_val,gqa,ai2d,mme,mathvista_testmini,mmvet" \
    --eval_modules lmms \
    --eval_checkpoint unigen_dpo/unwrapped_model \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data
    
bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_dpo.yaml  \
    --lmms_tasks "realworldqa,scienceqa_img,seedbench,pope" \
    --eval_modules lmms+geneval+dpgbench \
    --eval_checkpoint unigen_dpo/unwrapped_model \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

Evaluating UniGen after CoT-V Post-Training

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_cotv_post_sft.yaml  \
    --lmms_tasks "mmmu_val,gqa,ai2d,mme,mathvista_testmini,mmvet" \
    --eval_modules lmms \
    --eval_checkpoint unigen/checkpoint-500 \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_cotv_post_sft.yaml  \
    --lmms_tasks "realworldqa,scienceqa_img,seedbench,pope" \
    --eval_checkpoint unigen/checkpoint-500 \
    --output_dir path_to_your_out \
    --local_shared_fs unigen_data

Test-time Scaling of UniGen with CoT-V

To perform Best-of-N (where N=5) test-time scaling with CoT-V, set mmu_rating_style="think".

A. On the GenEval Benchmark

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_cotv_post_sft.yaml \
    --eval_modules cot-gen \
    --eval_checkpoint unigen/checkpoint-500 \
    --local_shared_fs unigen_data \
    --output_dir path_to_your_out \
    --mmu_rating_style think

B. On the DPG Benchmark

bash scripts/run_evaluation.sh \
    --config configs/unigen_1_5b/unigen_cotv_post_sft.yaml \
    --eval_modules cot-dpg \
    --eval_checkpoint unigen/checkpoint-500 \
    --local_shared_fs unigen_data \
    --output_dir path_to_your_out \
    --mmu_rating_style think

License

This project is licensed under the Apple Sample Code License.

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@article{tian2025unigen,
      title={UniGen: Enhanced Training \& Test-Time Strategies for Unified Multimodal Understanding and Generation},
      author={Tian, Rui and Gao, Mingfei and Xu, Mingze and Hu, Jiaming and Lu, Jiasen and Wu, Zuxuan and Yang, Yinfei and Dehghan, Afshin},
      journal={arXiv preprint arXiv:2505.14682},
      year={2025}
      }

@article{tian2025unigen1.5,
      title={UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning},
      author={Tian, Rui and Gao, Mingfei and Gang, Haiming and Lu, Jiasen and Gan, Zhe and Yang, Yinfei and Wu, Zuxuan and Dehghan, Afshin},
      journal={arXiv preprint arXiv:2511.14760},
      year={2025}
      }

About

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published