Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation (CVPR 2026)
Author: Boyu Han, Qianqian Xu*, Shilong Bao, Zhiyong Yang, Ruochen Cui, Xilin Zhao, Qingming Huang*
[2026-3-6] 🔥 Released our DCR code. We warmly welcome everyone to use it and give feedback or suggestions!
[2026-2-27] Our paper has been accepted to CVPR 2026.
-
Clone the repository
git clone https://github.com/boyuh/DCR.git
-
Install the required packages:
conda create -n DCR python=3.10 -y conda activate DCR pip install --upgrade pip pip install -r requirements.txt
-
Download CC3M dataset. Please refer to here for more details.
-
Place the CC3M dataset in the
datasetfolder with the following structure:DCR ├── dataset │ └── CC3M └── ...
-
Download the pre-trained Stable Diffusion model. We recommend Stable Diffusion v2.1.
-
Download the pre-trained CLIP model. Please refer to the following links:
CLIP Backbone Model Name Link OpenAICLIP ViT-L-14@224 clip-vit-large-patch14 🤗 OpenAICLIP ViT-L-14@336 clip-vit-large-patch14-336 🤗 MetaCLIP ViT-L-14@224 metaclip-l14-fullcc2.5b 🤗 MetaCLIP ViT-H-14@224 metaclip-h14-fullcc2.5b 🤗 SigLIP ViT-SO-14@224 siglip-so400m-patch14-224 🤗 SigLIP ViT-SO-14@384 siglip-so400m-patch14-384 🤗 -
Place the pre-trained models in the
pretrained_weightsfolder with the following structure:DCR ├── pretrained_weights │ ├── SD │ │ └── stable-diffusion-v2-1 │ ├── OpenAICLIP │ │ ├── clip-vit-large-patch14 │ │ └── clip-vit-large-patch14-336 │ ├── MetaCLIP │ │ ├── metaclip-l14-fullcc2.5b │ │ └── metaclip-h14-fullcc2.5b │ └── SigLIP │ ├── siglip-so400m-patch14-224 │ └── siglip-so400m-patch14-384 └── ...
In Stage-1, we only train the projector while freezing the CLIP model.
# OpenAICLIP ViT-L-14@224
bash train_scripts/scripts_train_OpenAICLIP_224_stage1.sh
# OpenAICLIP ViT-L-14@336
bash train_scripts/scripts_train_OpenAICLIP_336_stage1.sh
# MetaCLIP ViT-L-14@224
bash train_scripts/scripts_train_MetaCLIP_L_stage1.sh
# MetaCLIP ViT-H-14@224
bash train_scripts/scripts_train_MetaCLIP_H_stage1.sh
# SigLIP ViT-SO-14@224
bash train_scripts/scripts_train_SigLIP_224_stage1.sh
# SigLIP ViT-SO-14@384
bash train_scripts/scripts_train_SigLIP_384_stage1.shIn Stage-2, we only finetune the CLIP model.
# OpenAICLIP ViT-L-14@224
bash train_scripts/scripts_train_OpenAICLIP_224_stage2.sh
# OpenAICLIP ViT-L-14@336
bash train_scripts/scripts_train_OpenAICLIP_336_stage2.sh
# MetaCLIP ViT-L-14@224
bash train_scripts/scripts_train_MetaCLIP_L_stage2.sh
# MetaCLIP ViT-H-14@224
bash train_scripts/scripts_train_MetaCLIP_H_stage2.sh
# SigLIP ViT-SO-14@224
bash train_scripts/scripts_train_SigLIP_224_stage2.sh
# SigLIP ViT-SO-14@384
bash train_scripts/scripts_train_SigLIP_384_stage2.shWe provide the enhanced CLIP weights for six CLIP backbones on this Link.
| CLIP Backbone | MMVP-VLM (Original) | MMVP-VLM (Ours) | Checkpoint |
|---|---|---|---|
| OpenAICLIP ViT-L-14@224 | 19.2 | 33.3 | Google Drive |
| OpenAICLIP ViT-L-14@336 | 20.0 | 31.1 | Google Drive |
| MetaCLIP ViT-L-14@224 | 23.7 | 32.6 | Google Drive |
| MetaCLIP ViT-H-14@224 | 25.2 | 37.8 | Google Drive |
| SigLIP ViT-SO-14@224 | 37.8 | 43.0 | Google Drive |
| SigLIP ViT-SO-14@384 | 37.0 | 42.2 | Google Drive |
Please first download the benchmark MMVP-VLM.
We provide evaluation scripts of six CLIP backbones. To evaluate the model, run this command:
# OpenAICLIP ViT-L-14@224
python evaluation/evaluate_mmvp_OpenAICLIP_224.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'
# OpenAICLIP ViT-L-14@336
python evaluation/evaluate_mmvp_OpenAICLIP_336.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'
# MetaCLIP ViT-L-14@224
python evaluation/evaluate_mmvp_MetaCLIP_large.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'
# MetaCLIP ViT-H-14@224
python evaluation/evaluate_mmvp_MetaCLIP_huge.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'
# SigLIP ViT-SO-14@224
python evaluation/evaluate_mmvp_SigLIP_224.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'
# SigLIP ViT-SO-14@384
python evaluation/evaluate_mmvp_SigLIP_384.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'If you find our work inspiring or use our codebase in your research, please cite our work.
@inproceedings{han2026dcr,
title={Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation},
author={Boyu Han and Qianqian Xu and Shilong Bao and Zhiyong Yang and Ruochen Cui and Xilin Zhao and Qingming Huang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}
If you find any issues or plan to contribute back bug-fixes, please contact us by Boyu Han (Email: hanboyu23z@ict.ac.cn).
Our codes are built upon the excellent project GenHancer. Thanks for their valuable contributions to the community!

