Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation (CVPR 2026)

Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation (CVPR 2026)

Author: Boyu Han, Qianqian Xu*, Shilong Bao, Zhiyong Yang, Ruochen Cui, Xilin Zhao, Qingming Huang*

✨ Updates

[2026-3-6] 🔥 Released our DCR code. We warmly welcome everyone to use it and give feedback or suggestions!

[2026-2-27] Our paper has been accepted to CVPR 2026.

🔧 Prepare

Installation

Clone the repository

git clone https://github.com/boyuh/DCR.git

Install the required packages:

conda create -n DCR python=3.10 -y
conda activate DCR
pip install --upgrade pip
pip install -r requirements.txt

Dataset

Download CC3M dataset. Please refer to here for more details.
Place the CC3M dataset in the dataset folder with the following structure:
```
DCR
├── dataset
│   └── CC3M
└── ...
```

Pretrained Models

Download the pre-trained Stable Diffusion model. We recommend Stable Diffusion v2.1.

Download the pre-trained CLIP model. Please refer to the following links:

CLIP Backbone	Model Name	Link
OpenAICLIP ViT-L-14@224	clip-vit-large-patch14	🤗
OpenAICLIP ViT-L-14@336	clip-vit-large-patch14-336	🤗
MetaCLIP ViT-L-14@224	metaclip-l14-fullcc2.5b	🤗
MetaCLIP ViT-H-14@224	metaclip-h14-fullcc2.5b	🤗
SigLIP ViT-SO-14@224	siglip-so400m-patch14-224	🤗
SigLIP ViT-SO-14@384	siglip-so400m-patch14-384	🤗

Place the pre-trained models in the pretrained_weights folder with the following structure:

DCR
├── pretrained_weights
│   ├── SD
│   │   └── stable-diffusion-v2-1
│   ├── OpenAICLIP
│   │   ├── clip-vit-large-patch14
│   │   └── clip-vit-large-patch14-336
│   ├── MetaCLIP
│   │   ├── metaclip-l14-fullcc2.5b
│   │   └── metaclip-h14-fullcc2.5b
│   └── SigLIP
│       ├── siglip-so400m-patch14-224
│       └── siglip-so400m-patch14-384
└── ...

🖥️ Training

Stage-1

In Stage-1, we only train the projector while freezing the CLIP model.

# OpenAICLIP ViT-L-14@224
bash train_scripts/scripts_train_OpenAICLIP_224_stage1.sh

# OpenAICLIP ViT-L-14@336
bash train_scripts/scripts_train_OpenAICLIP_336_stage1.sh

# MetaCLIP ViT-L-14@224
bash train_scripts/scripts_train_MetaCLIP_L_stage1.sh

# MetaCLIP ViT-H-14@224
bash train_scripts/scripts_train_MetaCLIP_H_stage1.sh

# SigLIP ViT-SO-14@224
bash train_scripts/scripts_train_SigLIP_224_stage1.sh

# SigLIP ViT-SO-14@384
bash train_scripts/scripts_train_SigLIP_384_stage1.sh

Stage-2

In Stage-2, we only finetune the CLIP model.

# OpenAICLIP ViT-L-14@224
bash train_scripts/scripts_train_OpenAICLIP_224_stage2.sh

# OpenAICLIP ViT-L-14@336
bash train_scripts/scripts_train_OpenAICLIP_336_stage2.sh

# MetaCLIP ViT-L-14@224
bash train_scripts/scripts_train_MetaCLIP_L_stage2.sh

# MetaCLIP ViT-H-14@224
bash train_scripts/scripts_train_MetaCLIP_H_stage2.sh

# SigLIP ViT-SO-14@224
bash train_scripts/scripts_train_SigLIP_224_stage2.sh

# SigLIP ViT-SO-14@384
bash train_scripts/scripts_train_SigLIP_384_stage2.sh

⭐ Released Weights

We provide the enhanced CLIP weights for six CLIP backbones on this Link.

CLIP Backbone	MMVP-VLM (Original)	MMVP-VLM (Ours)	Checkpoint
OpenAICLIP ViT-L-14@224	19.2	33.3	Google Drive
OpenAICLIP ViT-L-14@336	20.0	31.1	Google Drive
MetaCLIP ViT-L-14@224	23.7	32.6	Google Drive
MetaCLIP ViT-H-14@224	25.2	37.8	Google Drive
SigLIP ViT-SO-14@224	37.8	43.0	Google Drive
SigLIP ViT-SO-14@384	37.0	42.2	Google Drive

📏 Evaluation

Please first download the benchmark MMVP-VLM.

We provide evaluation scripts of six CLIP backbones. To evaluate the model, run this command:

# OpenAICLIP ViT-L-14@224
python evaluation/evaluate_mmvp_OpenAICLIP_224.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'

# OpenAICLIP ViT-L-14@336
python evaluation/evaluate_mmvp_OpenAICLIP_336.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'

# MetaCLIP ViT-L-14@224
python evaluation/evaluate_mmvp_MetaCLIP_large.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'

# MetaCLIP ViT-H-14@224
python evaluation/evaluate_mmvp_MetaCLIP_huge.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'

# SigLIP ViT-SO-14@224
python evaluation/evaluate_mmvp_SigLIP_224.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'

# SigLIP ViT-SO-14@384
python evaluation/evaluate_mmvp_SigLIP_384.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'

✒️ Citation

If you find our work inspiring or use our codebase in your research, please cite our work.

@inproceedings{han2026dcr,
    title={Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation}, 
    author={Boyu Han and Qianqian Xu and Shilong Bao and Zhiyong Yang and Ruochen Cui and Xilin Zhao and Qingming Huang},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2026}
}

💬 Contact

If you find any issues or plan to contribute back bug-fixes, please contact us by Boyu Han (Email: hanboyu23z@ict.ac.cn).

📚 Acknowledgement

Our codes are built upon the excellent project GenHancer. Thanks for their valuable contributions to the community!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
clip_models		clip_models
evaluation		evaluation
image_datasets		image_datasets
resources		resources
src/stable_diffusion		src/stable_diffusion
train_configs		train_configs
train_scripts		train_scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train_MetaCLIP_stage1.py		train_MetaCLIP_stage1.py
train_MetaCLIP_stage2.py		train_MetaCLIP_stage2.py
train_OpenAICLIP_stage1.py		train_OpenAICLIP_stage1.py
train_OpenAICLIP_stage2.py		train_OpenAICLIP_stage2.py
train_SigLIP_stage1.py		train_SigLIP_stage1.py
train_SigLIP_stage2.py		train_SigLIP_stage2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation (CVPR 2026)

✨ Updates

🔧 Prepare

Installation

Dataset

Pretrained Models

🖥️ Training

Stage-1

Stage-2

⭐ Released Weights

📏 Evaluation

✒️ Citation

💬 Contact

📚 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation (CVPR 2026)

✨ Updates

🔧 Prepare

Installation

Dataset

Pretrained Models

🖥️ Training

Stage-1

Stage-2

⭐ Released Weights

📏 Evaluation

✒️ Citation

💬 Contact

📚 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages