GitHub - baaivision/DIVA: Diffusion Feedback Helps CLIP See Better

Diffusion Feedback Helps CLIP See Better

Wenxuan Wang^1,2,3*, Quan Sun^3*, Fan Zhang³, Yepeng Tang⁴, Jing Liu^1,2, Xinlong Wang³

¹CASIA, ²UCAS, ³BAAI, ⁴BJTU
^* Equal Contribution

⏰ Schedule

[2024-08-07] We release CLIP model weights ! 💥

[2024-08-05] We release training & evaluation code ! 💥

[2024-07-30] Our paper is released on arXiv ! 💥

💡 Motivation

In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (w/o corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ↑), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that DIVA preserves CLIP's strong zero-shot capabilities.

🤖 Architecture

Given an image, the CLIP model encodes the visual features as the main part of condition, then the generative diffusion model predicts the added noise taking the noisy image and condition as input. We optimize the CLIP's representation by maximizing the image likelihood with the diffusion loss via generative feedback.

🔨 Installation

Clone this repository and install the required packages:

git clone https://github.com/baaivision/DIVA.git
cd DIVA
mkdir -p outputs logs datasets pretrained_weights/CLIP pretrained_weights/SD

conda create -n diva python=3.9
conda activate diva
pip install -r requirements.txt

Core packages:

Pytorch version 2.0.0
open-clip-torch version 2.24.0
timm version 0.9.8

🍹 Preparation for DIVA's Generative Fine-tuning

Data Acquisition

For data preparation, please refer to image2dataset and MMVP for the employed training and evaluation data in this work. After collecting the corresponding datasets, directly put them into the dataset/ folder path.

Pre-trained Weight Downloading

As for pre-trained weight preparation, please refer to OpenAI ViT-L-14/224&336, MetaCLIP ViT-L/H-14, SigLIP ViT-SO-14/224, SigLIP ViT-SO-14/384, DFN ViT-H-14/224, DFN ViT-H-14/378 and SD-2-1-base to acquire the model weights for discriminative CLIP models and the leveraged diffusion model that provides generative feedback. After downloading all these necessary weights, move them respectively to the corresponding folder path pretrained_weights/CLIP/ and pretrained_weights/SD/.

Code Modification

For the preparation for our DIVA's condition design, some source code in the installed CLIP and OpenCLIP packages need to be modified.

For OpenAI CLIP, use the content in our provided condition/OpenAICLIP_for_clip_model.py to replace the content in Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/clip/model.py.

For MetaCLIP and DFN, use the content in our provided condition/MetaCLIP_for_openclip_transformer.py and condition/DFN_for_openclip_transformer.py to replace the content in Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/open_clip/transformer.py, respectively.

For SigLIP, use the content in our provided condition/SigLIP_for_timm_models_visiontransformer.py to replace the content in Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/timm/models/vision_transformer.py.

🍻 Quick Start for Training & Evaluation

After all the above preparation steps, you can simply start training for our DIVA with the following command:

# For OpenAICLIP
bash DIVA_for_OpenAICLIP.sh

# For MetaCLIP
bash DIVA_for_MetaCLIP.sh

# For SigLIP
bash DIVA_for_SigLIP.sh

# For DFN
bash DIVA_for_DFN.sh

Model Zoo

Method	Image Size	Params (M)	Average Score
OpenAI ViT-L-14	224²	427.6	25.9 (+6.6)
OpenAI ViT-L-14	336²	427.9	25.2 (+5.2)
MetaCLIP ViT-L-14	224²	427.6	27.4 (+3.7)
MetaCLIP ViT-H-14	224²	986.1	31.9 (+6.7)
SigLIP ViT-SO-14	224²	877.4	40.7 (+2.9)
SigLIP ViT-SO-14	384²	878.0	38.5 (+1.5)
DFN ViT-H-14	224²	986.1	43.7 (+4.4)
DFN ViT-H-14	378²	986.7	37.8 (+3.0)

It is worth noting that, due to the randomness among the introduced condition design during the training phase and the selection of local patch tokens during the inference phase for OpenAI CLIP, the obtained scores on MMVP_VLM benchmark using our provided OpenAI CLIP weights might not be the same as the reported results in our paper. At this time, we recommend trying different random seeds multiple times if the scores do not meet expectations.

🎨 Visualization

💙 Acknowledgement

DIVA is built upon the awesome Diffusion-TTA, MMVP, CLIP, OpenCLIP, timm.

📝 Citation

@article{wang2024diffusion,
      title={Diffusion Feedback Helps CLIP See Better},
      author={Wang, Wenxuan and Sun, Quan and Zhang, Fan and Tang, Yepeng and Liu, Jing and Wang, Xinlong},
      journal={arXiv preprint arXiv:2407.20171},
      year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
condition		condition
data		data
models		models
DIVA_for_DFN.sh		DIVA_for_DFN.sh
DIVA_for_MetaCLIP.sh		DIVA_for_MetaCLIP.sh
DIVA_for_OpenAICLIP.sh		DIVA_for_OpenAICLIP.sh
DIVA_for_SigLIP.sh		DIVA_for_SigLIP.sh
LICENSE		LICENSE
README.md		README.md
accelerator.json		accelerator.json
arguments.py		arguments.py
callbacks.py		callbacks.py
config.py		config.py
requirements.txt		requirements.txt
run_DIVA_with_DFN.py		run_DIVA_with_DFN.py
run_DIVA_with_MetaCLIP.py		run_DIVA_with_MetaCLIP.py
run_DIVA_with_OpenAICLIP.py		run_DIVA_with_OpenAICLIP.py
run_DIVA_with_SigLIP.py		run_DIVA_with_SigLIP.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diffusion Feedback Helps CLIP See Better

⏰ Schedule

[2024-08-07] We release CLIP model weights ! 💥

[2024-08-05] We release training & evaluation code ! 💥

[2024-07-30] Our paper is released on arXiv ! 💥

💡 Motivation

🤖 Architecture

🔨 Installation

🍹 Preparation for DIVA's Generative Fine-tuning

Data Acquisition

Pre-trained Weight Downloading

Code Modification

🍻 Quick Start for Training & Evaluation

Model Zoo

🎨 Visualization

💙 Acknowledgement

📝 Citation

About

Releases

Packages

Languages

License

baaivision/DIVA

Folders and files

Latest commit

History

Repository files navigation

Diffusion Feedback Helps CLIP See Better

⏰ Schedule

[2024-08-07] We release CLIP model weights ! 💥

[2024-08-05] We release training & evaluation code ! 💥

[2024-07-30] Our paper is released on arXiv ! 💥

💡 Motivation

🤖 Architecture

🔨 Installation

🍹 Preparation for DIVA's Generative Fine-tuning

Data Acquisition

Pre-trained Weight Downloading

Code Modification

🍻 Quick Start for Training & Evaluation

Model Zoo

🎨 Visualization

💙 Acknowledgement

📝 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages