Skip to content

Official PyTorch implementation of ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Notifications You must be signed in to change notification settings

VisionXLab/ProCLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProCLIP LogoProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Xiaoxing Hu1,2*Kaicheng Yang3*Ziyang Gong1Qi Ming4Zonghao Guo5Xiang An3Ziyong Feng3Junchi Yan1Xue Yang1†

1Shanghai Jiao Tong University 2 Beijing Institute of Technology 3 DeepGlint
4 Beijing University of Technology 5 Tsinghua University
* Equal contribution Corresponding author

If you find our work helpful, please consider giving us a ⭐!

Official PyTorch implementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.]

Notice

This repository is still being organized and refined. If you encounter any issues while using it, please contact |Email: xiaoxinghhh@gmail.com|WeChat: 15111480307| or submit an issue. Thank you for your attention.

TODO

  • Training and validation instruction
  • Paper Link
  • Model Weights

📢 News

  • [2025-10-20] We have model weights, please check the Model Zoo for details.
  • [2025-10-22] We have update the paper, please check the arXiv for details.

📖 Introduction

This repository contains the official pytorchimplementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.]. We introduce a progressive vision-language alignment approach that aligns the LLM-based embedder with the CLIP image encoder in a curriculum learning manner to enhance long-text, multilingual, and fine-grained understanding.

Paper Link: arXiv

Model Zoo: HuggingFace

👁️ Methodology

  • Stage 1: Align the LLM-based embedder with the CLIP text encoder via Cross-Architecture Distillation.
  • Stage 2: Align the LLM-based embedder with the CLIP image encoder with Self-Distillation Regularization.

🛠️ Requirements

  • Python >= 3.9
  • CUDA >= 11.8 (if using GPU)
  • Other dependencies in requirements.txt

🚀 Installation

  • Clone this repository and install dependencies:
# Clone the repo
git clone https://github.com/VisionXLab/ProCLIP.git
cd ProCLIP

# Create virtual environment
conda create -n proclip python=3.9 -y
conda activate proclip
# Install dependencies
pip install -r requirements.txt

Training

Coming soon.

Evaluation

Coming soon.

📊 Results

Retrieval Results

Sample Result

Classification Results

Sample Result

Multilingual Retrieval Results

Sample Result

Comparison with other LLM embedders-based CLIP models

Sample Result

  • More results can be found in the paper.

📜 Citation

If you find our work helpful, please cite our paper:

@misc{ProCLIP,
      title={ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder}, 
      author={Xiaoxing Hu and Kaicheng Yang and Ziyang Gong and Qi Ming and Zonghao Guo and Xiang An and Ziyong Feng and Junchi Yan and Xue Yang},
      year={2025},
      eprint={2510.18795},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.18795}, 
}

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙌 Acknowledgments

Our work is inspired by LLM2CLIP and CLIP. We are grateful for their outstanding work and code.

About

Official PyTorch implementation of ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages