Xiaoxing Hu1,2* Kaicheng Yang3* Ziyang Gong1 Qi Ming4 Zonghao Guo5 Xiang An3 Ziyong Feng3 Junchi Yan1 Xue Yang1†
4 Beijing University of Technology 5 Tsinghua University
* Equal contribution † Corresponding author
If you find our work helpful, please consider giving us a ⭐!
Official PyTorch implementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.]
This repository is still being organized and refined. If you encounter any issues while using it, please contact |Email: xiaoxinghhh@gmail.com|WeChat: 15111480307| or submit an issue. Thank you for your attention.
- Training and validation instruction
- Paper Link
- Model Weights
- [2025-10-20] We have model weights, please check the Model Zoo for details.
- [2025-10-22] We have update the paper, please check the arXiv for details.
This repository contains the official pytorchimplementation of [ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder.]. We introduce a progressive vision-language alignment approach that aligns the LLM-based embedder with the CLIP image encoder in a curriculum learning manner to enhance long-text, multilingual, and fine-grained understanding.
- Stage 1: Align the LLM-based embedder with the CLIP text encoder via Cross-Architecture Distillation.
- Stage 2: Align the LLM-based embedder with the CLIP image encoder with Self-Distillation Regularization.
- Python >= 3.9
- CUDA >= 11.8 (if using GPU)
- Other dependencies in
requirements.txt
- Clone this repository and install dependencies:
# Clone the repo
git clone https://github.com/VisionXLab/ProCLIP.git
cd ProCLIP
# Create virtual environment
conda create -n proclip python=3.9 -y
conda activate proclip
# Install dependencies
pip install -r requirements.txt
Coming soon.
Coming soon.
- More results can be found in the paper.
If you find our work helpful, please cite our paper:
@misc{ProCLIP,
title={ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder},
author={Xiaoxing Hu and Kaicheng Yang and Ziyang Gong and Qi Ming and Zonghao Guo and Xiang An and Ziyong Feng and Junchi Yan and Xue Yang},
year={2025},
eprint={2510.18795},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.18795},
}
This project is licensed under the MIT License - see the LICENSE file for details.
Our work is inspired by LLM2CLIP and CLIP. We are grateful for their outstanding work and code.