This is the official repository for the CVPR 2026 paper, Hierarchical Pre-Training of Vision Encoders with Large Language Model (HIVE).
- Install the local package:
conda create -n hive python=3.11 -y
conda activate hive
pip install -e .
- Install other dependencies. Key required packages include:
All training scripts are provided in the scripts/train folder. We use the language model MobileLLM to assist in vision encoder training.
- Vision Encoder Training: Training is conducted in 3 stages:
- Connector Pretraining
- Modality Adaptation
- Vision Finetuning
- VLM Training: We follow the LLaVA training method. Training is conducted in 2 stages:
- Pretrain (feature alignment)
- Visual Instruction Tuning
- Classifier Training: We adopt an attentive probe classifier and a linear probe classifier with a frozen vision encoder backbone.
We use the CC3M dataset. We utilize synthetic captions for the VQA task and alt-text for the classification task. The dataset is provided in the Hugging Face repository:
Following the LLaVA framework, we use the following datasets:
- Pretrain: LCS-558k
- Instruction Finetuning: LLaVA-NeXT-Data
- VQA Task: We use lmms-eval to evaluate the VQA task for the VLM.
- Image Classification: We use ImageNet-1k to evaluate the image classification task.
If you find this work helpful or use our code in your research, please consider citing our paper:
@inproceedings{lee2026hierarchical,
title={Hierarchical Pre-Training of Vision Encoders with Large Language Model},
author={Lee, Eugene and Chang, Ting-Yu and Tsai, Jui-Huang and Diao, Jiajie and Lee, Chen-Yi},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={7415--7424},
year={2026}
}
