Hierarchical Pre-Training of Vision Encoders with Large Language Model

This is the official repository for the CVPR 2026 paper, Hierarchical Pre-Training of Vision Encoders with Large Language Model (HIVE).

Install

Install the local package:

conda create -n hive python=3.11 -y
conda activate hive
pip install -e .

Install other dependencies. Key required packages include:

Train

All training scripts are provided in the scripts/train folder. We use the language model MobileLLM to assist in vision encoder training.

Vision Encoder Training: Training is conducted in 3 stages:

Connector Pretraining
Modality Adaptation
Vision Finetuning

VLM Training: We follow the LLaVA training method. Training is conducted in 2 stages:

Pretrain (feature alignment)
Visual Instruction Tuning

Classifier Training: We adopt an attentive probe classifier and a linear probe classifier with a frozen vision encoder backbone.

Datasets

Vision Encoder Training

We use the CC3M dataset. We utilize synthetic captions for the VQA task and alt-text for the classification task. The dataset is provided in the Hugging Face repository:

CC3M_synthetic

VLM Training

Following the LLaVA framework, we use the following datasets:

Pretrain: LCS-558k
Instruction Finetuning: LLaVA-NeXT-Data

Evaluation

VQA Task: We use lmms-eval to evaluate the VQA task for the VLM.
Image Classification: We use ImageNet-1k to evaluate the image classification task.

Citation

If you find this work helpful or use our code in your research, please consider citing our paper:

@inproceedings{lee2026hierarchical,
  title={Hierarchical Pre-Training of Vision Encoders with Large Language Model},
  author={Lee, Eugene and Chang, Ting-Yu and Tsai, Jui-Huang and Diao, Jiajie and Lee, Chen-Yi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={7415--7424},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
llava_next		llava_next
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
HIVE-overview.jpg		HIVE-overview.jpg
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hierarchical Pre-Training of Vision Encoders with Large Language Model

Install

Train

Datasets

Vision Encoder Training

VLM Training

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Pre-Training of Vision Encoders with Large Language Model

Install

Train

Datasets

Vision Encoder Training

VLM Training

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages