Skip to content

eugenelet/HIVE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hierarchical Pre-Training of Vision Encoders with Large Language Model

Paper arXiv Project Page

Overview of HIVE Framework

This is the official repository for the CVPR 2026 paper, Hierarchical Pre-Training of Vision Encoders with Large Language Model (HIVE).


Install

  1. Install the local package:
conda create -n hive python=3.11 -y
conda activate hive
pip install -e .
  1. Install other dependencies. Key required packages include:

Train

All training scripts are provided in the scripts/train folder. We use the language model MobileLLM to assist in vision encoder training.

  • Vision Encoder Training: Training is conducted in 3 stages:
  1. Connector Pretraining
  2. Modality Adaptation
  3. Vision Finetuning
  • VLM Training: We follow the LLaVA training method. Training is conducted in 2 stages:
  1. Pretrain (feature alignment)
  2. Visual Instruction Tuning
  • Classifier Training: We adopt an attentive probe classifier and a linear probe classifier with a frozen vision encoder backbone.

Datasets

Vision Encoder Training

We use the CC3M dataset. We utilize synthetic captions for the VQA task and alt-text for the classification task. The dataset is provided in the Hugging Face repository:

VLM Training

Following the LLaVA framework, we use the following datasets:


Evaluation

  • VQA Task: We use lmms-eval to evaluate the VQA task for the VLM.
  • Image Classification: We use ImageNet-1k to evaluate the image classification task.

Citation

If you find this work helpful or use our code in your research, please consider citing our paper:

@inproceedings{lee2026hierarchical,
  title={Hierarchical Pre-Training of Vision Encoders with Large Language Model},
  author={Lee, Eugene and Chang, Ting-Yu and Tsai, Jui-Huang and Diao, Jiajie and Lee, Chen-Yi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={7415--7424},
  year={2026}
}

About

[CVPR Workshop 2026] Hierarchical Pre-Training of Vision Encoders with Large Language Models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors