Skip to content

andyj1/raha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation (ECCV 2026)

Official implementation of Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation (RAHA), a geometry-aware method for condensing large vision-language datasets into compact synthetic sets while preserving cross-modal retrieval structure.

Paper | Project Page

News

  • Will be updated soon!

About

RAHA distills compact image-text datasets by combining rank-aware distribution matching with hyperbolic image-text alignment. It lifts multimodal representations to hyperbolic space, estimates an adaptive rank-k shared semantic range from real batch coupling, matches relevance in that dominant range, and regularizes the residual subspace so weak correlations do not dominate under tight budgets.

The distillation objective is:

L_total = L_hITC + λ_range L_range + λ_residual L_residual

Dataset

data/
├── datasets/
│   ├── Flickr30k/
│   ├── Flickr8k/
│   └── COCO/
└── annotations/
    ├── flickr30k/
    ├── flickr8k/
    └── coco/

Defaults in distill_raha.py follow the Karpathy-style retrieval splits used in the paper:

image roots such as ./data/datasets/Flickr30k/ and annotation root ./data/annotations/ when using the flickr / flickr8k / coco options.

Each distilled query contains one 3×224×224 synthetic image optimized in pixel space and one continuous 768-dimensional text embedding optimized in embedding space.

Training

export CKPT_PATH=/path/to/distilled.pt
./sh/distill.sh <gpu_id> [run_name]

Example command (Flickr8k, 500 pairs; matches the paper's default optimization settings in Table A2/A3):

CUDA_VISIBLE_DEVICES=0 python distill_raha.py \
    --dataset flickr8k \
    --num_queries 500 \
    --batch_size_train 64 \
    --batch_size_test 64 \
    --lr_img 1.0 \
    --lr_txt 1.0 \
    --Iteration 200 \
    --outer_loop 50 \
    --inner_loop 1 \
    --epoch_eval_train 100 \
    --num_eval 5 \
    --curvature 1.0 \
    --lift_scale 1.0 \
    --tau_hitc 0.07 \
    --tau_relevance 0.07 \
    --rank_energy 0.95 \
    --sinkhorn_eps 0.05 \
    --sinkhorn_iters 20 \
    --lambda_range 0.8 \
    --lambda_residual 0.4 \
    --lambda_comp 0.1 \
    --log_dir logs \
    --eval_eval_freq 50 \
    --save_it 50 \
    --seed 1

Evaluation

After distillation, train a retrieval model from scratch on the synthetic set for 100 epochs and evaluate on the real test split:

export CKPT_PATH=/path/to/distilled.pt
./sh/eval.sh <gpu_id>

Example command:

CUDA_VISIBLE_DEVICES=0 python eval.py \
    --dataset flickr8k \
    --num_queries 500 \
    --num_eval 5 \
    --epoch_eval_train 100 \
    --ckpt_path /path/to/distilled.pt \
    --image_encoder nfnet \
    --text_encoder bert \
    --batch_size_train 64 \
    --batch_size_test 64

Citation

If you use this code in your research, please cite:

@inproceedings{jeong2026raha,
  title     = {Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation},
  author    = {Jeong, Jongoh and Lee, Sun-Kyung and Yoon, Kuk-Jin},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

About

Official repository for Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation (ECCV 2026)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors