We propose Vision-Guided Attention, a method that guides visual attention by visual grounding.
conda create -n vga -y python=3.10
conda activate vga
pip install -r requirements.txtNote: The dependencies are referred to LLaVA-v1.5. For LLaVA-Next and Qwen2.5-VL-Instruct, you can also easily set up the environment by following the instructions from their official repositories.
All benchmarks need to be processed into structurally consistent JSON files.
Some samples could be found in data/samples.json.
We developed a shell script scripts/all.sh that can execute benchmarks end-to-end.