The official repository for the paper "DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination"
Figure 1: An overview of DAMRO.
- All code has been uploaded to GitHub.
- This is our Arxiv Paper Link.
- Our paper is accepted by EMNLP 2024 Main.
- We conduct in-depth analysis of the relationship between the attention maps of the visual encoder and the LLM decoder, revealing a high consistency in the distribution of their outlier tokens.
- We analyze the impact of the consistency on object hallucination and design the DAMRO method to mitigate the hallucination in LVLMs.
- We demonstrate effectiveness of our method via extensive experiments on various models and benchmarks. Moreover, our training-free approach is applicable to most LVLMs without external knowledge or models.
git clone https://github.com/coder-gx/DAMRO.gitClone the GitHub repository to your local directory.
Download coco val2014 and MME, then place them along with the remaining datasets into the data folder.
The final data directory structure should look like this:
DAMRO/
βββ data/
β βββ chair/
β βββ coco/
β β βββ annotations/
β β βββ val2014/
β βββ gpt4v/
β βββ mme/
β βββ pope/
β βββ coco/conda create -n damro python==3.9
conda activate damro
cd DAMRO
pip install requirement.txt
pip install -e ./transformers-4.43.0Here we use a modified version of transformers adapted for DAMRO. Compared to the original code, the modifications mainly involve three areas:
-
Model-related files, including modeling_llava.py, modeling_llava_next.py, and modeling_instructblip.py.
-
In
generation.py, add parameters and transfer parameters. -
Replace the
_sample()function implementation in sample.py.
Our code supports LLaVA-1.5, LLaVA-NeXT, and InstructBLIP models. We reproduce the four benchmarks used in the paper, with usage examples shown below:
# generation
python ./evaluation/eval_pope/run_pope.py \
--model_path <your_model_path> \
--image_folder ./data/coco/val2014 \
--question_file ./data/pope/coco/coco_pope_random.jsonl \
--answers_file ./output/damro_eval_pope_instructblip_random_alpha0.5_beta0.1_topk4.jsonl \
--damro_alpha 0.5 \
--damro_topk 4 \
--damro_beta 0.1 \
--use_damro \
--seed 42 \
--batch_size 1
# evaluation
python ./evaluation/eval_pope/eval_pope.py \
--gt_files ./data/pope/coco/coco_pope_random.jsonl \
--gen_files ./output/damro_eval_pope_instructblip_random_alpha0.5_beta0.1_topk4.jsonl# generation
python ./evaluation/eval_pope/run_chair.py \
--model_path <your_model_path> \
--image_folder ./data/coco/val2014 \
--question_file ./data/pope/coco/coco_pope_random.jsonl \
--answers_file ./output/damro_eval_chair_instructblip_alpha0.5_beta0.1_topk4.jsonl \
--damro_alpha 1.5 \
--damro_topk 4 \
--damro_beta 0.1 \
--use_damro \
--seed 42 \
--batch_size 1
# evaluation
python ./evaluation/eval_chair/eval_chair.py \
--cap_file ./output/damro_eval_chair_instructblip_alpha1.5_beta0.1_topk4_direct2.jsonl \
--image_id_key "image_id" \
--caption_key "caption" \
--cache ./evaluation/eval_chair/chair.pkl \
--coco_path ./data/coco/annotations \
--save_path ./output/chair_detail/damro_eval_chair_instructblip_alpha1.5_beta0.1_topk4_detail.json# generation
python ./evaluation/eval_mme/run_mme.py \
--model_path <your_model_path> \
--question_file ./data/mme \
--answers_file ./output/damro_eval_mme_llava-next_alpha2_beta0.1_topk10.jsonl \
--damro_alpha 2 \
--damro_topk 10 \
--damro_beta 0.1 \
--use_damro \
--seed 42 \
--batch_size 1
# evaluation
python ./evaluation/eval_mme/eval_mme.py \
--gen_file ./output/damro_eval_mme_llava-next_alpha2_beta0.1_topk10.jsonl# generation
python ./evaluation/gpt4v_aided/run_gpt4v_aided.py \
--model_path <your_model_path> \
--image_folder ./data/coco/val2014 \
--question_file ./data/gpt4v/gpt4v.jsonl \
--answers_file ./output/damro_eval_gpt4v_llava-1.5_alpha2_beta0.1_topk10.jsonl \
--damro_alpha 2 \
--damro_topk 10 \
--damro_beta 0.1 \
--use_damro \
--seed 42 \
--batch_size 1
# evaluation
python ./evaluation/gpt4v_aided/eval_gpt4v.py \
--file_path1 ./output/original_eval_gpt4v_llava-1.5.jsonl \
--file_path2 ./output/damro_eval_gpt4v_llava-1.5_alpha2_beta0.1_topk10.jsonl \
--save_path ./output/gpt4v_detail/damro_eval_gpt4v_llava-1.5_alpha2_beta0.1_topk10_detail.json \
--image_folder ./data/coco/val2014More usage examples can be found in the scripts folder,
where run_generation.sh is used for generation and eval_generation.sh for final score computation.
- The attention map of LVLMsβ visual encoder also focus on a small number of high-norm outlier tokens.
Figure 2: Attention map of visual encoder. Left: original image. Middle: attention map of InstructBLIP ViT
(16x16). Right: attention map of LLaVA-1.5 ViT (24x24).
- It can be observed that outlier tokens in visual encoding stage indeed have influence on the subsequent LLM decoding stage, which is closely related to the occurrence of hallucinations.
Figure 3: LLM decoder attention map of "plant" token (non-hallucinatory). It is evident that attention can
accurately locate the position of the plotted plant.
Figure 4: LLM decoder attention map of "clock" token (hallucinatory). The attention mainly focus on the
outlier tokens in the background, whose positions are
the same in visual encoder attention map in the right
sub-image of Figure 2.
- We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs.
Figure 7: DAMROβs performance on reducing hallucinations on on InstructBLIP.
Figure 8: DAMROβs performance on reducing hallucinations on LLaVA-1.5-7b.
We thank the VCD team for providing the foundational codebase that we adapted to implement DAMRO. We also acknowledge the open-source community for providing the datasets and evaluation benchmarks that made this research possible.
@misc{gong2024damrodiveattentionmechanism,
title={DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination},
author={Xuan Gong and Tianshi Ming and Xinpeng Wang and Zhihua Wei},
year={2024},
eprint={2410.04514},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.04514},
}This project is licensed under the MIT License - see the LICENSE file for details.



