Skip to content
/ DAMRO Public

[EMNLP 2024 Main] The official repository for the paper "DAMRO: Dive into Attention Mechanism to Reduce Object Hallucination""

Notifications You must be signed in to change notification settings

coder-gx/DAMRO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

arXiv GitHub Video

The official repository for the paper "DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination"

overview
Figure 1: An overview of DAMRO.

πŸ”₯ News

  • All code has been uploaded to GitHub.
  • This is our Arxiv Paper Link.
  • Our paper is accepted by EMNLP 2024 Main.

🌟 Key Highlights

  1. We conduct in-depth analysis of the relationship between the attention maps of the visual encoder and the LLM decoder, revealing a high consistency in the distribution of their outlier tokens.
  2. We analyze the impact of the consistency on object hallucination and design the DAMRO method to mitigate the hallucination in LVLMs.
  3. We demonstrate effectiveness of our method via extensive experiments on various models and benchmarks. Moreover, our training-free approach is applicable to most LVLMs without external knowledge or models.

πŸš€ Quick Start

Prepare Code and Data

git clone https://github.com/coder-gx/DAMRO.git

Clone the GitHub repository to your local directory. Download coco val2014 and MME, then place them along with the remaining datasets into the data folder. The final data directory structure should look like this:

DAMRO/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ chair/
β”‚   β”œβ”€β”€ coco/
β”‚   β”‚   β”œβ”€β”€ annotations/
β”‚   β”‚   └── val2014/
β”‚   β”œβ”€β”€ gpt4v/
β”‚   β”œβ”€β”€ mme/
β”‚   └── pope/
β”‚       └── coco/

Environment Setup

conda create -n damro python==3.9
conda activate damro
cd DAMRO
pip install requirement.txt
pip install -e ./transformers-4.43.0

Here we use a modified version of transformers adapted for DAMRO. Compared to the original code, the modifications mainly involve three areas:

  1. Model-related files, including modeling_llava.py, modeling_llava_next.py, and modeling_instructblip.py.

  2. In generation.py, add parameters and transfer parameters.

  3. Replace the _sample() function implementation in sample.py.

Start Inference

Our code supports LLaVA-1.5, LLaVA-NeXT, and InstructBLIP models. We reproduce the four benchmarks used in the paper, with usage examples shown below:

1. POPE

# generation
python ./evaluation/eval_pope/run_pope.py \
   --model_path <your_model_path> \
   --image_folder ./data/coco/val2014 \
   --question_file ./data/pope/coco/coco_pope_random.jsonl \
   --answers_file ./output/damro_eval_pope_instructblip_random_alpha0.5_beta0.1_topk4.jsonl \
   --damro_alpha 0.5 \
   --damro_topk 4 \
   --damro_beta 0.1 \
   --use_damro \
   --seed 42 \
   --batch_size 1

# evaluation
python ./evaluation/eval_pope/eval_pope.py \
   --gt_files ./data/pope/coco/coco_pope_random.jsonl \
   --gen_files ./output/damro_eval_pope_instructblip_random_alpha0.5_beta0.1_topk4.jsonl

2. CHAIR

# generation
python ./evaluation/eval_pope/run_chair.py \
   --model_path <your_model_path> \
   --image_folder ./data/coco/val2014 \
   --question_file ./data/pope/coco/coco_pope_random.jsonl \
   --answers_file ./output/damro_eval_chair_instructblip_alpha0.5_beta0.1_topk4.jsonl \
   --damro_alpha 1.5 \
   --damro_topk 4 \
   --damro_beta 0.1 \
   --use_damro \
   --seed 42 \
   --batch_size 1

# evaluation
python ./evaluation/eval_chair/eval_chair.py \
   --cap_file ./output/damro_eval_chair_instructblip_alpha1.5_beta0.1_topk4_direct2.jsonl \
   --image_id_key "image_id" \
   --caption_key "caption" \
   --cache ./evaluation/eval_chair/chair.pkl \
   --coco_path ./data/coco/annotations \
   --save_path ./output/chair_detail/damro_eval_chair_instructblip_alpha1.5_beta0.1_topk4_detail.json

3. MME Subset

# generation
python ./evaluation/eval_mme/run_mme.py \
   --model_path <your_model_path>  \
   --question_file ./data/mme \
   --answers_file ./output/damro_eval_mme_llava-next_alpha2_beta0.1_topk10.jsonl \
   --damro_alpha 2 \
   --damro_topk 10 \
   --damro_beta 0.1 \
   --use_damro \
   --seed 42 \
   --batch_size 1

# evaluation
python ./evaluation/eval_mme/eval_mme.py \
--gen_file ./output/damro_eval_mme_llava-next_alpha2_beta0.1_topk10.jsonl

4. GPT4V-Aided Evaluation

# generation
python ./evaluation/gpt4v_aided/run_gpt4v_aided.py \
   --model_path <your_model_path>  \
   --image_folder ./data/coco/val2014 \
   --question_file ./data/gpt4v/gpt4v.jsonl \
   --answers_file ./output/damro_eval_gpt4v_llava-1.5_alpha2_beta0.1_topk10.jsonl \
   --damro_alpha 2 \
   --damro_topk 10 \
   --damro_beta 0.1 \
   --use_damro \
   --seed 42 \
   --batch_size 1

# evaluation
python ./evaluation/gpt4v_aided/eval_gpt4v.py \
    --file_path1 ./output/original_eval_gpt4v_llava-1.5.jsonl \
    --file_path2 ./output/damro_eval_gpt4v_llava-1.5_alpha2_beta0.1_topk10.jsonl \
    --save_path ./output/gpt4v_detail/damro_eval_gpt4v_llava-1.5_alpha2_beta0.1_topk10_detail.json \
    --image_folder ./data/coco/val2014

More usage examples can be found in the scripts folder, where run_generation.sh is used for generation and eval_generation.sh for final score computation.

πŸ“Š Experiments

1. Drawbacks of ViT

  • The attention map of LVLMs’ visual encoder also focus on a small number of high-norm outlier tokens.

main
Figure 2: Attention map of visual encoder. Left: original image. Middle: attention map of InstructBLIP ViT (16x16). Right: attention map of LLaVA-1.5 ViT (24x24).

2. Outlier Tokens Cause Hallucination

  • It can be observed that outlier tokens in visual encoding stage indeed have influence on the subsequent LLM decoding stage, which is closely related to the occurrence of hallucinations.

main
Figure 3: LLM decoder attention map of "plant" token (non-hallucinatory). It is evident that attention can accurately locate the position of the plotted plant.

main
Figure 4: LLM decoder attention map of "clock" token (hallucinatory). The attention mainly focus on the outlier tokens in the background, whose positions are the same in visual encoder attention map in the right sub-image of Figure 2.

3. DAMRO Results

  • We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs.

main
Figure 5: POPE.

main
Figure 6: CHAIR.

main
Figure 7: MME.

main
Figure 8: GPT4V.

πŸ“Œ Examples

main
Figure 7: DAMRO’s performance on reducing hallucinations on on InstructBLIP.

main
Figure 8: DAMRO’s performance on reducing hallucinations on LLaVA-1.5-7b.

πŸ₯° Acknowledgements

We thank the VCD team for providing the foundational codebase that we adapted to implement DAMRO. We also acknowledge the open-source community for providing the datasets and evaluation benchmarks that made this research possible.

πŸ“ Citation

@misc{gong2024damrodiveattentionmechanism,
      title={DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination}, 
      author={Xuan Gong and Tianshi Ming and Xinpeng Wang and Zhihua Wei},
      year={2024},
      eprint={2410.04514},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.04514}, 
}

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

[EMNLP 2024 Main] The official repository for the paper "DAMRO: Dive into Attention Mechanism to Reduce Object Hallucination""

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published