Skip to content

btzyd/DHCP

Repository files navigation

DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Mode

This is a PyTorch implementation of the DHCP paper, which is accepted by ACMMM 2025.

Preparing the environment, code, data and model

  1. Prepare the environment.

Creating a python environment and activate it via the following command.

conda create -n dhcp python==3.10
conda activate dhcp
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.51.3 accelerate
pip install tqdm qwen_vl_utils
  1. Clone this repository.
git clone https://github.com/btzyd/DHCP.git
  1. Prepare the dataset POPE-COCO.

We have provided labels for POPE-COCO-train and POPE-COCO-test in POPE-COCO.jsonl. The first digit of the question_id field is 1 and 3, representing POPE-COCO-train and POPE-COCO-test, respectively. The second digit is 1, 2, and 3, representing random, popular, and adversarial, respectively.

As for images, you can download COCO-val2014.

  1. Download models.

You can download Qwen2.5-VL-7B. Of course, you can also download the model while loading it in python code, though the download may be unstable. We recommend that you download the model first and then run the python code to load it from local directories.

Run the DHCP code

Extracting the cross-modal attention for hallucination and non-hallucination examples

The command to extract cross-modal attention and answer is as follows, which will create the attention_file directory. This process may take a long time (~80 GPU hours), as there are a total of 408,060 samples that need to be inferenced and extracted.

For single GPU:

CUDA_VISIBLE_DEVICES=0 python extract_script/qwen_extract_attention.py

For multi GPU, we provided a script:

bash qwen_extract_attention.sh
# qwen_extract_attention.sh
#!/bin/bash
# Assume that there are N machines in total, and each machine has 8 GPUs. Index is an environment variable, representing the serial number of each machine, starting from 0 to N-1.

for ((i = 0; i < 8; i++))
do
    IDX=$((INDEX * 8 + i))
    CUDA_VISIBLE_DEVICES=${i} python extract_script/qwen_extract_attention.py \
    --num-chunks $((N * 8)) \
    --chunk-idx $IDX &
done

The meaning of the parameters are as follows:

  • model-path: The path of LVLM.
  • image-folder: The directory to store COCO val2014 images.
  • output-dir: The output path of cross-modal attention and answer.

After that, we merge the answers generated by LVLM and organize them into the format needed for subsequent use. Don't forget to modify the contents of qwen_answer_to_jsonl.py according to your number of GPUs and output directory. It will generate two files, one for the training set and one for the test set, labeled 0 to 3, where 0 means answering yes without hallucinations, 1 means answering no without hallucinations, 2 means answering yes with hallucinations, and 3 means answering no with hallucinations.

python qwen_answer_to_jsonl.py

Training the DHCP detector based on cross-modal attention and answer

python train_qwen_dhcp_detector.py --epoch 30 --sampler --gpu 0

The meaning of the parameters are as follows:

  • sampler: Due to the imbalance of samples with and without hallucinations, we can stabilize training by controlling the sampling weight to make the ratio of samples with and without hallucinations roughly equal.

Testing the DHCP detector

python test_qwen_dhcp_detector.py --dhcp_path [path-to-dhcp-detector] --gpu 0

The meaning of the parameters are as follows:

  • dhcp_path: The path to DHCP detector weights.

The above is the code for the results of Qwen2.5-VL-7B in Table 2 of the paper.

Citation

Since the conference proceedings have not yet been published, this is the citation format for the arXiv version.

@article{zhang2024dhcp,
  title={DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models},
  author={Zhang, Yudong and Xie, Ruobing and Chen, Jiansheng and Sun, Xingwu and Wang, Yu and others},
  journal={arXiv preprint arXiv:2411.18659},
  year={2024}
}

About

The code of "DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Mode"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages