Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, Hadar Averbuch-Elor
Hallucinated details are prevalent in the outputs of modern image captioning models. Prior work has largely focused on detecting or mitigating hallucinations by using closed-vocabulary object lists, which simplify the problem but fail to capture most types of hallucinations that occur in practice. By leveraging recent progress in generative foundation models, we propose a unified framework for quantifying and mitigating open-vocabulary hallucinations.
First, We introduce OpenCHAIR, a benchmark for evaluating open-vocabulary hallucinations which surpasses the existing benchmark CHAIR both in diversity and accuracy:
Additionally, we introduce MOCHa, a reinforcement learning-based approach that adjusts captioning models to output detailed, valid captions while avoiding such hallucinations:
git clone https://github.com/assafbk/mocha_code.git
cd mocha_code
To set up our environment, please run:
conda env create -f environment.yml
conda activate mocha
python -m spacy download en_core_web_sm
To perform evaluation over the OpenCHAIR benchmark:
-
Create a csv file with a single column titled 'generated_caption'. The following rows should contain the model's captions for OpenCHAIR's images. An example csv file can be found in:
OpenCHAIR/example_gen_file.csv
Additionally, we provide a script for generating such a file for the MOCHa-optimized BLIP-Base model, by running:
python OpenCHAIR/generate_captions.py \ --model-ckpt moranyanuka/blip-image-captioning-base-mocha \ --prompt "a photography of " \ --batch-size 100 --num-beams 5
Additional information:
model-ckpt
: The huggingface ckeckpoint of the model to be evaluated. Note that the script currently only supports BLIP-Base and BLIP-Large based models.prompt
: The prompt appended for the generation
-
Download the Concreteness Rating Dataset (xlsx format) from here.
-
Run the evaluation script:
python OpenCHAIR/evaluate.py \ --concreteness-dataset-path <path-to-concreteness-dataset> \ --generations-file-path <path-to-generated-captions-file>
More configuration options can be found in OpenCHAIR/evaluate.py
The OpenCHAIR dataset can also be accessed from 🤗 Here, and can be loaded as follows:
from datasets import load_dataset
dataset = load_dataset("moranyanuka/OpenCHAIR")['test']
- selecting a large
--batch-size
can significantly improve the evaluation script's runtime.
We currently support BLIP-Large on the MS-COCO Dataset (will add support for other models and datasets in the near future).
To run the training script:
python vlm_rlhf.py
The configuration file is vlm_rlhf_config.json
. Important configurations:
reward_model_weights
: List of weights for all rewards. First is the NLI weight and the second is the BERTScore weight (equivalent to alpha and 1-alpha in the paper). This field tunes the pareto frontier of the fidelity-adequacy curve. Initialized to [0.5,0.5].beta
: The weight for the kl-penalty reward. Initialized to 0.06.num_of_images_per_batch
: Number of images per PPO batch. Initialized to 10.num_of_samples_per_image
: Number of captions to generate per image. Initialized to 10.
In a single batch there are <num_of_images_per_batch> x <num_of_samples_per_image> captions.model_device
,ref_model_device
,reward_model_device
: Cuda device for each model.
All training metrics, including caption samples (for train and verification images) are displayed in the wandb webpage.
Additional configurations:
output_dir
: Where to save model checkpoints. Initialized to <project_dir>/output.cache_dir
: Huggingface cache dir for all models. Initialized to <project_dir>/hf_cache.activate_logging
: Enables wandb logging. Initialized to True.sampling_temperature
: Sampling temperature for the model. Initialized to 1.2.save_steps
: Model saving interval. Initialized to 200 (Note: best model is always saved regardless of this value).eval_steps
: Model evaluation interval. Initialized to 10.max_step
: Maximal amount of training steps. Initialized to 3000.
Check out vlm_rlhf_config.json
for more configurations.
Model type | Checkpoint |
---|---|
Blip-Base | 🤗 moranyanuka/blip-image-captioning-base-mocha |
Blip-Large | 🤗 moranyanuka/blip-image-captioning-large-mocha |
We will publish the checkpoints of additional models in the near future.
- If more than one GPU is available, we recommend setting
model_device
to the first GPU, andref_model_device
andreward_model_device
to the second GPU. (Motivation - the former requires grads hence uses the GPU memory more extensively). - If CUDA is running out of memory, try reducing the batch size (
num_of_images_per_batch
ornum_of_samples_per_image
or both). - To track the learning progress, keep an eye on the generated captions of the verification images (wandb -> Tables -> runs.summary["validation_data"])
- Additionally, it is helpful to look after validation_reward_mean and kl_dist (wandb -> Charts). The kl_dist should not be too large (in BLIP-Large, empirically, no more than 5). In parallel, we want to see validation_reward_mean increase under the small kl_dist constraint. kl_dist is controlled by beta (decreases when we increase beta).
If you find this useful for your research, please cite the following:
@misc{benkish2024mitigating,
title={Mitigating Open-Vocabulary Caption Hallucinations},
author={Assaf Ben-Kish and Moran Yanuka and Morris Alper and Raja Giryes and Hadar Averbuch-Elor},
year={2024},
eprint={2312.03631},
archivePrefix={arXiv},
primaryClass={cs.CV}
}