Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge. This is the official implementation of the [Paper] accepted by ECCV'24.
- Clone this repository and navigate to QA-Prompts folder
git clone https://github.com/WHB139426/QA-Prompts.git
cd QA-Prompts
- Install Package
conda create -n qaprompts python=3.9.16
conda activate qaprompts
pip install -r requirements.txt
We prepare the annotations of [A-OKVQA] in ./annotations
. You can directly download the annotation files from [🤗HF]
The images can be downloaded from [COCO2017], and you should organize the data as follows,
├── coco2017
│ └── train2017
│ └── val2017
│ └── test2017
├── QA-Prompts
│ └── annotations
│ └── aokvqa_v1p0_train.json
│ └── sub_qa.json
│ └── ...
│ └── datasets
│ └── models
│ └──...
You can prepare the pretrained weights of InstructBLIP-Vicuna-7B according to [InstructBLIP].
Since we have changed the structure of the code of the model, we recommend you download the pretrained weights of EVA-CLIP, Vicuna-7b-v1.1 and QFormer directly in [🤗HF]. The pretrained weights should be organize as follows,
├── QA-Prompts
│ └── experiments
│ └── eva_vit_g.pth
│ └── qformer_vicuna.pth
│ └── query_tokens_vicuna.pth
│ └── vicuna-7b
│ └── llm_proj_vicuna.pth
We recommend using GPUs with memory > 24G. Otherwise, you may need to extract the vision features in advance to save the memory usage of EVA-CLIP and avoid OOM.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=1111 finetune_ans.py