A-Simple-Baseline-For-Knowledge-Based-VQA

Repo for the EMNLP 2023 paper "A Simple Knowledge-Based Visual Question Answering"

Inference-time of our method for n-shot VQA..

Abstract

This paper is on the problem of KnowledgeBased Visual Question Answering (KB-VQA). Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively. A common limitation of such approaches is that they consist of relatively complicated pipelines and often heavily rely on accessing GPT-3 API. Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline which, in a nutshell, is based on efficient in-context learning by prompting LLaMA (1 and 2) using question-informative captions as contextual information. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and yet achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to understand important aspects of our method. Our code is publicly available at https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA

Install

First, please install the necessary dependencies:

pip install -r requirements.txt

Usage

First, download the LLaMA weights and convert them to Huggingface format:

Weights for the LLaMA models can be obtained from by filling out this form
After downloading the weights, they will need to be converted to the Hugging Face Transformers format using the conversion script.

To run the code use the run_main.sh script!

python main.py \
    --dataset ok_vqa \
    --evaluation_set val \
    --train_annotations_path annotations/ok_vqa/train_annots_fixed.csv.zip \
    --val_annotations_path annotations/ok_vqa/val_annots_fixed.csv.zip \
    --test_annotations_path None \
    --train_images_dir /path_to_the_train_images/ \
    --val_images_dir /path_to_the_val_images/ \
    --test_images_dir None \
    --n_shots 10 \
    --k_ensemble 5 \
    --no_of_captions 9 \
    --use_mcan_examples False \
    --mcan_examples_path mcan_examples/ok_vqa/examples.json \
    --llama_path meta-llama/Llama-2-13b-hf \
    --train_captions_path question_related_captions/ok_vqa/train_data_qr_captions_csv \
    --val_captions_path question_related_captions/ok_vqa/val_data_qr_captions_csv \
    --test_captions_path None \
    --blip_train_question_embedds_path blip_embedds/ok_vqa/blip_normalized_q_embedds/blip_train_question_embedds.csv.zip \
    --blip_train_image_embedds_path blip_embedds/ok_vqa/blip_normalized_i_embedds/blip_train_image_embedds.csv.zip \
    --blip_val_question_embedds_path blip_embedds/ok_vqa/blip_normalized_q_embedds/blip_val_question_embedds.csv.zip \
    --blip_val_image_embedds_path blip_embedds/ok_vqa/blip_normalized_i_embedds/blip_val_image_embedds.csv.zip \
    --path_to_save_preds results/ok_vqa_val_without_mcan_llama2.csv

*Note that you must include the paths to the train, val, and test images

Results

See the "results" folder for the results reported in the main paper

Citation

If you find our work useful, please consider citing our paper:

@inproceedings{xenos-etal-2023-simple,
    title = "A Simple Baseline for Knowledge-Based Visual Question Answering",
    author = "Xenos, Alexandros  and
      Stafylakis, Themos  and
      Patras, Ioannis  and
      Tzimiropoulos, Georgios",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.919",
    pages = "14871--14877",
    abstract = "This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA). Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively. A common limitation of such approaches is that they consist of relatively complicated pipelines and often heavily rely on accessing GPT-3 API. Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline which, in a nutshell, is based on efficient in-context learning by prompting LLaMA (1 and 2) using question-informative captions as contextual information. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and yet achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to understand important aspects of our method. Our code is publicly available at https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA",
}

Contact

Please feel free to get in touch at: a.xenos@qmul.ac.uk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annotations

annotations

blip_embedds

blip_embedds

mcan_examples

mcan_examples

question_related_captions

question_related_captions

results

results

source

source

README.md

README.md

config.py

config.py

main.py

main.py

method.png

method.png

requirements.txt

requirements.txt

run_main.sh

run_main.sh

Repository files navigation

A-Simple-Baseline-For-Knowledge-Based-VQA

Abstract

Install

Usage

Results

Citation

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
annotations		annotations
blip_embedds		blip_embedds
mcan_examples		mcan_examples
question_related_captions		question_related_captions
results		results
source		source
README.md		README.md
config.py		config.py
main.py		main.py
method.png		method.png
requirements.txt		requirements.txt
run_main.sh		run_main.sh

alexandrosXe/A-Simple-Baseline-For-Knowledge-Based-VQA

Folders and files

Latest commit

History

Repository files navigation

A-Simple-Baseline-For-Knowledge-Based-VQA

Abstract

Install

Usage

Results

Citation

Contact

About

Resources

Stars

Watchers

Forks

Languages