By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem
Welcome to the official code base for GPV-I - a general purpose vision-language architecture that can learn and perform any task that requires bounding boxes or text prediction. We demonstrate the effectiveness of GPV-I by jointly training it on VQA, Captioning, Localization, and Classification tasks and achieveing favorable performance in comparison to specialized single-task models.
Available on Arxiv: https://arxiv.org/abs/2104.00743
Project Page: https://prior.allenai.org/projects/gpv
Demo: https://vision-explorer.allenai.org/general_purpose_vision
BibTex:
@article{Gupta2021GPV,
title={Towards General Purpose Vision Systems},
author={Tanmay Gupta and A. Kamath and Aniruddha Kembhavi and Derek Hoiem},
journal={ArXiv},
year={2021},
volume={abs/2104.00743}
}
git clone --recurse-submodules git@github.com:allenai/gpv-1.git
Create conda environment
conda create -n gpv python=3.6 -y
conda activate gpv
Install libraries
bash setup_conda_env.sh
Decide the following paths:
<data_dir>
: This is the directory where images and annotations will be saved<output_dir>
: This is where outputs of various experiments will be saved including model checkpoints, visualization, inference and evaluation results
<data_dir>
and <output_dir>
refer to these absolute paths in the instructions below.
To study generalization of concepts across skills, we created a new split of COCO annotations - COCO-SCE. To download the original and our new split, pretrained DETR checkpoints on both splits run the following:
bash setup_data.sh <data_dir>
Note - If you intend to run experiments only on COCO-SCE, you can skip downloading COCO test images and save time and disk space by setting download_coco_test_images=False
in setup_data.sh
Model | Split | Download |
---|---|---|
GPV | COCO | Link |
GPV | COCO-SCE | Link |
To use any of these models, download them into <output_dir>/<exp_name>/ckpts
directory as follows:
wget <link> -P <output_dir>/<exp_name>/ckpts/
<exp_name>
could be any directory name of your choice such as gpv_coco
or gpv_coco_sce
.
We provide easy to use interactive IPython notebooks where you may provide an image and a natural language task description and visualize the models outputs, namely - bounding boxes for relevant image regions and text answer. Note that while some tasks might expect only one of the output modalities, the model always outputs both. For example, the model outputs relevant regions during captioning and text during localization. These auxiliary outputs may be unsolicited but often provide useful and diagnostic information.
We provide the following notebooks:
- inference.ipynb: This demonstrates inference for GPV-1 using greedy inference for text decoding as used in all experiments in our paper.
- inference_beam_search.ipynb: Post-submission, we implemented beam search! This also allows greedy inference by setting beam size to 1. This also allows sampling multiple high ranking text outputs which is especially useful for tasks with multiple plausible outputs such as captioning.
We also provide equivalent .py
scripts to run inference on a single image and task description pair. To run these scripts update output_dir
, ckpt
, inputs.img
, and inputs.query
in configs/exp/gpv_inference_cmdline.yaml.
For inference with beam search run:
python -m inference_beam_search beam_size=5
For greedy decoding either set beam_size to 1 in the previous command or run the following:
python -m inference
We provide scripts for training GPV on one or more of the following tasks:
CocoClassification
CocoVqa
CocoDetection
(refered to as the Localization task in the paper)CocoCaptioning
Training GPV-1 involves 3 steps:
-
Step 1: Update the configs/exp/gpv.yaml file. Here are the key parameters to consider (the ones marked with a star will be set later in Step 3):
num_gpus_per_node
(set to 4 if you have 24GB GPUs, 2 for 48GB, and 1 for 80GB)dist_url
output_dir
*data_dir
*model.pretr_detr
*
-
Step 2: Decide the dataset or combination of supported datasets to train the model. This is specified through one of the files in configs/learning_datasets. For instance,
all.yaml
trains on all 4 tasks,cap_vqa.yaml
trains onCocoCaptioning
&CocoVqa
, andcap.yaml
trains only onCocoCaptioning
. If you don't see a dataset combination you may add one by modifyingall.yaml
. We refer to the name of the chosen yaml file without the extension by<learning_datasets>
-
Step 3: Launch training as follows:
bash exp/gpv/scripts/train.sh <learning_datasets> <data_split> <exp_name> <output_dir> <data_dir>
<learning_datasets>
: set toall
to train on all 4 tasks or to the name of one of the yaml files inconfigs/learning_datasets
which specifies the tasks to train on<exp_name>
: name of the experiment directory (<output_dir>/<exp_name>
). This is where model checkpoints, visualization, and other experiment related data will be saved<data_split>
: set tooriginal_split
(COCO) orgpv_split
(COCO-SCE)
Note that training comprises of 2 sub-steps. First, the model is trained for
training.frozen_epochs
(inconfigs/exp/gpv.yaml
) steps with DETR weights frozen. Then the model is finetuned end-to-end for a total oftraining.num_epochs
epochs.train_gpv.sh
executes both steps sequentially.model.pretr_detr
is selected automatically in train.sh based on<data_split>
. -
Step 4: Visualize loss, metrics, and learning rate on tensorboard:
tensorboard --logdir=<output_dir> --bind_all
-
Step 5: Predictions are visualized on a small set of train and validation set samples every few thousand iterations (
training.vis_step
). These are available at<output_dir>/<exp_name>/training_visualizations
We provide evaluation code for the following tasks:
CocoClassification
CocoVqa
CocoDetection
(refered to as the Localization task in the paper)CocoCaptioning
RefCocop
Run the following command to evaluate on one or a set of tasks
bash exp/gpv/scripts/eval.sh <exp_name> <task_name> <subset> <split> <output_dir> <data_dir>
<exp_name>
: name of the experiment directory (<output_dir>/<exp_name>
) where the model to be evaluated lives.<task_name>
: set toall
to evaluate on all 5 tasks,all_but_refexp
to evalute on all tasks excepts RefCocop, or the name of tasks to evaluate only on that task.<subset>
: set totrain
orval
for COCO (notest
since COCO test annotations are hidden) andtrain
,val
, ortest
for COCO-SCE.<split>
: set tooriginal_split
(COCO) orgpv_split
(COCO-SCE). This flag is unused forRefCocop
.
Predictions and metrics are saved at <output_dir>/<exp_name>/eval
.
If you wish to evaluate captioning or vqa performnce on the COCO test images, we provide scripts to generate predictions in the format expected by their respective official evaluation servers (Captioning eval server, VQA eval server). You may run these as follows:
bash exp/gpv/scripts/eval_<cap/vqa>_test.sh <exp_name> <subset> <output_dir> <data_dir>
<subset>
may be test
or testdev
for VQA and val
or test
for Captioning.
GPV-1 can be finetuned on your data. To evaluate GPV-1's learning efficiency and extent of catastrophic forgetting, we provide scripts to finetune GPV on RefCocop
. These scripts may also be used as an example of finetuning GPV on your own data.
To finetune pretrained GPV-1 on RefCocop, run the following
bash exp/gpv/scripts/ft_gpv.sh <ckpt> <train_perc> <output_dir> <data_dir>
<ckpt>
: absolute path of the GPV-1 checkpoint that you want to initialize the training with<train_perc>
: percentage of the fullRefcocop
training set to use for learning. Supported values include 1, 2, 5, 10, 25, 50, 75, 100. These subsampled subsets can be found in<data_dir>/learning_phase_data/refcocop/
The evaluation script described in the previous section works for Refcocop
evaluation as well.
- The current hyperparameters are chosen for training GPV-1 with a batch size of 120 samples. This leads to significant GPU memory requirements during training (e.g. 5-7 days of training on four 24GB GPUs).
- While training is memory intensive, evaluation is easily run on a single GPU (you may further reduce batch size for evaluation using
eval.batch_size
flag in gpv.yaml file if working with low memory GPUs). - It may be possible to trade-off GPU memory with training time by reducing training batch size using the
training.batch_size
flag. However, this might require tuning the hyperparameters to achieve competitive performance. - Finally, if working with COCO-like data or when finetuning from a pretrained GPV-1 checkpoint, you might be able to get good performance with low GPU memory requirements by freezing the DETR backbone (
training.freeze=True
) and only training the remaining modules.