UpDown Captioner Baseline for `nocaps`

Baseline model for nocaps benchmark, a re-implementation based on the UpDown image captioning model trained on the COCO dataset (only).

Checkout our package documentation at nocaps.org/updown-baseline!

If you find this code useful, please consider citing:

@article{nocaps,
  author  = {Harsh Agrawal* and Karan Desai* and Yufei Wang and Xinlei Chen and Rishabh Jain and
             Mark Johnson and Dhruv Batra and Devi Parikh and Stefan Lee and Peter Anderson},
  title   = {{nocaps}: {n}ovel {o}bject {c}aptioning {a}t {s}cale},
  journal = {arXiv preprint arXiv:1812.08658},
  year    = {2018},
}

As well as the paper that proposed this model:

@inproceedings{Anderson2017up-down,
  author    = {Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson
               and Stephen Gould and Lei Zhang},
  title     = {Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},
  booktitle = {CVPR},
  year      = {2018}
}

How to setup this codebase?

This codebase requires Python 3.6+ or higher. It uses PyTorch v1.1, and has out of the box support with CUDA 9 and CuDNN 7. The recommended way to set this codebase up is through Anaconda or Miniconda. However, it should work just as fine with VirtualEnv.

Install Dependencies

Install Anaconda or Miniconda distribution based on Python3+ from their downloads' site.
Clone the repository.

git clone https://www.github.com/nocaps-org/updown-baseline
cd updown-baseline

Create a conda environment and install all the dependencies, and this codebase as a package in development version.

conda create -n updown python=3.6
conda activate updown
pip install -r requirements.txt
python setup.py develop

Note: If evalai package install fails, install these packages and try again:

sudo apt-get install libxml2-dev libxslt1-dev

Now you can import updown from anywhere in your filesystem as long as you have this conda environment activated.

Download Image Features

We provide pre-extracted bottom-up features for COCO and nocaps splits. These are extracted using a Faster-RCNN detector pretrained on Visual Genome (Anderson et al. 2017). We extract features from 100 region proposals for an image, and select them based on a confidence threshold of 0.2 - we finally get 10-100 features per image (adaptive).

Download (or symlink) the image features under $PROJECT_ROOT/data directory:

coco_train2017, coco_val2017, nocaps_val, nocaps_test.

Download Annotations

Download COCO captions and nocaps val/test image info and arrange in a directory structure as follows:

$PROJECT_ROOT/data
    |-- coco
    |   +-- annotations
    |       |-- captions_train2017.json
    |       +-- captions_val2017.json
    +-- nocaps
        +-- annotations
            |-- nocaps_val_image_info.json
            +-- nocaps_test_image_info.json

COCO captions: http://images.cocodataset.org/annotations/annotations_trainval2017.zip
nocaps val image info: https://s3.amazonaws.com/nocaps/nocaps_val_image_info.json
nocaps test image info: https://s3.amazonaws.com/nocaps/nocaps_test_image_info.json

Vocabulary

Build caption vocabulary using COCO train2017 captions.

python scripts/build_vocabulary.py -c data/coco/captions_train2017.json -o data/vocabulary

Constraint Beam Search

We need following open image classes meta data to start Constraint Beam Search:

class-descriptions-boxable.csv (open image class list): https://storage.googleapis.com/openimages/2018_04/class-descriptions-boxable.csv
bbox_labels_600_hierarchy_readable.json (open image class hierarchy structure) http://bit.ly/2MA5PVC
oi_concepts_to_words.txt (class vocabulary) http://bit.ly/2NvhIvC

Please doownload them into data/cbs/. By default, we use Constraint Beam Search for decoding our model but you can set MODEL.USE_CBS as False to disable it.

Evaluation Server

nocaps val and test splits are held privately behind EvalAI. To evaluate on nocaps, create an account on EvalAI and get the auth token from profile details. Set the token through EvalAI CLI as follows:

evalai set_token <your_token_here>

You are all set to use this codebase!

Training

We manage experiments through config files -- a config file should contain arguments which are specific to a particular experiment, such as those defining model architecture, or optimization hyperparameters. Other arguments such as GPU ids, or number of CPU workers should be declared in the script and passed in as argparse-style arguments. Train a baseline UpDown Captioner with all the default hyperparameters as follows. This would reproduce results of the first row in nocaps val/test tables from our paper.

python scripts/train.py \
    --config-yml configs/updown_nocaps_val.yaml \
    --gpu-ids 0 --serialization-dir checkpoints/updown-baseline

Refer updown/config.py for default hyperparameters. For other configurations, pass a path to config file through --config-yml argument, and/or a set of key-value pairs through --config-override argument. For example:

python scripts/train.py \
    --config-yml configs/updown_nocaps_val.yaml \
    --config-override OPTIM.BATCH_SIZE 250 \
    --gpu-ids 0 --serialization-dir checkpoints/updown-baseline

Multi-GPU Training

Multi-GPU training is fully supported, pass GPU IDs as --gpu-ids 0 1 2 3.

Saving Model Checkpoints

This script serializes model checkpoints every few iterations, and keeps track of best performing checkpoint based on overall CIDEr score. Refer updown/utils/checkpointing.py for more details on how checkpointing is managed. A copy of configuration file used for a particular experiment is also saved under --serialization-dir.

Logging

This script logs loss curves and metrics to Tensorboard, log files are at --serialization-dir. Execute tensorboard --logdir /path/to/serialization_dir --port 8008 and visit localhost:8008 in the browser.

Evaluation and Inference

Generate predictions for nocaps val or nocaps test using a pretrained checkpoint:

python scripts/inference.py \
    --config-yml /path/to/config.yaml \
    --checkpoint-path /path/to/checkpoint.pth \
    --output-path /path/to/save/predictions.json \
    --gpu-ids 0

Add --evalai-submit flag if you wish to submit the predictions directly to EvalAI and get results.

Results

Pre-trained checkpoint with the provided config is available to download here:

Checkpoint (.pth file): http://bit.ly/2ZctSMj
Predictions on nocaps val: https://bit.ly/2YKxxBA
Predictions on nocaps test: https://bit.ly/2XBs0R4

	in-domain		near-domain		out-of-domain		overall
split	CIDEr	SPICE	CIDEr	SPICE	CIDEr	SPICE	BLEU1	BLEU4	METEOR	ROUGE	CIDEr	SPICE
val	75.8	11.7	58.0	10.3	32.9	8.1	73.1	18.0	22.7	50.2	55.4	10.1
val-CBS	78.4	12.0	73.3	11.5	70.0	9.8	75.9	17.6	24.0	51.3	73.4	11.2
test	X	X	X	X	X	X	X	X	X	X	X	X
test-CBS	X	X	X	X	X	X	X	X	X	X	X	X

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
configs		configs
docs		docs
scripts		scripts
updown		updown
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

docs

docs

scripts

scripts

updown

updown

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

UpDown Captioner Baseline for `nocaps`

How to setup this codebase?

Install Dependencies

Download Image Features

Download Annotations

Vocabulary

Constraint Beam Search

Evaluation Server

Training

Multi-GPU Training

Saving Model Checkpoints

Logging

Evaluation and Inference

Results

About

Releases

Packages

Languages

License

GaryYufei/updown-baseline

Folders and files

Latest commit

History

Repository files navigation

UpDown Captioner Baseline for nocaps

How to setup this codebase?

Install Dependencies

Download Image Features

Download Annotations

Vocabulary

Constraint Beam Search

Evaluation Server

Training

Multi-GPU Training

Saving Model Checkpoints

Logging

Evaluation and Inference

Results

About

Resources

License

Stars

Watchers

Forks

Languages

UpDown Captioner Baseline for `nocaps`