This repository contains the reference code for the paper Partial Non-Autoregressive Image Captioning,
- Python 3
- PyTorch (>1.0)
- torchvision
Annotations and detection features for the COCO dataset are needed in this work. Please download the annotations file annotations.zip and extract it.
Detection features are computed with the code provided by [1]. Please download the COCO features file coco_detections.hdf5 (~53.5 GB), in which detections of each image are stored under the <image_id>_features
key. <image_id>
is the id of each COCO image, without leading zeros (e.g. the <image_id>
for COCO_val2014_000000037209.jpg
is 37209
), and each value should be a (N, 2048)
tensor, where N
is the number of detections.
This repository is based on the framework of Meshed-Memory Transformer.
[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.