This is the official implementation of our BMVC 2019 paper Look and Modify: Modification Networks for Image Captioning | arXiv | Poster
Python 3.6 and PyTorch 0.4
Download the COCO 2014 dataset from here. In particualr, you'll need the 2014 Training and Validation images.
Then download Karpathy's Train/Val/Test Split. You may download it from here.
If you want to do evaluation on COCO, download the COCO API from here if your on Linux or from here if your on Windows. Then download the COCO caption toolkit from here and re-name the folder to cococaptioncider
. Don't forget to download java as well. Simply dowload it from here if you don't have it.
Next, download the bottom up image features from here. If you're modifiying captions from any framework that uses ResNet features (e.g Attend and Tell, Adaptive Attention), then you may skip this step. Use the instructions from this repo to extract the features and indices.
The generated files should be placed in a folder called bottom-up features
.
Then download the caption data folder from here which includes the following:
Caption utilities
: a dictionary file with the following: {"COCO image name": {"caption": "previous caption to modify", "embedding": [512-d DAN embedding of previous caption], "attributes": [the indiced of the 5 extracted attributes], "image_ids": the COCO image id}
COCO image names with IDs
in the following format: ["COCO_val2014_000000391895.jpg", 391895]
. This is basically for evaluation on COCO.
Annotations
: The training annotations
Caption lengths
: The training captions length
Wordmap
: A dictionary to map the word to their corresponding indices
Bottom up features mapping to images
: The mapping of the bottom up features to their corresponding COCO images in the corresponding order.
For more information on the preperation of this dataset, see the folder data preperation
. Here, the previous captions are embedded first using Google's Universal Sentence Encoder available at TensorFlow Hub here and are loaded to the model for faster processing. You can find how to extract the features from the sentence in the folder data preperation
. If you would like to implement your own DAN, use the code provided in util/dan.py
which makes use of the GLoVe word embeddings. You can download the 300-d 6B trained word vectors from here for use in this function. Moreover, you may ignore the load embedding
function if you'd like to train the word vectors from scratch using nn.Embedding
.
In our paper, we make use of variational dropout to effectively regularize our language model, which samples one mask and uses it repeatedly across all timesteps. In that case, all timesteps of the language model receive the same dropout mask. This implementation is included here as well. If you change the dimension of your LSTM hidden state, make sure to adjust accordingly in the getLockedDropooutMask
function.
The training code is provided in train_eval.py
, the caption and attention map visualization in vis.py
and testing captions evaluation in test eval
. The DAN implementation is in util
folder.
Download the COCO 2014 dataset from here. In particualr, you'll need the 2014 Training and Validation images.
Then download Karpathy's Train/Val/Test Split. You may download it from here.
If you want to do evaluation on COCO, download the COCO API from here if your on Linux or from here if your on Windows. Then download the COCO caption toolkit from here and re-name the folder to cococaptioncider
. Don't forget to download java as well. Simply dowload it from here if you don't have it.
Use the repository here to extract the image features to a .hd5
file.
Then download the caption data folder from here which includes the following:
Caption utilities
: a dictionary file with the following: {"COCO image name": {"caption": "previous caption to modify", "embedding": [512-d DAN embedding of previous caption], "attributes": [the indiced of the 5 extracted attributes], "image_ids": the COCO image id}
COCO image names
in the corresponding order
Annotations
: The training annotations
Caption lengths
: The training captions length
Wordmap
: A dictionary to map the word to their corresponding indices
Place the extracted .hd5
files in the folder caption data
.
For more information on the preperation of this dataset, see the folder data preperation
. The previous captions are embedded first using Google's Universal Sentence Encoder available at TensorFlow Hub here and are loaded to the model for faster processing. You can find how to extract the features from the sentence in the folder data preperation
. If you would like to implement your own DAN, use the code provided in util/dan.py
which makes use of the GLoVe word embeddings. You can download the 300-d 6B trained word vectors from here for use in this function. Moreover, you may ignore the load embedding
function if you'd like to train the word vectors from scratch using nn.Embedding
.
In our paper, we make use of variational dropout to effectively regularize our language model, which samples one mask and uses it repeatedly across all timesteps. In that case, all timesteps of the language model receive the same dropout mask. This implementation is included here as well. If you change the dimension of your LSTM hidden state, make sure to adjust accordingly in the getLockedDropooutMask
function.
The training code is provided in train_eval.py
, the caption and attention map visualization in vis.py
and testing captions evaluation in test eval
. The DAN implementation is in util
folder.
If you use our code or find our paper useful in your research, please acknowledge:
@misc{Sammani2019ModificationNet,
author = {Sammani, Fawaz and Elsayed, Mahmoud},
title = {Look and Modify: Modification Networks for Image Captioning},
journal = {British Machine Vision Conference (BMVC)},
year = {2019}
}
This code is adopted from my Adaptive Attention repository and from sgrvinod implementation on "Show, Attend and Tell".