Skip to content
M²: Meshed-Memory Transformer for Image Captioning
Python
Branch: master
Clone or download
baraldilorenzo Update environment.yml
Remove en-core-web-sm==2.0.0
Latest commit c411e65 Dec 26, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data M² Transformer Source Code Dec 17, 2019
evaluation M² Transformer Source Code Dec 17, 2019
images M² Transformer Source Code Dec 17, 2019
models M² Transformer Source Code Dec 17, 2019
output_logs M² Transformer Source Code Dec 17, 2019
utils M² Transformer Source Code Dec 17, 2019
LICENSE Create LICENSE Dec 17, 2019
README.md
environment.yml Update environment.yml Dec 26, 2019
test.py M² Transformer Source Code Dec 17, 2019
train.py M² Transformer Source Code Dec 17, 2019
vocab.pkl

README.md

M²: Meshed-Memory Transformer

This repository contains the reference code for the paper M²: Meshed-Memory Transformer for Image Captioning.

Meshed-Memory Transformer

Environment setup

Clone the repository and create the m2release conda environment using the environment.yml file:

conda env create -f environment.yml
conda activate m2release

Then download spacy data by executing the following command:

python -m spacy download en

Note: Python 3.6 is required to run our code.

Data preparation

To run the code, annotations and detection features for the COCO dataset are needed. Please download the annotations file annotations.zip and extract it.

Detection features are computed with the code provided by [1]. To reproduce our result, please download the COCO features file coco_detections.hdf5 (~53.5 GB), in which detections of each image are stored under the <image_id>_features key. <image_id> is the id of each COCO image, without leading zeros (e.g. the <image_id> for COCO_val2014_000000037209.jpg is 37209), and each value should be a (N, 2048) tensor, where N is the number of detections.

Evaluation

To reproduce the results reported in our paper, download the pretrained model file meshed_memory_transformer.pth and place it in the code folder.

Run python test.py using the following arguments:

Argument Possible values
--batch_size Batch size (default: 10)
--workers Number of workers (default: 0)
--features_path Path to detection features file
--annotation_folder Path to folder with COCO annotations

Expected output

Under output_logs/, you may also find the expected output of the evaluation code.

Training procedure

Run python train.py using the following arguments:

Argument Possible values
--exp_name Experiment name
--batch_size Batch size (default: 10)
--workers Number of workers (default: 0)
--m Number of memory vectors (default: 40)
--head Number of heads (default: 8)
--warmup Warmup value for learning rate scheduling (default: 10000)
--resume_last If used, the training will be resumed from the last checkpoint.
--resume_best If used, the training will be resumed from the best checkpoint.
--features_path Path to detection features file
--annotation_folder Path to folder with COCO annotations
--logs_folder Path folder for tensorboard logs (default: "tensorboard_logs")

For example, to train our model with the parameters used in our experiments, use

python train.py --exp_name m2_transformer --batch_size 50 --m 40 --head 8 --warmup 10000 --features_path /path/to/features --annotation_folder /path/to/annotations

Sample Results

References

[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

You can’t perform that action at this time.