We present a heuristic of beam search on top of the encoder-decoder based architecture that gives better quality captions on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

Beam search helps in finding the most optimal caption that can be generated by the model instead of greedily choosing the word with best score at each decoding step. Following shows how a beam width (k) of 3 helps in generating better captions:

beam search


For dependencies related to this project, environment.yml and requirements.txt files have been provided.

To install the dependencies using conda:

conda env create -f environment.yml
conda env list


Reference data folder and annotations json file for the downloaded dataset (MSCOCO, Flickr8k, Flickr30k) in and run the python script to create the required dataset.

To train a model run python All training hyper-parameters are mentioned in

Note: Pretrained models for MSCOCO, Flickr8k, Flickr30k can be downloaded from here.

The downloaded zip file needs to be extracted in the models/ directory.

Testing / Inference

  • You may use to generate image captions and attention map over an image.

    python --img='path/to/image.jpeg' --model='path/to/BEST_checkpoint_coco_5_cap_per_img_5_min_word_freq.pth.tar' --word_map='path/to/WORDMAP_coco_5_cap_per_img_5_min_word_freq.json' --beam_size=5
  • The Jupyter Notebook Caption-Sample-Images.ipynb can be used to caption specified images using the trained model.

  • Generate-Testset-Predictions.ipynb is used for generating predictions in the required format for the testing dataset.


results table

comparing captions

image1 image1a
image2 image2a
image3 image3a

Intercative User Interface

To use the UI based image captioner module run the following commands:

cd ui/

This would open the following user interface:

ui-view1 ui-view3

Project UI Demo

You can find the demo video here on youtube.


