Skip to content
Code for Unsupervised Image Captioning
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
data add training code Mar 21, 2019
initialization skip empty images in im_caption Apr 26, 2019
preprocessing add comments Mar 27, 2019
.gitignore add training code Mar 21, 2019
LICENSE add license Apr 2, 2019 change the position of pip command Apr 10, 2019 add training code Mar 21, 2019 add training code Mar 21, 2019 add training code Mar 21, 2019 add training code Mar 21, 2019 add training code Mar 21, 2019 add training code Mar 21, 2019
requirements.txt add training code Mar 21, 2019 add comments Mar 27, 2019

Unsupervised Image Captioning

by Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo


Most image captioning models are trained using paired image-sentence data, which are expensive to collect. We propose unsupervised image captioning to relax the reliance on paired data. For more details, please refer to our paper.

alt text


  author = {Feng, Yang and Ma, Lin and Liu, Wei and Luo, Jiebo},
  title = {Unsupervised Image Captioning},
  booktitle = {CVPR},
  year = {2019}


mkdir ~/workspace
cd ~/workspace
git clone tf_models
git clone
touch tf_models/research/im2txt/im2txt/
touch tf_models/research/im2txt/im2txt/data/
touch tf_models/research/im2txt/im2txt/inference_utils/
mkdir ckpt
tar zxvf inception_v4_2016_09_09.tar.gz -C ckpt
git clone
cd unsupervised_captioning
pip install -r requirements.txt

Dataset (Optional. The files generated below can be found at Gdrive).

In case you do not have the access to Google, the files are also available at One Drive.

  1. Crawl image descriptions. The descriptions used when conducting the experiments in the paper are available at link. You may download the descriptions from the link and extract the files to data/coco.

    pip3 install absl-py
    python3 preprocessing/
  2. Extract the descriptions. It seems that NLTK is changing constantly. So the number of the descriptions obtained may be different.

    python -c "import nltk;'punkt')"
    python preprocessing/
  3. Preprocess the descriptions. You may need to change the vocab_size, start_id, and end_id in if you generate a new dictionary.

    python preprocessing/ --word_counts_output_file \ 
      data/word_counts.txt --new_dict
  4. Download the MSCOCO images from link and put all the images into ~/dataset/mscoco/all_images.

  5. Object detection for the training images. You need to first download the detection model from here and then extract the model under tf_models/research/object_detection.

    python preprocessing/ --image_path\
      ~/dataset/mscoco/all_images --num_proc 2 --num_gpus 1
  6. Generate tfrecord files for images.

    python preprocessing/ --image_path\


  1. Train the model without the intialization pipeline.

    python --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
      --multi_gpu --batch_size 512 --save_checkpoint_steps 1000\
      --gen_lr 0.001 --dis_lr 0.001
  2. Evaluate the model. The last element in the b34.json file is the best checkpoint.

    CUDA_VISIBLE_DEVICES='0,1' python\
      --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
      --data_dir ~/dataset/mscoco/all_images
    js-beautify saving/b34.json
  3. Evaluate the model on test set. Suppose the best validation checkpoint is 20000.

    python --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
      --data_dir ~/dataset/mscoco/all_images --job_dir saving/model.ckpt-20000

Initialization (Optional. The files can be found at here).

  1. Train a object-to-sentence model, which is used to generate the pseudo-captions.

    python initialization/
  2. Find the best obj2sen model.

    python initialization/ --threads 8
  3. Generate pseudo-captions. Suppose the best validation checkpoint is 35000.

    python initialization/ --num_proc 8\
      --job_dir obj2sen/model.ckpt-35000
  4. Train a captioning using pseudo-pairs.

    python initialization/ --o2s_ckpt obj2sen/model.ckpt-35000\
      --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt
  5. Evaluate the model.

    CUDA_VISIBLE_DEVICES='0,1' python\
      --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
      --data_dir ~/dataset/mscoco/all_images --job_dir saving_imcap
    js-beautify saving_imcap/b34.json
  6. Train sentence auto-encoder, which is used to initialize sentence GAN.

    python initialization/
  7. Train sentence GAN.

    python initialization/
  8. Train the full model with initialization. Suppose the best imcap validation checkpoint is 18000.

    python --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
      --imcap_ckpt saving_imcap/model.ckpt-18000\
      --sae_ckpt sen_gan/model.ckpt-30000 --multi_gpu --batch_size 512\
      --save_checkpoint_steps 1000 --gen_lr 0.001 --dis_lr 0.001


Part of the code is from coco-caption, im2txt, tfgan, resnet, Tensorflow Object Detection API and maskgan.

Xinpeng told me the idea of self-critic, which is crucial to training.

You can’t perform that action at this time.