Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time
March 21, 2019 21:25
March 27, 2019 17:37
March 21, 2019 21:25
April 2, 2019 22:42
March 21, 2019 21:25
March 21, 2019 21:25
March 21, 2019 21:25
March 21, 2019 21:25
March 21, 2019 21:25
November 30, 2019 21:52
March 21, 2019 21:25
March 27, 2019 17:37

Unsupervised Image Captioning

by Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo


Most image captioning models are trained using paired image-sentence data, which are expensive to collect. We propose unsupervised image captioning to relax the reliance on paired data. For more details, please refer to our paper.

alt text


  author = {Feng, Yang and Ma, Lin and Liu, Wei and Luo, Jiebo},
  title = {Unsupervised Image Captioning},
  booktitle = {CVPR},
  year = {2019}


mkdir ~/workspace
cd ~/workspace
git clone tf_models
git clone
touch tf_models/research/im2txt/im2txt/
touch tf_models/research/im2txt/im2txt/data/
touch tf_models/research/im2txt/im2txt/inference_utils/
mkdir ckpt
tar zxvf inception_v4_2016_09_09.tar.gz -C ckpt
git clone
cd unsupervised_captioning
pip install -r requirements.txt

Dataset (Optional. The files generated below can be found at Gdrive).

In case you do not have the access to Google, the files are also available at One Drive.

  1. Crawl image descriptions. The descriptions used when conducting the experiments in the paper are available at link. You may download the descriptions from the link and extract the files to data/coco.

    pip3 install absl-py
    python3 preprocessing/
  2. Extract the descriptions. It seems that NLTK is changing constantly. So the number of the descriptions obtained may be different.

    python -c "import nltk;'punkt')"
    python preprocessing/
  3. Preprocess the descriptions. You may need to change the vocab_size, start_id, and end_id in if you generate a new dictionary.

    python preprocessing/ --word_counts_output_file \ 
      data/word_counts.txt --new_dict
  4. Download the MSCOCO images from link and put all the images into ~/dataset/mscoco/all_images.

  5. Object detection for the training images. You need to first download the detection model from here and then extract the model under tf_models/research/object_detection.

    python preprocessing/ --image_path\
      ~/dataset/mscoco/all_images --num_proc 2 --num_gpus 1
  6. Generate tfrecord files for images.

    python preprocessing/ --image_path\


  1. Train the model without the intialization pipeline.

    python --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
      --multi_gpu --batch_size 512 --save_checkpoint_steps 1000\
      --gen_lr 0.001 --dis_lr 0.001
  2. Evaluate the model. The last element in the b34.json file is the best checkpoint.

    CUDA_VISIBLE_DEVICES='0,1' python\
      --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
      --data_dir ~/dataset/mscoco/all_images
    js-beautify saving/b34.json
  3. Evaluate the model on test set. Suppose the best validation checkpoint is 20000.

    python --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
      --data_dir ~/dataset/mscoco/all_images --job_dir saving/model.ckpt-20000

Initialization (Optional. The files can be found at here).

  1. Train a object-to-sentence model, which is used to generate the pseudo-captions.

    python initialization/
  2. Find the best obj2sen model.

    python initialization/ --threads 8
  3. Generate pseudo-captions. Suppose the best validation checkpoint is 35000.

    python initialization/ --num_proc 8\
      --job_dir obj2sen/model.ckpt-35000
  4. Train a captioning using pseudo-pairs.

    python initialization/ --o2s_ckpt obj2sen/model.ckpt-35000\
      --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt
  5. Evaluate the model.

    CUDA_VISIBLE_DEVICES='0,1' python\
      --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
      --data_dir ~/dataset/mscoco/all_images --job_dir saving_imcap
    js-beautify saving_imcap/b34.json
  6. Train sentence auto-encoder, which is used to initialize sentence GAN.

    python initialization/
  7. Train sentence GAN.

    python initialization/
  8. Train the full model with initialization. Suppose the best imcap validation checkpoint is 18000.

    python --inc_ckpt ~/workspace/ckpt/inception_v4.ckpt\
      --imcap_ckpt saving_imcap/model.ckpt-18000\
      --sae_ckpt sen_gan/model.ckpt-30000 --multi_gpu --batch_size 512\
      --save_checkpoint_steps 1000 --gen_lr 0.001 --dis_lr 0.001


Part of the code is from coco-caption, im2txt, tfgan, resnet, Tensorflow Object Detection API and maskgan.

Xinpeng told me the idea of self-critic, which is crucial to training.


Code for Unsupervised Image Captioning







No releases published


No packages published