Skip to content
Supplementary material to "Top-down Visual Saliency Guided by Captions" (CVPR 2017)
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
DATA Initial commit Apr 21, 2017
coco-caption @ 3f0fe9b Initial commit Apr 21, 2017
.gitignore Initial commit Apr 21, 2017
LICENSE Initial commit Apr 21, 2017 Initial commit Apr 21, 2017 fixes 'leaky' scope during Adam optimizer initialization for Tensorfl… May 23, 2017 Initial commit Apr 21, 2017

Caption-Guided Saliency

This code is released as a supplementary material to "Top-down Visual Saliency Guided by Captions" (CVPR 2017).

Getting started

Clone this repo (including coco-caption as a submodule):

$ git clone --recursive

Install dependencies

The model is implemented using TensorFlow framework, Python 2.7. For TensorFlow installation please refer to the official Installing TensorFlow guide or simply:

$ pip install --upgrade

Warning! The standard version of TensorFlow gives the warnings like:

The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.

It's fine. To get rid of them you'll need to build TensorFlow from sources with --config=opt.

List of other required python modules:

$ pip install tqdm numpy six pillow matplotlib scipy

The code also uses ffmpeg for data preprocessing.

Obtain the dataset you need:

and unpack files into their respective directories under ./DATA/.

Expected layout so far is:

    │   │   test_videodatainfo.json
    │   │   train_val_videodatainfo.json
    │   │
    │   └───TestVideo/
    │   │       ...
    │   │   
    │   └───TrainValVideo/
    │           ...
        │   results_20130124.token

Run data preprocessing

$ python --dataset {MSR-VTT|Flickr30k}

This step takes ~30mins for Flickr30k and ~2h for MSR-VTT.

Run training

$ python --dataset {MSR-VTT|Flickr30k} --train

We do not finetune CNN part of the model, thus, training on GPU takes only several hours. Training/validation/test splits for Flickr30k are taken from NeuralTalk. After the training you can run evaluation of the model:

$ python --dataset {MSR-VTT|Flickr30k} --test --checkpoint {number}

Saliency Visualization

After you got the model which was trained to produce captions for MSR-VTT dataset, you can get video with saliency visualization similar to those in the beginning of the readme:

$ python --dataset MSR-VTT     \
                          --media_id video9461  \
                          --checkpoint {number} \
                          --sentence "A man is driving a car"

where media_id should belong to the test split of MSR-VTT, sentence sets a query phrase.

What's next

You can change model's parameters (dimensionality of layers, learning rate etc.) directly in Every run of with --train switch will overwrite files in experiments directory.


If you find this useful in your work please consider citing:

          title = {Top-down Visual Saliency Guided by Captions},
          author = {Vasili Ramanishka and Abir Das and Jianming Zhang and Kate Saenko},
          booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
          year = {2017}
You can’t perform that action at this time.