Skip to content
Supplementary material to "Top-down Visual Saliency Guided by Captions" (CVPR 2017)
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
DATA Initial commit Apr 21, 2017
coco-caption @ 3f0fe9b Initial commit Apr 21, 2017
.gitignore Initial commit Apr 21, 2017
.gitmodules
LICENSE
README.md
cfg.py
cocoeval.py Initial commit Apr 21, 2017
preprocessing.py Initial commit Apr 21, 2017
run_s2vt.py fixes 'leaky' scope during Adam optimizer initialization for Tensorfl… May 23, 2017
s2vt_model.py Initial commit Apr 21, 2017
visualization.py

README.md

Caption-Guided Saliency

This code is released as a supplementary material to "Top-down Visual Saliency Guided by Captions" (CVPR 2017).

Getting started

Clone this repo (including coco-caption as a submodule):

$ git clone --recursive git@github.com:VisionLearningGroup/caption-guided-saliency.git

Install dependencies

The model is implemented using TensorFlow framework, Python 2.7. For TensorFlow installation please refer to the official Installing TensorFlow guide or simply:

$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp27-none-linux_x86_64.whl

Warning! The standard version of TensorFlow gives the warnings like:

The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.

It's fine. To get rid of them you'll need to build TensorFlow from sources with --config=opt.

List of other required python modules:

$ pip install tqdm numpy six pillow matplotlib scipy

The code also uses ffmpeg for data preprocessing.

Obtain the dataset you need:

and unpack files into their respective directories under ./DATA/.

Expected layout so far is:

./DATA/
    └───MSR_VTT/
    │   │   test_videodatainfo.json
    │   │   train_val_videodatainfo.json
    │   │
    │   └───TestVideo/
    │   │       ...
    │   │   
    │   └───TrainValVideo/
    │           ...
    └───Flickr30k
        │   results_20130124.token
        │      
        └───flickr30k-images/
                ...

Run data preprocessing

$ python preprocessing.py --dataset {MSR-VTT|Flickr30k}

This step takes ~30mins for Flickr30k and ~2h for MSR-VTT.

Run training

$ python run_s2vt.py --dataset {MSR-VTT|Flickr30k} --train

We do not finetune CNN part of the model, thus, training on GPU takes only several hours. Training/validation/test splits for Flickr30k are taken from NeuralTalk. After the training you can run evaluation of the model:

$ python run_s2vt.py --dataset {MSR-VTT|Flickr30k} --test --checkpoint {number}

Saliency Visualization

After you got the model which was trained to produce captions for MSR-VTT dataset, you can get video with saliency visualization similar to those in the beginning of the readme:

$ python visualization.py --dataset MSR-VTT     \
                          --media_id video9461  \
                          --checkpoint {number} \
                          --sentence "A man is driving a car"

where media_id should belong to the test split of MSR-VTT, sentence sets a query phrase.

What's next

You can change model's parameters (dimensionality of layers, learning rate etc.) directly in cfg.py. Every run of run_s2vt.py with --train switch will overwrite files in experiments directory.

References

If you find this useful in your work please consider citing:

@inproceedings{Ramanishka2017cvpr,
          title = {Top-down Visual Saliency Guided by Captions},
          author = {Vasili Ramanishka and Abir Das and Jianming Zhang and Kate Saenko},
          booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
          year = {2017}
          }
You can’t perform that action at this time.