Skip to content
Go to file

Visual Dialog Challenge Starter Code

PyTorch starter code for the Visual Dialog Challenge 2019.

If you use this code in your research, please consider citing:

  author =       {Karan Desai and Abhishek Das and Dhruv Batra and Devi Parikh},
  title =        {Visual Dialog Challenge Starter Code},
  howpublished = {\url{}},
  year =         {2018}


What's new with v2019?

If you are a returning user (from Visual Dialog Challenge 2018), here are some key highlights about our offerings in v2019 of this starter code:

  1. Almost a complete rewrite of v2018, which increased speed, readability, modularity and extensibility.
  2. Multi-GPU support - try out specifying GPU ids to train/evaluate scripts as: --gpu-ids 0 1 2 3
  3. Docker support - we provide a Dockerfile which can help you set up all the dependencies with ease.
  4. Stronger baseline - our Late Fusion Encoder is equipped with Bottom-up Top-Down attention. We also provide pre-extracted image features (links below).
  5. Minimal pre-processed data - no requirement to download tens of pre-processed data files anymore (were typically referred as visdial_data.h5 and visdial_params.json).

Setup and Dependencies

This starter code is implemented using PyTorch v1.0, and provides out of the box support with CUDA 9 and CuDNN 7. There are two recommended ways to set up this codebase: Anaconda or Miniconda, and Docker.

Anaconda or Miniconda

  1. Install Anaconda or Miniconda distribution based on Python3+ from their downloads' site.
  2. Clone this repository and create an environment:
git clone
conda create -n visdialch python=3.6

# activate the environment and install all dependencies
conda activate visdialch
cd visdial-challenge-starter-pytorch/
pip install -r requirements.txt

# install this codebase as a package in development version
python develop

Note: Docker setup is necessary if you wish to extract image features using Detectron.


We provide a Dockerfile which creates a light-weight image with all the dependencies installed.

  1. Install nvidia-docker, which enables usage of GPUs from inside a container.
  2. Build the image as:
cd docker
docker build -t visdialch .
  1. Run this image in a container by setting user+group, attaching project root (this codebase) as a volume and setting shared memory size according to your requirements (depends on the memory usage of your model).
nvidia-docker run -u $(id -u):$(id -g) \
                  -v $PROJECT_ROOT:/workspace \
                  --shm-size 16G visdialch /bin/bash

We recommend this development workflow, attaching the codebase as a volume would immediately reflect source code changes inside the container environment. We also recommend containing all the source code for data loading, models and other utilities inside visdialch directory. Since it is a setuptools-style package, it makes handling of absolute/relative imports and module resolving less painful. Scripts using visdialch can be created anywhere in the filesystem, as far as the current conda environment is active.

Download Data

  1. Download the VisDial v1.0 dialog json files from here and keep it under $PROJECT_ROOT/data directory, for default arguments to work effectively.

  2. Get the word counts for VisDial v1.0 train split here. They are used to build the vocabulary.

  3. We also provide pre-extracted image features of VisDial v1.0 images, using a Faster-RCNN pre-trained on Visual Genome. If you wish to extract your own image features, skip this step and download VIsDial v1.0 images from here instead. Extracted features for v1.0 train, val and test are available for download at these links.

  1. We also provide pre-extracted FC7 features from VGG16, although the v2019 of this codebase does not use them anymore.


This codebase supports both generative and discriminative decoding; read more here. For reference, we have Late Fusion Encoder from the Visual Dialog paper.

We provide a training script which accepts arguments as config files. The config file should contain arguments which are specific to a particular experiment, such as those defining model architecture, or optimization hyperparameters. Other arguments such as GPU ids, or number of CPU workers should be declared in the script and passed in as argparse-style arguments.

Train the baseline model provided in this repository as:

python --config-yml configs/lf_disc_faster_rcnn_x101.yml --gpu-ids 0 1 # provide more ids for multi-GPU execution other args...

To extend this starter code, add your own encoder/decoder modules into their respective directories and include their names as choices in your config file. We have an --overfit flag, which can be useful for rapid debugging. It takes a batch of 5 examples and overfits the model on them.

Saving model checkpoints

This script will save model checkpoints at every epoch as per path specified by --save-dirpath. Refer visdialch/utils/ for more details on how checkpointing is managed.


We use Tensorboard for logging training progress. Recommended: execute tensorboard --logdir /path/to/save_dir --port 8008 and visit localhost:8008 in the browser.


Evaluation of a trained model checkpoint can be done as follows:

python --config-yml /path/to/config.yml --load-pthpath /path/to/checkpoint.pth --split val --gpu-ids 0

This will generate an EvalAI submission file, and report metrics from the Visual Dialog paper (Mean reciprocal rank, R@{1, 5, 10}, Mean rank), and Normalized Discounted Cumulative Gain (NDCG), introduced in the first Visual Dialog Challenge (in 2018).

The metrics reported here would be the same as those reported through EvalAI by making a submission in val phase. To generate a submission file for test-std or test-challenge phase, replace --split val with --split test.

Results and pretrained checkpoints

Performance on v1.0 test-std (trained on v1.0 train + val):

Model R@1 R@5 R@10 MeanR MRR NDCG
lf-disc-faster-rcnn-x101 0.4617 0.7780 0.8730 4.7545 0.6041 0.5162
lf-gen-faster-rcnn-x101 0.3620 0.5640 0.6340 19.4458 0.4657 0.5421


  • This starter code began as a fork of batra-mlp-lab/visdial-rl. We thank the developers for doing most of the heavy-lifting.
  • The Lua-torch codebase of Visual Dialog, at batra-mlp-lab/visdial, served as an important reference while developing this codebase.
  • Some documentation and design strategies of Metric, Reader and Vocabulary classes are inspired from AllenNLP, It is not a dependency because the use-case in this codebase would be too little in its current state.
You can’t perform that action at this time.