Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


This repository contains code for the paper:

Visual Coreference Resolution in Visual Dialog using Neural Module Networks
Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach
[PDF] [ArXiv] [Code]
European Conference on Computer Vision (ECCV), 2018


Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as oneround dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns, co-refer to the same entity/object instance in an image. This is crucial, especially for pronouns (e.g., ‘it’), as the dialog agent must first link it to a previous coreference (e.g., ‘boat’), and only then can rely on the visual grounding of the coreference ‘boat’ to reason about the pronoun ‘it’. Prior work (in visual dialog) models visual coreference resolution either (a) implicitly via a memory network over history, or (b) at a coarse level for the entire question; and not explicitly at a phrase level of granularity. In this work, we propose a neural module network architecture for visual dialog by introducing two novel modules—Refer and Exclude—that perform explicit, grounded, coreference resolution at a finer word level.

CorefNMN This repository trains our explicit visual coreference model CorefNMN (figure above).

If you find this code useful, consider citing our work:

  author    = {Kottur, Satwik and Moura, Jos\'e M. F. and Parikh, Devi and 
               Batra, Dhruv and Rohrbach, Marcus},
  title     = {Visual Coreference Resolution in Visual Dialog using Neural 
               Module Networks},
  booktitle = {The European Conference on Computer Vision (ECCV)},
  month     = {September},
  year      = {2018}


The structure for this repository has been inspired from n2nmn github repository.

The code is in Python3 and uses TensorFlow r1.0. Additionally, it uses TensorFlow Fold for execution of dynamic networks. Install instructions for Fold can be found here.

Note: Compatibility of Fold has been tested with only TensorFlow r1.0!

Additional python package dependencies can be installed as follows:

pip install argparse
pip install json
pip install tqdm
pip install numpy

Finally, add the current working directory to the python path, i.e., PYTHONPATH=..

This repository contains experiments on two datasets: VisDial v0.9 and MNIST Dialog. Instructions to train models on each of these datasets are given below.

Experiments on VisDial v0.9 Dataset

Preprocessing VisDial v0.9

This code has a lot of preprocessing steps, please hold tight!

There are three preprocessing phases. A script for each of these phases has been provided in scripts/ folder.

Phase A: In this phase, we will download the data and extract questions and captions as text files to run parsers (phase B). The first phase involves running the following command:


This creates a folder data/ within which another folder visdial_v0.9 will be created. All our preprocessing steps will operate on files in this folder.

Phase B: The second phase runs the Stanford Parser to acquire weak program supervision for questions and captions. Follow the steps below:

  1. Download the Stanford parser here.
  2. Next, copy the file scripts/ to the same folder as the Stanford parser. Ensure that the VISDIAL_DATA_ROOT flag in the above script (after copying) points correctly to the data/visdial_v0.9 folder. For example, if you download and extract the Stanford parser in the parent folder of this repository, VISDIAL_DATA_ROOT should be ../data/visdial_v0.9/.

Now run the parser using the command ./ from the parser folder. This should take about 45-60 min based on your CPU configuration. Adjust the memory argument in based on your RAM.

Phase C: For the third phase, first ensure the following:

  1. Download the vocabulary file from the original visual dialog codebase (github). The vocabulary for the VisDial v0.9 dataset is here.

  2. Save the following files: data/visdial_v0.9/vocabulary_layout_4.txt


    and data/visdial_v0.9/vocabulary_layout_5.txt

  3. Download the visual dialog data files with weak coreference supervision. These files have been obtained using off-the-shelf, text-only coreference resolution system (github). They are available here: train and val.

NOTE: To extract weak coreferences on your own dataset, please install v2.0 of neuralcoref repository (here) from source. For v2.0,spaCy==2.1.3 has only been tested. Run util/ on files in the VisDial format to extract weak coreference supervision.

Now, run the script for the third phase:


This will use the output from the Stanford parser, create programs for our model, extract vocabulary from train dataset, and finally create image-dialog database for each split (train and val) that will be used by our training code.

Extracting Image Features

To extract image features, please follow instructions here.

All instructions for preprocessing are now done! We are set to train visual dialog models that perform explicit coreference resolution.


To train a model, look at scripts/ that shows usages of command line flags for the file exp_vd/ Information about these flags can be obtained from exp_vd/

Usage for exp_vd/ (evaluating a checkpoint) are also given in scripts/

Experiments on MNIST Dialog Dataset

Preprocessing MNIST Dialog

In order to preprocess MNIST Dialog dataset, simply run the command:


As before, this will download the dataset and create image-dialog database (similar to VisDial v0.9 preprocessing phase C).

Finally save the following at: data/mnist/vocabulary_layout_mnist.txt


Done with preprocessing!


Training a model is handled by exp_mnist/, while exp_mnist/ handles evaluating a specific checkpoint. Commandline options are parsed by exp_mnist/

Usage of these scripts is demonstrated by the bash script scripts/

Future Releases

  • Visualization scripts
  • MNIST Experiments
  • Detailed Doc Strings
  • Additional installation instructions
  • Include pretrained models


This project is licensed under the license found in the LICENSE file in the root directory of this source tree (here). Portions of the source code are from the n2nmn project which is in LICENSE.n2nmn in the root directory of this source tree (here).


Visual Coreference Resolution in Visual Dialog using Neural Module Networks



Unknown, BSD-2-Clause licenses found

Licenses found


Code of conduct

Security policy





No releases published


No packages published