Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Development Toolbox for Benchmarking Natural Language Visual Detection (NLVD)


A benchmarking protocol for visual localization and detection with natural language queries was proposed in [DBNet]. This repository provides the development toolbox for implementing this protocol. It includes utilities in both MATLAB and Python 3 for reading the annotations, accessing the test query sets, storing test results in a standard format, etc. It also includes the MATLAB code for evaluating the test results in the way proposed in [DBNet].

The current version of the benchmarking protocol is built upon the Visual Genome dataset (V1.2).

Please refer to the DOWNLOAD section for obtaining necessary data for using this development toolbox.


Please cite the following paper for the benchmarking protocol.

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries,
Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, Honglak Lee
In CVPR 2017 (spotlight)

Please cite the following paper for the Visual Genome dataset.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia-Li, David Ayman Shamma, Michael Bernstein, Li Fei-Fei
International Journal of Computer Vision, Volume 123, Issue 1, Pages 32-73, May 2017

Protocol data

The benchmarking protocol on the Visual Genome dataset uses the original text phrase annotations on image regions with the following differences:

  • Misspelled words are corrected by the Enchant spell checker from AbiWord.
  • A period (.) is appended to any text phrase without an ending punctuation.
  • A space character is added between a punctuation and its preceding word. (e.g., "a black cat." becomes "a black cat .")

The annotations are organized in a format that is different from the original Visual Genome annotation data for easier access.

This benchmark protocol split the datasets into training, validation, and test sets in the same way in DenseCap. Results should be reported on the test set.

More detailed summary about the dataset can be found here.

In addition, the benchmarking protocol also provides test query sets (on the test set only) with different difficulty levels for the detection task. More details about the detection test query sets can be found here.

Downloading the protocol data

The protocol data on the Visual Genome dataset can be obtained by running the following command:


If you would like to use only the MATLAB APIs, you can alternatively run ./annotations/ to download only the MAT version of the protocol data.

If you have difficulties in running the above scripts, you can download it manually via this link (JSON format for the Python APIs) and this link (MAT format for the MATLAB APIs). Please extract them in annotations/vg_v1

More details about the data files can be found here.

Obtaining Visual Genome images

The Visual Genome images can be obtained by running the following command:


The images will be extracted in images/vg, and a symlink to it will be created as images/vg_v1.

You can also obtain the images via the Visual Genome official links, as follows:

Note that the Visual Genome images are NOT a contribution of this benchmarking protocol. The script and links are provided here only for the convenience of the users.

Development APIs

Development APIs are provided in both Python and MATLAB for accessing annotations and storing test results.

See the development/ for the API code and the detailed manual.

Evaluating test results

Evaluation code are provided in MATLAB to measure the localization and detection performance based on stored results. The evaluation statistics, such as average_precision (AP), average accuracy (ACC), are written into text files.

See the evaluation/ for the evaluation code and detailed instructions.


A few utility files (BoxIntersection.m, BoxSize.m, PascalOverlap.m) are borrowed from the selective search toolbox developed by Jasper Uijlings et al.

Many thanks to Yijie Guo and Luyao Yuan for their great efforts in developing this toolbox.


Benchmarking protocol for visual localization and detection with natural language queries







No releases published


No packages published