The Cornell Natural Language Visual Reasoning (NLVR) Corpus v1.0
The corpus and task are described in:
A Corpus of Natural Language for Visual Reasoning
Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi
In Proceedings of the Conference of the Association for Computational Linguistics (ACL), 2017.
Supplementary material: http://yoavartzi.com/pub/slya-acl.2017.sup.pdf
Structure of repository
There are three subdirectories, one for each split of the data (train, development, and public test).
In each of the split directories, there are two items:
- .json file, containing the labeled sentences, structured representations, and identifiers which can be used to match the example with its six PNG permutations.
- images subdirectory, containing the PNG images.
Each line includes one example, represented as a JSON object. The fields are:
sentence: the sentence from the first stage of data collection.
identifier: a two-part identifier, in the form n-m, where n is the identifier for the original presentation in the first stage of data collection (there are at most four examples in the set which has this number), and m is the position in the original presentation (0 = A, 1 = B, 2 = C, 3 = D).
This identifier is used to match examples in this file with images in the images subdirectory. The image names are in the format split-n-m-k.png, where 0 <= k <= 5 (representing the six permutations of the boxes), and split is the split name (train, dev, test).
directory: the directory that the images associated with this example are in. E.g., if directory = n and the file is dev.json, the images for the example will be located in directory dev/images/n.
label: the final label for the example, true or false.
structured_rep: the structured representation of the image, which is a list of length three. For each item in this list, which represents a box, there is another list of items (up to length eight). For each item, there is an x and y position (x_loc and y_loc), a type (the name of the shape), a color, and a size.
evals: the set of validations done in the second stage of data collection. This is a dictionary mapping a rater identifier to their validation. The rater identifiers persist across all splits; i.e., rater r0 in train is the same rater as rater r0 in dev.
This is NOT the final label for the example. The final label is in the field label.
These contain numbered subdirectories, each containing up to 1000 PNGs each. For the training and development splits, these subdirectories can be used as cross-validation splits -- all examples originally presented together are in the same numbered subdirectory, and all six images for an example are in the same numbered subdirectory.
For example, an example whose identifier is 72-0 has the PNGs 72-0-0 through 72-0-5. If all six of these PNGs are the numbered subdirectory 5, then the PNGs for examples 72-1, 72-2, and 72-3 are also in subdirectory 5. This is done to prevent information (either about the original presentation of the four examples or about the permutations of a specific example) from leaking between cross-validation splits.
We provide evaluation scripts for predictions. These scripts are metrics_images.py and metrics_structured_rep.py. We assume that you have a CSV that has one line per prediction. In the case of images, we assume one prediction per PNG file. Each line contains two values: the identifier (in the case of images, we assume this is the filename; in the case of structured representations, we assume this is simply the identifier) and the predicted value ("true" or "false", case insensitive).
Each script takes two arguments. The first argument is your predictions file. The second argument is the JSON file for the subset of the data you are comparing against.
We return two numbers: precision and consistency. Precision is the proportion of examples that your predictions were correct for. Consistency is the proportion of original sentences (for images, there may be at most 24 images; for structured representation, there may be at most 4 examples) that had a correct prediction.
Running on the leaderboard held-out test set
To avoid overfitting and degrading the leaderboard held-out test set, we require two months or more between runs on the leaderboard test set. We will do our best to run within two weeks (usually we will run much faster). We will only post results on the leaderboard when an online description of the system is available. Testing on the leaderboard test set is meant to be the final step before publication. Under extreme circumstances, we reserve the right to limit running on the leaderboard test set to systems that are mature for publication. Your model should generate a prediction file in the format specified above (under "evaluation scripts") and run with the provided evaluation scripts.
Note about sampling a validation set
The training set contains many examples which differ only in the order of the boxes. When selecting a validation set to use, we suggest choosing a set of subdirectories in train/images/ rather than sampling. We observed that certain models are able to perform well on randomly-sampled validation images, due to permutations of these images occurring in the training set, but generalize poorly to the development data.