Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

original dataset and generation script output have different formats #14

Open
rizar opened this issue Apr 12, 2019 · 3 comments
Open

original dataset and generation script output have different formats #14

rizar opened this issue Apr 12, 2019 · 3 comments

Comments

@rizar
Copy link

rizar commented Apr 12, 2019

In the dataset, "question_family_index" field takes values from 0 to 89. When I generate a new dataset with the generation script, "question_family_index" takes smaller values as it refers to the index within a template file. In this regard, I have two questions:

  • Are there any other differences between the code that was originally used to generate CLEVR and the code that is currently hosted on GitHub?
  • In the version of the CLEVR that is currently available for download, is there a way to resolve "question_family_index" into the actual template? I guess I have to know in which order the template files were loaded for this purpose, and I am not 100% sure if this is deterministic.
@jcjohnson
Copy link

The question generation script here on GitHub is mostly the same as the code used to generate CLEVR -- mostly I added documentation, tried to remove dead code that wasn't being used anymore, and changed the names of some of the JSON keys to have better names.

Here's the original generation code for you to compare against in case there are other differences that I can't remember:

https://gist.github.com/jcjohnson/6fb119a0372166ec9f4f006a1242a7bc

In the original code (L710) "template_idx" is also the index of a template within a file, much like "question_family_index" in the GitHub version of the file.

There was another script that converted the output from the original generation script into the format that we released as CLEVR_v1.0, which changed the names of JSON keys ("text_question" -> "question", "structured_question" -> "program"). Unfortunately after digging around today I wasn't able to find this conversion script.

However I suspect that the conversion script also changed the semantics of "template_idx" / "question_family_index" to be an overall index of the template (between 0 and 89) rather than the index of the template within the file; in hindsight this was clearly a mistake since it makes it tough to figure out which template was used to generate which question.

Thankfully the templates originally used for question generation have exactly the same structure as the ones on GitHub, so the only source of nondeterminism is the order that the JSON files are loaded (since this order depends on os.listdir, which I think can give different orders on different filesystems).

To fix this issue, I manually matched up values of "question_family_index" from the released CLEVR_v1.0 data to the text templates from the JSON files, and found that you can recover the template for each question if you load them in this order:

  • compare_integer.json
  • comparison.json
  • three_hop.json
  • single_and.json
  • same_relate.json
  • single_or.json
  • one_hop.json
  • two_hop.json
  • zero_hop.json

Here's a little script that shows how to recover templates from the released questions: It loads templates in this order, randomly samples some questions, and prints out the text of the question as well as it's recovered template:

https://gist.github.com/jcjohnson/9f3173703f8578db787345d0ce61002a

In the process of figuring this out, I realized another slight inconsistency between the original code and the GitHub code: we changed the wording of the "same_relate" templates to be less ambiguous (in particular adding "other" or "another"), but the semantics of these templates are exactly the same. Here are the old versions of those templates:

https://gist.github.com/jcjohnson/09541f3bcb32e73e0ba47c57d09f3f6e

@rizar
Copy link
Author

rizar commented Apr 16, 2019

Thanks a lot, @jcjohnson !

Regarding the inconsistency between the original code and the Github code: can you please clarify which code was used to generate the widely used CLEVR distribution? I just checked, and found that CLEVR_val_questions.json does contain questions of the form "What size is the other ...", meaning that it was probably the newer version of your templates that was used to generate it, not the one you linked as a Github gist. Can you please clarify?

@gudovskiy
Copy link

I found another small incompatibility. Original CLEVR key for the type of functional program was called "function", while the GitHub key is "type". So, I wrote a small conversion script:

import argparse, os, json, random
from collections import defaultdict

parser = argparse.ArgumentParser()
parser.add_argument('--input_questions_file',  default='../clevr-output/CLEVR_questions.json')
parser.add_argument('--output_questions_file', default='../clevr-output/CLEVR_fixed_questions.json')

def main(args):
    # Load questions
    with open(args.input_questions_file, 'r') as f:
        question_data = json.load(f)
        info = question_data['info']
        questions = question_data['questions']
    print('Read %d questions from disk' % len(questions))
    # Rename 'type' to 'program'
    for q in questions:
        programs = q['program']
        for p in programs:
            p['function'] = p.pop('type')
    # Dump new dict
    with open(args.output_questions_file, 'w') as f:
        print('Writing output to %s' % args.output_questions_file)
        json.dump({
            'info': info,
            'questions': questions,
        }, f)

if __name__ == '__main__':
  main(parser.parse_args())

kris7t added a commit to csekili/clevr-dataset-gen that referenced this issue Oct 9, 2020
We downgrade same_relate.json to the CLEVR 1.0 version.

The question generator should use the same question format as the CLEVR
1.0 dataset on which the CLEVR IEP neural networds were trained.

See
facebookresearch#14 (comment)

Also add fix_questions.py to convert the output questions json to the
older format.

See
facebookresearch#14 (comment)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants