Break: A Question Understanding Benchmark

Break is a human annotated dataset of natural language questions and their Question Decomposition Meaning Representations (QDMRs). Break consists of 83,978 examples sampled from 10 question answering datasets over text, images and databases. This repository contains the Break dataset along with information on the exact data format.

For more details check out our TACL paper "Break It Down: A Question Understanding Benchmark", and website.
The code and models presented in our paper, see our repository at: https://github.com/tomerwolgithub/Break.

Key Links
- Break Dataset: Download
- Paper: "Break It Down: A Question Understanding Benchmark"
- Models Code: https://github.com/tomerwolgithub/Break
- Leaderboard:
  - Break: Leaderboard
  - Break High-Level: Leaderboard
  - Evaluator Code: https://github.com/allenai/break-evaluator
- Website: https://allenai.github.io/Break/
- Huggingface nlp library: https://huggingface.co/datasets/break_data

Changelog

7/04/2020 Break is now part of HuggingFace nlp library see details.
4/10/2020 Pretrained QDMR Parsing models are now available.
4/02/2020 New AI2 leaderboards for Break and Break High-Level.
2/26/2020 Our paper's entire codebase is now available.
1/31/2020 The entire codebase and official leaderboard will be released soon.
1/31/2020 The full Break dataset has been released!

Question Answering Datasets

The Break dataset contains questions from the following 10 datasets:
- Semantic Parsing: Academic, ATIS, GeoQuery, Spider
- Visual Question Answering: CLEVR-humans, NLVR2
- Reading Comprehension (and KB-QA): ComQA, ComplexWebQuestions, DROP, HotpotQA

Data Description

Datasets

QDMR: Contains questions over text, images and databases annotated with their Question Decomposition Meaning Representation. In addition to the train, dev and (hidden) test sets we provide lexicon_tokens files. For each question, the lexicon file contains the set of valid tokens that could potentially appear in its decomposition (Section 3).
QDMR high-level: Contains questions annotated with the high-level variant of QDMR. These decomposition are exclusive to Reading Comprehension tasks (Section 2). lexicon_tokens files are also provided.
logical-forms: Contains questions and QDMRs annotated with full logical-forms of QDMR operators + arguments. Full logical-forms were inferred by the annotation-consistency algorithm described in Section 4.3.

Data Format

QDMR & QDMR high-level:
- train.csv, dev.csv, test.csv:
  - question_id: The Break question id, of the format [ORIGINAL DATASET]_[original split]_[original id]. E.g., NLVR2_dev_dev-1049-1-1 is from NLVR2 dev split with its NLVR2 id being, dev-1049-1-1.
  - question_text: Original question text.
  - decomposition: The annotated QDMR of the question, its steps delimited by ;. E.g., return flights ;return #1 from washington ;return #2 to boston ;return #3 in the afternoon.
  - operators: List of tagged QDMR operators for each step. QDMR operators are fully described in (Section 2) of the paper. The 14 potential operators are, select, project, filter, aggregate, group, superlative, comparative, union, intersection, discard, sort, boolean, arithmetic, comparison. Unidefntified operators are tagged with None.
  - split: The Break dataset split of the example, train / dev / test.
- train_lexicon_tokens.json, dev_lexicon_tokens.json, test_lexicon_tokens.json:
  - "source": The source question.
  - "allowed_tokens": The set of valid lexicon tokens that can appear in the QDMR of the question. For the method used to generate lexicon tokens see here.
logical-forms:
- train.csv, dev.csv, test.csv:
  - question_id: Same as before.
  - question_text: Same as before.
  - decomposition: Same as before.
  - program: List of QDMR operators and arguments that the original QDMR was mapped to. E.g., for the QDMR, return citations ;return #1 of Making database systems usable ;return number of #2, its program is, [ SELECT['citations'], FILTER['#1', 'of Making database systems usable'], AGGREGATE['count', '#2'] ].
  - operators: Same as before.
  - split: Same as before.

Data Statistics

Break question decomposition datasets:

Data	Examples	Train	Dev	Test
QDMR	60,150	44,321 (73.7%)	7,760 (12.9%)	8,069 (13.4%)
QDMR High-level	23,828	17,503 (73.5%)	3,130 (13.1%)	3,195 (13.4%)
logical-forms (QDMR)	59,823	44,098 (73.7%)	7,719 (12.9%)	8,006 (13.4%)

QDMR annotations by original dataset:

Data	Examples	Train	Dev	Test
Academic	195	195	0	0
ATIS	4,906	4,042	457	407
GeoQuery	877	547	50	280
Spider	7,982	6,955	502	525
CLEVR-humans	13,935	9,453	2,215	2,267
NLVR2	13,517	9,915	1,805	1,797
ComQA	5,520	3,546	988	986
ComplexWebQuestions	2,988	1,985	475	528
DROP	10,230	7,683	1,268	1,279

QDMR High-level annotations by original dataset:

Data	Examples	Train	Dev	Test
ComplexWebQuestions	2,991	1,988	475	528
DROP	10,262	7,705	1,273	1,284
HotpotQA-hard	10,575	7,810	1,382	1,383

Reference

@article{Wolfson2020Break,
  title={Break It Down: A Question Understanding Benchmark},
  author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan},
  journal={Transactions of the Association for Computational Linguistics},
  year={2020},
}

HuggingFace nlp library

You can also access Break as part of the HuggingFace nlp library:

!pip install nlp
from nlp import load_dataset
dataset = load_dataset('break_data', 'QDMR-high-level')
# dataset = load_dataset('break_data', 'QDMR')

Break is referenced here and can be browsed online as part of a simple viewer.
More details on the options and usage for this library can be found on the nlp repository at https://github.com/huggingface/nlp.

Name		Name	Last commit message	Last commit date
Latest commit History 341 Commits
_layouts		_layouts
assets/css		assets/css
break_dataset		break_dataset
images		images
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
blogpost.md		blogpost.md
explore.md		explore.md
index.md		index.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Break: A Question Understanding Benchmark

Changelog

Question Answering Datasets

Data Description

Datasets

Data Format

Data Statistics

Reference

HuggingFace nlp library

About

Releases

Packages

Languages

License

allenai/Break

Folders and files

Latest commit

History

Repository files navigation

Break: A Question Understanding Benchmark

Changelog

Question Answering Datasets

Data Description

Datasets

Data Format

Data Statistics

Reference

HuggingFace nlp library

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages