TREC QA Data
The now-discontinued QA track of the NIST TREC conference provides the de-facto standard benchmark data for QA systems:
This directory contains some download and format conversion scripts
trec-setup.sh to download TREC datasets for several years
and produce some easy-to-process TSV files) and also the reference
(mostly) TREC-based datasets:
treclarge-raw.tsvcontains ID, type, question and answer PCRE for the "large" dataset of questions coming from TREC 8, 9, 10, 11 and 12 (years 1999-2003), when the QA track was about isolated, general factoid questions. Do not edit questions in this file, it is autogenerated.
trecnew-raw.tsvcontains ID, type, question and answer PCRE for the "new" dataset of questions coming from TREC 11 and 12 (years 2002, 2003), which appear to be the most mature and corpus-agnostic sets. (Also used e.g. in Chu-Carroll, Fan: "Leveraging Wikipedia Characteristics...") Do not edit questions in this file, it is autogenerated.
trecnew-raw-comments.txtcontains curation notes for the questions - general guidelines we followed in the curation process and analysis of questions that are deemed as unapplicable for being too source-specific, with inaccurate answer pattern or that is unanswerable, plus notes about changes made to the dataset; this file is not exhaustive anymore and of historical interest, newer dataset changes are in Git history.
trecnew-curated.tsvis a version of the
trecnew-raw.tsvwith manual tweaks to improve evaluation accuracy on the dataset: updated answer patterns, some questions reworder and quite a few removed altogether. If you want to further curate the dataset, edit this file (and the standard splits in the root directory).
Origin and Licencing
These sets were produced by NIST, a US Government institution, which makes them public domain. Ellen M. Voorhees of NIST kindly confirmed:
> > The questions and answer patterns are freely download-able > > from the TREC web site. ... > ... Could you please clarify what of what copyright status > they are and what their licence is? They are in the public domain, though, of course, we would appreciate attribution.
The answer PCRE patterns have been originally contributed to the NIST TREC QA track as a courtesy of Ken Litkowski of CL Research.