Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
56 lines (44 sloc) 2.43 KB

TREC QA Data

The now-discontinued QA track of the NIST TREC conference provides the de-facto standard benchmark data for QA systems:

http://trec.nist.gov/data/qa.html

This directory contains some download and format conversion scripts (run trec-setup.sh to download TREC datasets for several years and produce some easy-to-process TSV files) and also the reference (mostly) TREC-based datasets:

  • treclarge-raw.tsv contains ID, type, question and answer PCRE for the "large" dataset of questions coming from TREC 8, 9, 10, 11 and 12 (years 1999-2003), when the QA track was about isolated, general factoid questions. Do not edit questions in this file, it is autogenerated.

  • trecnew-raw.tsv contains ID, type, question and answer PCRE for the "new" dataset of questions coming from TREC 11 and 12 (years 2002, 2003), which appear to be the most mature and corpus-agnostic sets. (Also used e.g. in Chu-Carroll, Fan: "Leveraging Wikipedia Characteristics...") Do not edit questions in this file, it is autogenerated.

  • trecnew-raw-comments.txt contains curation notes for the questions - general guidelines we followed in the curation process and analysis of questions that are deemed as unapplicable for being too source-specific, with inaccurate answer pattern or that is unanswerable, plus notes about changes made to the dataset; this file is not exhaustive anymore and of historical interest, newer dataset changes are in Git history.

  • trecnew-curated.tsv is a version of the trecnew-raw.tsv with manual tweaks to improve evaluation accuracy on the dataset: updated answer patterns, some questions reworder and quite a few removed altogether. If you want to further curate the dataset, edit this file (and the standard splits in the root directory).

Origin and Licencing

These sets were produced by NIST, a US Government institution, which makes them public domain. Ellen M. Voorhees of NIST kindly confirmed:

> > The questions and answer patterns are freely download-able
> > from the TREC web site. ...
> ... Could you please clarify what of what copyright status
> they are and what their licence is?
They are in the public domain, though, of course, we would
appreciate attribution.

The answer PCRE patterns have been originally contributed to the NIST TREC QA track as a courtesy of Ken Litkowski of CL Research.

You can’t perform that action at this time.