Skip to content

Commit

Permalink
trec/: Add TREC 8 (1999) dataset, for use in large+noisy collections
Browse files Browse the repository at this point in the history
TREC 8 is not high quality and contains strange questions that are
unlikely to be answerable using Wikipedia (especially also when to match
the provided patterns). However, positive matches may still be useful to
train systems using large datasets.
  • Loading branch information
pasky committed May 28, 2015
1 parent e48ee6b commit 30ac739
Show file tree
Hide file tree
Showing 6 changed files with 799 additions and 5 deletions.
6 changes: 3 additions & 3 deletions trec/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,16 @@ and produce some easy-to-process TSV files) and also the reference
(mostly) TREC-based datasets:

* ``treclarge-raw.tsv`` contains ID, type, question and answer PCRE
for the "large" dataset of questions coming from TREC 9, 10, 11
and 12 (years 2000-2003), when the QA track was about isolated,
for the "large" dataset of questions coming from TREC 8, 9, 10, 11
and 12 (years 1999-2003), when the QA track was about isolated,
general factoid questions. Do not edit questions in this file,
it is autogenerated.

* ``trecnew-raw.tsv`` contains ID, type, question and answer PCRE
for the "new" dataset of questions coming from TREC 11 and 12
(years 2002, 2003), which appear to be the most mature and
corpus-agnostic sets. (Also used e.g. in Chu-Carroll, Fan:
"Leveraging Wikipedia Characteristics...) Do not edit questions
"Leveraging Wikipedia Characteristics...") Do not edit questions
in this file, it is autogenerated.

* ``trecnew-raw-comments.txt`` contains curation notes for the
Expand Down
3 changes: 1 addition & 2 deletions trec/trec-setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,12 @@
# Download and preprocess TREC2000-2003 QA datasets - questions
# and answer patterns.

# 1999 data is very specific to the supplied corpus rather than being generic trivia
# http://trec.nist.gov/data/qa/T8_QAdata/topics.qa_questions.txt http://trec.nist.gov/data/qa/T8_QAdata/adjudicated_for_perl
# 2004 data is grouped into "topics" and that's too weird for us now
# http://trec.nist.gov/data/qa/2004_qadata/QA2004_testset.xml
# http://trec.nist.gov/data/qa/2004_qadata/04.patterns.zip; unzip 04.patterns.zip trec13factpats.txt

datasets="
1999 http://trec.nist.gov/data/qa/T8_QAdata/topics.qa_questions.txt http://trec.nist.gov/data/qa/T8_QAdata/adjudicated_for_perl
2000 http://trec.nist.gov/data/qa/T9_QAdata/qa_questions_201-893 http://trec.nist.gov/data/qa/T9_QAdata/patterns
2001 http://trec.nist.gov/data/qa/2001_qadata/main_task_QAdata/qa_main.894-1393.txt http://trec.nist.gov/data/qa/2001_qadata/main_task_QAdata/patterns.trec10
2002 http://trec.nist.gov/data/qa/2002_qadata/main_task_QAdata/t11_500_numbered.txt http://trec.nist.gov/data/qa/2002_qadata/main_task_QAdata/patterns.txt
Expand Down
199 changes: 199 additions & 0 deletions trec/trec1999-p.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@

1 Young
2 \$469,000
3 106s?|205s?|306s?|309s?|405|504s?|505s?|Peugeots|automobiles?|cars?|diesel\s+motors?|plastic\s+components|vehicles?
4 Pounds\s+12\s*(?:m|(?:million))
5 Horne
6 To\s+record\s+his\s+revelations|finish\w*\s+writing.*?revelations
7 1\.4\s*(?:(?:bn)|(?:billion))|1\.6\s*(?:(?:bn)|(?:billion))
8 Tourette\s*'\s*s
9 150 miles?
10 Pfister
11 Folsom
12 Pounds\s*4\s*(?:m|(?:million))
13 \$\s*1
14 China
15 1980s|1987
16 (?:Krebs.*?Fischer)|(?:Fischer.*?Krebs)
17 (?:(?:nine)|9).*?months?
18 Lee\s+Teng\s*-?\s*Hui|Li\s+Teng\s*-?\s*Hui|Mr\.?\s+Lee|Mr\.?\s+Li|President\s+Lee|President\s+Li
19 Koresh
20 Norwich
21 Shepard
22 130\s+million\s+years\s+ago
23 1950
24 17\s+years\s+ago|1972|2[23]\s+years\s+ago|April\s+1993|February\s+1976|early\s+1970s|more\s+than\s+20\s+years\s+have\s+elapsed|over\s+20\s+years\s+ago|today|two\s+decades\s+ago
25 Ryan
26 La\s+Nina
27 East\s+Java|Surabaya
28 Robinson
29 Sirius
30 10\s*-?\s*point\s+environmental\s+agenda|code\s+of\s+conduct|environmental\s+agenda\s+for\s+corporations|principles\s+of\s+the\s+Coalition\s+of\s+Environmentally\s+Responsible\s+Economies
31 Ohio
32 Sinatra
33 Berlin
34 Hollywood\s+Cemetery|Hollywood\s+Memorial\s+Park
35 Kilimanjaro|Uhuru\s+Peak
36 Tuesday
37 Hall
38 Virginia|Westmoreland\s+County
39 Powell
40 Kyi
41 0?\.\s*08(?:%|(?:pct)|(?:per))?|0?\.\s*10(?:%|(?:pct)|(?:per))?|twice\s+the\s+legal\s+limit.*?\.20?
42 3\s*\.\s*5\s*(?:-|(?:to))\s*5\s*\.\s*5(?:%|(?:pct)|(?:per))?|3\s+1/2\s*(?:-|(?:to))\s*5\s+1/2(?:%|(?:pct)|(?:per))?|4\s*(?:-|(?:to))\s*6(?:%|(?:pct)|(?:per))?|between\s+3\s*\.\s*5.*?5\s*\.\s*5(?:%|(?:pct)|(?:per))?
43 Whitten
44 McKusick
45 10\s*Feb(?:ruary)?|Feb.*?94|Feb\s*10
46 Jesus\s+Gil\s+y\s+Gil
47 Mitsubishi\s+Heavy\s+Industries
48 90\s*km\s+north\s+of\s+Pyongyang|Yongbyon|Yongbyun
49 Henderson
50 1956
51 Kennedy
52 Morgan
53 Starzl
54 22\s*Apr(?:il)?|Apr(?:il)?\s*22
55 Seattle\s+suburb|Washington|around\s+Seattle
56 562
57 S?EER
58 (?:Sky)?larks\s+on\s+the\s+String|Grand\s+Canyon|In\s+The\s+Name\s+Of\s+The\s+Father|Larks|Mistertao|Music\s+Box|The\s+Wedding\s+Banquet
59 Calderon|Figueres
60 Pounds\s+5\s*,?\s*0[03]0|\$?6\s*,?\s*400
61 Havana\s+Club
62 MS|Multiple\s+Sclerosis
63 Komsomolets
64 Frank\s+Oz
65 German(?:y|s)?|Japan(?:(?:ese)|s)?
66 McAuliffe
67 Mississippi
68 Christ\s+child|boy\s+child
69 fishermen
70 26[1234]|some\s+220
71 1941
72 1960
73 Agra|India
74 Kirk
75 Kaposi
76 198[567]|mid\s*-?\s*1980\s*'?\s*s
77 Brando
78 District\s+of\s+Columbia|Washington\s*'?\s*s?
79 cello\s+concertos?
80 taxol
81 30\s*,?\s*000
82 2\s*,?\s*130
83 Landmark\s+Tower|Sunshine
84 Japan
85 Duke
86 Tomba
87 Helmut\s+Schmidt
88 nutmeg
89 mid\s*-?\s*30\s*'?\s*s
90 Hubbard
91 Shanghai
92 Morris
93 Magellan
94 Hoagie\s+Carmichael
95 Ghana|Morocco
96 6:33\s*a\.?m\.?
97 19m\s*-?\s*acres?
98 12\s*,?\s*388\s*ft\.?|Mt\.?\s+Fuji
99 genome
100 Hurricane\s+Andrew|Hurricane\s+Hugo
101 1\s*,?\s*900|2\s*,?\s*[01]00
102 Haynes
103 900\s+people\s+died|[89]00\s+lives|at\s+least\s+139.*?survived
104 Marathi
105 1[057]\s*,?\s*000
106 K\s*-?\s*2
107 1964
108 NCSA|National\s+Cent(?:(?:er)|(?:re))\s+for\s+Supercomputing|Netscape|University\s+of\s+Illinois
109 moon
110 Ruby
111 one\s+hour\s+and\s+forty\s+minutes
112 Rawlings
113 anencephal(?:y|(?:ics?))
114 1\s*\.\s*5\s+times
115 comprehensive\s+health,\s+nutritional,\s+educational,\s+social,\s+and\s+other\s+services|federally\s+funded.+?program|national\s+program\s+providing\s+comprehensive\s+developmental\s+services|preschool\s+program|provides\s+education\s+meals\s+and\s+health\s+screening|readiness\s+skills
116 New\s+York\s+Jets
117 (?:Watson.*?Crick)|(?:Crick.*?Watson)
118 20\s+per\s+cent\s+of\s+all\s+higher\s+plant\s+species|30\s*(?:%|(?:per)|(?:pct))
119 Mairead\s+(?:(?:Maguire)|(?:Corrigan))
120 Trout
121 Muluzi
122 Jones
123 81\s*,?\s*000
124 defoliant|to\s+(?:(?:destroy)|(?:remove))\s+(?:(?:ground)|(?:jungle))\s+cover|to\s+destroy\s+fields
125 District\s+of\s+Columbia|Washington
126 (?:4\s*May)|(?:May\s*4)|5\/4\/1994
127 Nagoya
128 Aldrin
129 13
130 1990(?:may22)?|22 May 1993
132 Montevideo
133 green\s+remediation
134 Berth\s+87|Los\s+Angeles.*?Harbor|Port\s+of\s+Los\s+Angeles|San\s+Pedro
135 Trotsky
136 Queen Beatrix
137 Gonzalez
138 apoptosis
139 4\s*,?\s*200
140 11
141 1\s*hour\s*11\s*minutes
142 600\s*,?\s*000
143 Mexico
144 1\s*,?\s*200(?:MW)?|2\s*,?\s*400(?:MW)?|two.*?600(?:MW)?
145 Hinckley.+?shot.+?Reagan|Reagan.+?gunman.+?Hinckley|Reagan.+?shot\s+by|Reagan.+?would\s*-\s*be\s+assassin|attempt.+?to\s+assassinate.*?President|to\s+shoot\s+the\s+President|tried\s+to\s+kill\s+President
146 1990
147 Hata|Hosokawa|Kaifu|Miyazawa|Murayama|Takeshita
148 14\s*,?\s*000|20\s*,?\s*000
149 Chicago
150 1\s*,?\s*270\s*k(?:ilo)?m(?:eters)?|850\s*-?\s*miles?
151 New\s+York
152 11\s*\.\s*9\s*m(?:illion)?|12\s*m(?:illion)?
153 Sacramento
154 11
155 Clinton|Dukakis|Humphrey|McGovern
156 Jan.?\s*3
157 Hanover|N\s*\.?\s*H\s*\.?|New Hampshire|United States
158 30\s*,?\s*000|33\s*,?\s*000
159 cold
160 14\s*July|1789(?:jul14)?|7\/14|July\s*14
161 Dollars\s*6\s*\.\s*[34]\s*bn|Dollars\s*7\s*bn|billionaire
162 Prish?tina|Prishtine
163 VA|Virginia
164 Equifax
165 209
166 10\/23\/1989|23\s*Oct(?:ober)?\s*1989
167 Poland|Polish
168 Hazelwood
169 Phoenix Suns
170 Rabbani
171 Sweeten
172 Hawaii
173 four
174 1987|two\s+years\s+ago
175 1827
176 500\s*,?\s*000
177 29\s*,?\s*028\s*-?\s*f(?:ee)?t|29\s*,?\s*028\s*-?\s*foot|29\s*,?\s*100\s*-?\s*f(?:ee)?t|29\s*,?\s*100\s*-?\s*foot
178 Brazzaville
179 Rome
180 Colombo
181 Do\s+Androids\s+Dream\s+Of\s+Electric\s+Sheep
182 Thespis|Trial\s+by\s+Jury
183 HAL
185 1916|April\s*21\s*,?\s*1918
186 Clervaux\s+Castle|Luxembourg
187 Ford\s*'?\s*s\s+Theater|Ford\s*'?\s*s\s+Theatre|Washington\s*,?\s*D\s*\.?\s*C\s*\.?
188 1893.*?27\s+years\s+before|69\s+years\s+ago.*?1989|Aug\s*\.?\s*18\s*,?\s*1920
189 Middle\s+East|Persian\s+Gulf
190 Ind(?:iana)?|South\s+Bend\s*,\s*IN|South\s+Bend\s*IN\s+466\d\d
191 Lamar\s*,?\s*Mo|Missouri|Mo\s*\.|USA
192 Kissinger|Rogers
193 Honest\s+Abe|Lincoln
194 Respighi
195 Joyce
196 Shakespeare(?:an)?
197 Yeah\s*,?\s*but\s+I\s*'?\s*m\s+sleeping
198 hemlock
199 14\s*,?\s*690\s*-\s*foot|14\s*,?\s*776\s*feet\s+9|7\s*inches\s+higher\s+than.*?14\s*,?\s*776\s*feet\s+2
200 147\s*-?\s*f(?:(?:oo)|(?:ee))?t
Loading

0 comments on commit 30ac739

Please sign in to comment.