-
Notifications
You must be signed in to change notification settings - Fork 206
/
00-readme.test.txt
232 lines (149 loc) · 6.62 KB
/
00-readme.test.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
STS 2014
STS
Semantic Textual Similarity for English
TEST DATA WITH GOLD LABELS
This set of files describes the DATASET for the Semantic Textual Similarity 2014.
The test dataset contains the following:
00-README.txt this file
correlation-noconfidence.pl evaluation script
sts2012-train.tgz train data for sts 2012
sts2012-test.tgz test data for sts 2012
sts2013-test.tgz test data for sts 2013
STS.gs.image.txt gold standard file
STS.gs.OnWN.txt "
STS.gs.tweet-news.txt "
STS.gs.deft-news.txt "
STS.gs.deft-forum.txt "
STS.gs.headlines.txt "
STS.input.image.txt tab separated input file with
sentence pairs
STS.input.OnWN.txt "
STS.input.tweet-news.txt "
STS.input.deft-news.txt "
STS.input.deft-forum.txt "
STS.input.headlines.txt "
STS.output.headlines.en.txt tab separated sample input file with
sentence pairs
Introduction
------------
Given two sentences of text, s1 and s2, the systems participating in
this task should compute how similar s1 and s2 are, returning a
similarity score, and an optional confidence score.
The test dataset comprises the 2012 and 2013 datasets, which can be used to
develop and train systems.
We include sample data for the test datasets, coming
from the following:
1) image description (image)
2) OntoNotes and WordNet sense definition mappings (OnWN)
3) news title and tweet comments (tweet-news)
4) deft discussion forum and news (deft-forum and deft-news)
5) news headlines (headlines)
The datasets has been derived as follows:
- STS.input.image.txt: The Image Descriptions data set is a subset of
the PASCAL VOC-2008 data set (Rashtchian et al., 2010) . PASCAL
VOC-2008 data set consists of 1,000 images and has been used by a
number of image description systems. The image captions of the data
set are released under a CreativeCommons Attribution-ShareAlike
license, the descriptions itself are free.
- STS.input.OnWN.txt: The sentences are sense definitions from WordNet
and OntoNotes. 5 pairs of sentences.
- STS.input.tweet-news.txt: The tweet-news data set is a subset of the
Linking-Tweets-to-News data set (Guo et al., 2013), which consists
of 34,888 tweets and 12,704 news articles. The tweets are the
comments on the news articles. The news sentences are the titles of
news articles.
- STS.input.deft-news.txt: A subset of news article data in the DEFT
project.
- STS.input.deft-forum.txt: A subset of discussion forum data in the
DEFT project.
- STS.input.headlines.txt: we used headlines mined from several news
sources by European Media Monitor using the RSS feed.
http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html
NOTE: Participant systems should NOT use the following datasets to
develop or train their systems:
- Ontonotes - Wordnet sense aligned definitions.
- Data released in (Guo et al., 2013).
- The train data for task 1 (seminal 2014)
Input format
------------
The input file consist of two fields separated by tabs:
- first sentence (does not contain tabs)
- second sentence (does not contain tabs)
Please check any of STS.input.*.txt
Gold Standard
-------------
The gold standard contains a score between 0 and 5 for each pair of
sentences, with the following interpretation:
(5) The two sentences are completely equivalent, as they mean the same
thing.
The bird is bathing in the sink.
Birdie is washing itself in the water basin.
(4) The two sentences are mostly equivalent, but some unimportant
details differ.
In May 2010, the troops attempted to invade Kabul.
The US army invaded Kabul on May 7th last year, 2010.
(3) The two sentences are roughly equivalent, but some important
information differs/missing.
John said he is considered a witness but not a suspect.
"He is not a suspect anymore." John said.
(2) The two sentences are not equivalent, but share some details.
They flew out of the nest in groups.
They flew into the nest together.
(1) The two sentences are not equivalent, but are on the same topic.
The woman is playing the violin.
The young lady enjoys listening to the guitar.
(0) The two sentences are on different topics.
John went horse back riding at dawn with a whole group of friends.
Sunrise at dawn is a magnificent view to take in if you wake up
early enough for it.
Format: the gold standard file consist of one single field per line:
- a number between 0 and 5
The gold standard in the test data will be assembled using mechanical
turk, gathering 5 scores per sentence pair. The gold standard score
will the average of those 5 scores. In this test dataset, this is
just a dummy number which you can ignore.
Answer format
--------------
The answer format is similar to the gold standard format, but includes
an optional confidence score. Each line has two fields separated by a
tab:
- a number between 0 and 5 (the similarity score)
- a number between 0 and 100 (the confidence score)
The use of confidence scores is experimental, and it is not required
for the official score.
An example is file STS.output.headlines.en.txt
Because there are two tasks this year, please add an extension “.en”
at the end of the system output submission file for the English task.
Scoring
-------
The official score is based on the average of Pearson correlation. The use of
confidence scores will be experimental, and it is not required for the
official scores.
Participation in the task
-------------------------
Participant teams will be allowed to submit three runs at most.
Other
-----
Please check http://alt.qcri.org/semeval2014/task10/ for more details.
Authors
-------
Eneko Agirre
Daniel Cer
Mona Diab
Aitor Gonzalez-Agirre
Weiwei Guo
German Rigau
References
----------
Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph
Weischedel. 2006. Ontonotes: The 90% solution. In Proceedings of the
Human Language Technology Conference of the North American Chapter
of the ACL.
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier.
Collecting Image Annotations Using Amazon's Mechanical Turk.
In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and
Language Data with Amazon's Mechanical Turk.
Weiwei Guo, Hao Li, Heng Ji and Mona Diab. 2013.
Linking Tweets to News: A Framework to Enrich Online Short Text Data
in Social Media. In Proceedings of the 51th Annual Meeting of the
Association for Computational Linguistics