Training data for the DiMSUM shared task at SEMEVAL 2016
Python Shell
Switch branches/tags
Clone or download
Latest commit 3af1dd3 Feb 8, 2016

DiMSUM 2016 shared task data

December 28, 2015

Anders Johannsen
Nathan Schneider
Dirk Hovy
Marine Carpuat

This release contains data and scripts for the DiMSUM shared task at SemEval 2016.

Data files

dimsum16.train, task training data

The training data combines and harmonizes three data-sets, the STREUSLE 2.1 corpus of web reviews, as well as the Ritter and Lowlands Twitter datasets. The Ritter and Lowlands datasets have been reannotated for MWEs and supersenses to improve their quality and to more closely follow the conventions used in the STREUSLE annotations. Our harmonization also consisted of: updating the POS tags to use the 17 Universal POS categories; naming supersenses in the form n.person; removing STREUSLE class labels that are not proper supersenses (such as `a = auxiliary, `p = preposition, ?? = unintelligible); removing weak MWE links in the STREUSLE data; separating the MWE position and supersense into different fields; and listing the supersense only for the first token of any expression.

In this final release of the training data, a couple differences between the component datasets remain:

  • The Lowlands Twitter dataset replace usernames, URLs, and numbers by special symbols, while the original text is always preserved in the other datasets.
  • The Universal POS tags in the Twitter datasets do not use the new subordinating conjunction category SCONJ. Subordinating conjunctions are instead labeled as adpositions (ADP) or conjunctions (CONJ).

dimsum16.test.blind, task test input

This is in the same format as the training data, except without MWE and supersense annotations, which are to be predicted by the system:

  • there is no supersense label (column 8 is blank)
  • MWE tags (column 5) are all O, and MWE parent offsets (column 6) are all 0, indicating that no MWEs are marked
  • sentence IDs (column 9) are unanalyzable to obscure the sentence's source dataset and its order relative to other sentences; the sentences in this file are listed in a random order


The test set consists of 16,500 words in 1,000 English sentences. The sentences are drawn from the following sources:

More precise information on the composition and preparation of the test corpus will be announced after the end of the task evaluation period.

File format

The DiMSUM files have tab-separated columns in the spirit of CoNLL, with blank lines to separate sentences.

Nine tab-separated columns:

  1. token offset
  2. word
  3. lowercase lemma
  4. POS
  5. MWE tag
  6. offset of parent token (i.e. previous token in the same MWE), if applicable
  7. strength level encoded in the tag, if applicable. Currently not used
  8. supersense label, if applicable
  9. sentence ID

Fields 5, 6, and 8 need to be predicted at test time; the rest will be present in the input. Field 6 can be deterministically filled in given the tagging in field 5. Field 7 should be left blank. The file describes the MWE and supersense tagsets.

All sentences in the training data are marked with an identifier whose prefix indicate the source dataset (field 9). In the test data, this field will contain a unique ID for the sentence, but the ID will be uninformative: it will not reveal the domain or document position of the sentence.