Skip to content
SaharVahdati edited this page Jan 17, 2016 · 7 revisions

Task 1

… of the Semantic Publishing Challenge 2015.

Motivation

Participants are required to extract from a set of HTML-encoded tables of contents of workshop proceedings volumes a graph of workshops, their chairs and conference affiliations, as well as their papers and their authors. From this graph we would ultimately like to compute indicators for the quality of workshops, such as:

  • If a workshop series has had a long history, this hints at high quality.
  • If a workshop has grown over years, it may be of high quality.
  • If a conference attracts many high-quality workshops, it is a high-quality conference.
  • High-quality workshops outside big conferences might be of interest to the organisers of these conferences.
  • If a workshop in a series has a good balance between old and new authors, it may facilitate exchange of experience.
  • If a workshop has a high ratio of invited papers, it may be of low quality (unless there are high-profile invited speakers).
  • A high ratio of submissions (co-)authored by a workshop's chairs may indicate a low diversity in schools of thought represented at the workshop, and thus possibly low quality.
  • A fast publication turnaround (proceedings published quickly after, or even before the workshop) gives an impression of professional organisation, possibly of quality.
  • If person P1 is an invited speaker in a workshop chaired by person P2 and vice versa, these persons might not be good speakers (and thus contributing to a high-quality workshop), but rather just good friends.

The queries participants are required to answer approximate some of these indicators and are shown below.

Data Source

Some information about workshops, such as their papers and authors, is readily available in existing LOD sets (e.g. DBLP), but this information is not sufficient for assessing the quality of a workshop. This is why we are working with the data of the CEUR-WS.org workshop proceedings. Basic bibliographic information of many of the more recent workshops published there is available on DBLP, but the following information is not:

  • what workshop series a workshop is part of
  • affiliations of editors
  • exact date of workshop and of proceedings publication
  • distinction between invited and contributed papers

The input datasets consist of:

  • one HTML index page linking to all CEUR-WS.org workshop proceedings volumes (invalid, somewhat messy but still uniformly structured HTML 4)
  • the volumes' HTML tables of contents, which link to the individual workshop papers. Their format is largely uniform but has changed towards more structure and more semantics over time. Most of these innovations have not been “ported back” to old workshop proceedings volumes. For example:
    • Microformat annotations were introduced around 2010 (starting with [http://ceur-ws.org/Vol-559/](volume 559), feature set extended later)
    • Optional RDFa (in addition to microformats) was introduced in 2013 (starting with volume 994, with but it only covers around 1 in 10 volumes since 2013).
    • Valid HTML5 has been mandatory since 2013 (starting roughly with volume 1059).

For technical reasons, the HTML is restricted to the US-ASCII character set. All other characters are encoded as HTML entities, e.g. é for é. The latter, i.e. the native Unicode form, should be used in the extracted RDF.

To support the evolution of extraction tools, the 2016 training dataset is largely the same as the union of the 2015 training dataset plus evaluation dataset, with a few additions. PDF full texts are no longer part of the dataset for Task 1, as now Task 2 deals exclusively with information extraction from PDFs.

Datasets can be downloaded here:

Training Dataset TD1

CEUR-WS.org index page and several proceedings volumes. Individual description given here; list of URLs for convenient one-time download below.

  • index page (valid HTML 4.01)
  • Vol-1317 (valid HTML5, microformats)
  • Vol-1186 (valid HTML5, microformats; multi-workshop volume)
  • Vol-1128 (valid HTML5, microformats)
  • Vol-1123 (valid HTML5, microformats)
  • Vol-1116 (valid HTML5, microformats)
  • Vol-1111 (valid HTML5, microformats)
  • Vol-1085 (valid HTML5, RDFa (latest version with proper fragment IDs for papers) + microformats)
  • Vol-1081 (valid HTML5, microformats)
  • Vol-1044 (mostly valid HTML5, RDFa+microformats)
  • Vol-1019 (mostly valid HTML5, microformats)
  • Vol-1014 (mostly valid HTML5, microformats)
  • Vol-1010 (valid HTML5, RDFa + microformats; multi-workshop volume)
  • Vol-1008 (valid HTML5, microformats)
  • Vol-996 (invalid HTML 4.01, microformats)
  • Vol-994 (invalid HTML 4.01, RDFa)
  • Vol-981 (invalid HTML 4.01, microformats)
  • Vol-979 (mostly valid HTML 4.01, microformats)
  • Vol-958 (mostly valid HTML 4.01, microformats; “at ECAI” refers to the conference, but we will not penalise it if you treat it as a part of the workshop title.)
  • Vol-951 (mostly valid HTML 4.01, microformats)
  • Vol-946 (invalid HTML 4.01, microformats)
  • Vol-943 (mostly valid HTML 4.01, microformats)
  • Vol-937 (invalid HTML 4.01, microformats)
  • Vol-936 (invalid HTML 4.01, microformats)
  • Vol-930 (mostly valid HTML 4.01, microformats)
  • Vol-929 (mostly valid HTML 4.01, microformats)
  • Vol-921 (invalid HTML 4.01, microformats; multi-workshop volume)
  • Vol-919 (invalid HTML 4.01, microformats)
  • Vol-914 (mostly valid HTML 4.01, microformats)
  • Vol-906 (invalid HTML 4.01, microformats)
  • Vol-905 (invalid HTML 4.01, microformats)
  • Vol-904 (invalid HTML 4.01, microformats)
  • Vol-903 (mostly valid HTML 4.01, microformats)
  • Vol-902 (invalid HTML 4.01, microformats)
  • Vol-901 (invalid HTML 4.01, microformats)
  • Vol-900 (mostly valid HTML 4.01, microformats)
  • Vol-895 (mostly valid HTML 4.01, microformats)
  • Vol-890 (invalid HTML 4.01, microformats)
  • Vol-875 (mostly valid HTML 4.01, microformats)
  • Vol-869 (invalid HTML 4.01, microformats)
  • Vol-862 (invalid HTML 4.01, microformats; multi-workshop volume)
  • Vol-859 (invalid HTML 4.01, microformats)
  • Vol-856 (invalid HTML 4.01, microformats)
  • Vol-846 (invalid HTML 4.01, microformats)
  • Vol-843 (mostly valid HTML 4.01, microformats)
  • Vol-840 (invalid HTML 4.01, microformats)
  • Vol-839 (invalid HTML 4.01, microformats)
  • Vol-838 (invalid HTML 4.01, microformats)
  • Vol-830 (mostly valid HTML 4.01, microformats)
  • Vol-814 (invalid HTML 4.01, microformats)
  • Vol-813 (invalid HTML 4.01, microformats)
  • Vol-809 (mostly valid HTML 4.01, microformats)
  • Vol-800 (invalid HTML 4.01, microformats)
  • Vol-798 (invalid HTML 4.01, microformats)
  • Vol-784 (mostly valid HTML 4.01, microformats)
  • Vol-783 (mostly valid HTML 4.01, microformats)
  • Vol-782 (invalid HTML 4.01, microformats)
  • Vol-781 (invalid HTML 4.01, microformats)
  • Vol-779 (invalid HTML 4.01, microformats)
  • Vol-778 (invalid HTML 4.01, microformats)
  • Vol-775 (invalid HTML 4.01, microformats)
  • Vol-748 (invalid HTML 4.01, microformats)
  • Vol-745 (invalid HTML 4.01, microformats)
  • Vol-737 (mostly valid HTML 4.01, microformats)
  • Vol-736 (mostly valid HTML 4.01, microformats)
  • Vol-721 (invalid HTML 4.01, microformats)
  • Vol-718 (invalid HTML 4.01, microformats)
  • Vol-717 (mostly valid HTML 4.01, microformats)
  • Vol-689 (invalid HTML 4.01, microformats)
  • Vol-671 (invalid HTML 4.01, microformats)
  • Vol-669 (mostly valid HTML 4.01, microformats)
  • Vol-658 (invalid HTML 4.01, microformats)
  • Vol-628 (invalid HTML 4.01, microformats)
  • Vol-573 (invalid HTML 4.01, microformats)
  • Vol-571 (invalid HTML 4.01, microformats)
  • Vol-551 (invalid HTML 4.01, no semantic markup)
  • Vol-538 (invalid HTML 4.01, no semantic markup)
  • Vol-540 (valid XHTML+RDFa; RDFa different from later volumes)
  • Vol-477 (invalid HTML 4.01, no semantic markup)
  • Vol-431 (invalid HTML 4.01, no semantic markup)
  • Vol-369 (invalid HTML 4.01, no semantic markup)
  • Vol-353 (invalid HTML 4.01, no semantic markup)
  • Vol-315 (invalid HTML 4.01, no semantic markup)
  • Vol-304 (invalid HTML 4.01, no semantic markup)
  • Vol-250 (invalid HTML 4.01, no semantic markup; poster abstracts tagged in a non-standard way, which is not to be confused with author names)
  • Vol-225 (mostly valid HTML 4.01, no semantic markup)
  • Vol-232 (invalid HTML 4.01, no semantic markup)
  • Vol-189 (invalid HTML 4.01, no semantic markup; using a markup variant for the editors' affiliations)
  • Vol-147 (invalid HTML 4.01, no semantic markup)
  • Vol-104 (invalid HTML 4.01, no semantic markup; “proceedings” doesn't count as a regular paper)
  • Vol-81 (invalid HTML 4.01, no semantic markup; “proceedings” doesn't count as a regular paper)
  • Vol-53 (invalid HTML 4.01, no semantic markup; formulas in paper titles; alt text of the images should be used)
  • Vol-49 (invalid HTML 4.01, no semantic markup)
  • Vol-44 (invalid HTML 4.01, no semantic markup)
  • Vol-33 (invalid HTML 4.01, no semantic markup)
  • Vol-22 (invalid HTML 4.01, no semantic markup)
  • Vol-11 (invalid HTML 4.01, no semantic markup)
  • Vol-5 (invalid HTML 4.01, no semantic markup; different markup of the editor: “Herausgegeben von” means “Edited by”.)
  • Vol-1 (invalid HTML 4.01, no semantic markup)

List of URLs for one-time download:

http://ceur-ws.org/
http://ceur-ws.org/Vol-1317/
http://ceur-ws.org/Vol-1186/
http://ceur-ws.org/Vol-1128/
http://ceur-ws.org/Vol-1123/
http://ceur-ws.org/Vol-1116/
http://ceur-ws.org/Vol-1111/
http://ceur-ws.org/Vol-1085/
http://ceur-ws.org/Vol-1081/
http://ceur-ws.org/Vol-1044/
http://ceur-ws.org/Vol-1019/
http://ceur-ws.org/Vol-1014/
http://ceur-ws.org/Vol-1010/
http://ceur-ws.org/Vol-1008/
http://ceur-ws.org/Vol-996/
http://ceur-ws.org/Vol-994/
http://ceur-ws.org/Vol-981/
http://ceur-ws.org/Vol-979/
http://ceur-ws.org/Vol-958/
http://ceur-ws.org/Vol-951/
http://ceur-ws.org/Vol-946/
http://ceur-ws.org/Vol-943/
http://ceur-ws.org/Vol-937/
http://ceur-ws.org/Vol-936/
http://ceur-ws.org/Vol-930/
http://ceur-ws.org/Vol-929/
http://ceur-ws.org/Vol-921/
http://ceur-ws.org/Vol-919/
http://ceur-ws.org/Vol-914/
http://ceur-ws.org/Vol-906/
http://ceur-ws.org/Vol-905/
http://ceur-ws.org/Vol-904/
http://ceur-ws.org/Vol-903/
http://ceur-ws.org/Vol-902/
http://ceur-ws.org/Vol-901/
http://ceur-ws.org/Vol-900/
http://ceur-ws.org/Vol-895/
http://ceur-ws.org/Vol-890/
http://ceur-ws.org/Vol-875/
http://ceur-ws.org/Vol-869/
http://ceur-ws.org/Vol-862/
http://ceur-ws.org/Vol-859/
http://ceur-ws.org/Vol-856/
http://ceur-ws.org/Vol-846/
http://ceur-ws.org/Vol-843/
http://ceur-ws.org/Vol-840/
http://ceur-ws.org/Vol-839/
http://ceur-ws.org/Vol-838/
http://ceur-ws.org/Vol-830/
http://ceur-ws.org/Vol-814/
http://ceur-ws.org/Vol-813/
http://ceur-ws.org/Vol-809/
http://ceur-ws.org/Vol-800/
http://ceur-ws.org/Vol-798/
http://ceur-ws.org/Vol-784/
http://ceur-ws.org/Vol-783/
http://ceur-ws.org/Vol-782/
http://ceur-ws.org/Vol-781/
http://ceur-ws.org/Vol-779/
http://ceur-ws.org/Vol-778/
http://ceur-ws.org/Vol-775/
http://ceur-ws.org/Vol-748/
http://ceur-ws.org/Vol-745/
http://ceur-ws.org/Vol-737/
http://ceur-ws.org/Vol-736/
http://ceur-ws.org/Vol-721/
http://ceur-ws.org/Vol-718/
http://ceur-ws.org/Vol-717/
http://ceur-ws.org/Vol-689/
http://ceur-ws.org/Vol-671/
http://ceur-ws.org/Vol-669/
http://ceur-ws.org/Vol-658/
http://ceur-ws.org/Vol-628/
http://ceur-ws.org/Vol-573/
http://ceur-ws.org/Vol-571/
http://ceur-ws.org/Vol-551/
http://ceur-ws.org/Vol-538/
http://ceur-ws.org/Vol-540/
http://ceur-ws.org/Vol-477/
http://ceur-ws.org/Vol-431/
http://ceur-ws.org/Vol-369/
http://ceur-ws.org/Vol-353/
http://ceur-ws.org/Vol-315/
http://ceur-ws.org/Vol-304/
http://ceur-ws.org/Vol-250/
http://ceur-ws.org/Vol-225/
http://ceur-ws.org/Vol-232/
http://ceur-ws.org/Vol-189/
http://ceur-ws.org/Vol-147/
http://ceur-ws.org/Vol-104/
http://ceur-ws.org/Vol-81/
http://ceur-ws.org/Vol-53/
http://ceur-ws.org/Vol-49/
http://ceur-ws.org/Vol-44/
http://ceur-ws.org/Vol-33/
http://ceur-ws.org/Vol-22/
http://ceur-ws.org/Vol-11/
http://ceur-ws.org/Vol-5/
http://ceur-ws.org/Vol-1/

Evaluation dataset ED1

ED1 is same as the training dataset TD1, plus additional volumes to be considered.

List of URLs for one-time download (only those in addition to TD1; please also download the TD1 list above):

http://ceur-ws.org/
http://ceur-ws.org/Vol-1353/
http://ceur-ws.org/Vol-1346/
http://ceur-ws.org/Vol-1343/
http://ceur-ws.org/Vol-1342/
http://ceur-ws.org/Vol-1139/
http://ceur-ws.org/Vol-1333/
http://ceur-ws.org/Vol-1331/
http://ceur-ws.org/Vol-1321/
http://ceur-ws.org/Vol-1317/
http://ceur-ws.org/Vol-1309/
http://ceur-ws.org/Vol-1302/
http://ceur-ws.org/Vol-1290/
http://ceur-ws.org/Vol-1286/
http://ceur-ws.org/Vol-1285/
http://ceur-ws.org/Vol-1281/
http://ceur-ws.org/Vol-1277/
http://ceur-ws.org/Vol-1270/
http://ceur-ws.org/Vol-1258/
http://ceur-ws.org/Vol-1255/
http://ceur-ws.org/Vol-1254/
http://ceur-ws.org/Vol-1250/
http://ceur-ws.org/Vol-1249/
http://ceur-ws.org/Vol-1242/
http://ceur-ws.org/Vol-1239/
http://ceur-ws.org/Vol-1237/
http://ceur-ws.org/Vol-1236/
http://ceur-ws.org/Vol-1235/
http://ceur-ws.org/Vol-1193/
http://ceur-ws.org/Vol-1191/
http://ceur-ws.org/Vol-1188/
http://ceur-ws.org/Vol-1186/
http://ceur-ws.org/Vol-1184/
http://ceur-ws.org/Vol-1155/
http://ceur-ws.org/Vol-1151/
http://ceur-ws.org/Vol-1141/
http://ceur-ws.org/Vol-1089/
http://ceur-ws.org/Vol-1073/
http://ceur-ws.org/Vol-1034/
http://ceur-ws.org/Vol-1010/
http://ceur-ws.org/Vol-971/
http://ceur-ws.org/Vol-891/
http://ceur-ws.org/Vol-837/
http://ceur-ws.org/Vol-754/
http://ceur-ws.org/Vol-654/
http://ceur-ws.org/Vol-516/
http://ceur-ws.org/Vol-424/
http://ceur-ws.org/Vol-389/
http://ceur-ws.org/Vol-188/
http://ceur-ws.org/Vol-173/
http://ceur-ws.org/Vol-129/

###Notes SSN (2013 edition: http://ceur-ws.org/Vol-1063/) also took place at ISWC 2014 (evidence: conference program) but has not been published so far. If it gets published between the publication of the evaluation dataset and the submission of your data, we will take this into account.

As far as multi-workshop volumes are concerned, the “number of papers” in Query Q1.16 is determined by individual workshop, not by the overall volume.

Queries

Participants are required to produce a dataset for answering the following queries, roughly ordered by difficulty:

  • Q1.1: List the full names of all editors of the proceedings of workshop W
  • Q1.2: Count the number of papers in workshop W
  • Q1.3: List the full names of all authors who have (co-)authored a paper in workshop W
  • Q1.4: Compute the average length (in numbers of pages) of a paper in workshop W
  • Q1.5: Find out whether the proceedings of workshop W were published on CEUR-WS.org before the workshop took place
  • Q1.6: Identify all editions that the workshop series titled T has published with CEUR-WS.org
  • Q1.7: Identify the full names of those chairs of the workshop series titled T that have so far been a chair in every edition of the workshop that was published with CEUR-WS.org
  • Q1.8: Identify all CEUR-WS.org proceedings volumes in which papers of workshops of conference C in year Y were published
  • Q1.9: Identify those papers of workshop W that were (co-)authored by at least one chair of the workshop
  • Q1.10: List the full names of all authors of invited papers in workshop W
  • Q1.11: Determine the number of editions that the workshop series titled T has had, regardless of whether published with CEUR-WS.org
  • Q1.12: Determine the title (without year) that workshop W had in its first edition
  • Q1.13: Of the workshops of conference C in year Y, identify those that did not publish with CEUR-WS.org in the following year (and that therefore probably no longer took place)
  • Q1.14: Identify the papers of the workshop titled T (which was published in a joint volume V with other workshops)
  • Q1.15: List the full names of all editors of the proceedings of the workshop titled T (which was published in a joint volume V with other workshops)
  • Q1.16: Of the workshops that had editions at conference C both in year Y and Y+1, identify the workshop(s) with the biggest percentage of growth
  • Q1.17: Return the acronyms of those workshops of conference C in year Y whose previous edition was co-located with a different conference series.
  • Q1.18: Of the workshop series titled T, identify those editions that took place more than two months later/earlier than the previous edition that was published with CEUR-WS.org
  • Q1.19: Identify the affiliations and countries of all editors of the proceedings of workshop W. Use DBpedia resources for the countries.
  • Q1.20: Identify the full names of those authors of papers in the workshop series titled T that have so far been a (co-)author of a paper in every edition of the workshop that was published with CEUR-WS.org

These queries have to be translated in SPARQL according to the challenge's general rules and have to produce an output according to the detailed rules.