Bahar Sateli edited this page Oct 24, 2017 · 39 revisions

Semantic Publishing Challenge 2016


The challenge is over and we are happy to announce the winners:

Task 2: Extraction of contextual information from PDF papers

_Best-performing tool

  • Information Extraction from PDF Sources based on Rule-based System using Integrated Formats by Riaz Ahmad, Muhammad Tanvir Afzal and Muhammad Abdul Qadir

Most innovative approach

  • _Reconstructing the Logical Structure of a Scientific Publication using Machine Learning_ by Stefan Klampfl and Roman Kern

Task 2 best-performing tool: final results

Title Authors Precision Recall F-score
Information Extraction from PDF Sources based on Rule-based System using Integrated Formats Riaz Ahmad, Muhammad Tanvir Afzal and Muhammad Abdul Qadir. 0.775 0.778 0.771
An Automatic Workflow for Formalization of Scholarly Articles' Structural and Semantic Elements Bahar Sateli and René Witte. 0.64 0.629 0.632
Reconstructing the Logical Structure of a Scientific Publication using Machine Learning Stefan Klampfl and Roman Kern 0.593 0.606 0.592
ACM: Article Content Miner Andrea Nuzzolese, Silvio Peroni and Diego Reforgiato Recupero. 0.412 0.43 0.416
Automatically Identify and Label Sections in Scientific Journals using Conditional Random Fields Sree Harsha Ramesh, Arnab Dhar, Raveena R. Kumar, Anjaly V, Sarath K. S, Jason Pearce and Krishna R. Sundaresan. 0.393 0.428 0.389

See the SemPubEvaluator repository on GitHub to access the golden standard and to run the final evaluation on your data.


EVALUATION GOLDEN STANDARD AVAILABLE: See the SemPubEvaluator repository on GitHub.

EVALUATION DATASET AVAILABLE! Instructions for the final submission at: Final Submission Rules

BEST-PERFORMING EVALUATION TOOL available on GitHub: SemPubEvaluator

Submission deadline extended to Thursday March 24th, 2016
**The submission Web site for SemPub16 is

The selection of the best challenge papers will be published in the Satellite Event proceedings (a separate Springer LNCS Volume) of ESWC2016.

All accepted challenge papers will be published in a separate volume of Communications in Computer and Information Science, published by Springer (

The Challenge is open! Please subscribe to the mailing list to be kept up to date.

Motivation and objectives

This is the next iteration of the successful Semantic Publishing Challenge of ESWC 2014 and 2015. We continue pursuing the objective of assessing the quality of scientific output, evolving the dataset bootstrapped in 2014 and 2015 to take into account the wider ecosystem of publications.

To achieve that, this year’s challenge focuses on refining and enriching an existing linked open dataset about workshops, their publications and their authors. Aspects of “refining and enriching” include extracting deeper information from the HTML and PDF sources of the workshop proceedings volumes and enriching this information with knowledge from existing datasets.

Thus, a combination of broadly investigated technologies in the Semantic Web field, such as Information Extraction (IE), Natural Language Processing (NLP), Named Entity Recognition (NER), link discovery, etc., is required to deal with the challenge’s tasks.

Target Audience

The Challenge is open to everyone from industry and academia.


We ask challengers to automatically annotate a set of multi-format input documents and to produce a LOD that fully describes these documents, their context, and relevant parts of their content. The evaluation will consist of evaluating a set of queries against the produced dataset to assess its correctness and completeness.

The primary input dataset is the LOD that has been extracted from the workshop proceedings using the winning extraction tools of the 2014 and 2015 challenges, plus its full original HTML and PDF source documents. In addition, the challenge uses (as linking targets) existing LOD on scholarly publications.

The input dataset will be split in two parts: a training dataset and an evaluation dataset, which will disclosed a few days before the submission deadline. Participants will be asked to run their tool on the evaluation dataset and to produce the final Linked Dataset and the output of the queries on that dataset.

Further details about the organization are provided in the general rules page.

The Challenge will include three tasks:

Task 1: Extraction and assessment of workshop proceedings information

Participants are required to extract information from a set of HTML tables of contents and PDF papers published in workshop proceedings. The extracted information is expected to answer queries about the quality of these workshops, for instance by measuring growth, longevity, etc. The task is an extension of the Task 1 of the 2014 and 2015 Challenges: we will reuse the most challenging quality indicators from last year’s challenge, others will be defined more precisely, others will be completely new.

Task 2: Extracting information from the PDF full text of the papers

Participants are required to extract information from the textual content of the papers (in PDF). That information should describe the organization of the paper and should provide a deeper understanding of the context in which it was written. In particular, the extracted information is expected to answer queries about the internal organization of sections, tables, figures and about the authors’ affiliations and research institutions, and fundings source. The task mainly requires PDF mining techniques and some NLP processing.

Task 3: Interlinking

Participants are required to interlink the linked dataset with relevant datasets already existing at the Linked Open Data cloud. In particular, they are expected to interlink persons, papers, events, organizations and publications. All these entities should be identified, disambiguated and interlinked to their correspondences at other LOD datasets. Task 3 can be accomplished either as a named entity recognition and disambiguation task (NLP based entity linking), or as an entity interlinking task, or as a combination of methods.

Evaluation (in progress and final)

Participants will be requested to submit the LOD that their tool produces from the evaluation dataset, as well as a paper that describes their approach. They will also be given a set of queries in natural language form and will be asked to translate those queries into a SPARQL form that works on their LOD.

This year there are two important changes:

  • participants will be provided a tool for evaluating their queries during the training phase. They will also be given a set of CSV that the queries are expected to produce on the training dataset.
  • a few days before the deadline we will publish the set of query used for the final evaluation. Participants will be asked then to run these queries on their dataset and to submit the produced output in CSV

The results of these queries will be compared with the expected output, and precision and recall will be measured to identify the best performing approach. We will use the same evaluation tool made available during the training phase.

Separately, the most original approach will be assigned by the Program Committee.

For details, see our general rules.

Feedback and discussion

A discussion group is open for participants to ask questions and to receive updates about the challenge.

Participants are invited to subscribe to this group as soon as possible and to communicate their intention to participate. They are also invited to use this channel to discuss problems in the input dataset and to suggest changes.

Judging and prizes

After a first round of review, the Program Committee and the chairs will select a number of submissions conforming to the challenge requirements that will be invited to present their work. Submissions accepted for presentation will receive constructive reviews from the Program Committee, they will be included in the Springer LNCS post-proceedings of ESWC.

We are also planning to allow winners to present their work in a special slot of the main program and to invite them to submit an extended version of their paper to an international journal.

Six winners will be selected. For each task we will select:

  • best performing tool, given to the paper which will get the highest score in the evaluation
  • most original approach, selected by the Challenge Committee with the reviewing process

How to participate

Participants are required to submit:

  • Abstract: no more than 200 words. (OPTIONAL)
  • Description: It should explain the details of the automated annotation system, including why the system is innovative, how it uses Semantic Web technology, what features or functions the system provides, what design choices were made and what lessons were learned. The description should also summarize how participants have addressed the evaluation tasks. An outlook towards how the data could be consumed is appreciated but not strictly required. Papers must be submitted in PDF format, following the style of the Springer's Lecture Notes in Computer Science (LNCS) series, and not exceeding 12 pages in length. Submissions in RASH format ( and dokieli ( are also accepted as long as the final camera-ready version conforms to Springer's requirements.
  • The Linked Dataset produced by their tool on the evaluation dataset (as a file or as a URL, in Turtle or RDF/XML).
  • A set of SPARQL queries that work on that LOD and correspond to the natural language queries provided as input
  • The output of these SPARQL queries on the evaluation dataset (in CSV format)

All submissions should be provided via the submission system. Available soon.

Mailing List

We invite the potential participants to subscribe to our mailing list at!forum/sempub-challenge in order to be kept up to date with the latest news related to the challenge.

Important dates

  • January 18, 2016: Publication of the full description of tasks, rules and queries; publication of the training dataset
  • March 24, 2016: Paper submission -- extended --
  • April 16, 2016: Notification and invitation to submit task results; -- extended --
  • April 30, 2016: Conference camera-ready (see note below) -- extended --
  • May, 18, 2016: Deadline for making remarks to the training dataset and the evaluation tool -- extended --
  • May 24, 2016: Publication of the evaluation dataset details -- extended --
  • May 25, 2016: Results submission -- extended --
  • May 29 June 2: Challenge days

NOTE: Papers will be included in the Conference USB stick. After the conference, participants will be able to add data about the evaluation and to finalize the camera-ready for the final proceedings, in a volume of the Communications in Computer and Information Science series, published by Springer.

Challenge Chairs

  • Angelo Di Iorio, Department of Computer Science and Engineering, University of Bologna, IT
  • Anastasia Dimou, Ghent University, BE
  • Christoph Lange, Enterprise Information Systems, University of Bonn / Fraunhofer IAIS, DE
  • Sahar Vahdati, University of Bonn, DE

Program Committee

  • Aliaksandr Birukou, Springer Verlag, Heidelberg, Germany
  • Lukasz Bolikowski, University of Warsaw, Poland
  • Kai Eckert, University of Mannheim, Germany
  • Maxim Kolchin, ITMO University, SaintPetersburg, Russia
  • Phillip Lord, Newcastle University, UK
  • Philipp Mayr, GESIS, Germany
  • Jodi Schneider, University of Pittsburgh, USA
  • Selver Softic, Graz University of Technology, Austria
  • Ruben Verborgh, Ghent university – iMinds
  • Michael Wagner, Schloss Dagstuhl, LeibnizZentrum für Informatik, German

We are inviting further members.

ESWC Challenge Coordinators

  • Stefan Dietze (L3S Research Institute, Hannover, DE)
  • Anna Tordai (Elsevier, NL)
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.