Skip to content

SemPub17_QueriesTask2

angelobo edited this page Jan 26, 2017 · 9 revisions

More details and explanations will be gradually added to this page. Participants are invited to use the mailing list (https://groups.google.com/forum/#!forum/sempub-challenge) to comment, to ask questions, and to get in touch with chairs and other participants.

General information and rules

Participants are required to translate the input queries into SPARQL queries that can be executed against the produced LOD. The dataset can use any vocabulary but the query result output must conform with the rules described on this page.

Some preliminary information and general rules:

  • queries must produce a CSV output, according to the rules detailed below. The evaluation will be performed automatically by comparing this output (on the evaluation dataset) with the expected results.
  • IRIs of workshop volumes and papers must follow the following naming convention:
type of resource URI example
workshop volume http://ceur-ws.org/Vol-1010/
paper http://ceur-ws.org/Vol-1099/#paper3

Papers have fragment IDs like paper3 in the most recently published workshop proceedings. When processing older workshop proceedings, please derive such IDs from the filenames of the papers, by removing the PDF extension (e.g. paper3.pdfpaper3 or ldow2011-paper12.pdfldow2011-paper12 ).

  • IRIs of other resources (e.g. affilitations, funding agencies) must also be within the http://ceur-ws.org/ namespace, but in a path separate from http://ceur-ws.org/Vol-NNN/ for any number NNN.
  • the structure of the IRI used in the examples is not normative and does not provide any indication. Participants are free to use their own IRI structure and their own organization of classes and instances
  • All data relevant for the queries and available in the input dataset must be extracted and produced as output. Though the evaluation mechanisms will be implemented so as to take minor differences into account and to normalize them, participants are asked to extract as much as information as possible. Further details are given below for each query.
  • Since most of the queries take as input a paper (usually denoted as X), participants are required to use an unambiguous way of identifying input papers. To avoid errors, papers are identified by the URL of the PDF file, as available in http://ceur-ws-org.
  • The order of output records does not matter.

We do not provide further explanations for queries whose output looks clear. If they are not or there is any other issue, please feel free to ask on the mailing list.

Queries

Query Q2.1: Affiliations in a paper

Query: Identify the affiliations of the authors of the paper X

The correct identification of the affiliations is tricky and would require to model complex organizations, sub-organizations and units. A simplified approach is adopted for this task: participants are required to extract one single string for each affiliation as it appears in the header of the paper excluding data about the location (address, city, state).

Participants are also asked to extract the fullname of each author, without any further processing: author names must be extracted as they appear in the header. No normalizion on middlenames and initials is required.

During the evaluation process, these values will be normalized in lowercase; spaces, punctuations and special characters will be stripped.

Further notes:

  • If the affiliation is composed of multiple parts (for instance, it indicates a Department of a University) all these parts must be included in the same affiliation.
  • If the affiliation is described in multiple lines, all these lines must be included apart from data about the location (according to the general rule above). Multiple lines can be collapsed in a single one, since newlines and punctuations will be stripped during the evaluation.
  • In case of multiple affiliations for the same author, the query must return one line for each affiliation.
  • In case of multiple authors with the same affiliation, the query must return one line for each author.

Expected output format (CSV):

affiliation-iri, affiliation-fullname, author-iri, author-fullname
<IRI>,rdfs:Literal,<IRI>,rdfs:Literal
<IRI>,rdfs:Literal,<IRI>,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q1.3: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1531/paper8.pdf

affiliation-iri, affiliation-fullname, author-iri, author-fullname
<http://ceur-ws.org/affiliation/department-of-systems-and-computer-engineering-carleton-university>, "Department of Systems and Computer Engineering Carleton University", <http://ceur-ws.org/author/adnan-faisal>, "Adnan Faisal"

Query Q2.2: Countries in affiliations

Query: Identify the countries of the affiliations of the authors in the paper X

Participants are required to extract data about affiliations and to identify the country where each research institution is located.

During the evaluation process, the name of the countries will be normalized in lowercase.

Further notes:

  • the country names must be in English
  • if the country is not explicitely mentioned in the affiliation, it should be derived from external sources
  • the article 'the' in the country name is not relevant (for instance, 'The Netherlands' is considered equal to 'Netherlands')
  • some acronyms are normalized: for instance 'UK', 'U.K.' and 'United Kingdom' are considered equivalent; 'USA', 'U.S.A.' are equivalent to 'United Stated of America'

Expected output format (CSV):

country-iri, country-fullname
<IRI>,rdfs:Literal
<IRI>,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q2.13: Identify the countries of the affiliations of the authors in the paper http://ceur-ws.org/Vol-1522/Badreddin2015HuFaMo.pdf

country-iri, country-fullname
<http://ceur-ws.org/country/usa>, "U.S.A."
<http://ceur-ws.org/country/israel>, "Israel"
<http://ceur-ws.org/country/canada>, "Canada"

Query Q2.3: Supplementary material

Query: Identify the supplementary material(s) for the paper X

Some scientific papers are equipped with supplementary material, that integrates the content of the paper. This material is linked in the fulltext (or in footnotes or in appendices) and might include: evaluation datasets, detailed report on evaluation, documentation, video, prototypes source code, etc..

Participants are required to identify these links in the paper and to extract the URL to access the supplementary material.

Important. The following data are NOT required to be extracted and included in the output:

  • technical reports and extended versions of the papers
  • external datasets which are mentioned in a paper but exist independently from that paper. Datasets should be only considered if they are explicitly mentioned as supplementary material
  • existing software, libraries, APIs and technologies used to develop a system

To avoid confusion, the web site of the system (or model, ontology, prototype, etc.) is instead considered as supplementary material for the purposes of this task.

Expected output format (CSV):

material-url
<IRI>
<IRI>
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q3.13: Identify the supplementary material(s) for the paper http://ceur-ws.org/Vol-1522/Badreddin2015HuFaMo.pdf

material-url
"https://zenodo.org/record/20367?ln=en#.Veuv5dNViFI"^^xsd:anyURI

Query Q3.19: Identify the supplementary material(s) for the paper http://ceur-ws.org/Vol-1758/paper3.pdf

material-url
"http://github.com/georghinkel/TTC2016CRA"^^xsd:anyURI
"http://is.ieis.tue.nl/staff/pvgorp/share/?page=ConfigureNewSession&vdi=XP-TUe_TTC16_NMF.vdi"^^xsd:anyURI

Query Q2.4: Sections

Query: Identify the titles of the first-level sections of the paper X.

As last year, we would like to go deeper into the content and the organization of the papers. As first step, participants are required to extract the titles of the first-level sections of each paper. Though nested levels would be equally interesting, we limit the analysis to the main level only.

Sections must be represented as resources in the produced dataset identified by the section-iri value.

Section titles can be in lowercase or uppercase or capitalized (they will be normalized during the evaluation). For the sake of simplicity, subscript or superscript text in titles has to be treated as normal text.

Participants are also required to identify the number of each section, as it appears in the paper.

If a section is not numbered an empty value should be returned.

If the numbering is not correct, it should be returned 'as is'.

Important. The following rules apply to special sections:

  • Abstracts must NOT be included in the output, unless the paper is abstract-only;
  • The Reference section must be included in the output. It might be numbered or not.
  • Acknowledgements sections must be identified as separate sections
    • acknowledgements must be considered as separate sections even if they are just formatted as special paragraphs at the end of the paper; instead, if the acknowledgments are in a footnote or in the main text of the paper they are not relevant for the purposes of this task.
  • Appendixes must be considered sections too, and included in the output if present in the input paper.

During the evaluation process, section titles will be normalized in lowercase; spaces, punctuations and special characters will be stripped.

Expected output format (CSV):

section-iri, section-number, section-title
<IRI>,xsd:integer, rdfs:Literal
<IRI>,xsd:integer,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q4.1: Identify the first-level sections of the paper http://ceur-ws.org/Vol-1006/paper2.pdf

section-iri, section-number, section-title
<http://ceur-ws.org/section/vol-1006-paper2_sec1>, "1"^^xsd:integer, "Introduction"
<http://ceur-ws.org/section/vol-1006-paper2_sec2>, "2"^^xsd:integer, "Related Work"
<http://ceur-ws.org/section/vol-1006-paper2_sec3>, "3"^^xsd:integer, "ICMM for Collaborative Innovation Governance"
<http://ceur-ws.org/section/vol-1006-paper2_sec4>, "4"^^xsd:integer, "Limitation and Outlook"
<http://ceur-ws.org/section/vol-1006-paper2_secnonum_1>, , "Acknowledgment"
<http://ceur-ws.org/section/vol-1006-paper2_secnonum_2>, , "References"

Query Q4.5: Identify the first-level sections of the paper http://ceur-ws.org/Vol-1405/paper-02.pdf

section-iri, section-number, section-title
<http://ceur-ws.org/section/vol-1405-paper-02_sec1>, "1"^^xsd:integer, "INTRODUCTION"
<http://ceur-ws.org/section/vol-1405-paper-02_sec2>, "2"^^xsd:integer, "RECOMMENDATION MODELS"
<http://ceur-ws.org/section/vol-1405-paper-02_sec3>, "3"^^xsd:integer, "EXPERIMENTAL EVALUATION"
<http://ceur-ws.org/section/vol-1405-paper-02_sec4>, "4"^^xsd:integer, "CONCLUSIONS"
<http://ceur-ws.org/section/vol-1405-paper-02_sec5>, "5"^^xsd:integer, "ACKNOWLEDGEMENTS"
<http://ceur-ws.org/section/vol-1405-paper-02_sec6>, "6"^^xsd:integer, "REFERENCES"

Query Q2.5: Tables

Query: Identify the captions of the tables in the paper X

Participants are also required to extract information about other structural components of the papers, among which tables.

As first step, they are asked to extract the captions of the tables. These captions can be in lowercase or uppercase (they will be normalized during the evaluation). For the sake of simplicity, subscript or superscript text in the caption has to be treated as normal text.

Tables must be represented as resources in the produced dataset identified by the table-iri value.

Participants are also required to identify the number of each table.

Important. Caption labels, such as 'Table', 'Tab.', etc., must not be part of the number (which is an integer value).

The representation has to use arabic numerals, even if the original paper used roman numerals or letters.

During the evaluation process, captions will be normalized in lowercase; spaces, punctuations and special characters will be stripped.

Expected output format (CSV):

table-iri, table-number, table-caption
<IRI>,xsd:integer,rdfs:Literal
<IRI>,xsd:integer,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q5.6: Identify the captions of the tables in the paper http://ceur-ws.org/Vol-1500/paper3.pdf

table-iri, table-number, table-caption
<http://ceur-ws.org/table/vol-1500-paper3_tab1>, "1"^^xsd:integer, "GQM model for evaluating systems according to our proposed benchmark criteria."
<http://ceur-ws.org/table/vol-1500-paper3_tab2>, "2"^^xsd:integer, "Results of our manual and example implementations for the metrics defined as benchmark."
<http://ceur-ws.org/table/vol-1500-paper3_tab3>, "3"^^xsd:integer, "Rounded average execution times of the benchmarking example."

Query Q2.6: Figures

Query: Identify the captions of the figures in the paper X

Participants are also required to extract information about figures included in the papers.

As first step, they are asked to extract the captions of the figures. These captions can be in lowercase or uppercase (they will be normalized during the evaluation). For the sake of simplicity, subscript or superscript text in the caption has to be treated as normal text.

Important. In-line figures with no caption must not be taken into account. For the sake of simplicity, a figure composed of subfigures - with only one caption - has to be considered as one single figure (the caption describes all subfigures). Listings, pseudocode and algorithms are not relevant for the purpose of this task.

Figures must be represented as resources in the produced dataset identified by the figure-iri value.

Participants are also required to identify the number of each figure.

Important. Caption labels, such as 'Figure', 'Fig.', etc., must not be part of the number (which is an integer value). The number of each image has to match its position in the paper.

The representation has to use arabic numerals, even if the original paper used roman numerals or letters.

During the evaluation process, captions will be normalized in lowercase; spaces, punctuations and special characters will be stripped.

Expected output format (CSV):

figure-iri, figure-number, figure-caption
<IRI>,xsd:integer,rdfs:Literal
<IRI>,xsd:integer,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q6.10: Identify the captions of the figures in the paper http://ceur-ws.org/Vol-1123/paper4.pdf

figure-iri, figure-number, figure-caption
<http://ceur-ws.org/figure/vol-1123-paper4_fig1>, "1"^^xsd:integer, "EPCIS event classes as represented in EEM"
<http://ceur-ws.org/figure/vol-1123-paper4_fig2>, "2"^^xsd:integer, "Business context entities, relationships and representative individuals"
<http://ceur-ws.org/figure/vol-1123-paper4_fig3>, "3"^^xsd:integer, "Interlinking EPCIS event data"
<http://ceur-ws.org/figure/vol-1123-paper4_fig4>, "4"^^xsd:integer, "EPCIS events, sensor installations and workflow"
<http://ceur-ws.org/figure/vol-1123-paper4_fig5>, "5"^^xsd:integer, "An EEM object event representation from the tomato supply chain"

Query Q6.5: Identify the captions of the figures in the paper http://ceur-ws.org/Vol-1405/paper-02.pdf

figure-iri, figure-number, figure-caption

Query Q2.7: Footnotes

Query: Identify the footnotes in the paper X.

Participants are required to extract the footnotes from the paper. For each footnote, they are required to extract:

  • the footnote marker: the symbol used to identify the footnote in the text, usually a superscripted number
  • the footnote body: the actual content of the footnote (without the footnote marker)
  • the footnote sentence: the sentence the footnote is referred from (after some normalization of the footnote marker)

The footnote sentence should be extracted and rebuilt as follows:

  • the footnote marker should be substituted with the sequence of characters '******'; footnote markers for other footnotes should be removed from the sentence
  • if the footnote marker is in a paragraph, only the sentence including the marker should be extracted. The sentence is considered as bounded by '.', for the sake of simplicity. The rest of the paragraph is not relevant
  • if the footnote sentence spans two pages, it should be rebuilt completely (merging the content from the two pages)
  • if the footnote marker is in a sentence within parentheses, only that sentence should be extracted
  • if the footnote marker is in a list item, only the content of that list item should be extracted
  • if the footnote marker is in a table cell, only the content of that cell should be extracted

Footnotes must be represented as resources in the produced dataset identified by the footnote-iri value.

The content can be in lowercase or uppercase (it will be normalized during the evaluation). Punctuation and spaces will be normalized during the evaluation process too.

Footnotes used for authors should NOT be considered for this query.

Further notes:

  • footnote markers should be reported as normal text, instead of superscript (even if it was originally formatted as superscript)
  • if a footnote is referred many times in the paper, only the first occurrence should be considered
  • links within footnotes should be considered as plain text
  • references within footnotes should be considered as plain text

Expected output format (CSV):

footnote-iri, footnote-marker, footnote-body, sentence-with-footnote
<IRI>,rdfs:Literal,rdfs:Literal,rdfs:Literal
<IRI>,rdfs:Literal,rdfs:Literal,rdfs:Literal
[...] 

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q7.4: Identify the footnotes in the paper http://ceur-ws.org/Vol-1184/ldow2014_paper_02.pdf

footnote-iri, footnote-marker, footnote-body, sentence-with-footnote
<http://ceur-ws.org/footnote/vol-1184-ldow2014_paper_02_footnote1>, "1", "ClueWeb, 2009, http://lemurproject.org/clueweb09.php", "For instance, Cafarella et al. report 14.1 billion HTML tables in English documents in Googles's main index [4] and over 400,000 Excel spreadsheets in the Clueweb09****** crawl [5]."
<http://ceur-ws.org/footnote/vol-1184-ldow2014_paper_02_footnote2>, "2", "We use the patterns in the PATTY system [12].", "We start from a list of textual patterns that are associated with each relation; such patterns are detected by automatic methods for building knowledge bases from natural language******."

Query Q7.4: Identify the footnotes in the paper http://ceur-ws.org/Vol-1184/ldow2014_paper_02.pdf

footnote-iri, footnote-marker, footnote-body, sentence-with-footnote
<http://ceur-ws.org/footnote/vol-1522-badreddin2015hufamo_footnote1>, "1", "https://zenodo.org/record/20367?ln=en#.Veuv5dNViFI", "The complete raw data as well as summarised data are made publicly available****** to facilitate replication and validation of the results [17]."

Query Q7.37: Identify the footnotes in the paper http://ceur-ws.org/Vol-1766/om2016_poster2.pdf

footnote-iri, footnote-marker, footnote-body, sentence-with-footnote
<http://ceur-ws.org/footnote/vol-1766-om2016_poster2_footnote3>, "3", "http://oaei.ontologymatching.org/", "As this field has solid tools and benchmarks******, we design a framework that provides the required input to any ontology matching tool, resulting in Web table annotations."
<http://ceur-ws.org/footnote/vol-1766-om2016_poster2_footnote4>, "4", "http://webdatacommons.org/webtables/goldstandard.html", "We evaluate our approach using the instance mappings of the T2D gold standard****** and LogMap [5], one of the most efficient ontology matching tools [3]."
<http://ceur-ws.org/footnote/vol-1766-om2016_poster2_footnote5>, "5", "http://wiki.dbpedia.org/projects/dbpedia-lookup", "For each entity label in our table, we use top-1 DBpedia lookup****** result as annotation."

Query Q2.8: EU projects

Query: Identify the EU project(s) that supported the research presented in the paper X (or part of it).

The analysis is restricted to projects explicitly mentioned in the paper. The name of the projects must be copied directly from the paper, without looking at external data sources.

Projects must be represented as resources in the produced dataset identified by the project-iri value.

Punctuation, spaces, prepositions and articles in these values will be normalized during the evaluation process.

Further notes:

  • projects are identified by their name. If the paper mentions both the name and the EU agreement number, it is enough to include the name.
  • if the paper mentions both the long name of the project and the short one, for instance in parentheses, the short one should be used
  • if the paper only mentions the number of the project, with no information about the name, the number must be included
  • the name of the project must be included without the string 'project'

Expected output format (CSV):

project-iri, project-name 
<IRI>,rdfs:Literal
<IRI>,rdfs:Literal
[...] 

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q8.10: Identify the EU project(s) that supported the research presented in the paper http://ceur-ws.org/Vol-1123/paper4.pdf (or part of it)

project-iri, project-name
<http://ceur-ws.org/project/smartagrifood>, "SmartAgriFood"
<http://ceur-ws.org/project/fispace>, "FISpace"