This repository contains manually inspected datasets for evaluating the different steps during the citation information extraction process.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
1-German_papers_with_reference_section_at_end_of_paper_first_group
2-English_papers_with_reference_section_at_end_of_paper
3-German_papers_with_footnote
4-German_papers_with_reference_section_at_end_of_paper_second_group
5-German_papers_with_reference_section_at_end_of_paper_plus_short_citation_footnote
6-Guidelinefiles
README.md

README.md

SSOAR Gold Standard

This repository contains manually inspected datasets for evaluating the different steps during the reference extraction process. All datasets consist of research papers which come from the SSOAR repository. This corpus will grow constantly.

big picture of "gold standard process"

gold standard process

Content

  • Number of all papers in repository: 354
  • Number of German papers in repository: 254
  • Number of English papers in repository: 100
  • Number of processed papers up until now: 225

How access to papers:

There are six different folders in this repository.

Each folder contains several sub-folders:

  1. Pdfs
    • This folder contains pdf files which are randomly picked from SSOAR publications(for understanding how they are selected see Selection method).
  2. Layouts
    • This folder contains extracted layout files from selected pdfs.
    • CERMINE is used for Generating layout CSV files from given PDFs.
  3. Layout with identified references
    • This folder contains layout files plus identified references and all the reference strings are annotated and checked manually.
    • EXRef-Identifier is used for checking identified reference strings in layout files.
  4. Extracted References from Layouts
    • This folder contains extracted reference strings from annotated layout files (the output of step 3).
    • refext is used for extracting reference strings from layout files.
  5. Segmented References
    • This folder contains segmented reference strings.
    • references are checked and edited manually by EXRefmeta-Extractor Tool
  6. Merged Layout and segmented references
    • This folder contains layout files(from step 3) which are merged with segmented reference strings(from step 5).

Selection method

We randomly select our papers from the 33,954 available publications in SSOAR repository.

Files name

Files name are equal to SSOAR ID (For easier referencing).

For example, 12826.pdf refers to SSOAR ID 12826.

How access to papers in SSOAR repository

By searching SSOARID (file name) in SSOAR repository, you can access each paper(metadata and pdf).

Selection criteria

  • selected papers are in German or English languages. (other languages are excluded)
  • selected papers are not in OCR or scanned (images) format or do not have a watermark in the background.
  • selected papers contain a reference section at end of the paper or some short citation as a footnote .
  • Number of References are in the range of: (3 < Reference < 50)