Skip to content
This repository has been archived by the owner on Mar 30, 2022. It is now read-only.

Related Projects

Daniel Ecer edited this page Jul 23, 2020 · 7 revisions

Related Projects

Overview of related projects.

Meta Projects

Use the multiple tools to produce output (as is ScienceBeam itself).

PKP OTS

  • Input: Doc (Word) or PDF
  • Output: JATS XML
  • Scope: References
  • Activity: Active
  • License: GPL 3.0

Links:

Pandoc

  • Input: (m)any markup
  • Output: (m)any markup

Links:

Semantic Extraction Projects

GROBID (GeneRation Of BIbliographical Data)

  • Input: PDF
  • Output: TEI XML
  • Model: Hiearchical CRF (ML)
  • Scope: Full Text
  • Language: Java
  • Activity: Active
  • License: Apache 2.0

Links:

CERMINE (Content ExtRactor and MINEr)

  • Input: PDF
  • Output: TEI XML
  • Model: SVM (ML)
  • Scope: Full Text
  • Language: Java
  • Activity: Active
  • License: AGPL 3.0

Links:

Science Parse V1

  • Input: PDF
  • Output: JSON
  • Model: CRF (ML)
  • Scope: Title, Authors, Abstract, References
  • Language: Scala / Java
  • Activity: Deprecated (but still running, see Science Parse V2)
  • License: Apache 2.0

Links:

Science Parse V2

  • Input: PDF
  • Output: JSON
  • Model: BiLSTM (ML)
  • Scope: Title, Authors, References (may be extended in the future)
  • Language: Python
  • Activity: Active
  • License: Apache 2.0

Links:

ContentMine pdf2xml

  • Input: SVG (PDF converted using pdf2svg)
  • Output: XML
  • Model: Rule based?
  • Scope: Full Text (focus on Tables?), a number of sub-projects
  • Language: Java
  • Activity: Active
  • License: Apache 2.0

Links:

OCR++

  • Input: PDF
  • Output: TEI XML
  • Model: CRF (ML)
  • Scope: Full Text
  • Language: Python
  • Activity: ~2017
  • License:

Links:

PdfAct

  • Input: PDF
  • Output: XML
  • Model:
  • Scope: Paragraph?
  • Language: Java
  • Activity: Active
  • License: Apache 2.0

Links:

meTypeset

  • Input: Word .docx
  • Output: JATS XML (TEI as intermediate format)
  • Model: Rule based?
  • Scope: Full Text?
  • Language: XSLT
  • Activity: Active
  • License: GPL 2.0

Links:

im2markup

  • Input: PDF
  • Output: LaTeX
  • Model: Computer Vision
  • Scope: Formulas
  • Language: Python
  • Activity: 2016
  • License: Apache 2.0

Links:

LA-PDFText

  • Input: PDF
  • Output: JATS? XML
  • Model: Rule based
  • Scope: Full Text
  • Language: Java
  • Activity:
  • License: GPL 3.0

Links:

PDFX

  • Input: PDF
  • Output: JATS XML
  • Model:
  • Scope: Full Text
  • Activity:
  • License:

Links:

ParsCit

  • Input: PDF
  • Output: JATS XML
  • Model: CRF
  • Scope: References
  • Activity: ~2013
  • License: LGPL 3.0

Links:

Crossref pdf-extract

  • Input: PDF
  • Output: XML
  • Model:
  • Scope: References, Regions
  • Activity: ~2015 (retired)
  • License: MIT

Links:

Arxiv Vanity / Engrafo

  • Input: LaTeX
  • Output: HTML
  • Model:
  • Scope:
  • Activity: Active
  • License: Apache 2.0

Links:

Low-level PDF Extraction

OCR

Resources

Semantic Extraction Resources

Reading Order Resources

Object Detection / Image Segmentation

More