This repository has been archived by the owner on Mar 30, 2022. It is now read-only.

Related Projects

Jump to bottom

Daniel Ecer edited this page Jul 23, 2020 · 7 revisions

Related Projects

Overview of related projects.

Meta Projects

Use the multiple tools to produce output (as is ScienceBeam itself).

PKP OTS

Input: Doc (Word) or PDF
Output: JATS XML
Scope: References
Activity: Active
License: GPL 3.0

Links:

Pandoc

Input: (m)any markup
Output: (m)any markup

Links:

project

Semantic Extraction Projects

GROBID (GeneRation Of BIbliographical Data)

Input: PDF
Output: TEI XML
Model: Hiearchical CRF (ML)
Scope: Full Text
Language: Java
Activity: Active
License: Apache 2.0

Links:

CERMINE (Content ExtRactor and MINEr)

Input: PDF
Output: TEI XML
Model: SVM (ML)
Scope: Full Text
Language: Java
Activity: Active
License: AGPL 3.0

Links:

GitHub

Science Parse V1

Input: PDF
Output: JSON
Model: CRF (ML)
Scope: Title, Authors, Abstract, References
Language: Scala / Java
Activity: Deprecated (but still running, see Science Parse V2)
License: Apache 2.0

Links:

GitHub

Science Parse V2

Input: PDF
Output: JSON
Model: BiLSTM (ML)
Scope: Title, Authors, References (may be extended in the future)
Language: Python
Activity: Active
License: Apache 2.0

Links:

GitHub

ContentMine pdf2xml

Input: SVG (PDF converted using pdf2svg)
Output: XML
Model: Rule based?
Scope: Full Text (focus on Tables?), a number of sub-projects
Language: Java
Activity: Active
License: Apache 2.0

Links:

Bitbucket

OCR++

Input: PDF
Output: TEI XML
Model: CRF (ML)
Scope: Full Text
Language: Python
Activity: ~2017
License:

Links:

PdfAct

Input: PDF
Output: XML
Model:
Scope: Paragraph?
Language: Java
Activity: Active
License: Apache 2.0

Links:

meTypeset

Input: Word .docx
Output: JATS XML (TEI as intermediate format)
Model: Rule based?
Scope: Full Text?
Language: XSLT
Activity: Active
License: GPL 2.0

Links:

GitHub

im2markup

Input: PDF
Output: LaTeX
Model: Computer Vision
Scope: Formulas
Language: Python
Activity: 2016
License: Apache 2.0

Links:

LA-PDFText

Input: PDF
Output: JATS? XML
Model: Rule based
Scope: Full Text
Language: Java
Activity:
License: GPL 3.0

Links:

PDFX

Input: PDF
Output: JATS XML
Model:
Scope: Full Text
Activity:
License:

Links:

"PDFX: Fully-automated PDF-to-XML Conversion of Scientific Literature" (2013)

ParsCit

Input: PDF
Output: JATS XML
Model: CRF
Scope: References
Activity: ~2013
License: LGPL 3.0

Links:

Crossref pdf-extract

Input: PDF
Output: XML
Model:
Scope: References, Regions
Activity: ~2015 (retired)
License: MIT

Links:

Arxiv Vanity / Engrafo

Input: LaTeX
Output: HTML
Model:
Scope:
Activity: Active
License: Apache 2.0

Links:

Low-level PDF Extraction

OCR

Resources

Semantic Extraction Resources

Reading Order Resources

Object Detection / Image Segmentation

More