This repository has been archived by the owner on Mar 30, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 33
Related Projects
Daniel Ecer edited this page Jul 23, 2020
·
7 revisions
Overview of related projects.
Use the multiple tools to produce output (as is ScienceBeam itself).
- Input: Doc (Word) or PDF
- Output: JATS XML
- Scope: References
- Activity: Active
- License: GPL 3.0
Links:
- Input: (m)any markup
- Output: (m)any markup
Links:
- Input: PDF
- Output: TEI XML
- Model: Hiearchical CRF (ML)
- Scope: Full Text
- Language: Java
- Activity: Active
- License: Apache 2.0
Links:
- Input: PDF
- Output: TEI XML
- Model: SVM (ML)
- Scope: Full Text
- Language: Java
- Activity: Active
- License: AGPL 3.0
Links:
- Input: PDF
- Output: JSON
- Model: CRF (ML)
- Scope: Title, Authors, Abstract, References
- Language: Scala / Java
- Activity: Deprecated (but still running, see Science Parse V2)
- License: Apache 2.0
Links:
- Input: PDF
- Output: JSON
- Model: BiLSTM (ML)
- Scope: Title, Authors, References (may be extended in the future)
- Language: Python
- Activity: Active
- License: Apache 2.0
Links:
- Input: SVG (PDF converted using pdf2svg)
- Output: XML
- Model: Rule based?
- Scope: Full Text (focus on Tables?), a number of sub-projects
- Language: Java
- Activity: Active
- License: Apache 2.0
Links:
- Input: PDF
- Output: TEI XML
- Model: CRF (ML)
- Scope: Full Text
- Language: Python
- Activity: ~2017
- License:
Links:
- GitHub
- "OCR++: A Robust Framework For Information Extraction from Scholarly Articles" (2016)
- project
- Input: PDF
- Output: XML
- Model:
- Scope: Paragraph?
- Language: Java
- Activity: Active
- License: Apache 2.0
Links:
- Input: Word .docx
- Output: JATS XML (TEI as intermediate format)
- Model: Rule based?
- Scope: Full Text?
- Language: XSLT
- Activity: Active
- License: GPL 2.0
Links:
- Input: PDF
- Output: LaTeX
- Model: Computer Vision
- Scope: Formulas
- Language: Python
- Activity: 2016
- License: Apache 2.0
Links:
- Input: PDF
- Output: JATS? XML
- Model: Rule based
- Scope: Full Text
- Language: Java
- Activity:
- License: GPL 3.0
Links:
- Input: PDF
- Output: JATS XML
- Model:
- Scope: Full Text
- Activity:
- License:
Links:
- Input: PDF
- Output: JATS XML
- Model: CRF
- Scope: References
- Activity: ~2013
- License: LGPL 3.0
Links:
- GitHub
- "Logical Structure Recovery in Scholarly Articles with Rich Document Features" (2010) pre-print
- Input: PDF
- Output: XML
- Model:
- Scope: References, Regions
- Activity: ~2015 (retired)
- License: MIT
Links:
- Input: LaTeX
- Output: HTML
- Model:
- Scope:
- Activity: Active
- License: Apache 2.0
Links:
- pdf2svg
- xpdf
- pdftohtml
- pdf2htmlEX
- pdfminer
- PyPDF2
- pdfrw
- ContentMine pdf2svg GitHub
- docsplit
- pypdf2xml
- pdf2json
- node-pdfreader
- mupdf
- pdfbox
- inkscape
- "Recursive Recurrent Nets with Attention Modeling for OCR in the Wild" (2016)) (Attention-OCR)
- "Visually Identifying Rank" (2016)
- tesseract
- pytesseract
- jocr
- ocropy
- kraken
- Transkribus
- Keras example
- "Semi-automatic Metadata Extraction from Scientific Journal Article for Full-text XML Conversion" (2016)
- "Machine Learning Approach upon Text from Varied Publishing Formats" (2016)
- "A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles" (2014)
- "From Legacy Documents to XML: A Conversion Framework" (2005)
- "Automating XML Markup using Machine Learning Techniques" (2004)
- "Information Extraction and Automatic Markup for XML documents" (2003)
- "Automatic Document Metadata Extraction using Support Vector Machines" (2003)
- "Machine Learning for Information Extraction from XML marked-up text on the Semantic Web" (2001)
- "Converting PDF to XML" - Mark D. Anderson (2008)
- "Machine Learning for Reading Order Detection in Document Image Understanding" (2007) pdf
- "The Significance of Reading Order in Document Recognition and its Evaluation" (2013) pdf
- "Machine Learning for Document Structure Recognition" (2011) researchgate
- "A Machine Learning Approach To Sentence Ordering For Multi-Document Summarization and its Evaluation" (2005) pdf
- "Unsupervised document structure analysis of digital scientific articles" (2014) researchgate
- "Zhang-Shasha: Tree edit distance in Python"
- "SegAN: Adversarial Network with Multi-scale L1 Loss for Medical Image Segmentation" (2017)
- "A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN" (2017)
- "Semi and Weakly Supervised Semantic Segmentation Using Generative Adversarial Network" (2017)
- "Deep learning for satellite imagery via image segmentation" (2017)
- "Image-to-Image Translation with Conditional Adversarial Networks" (2016) GitHub
- "G-RMI Object Detection 2nd ImageNet and COCO Visual Recognition Challenges Joint Workshop ECCV 2016, Amsterdam" (2016) presentation
- "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks" (2015)
- "U-Net: Convolutional Networks for Biomedical Image Segmentation" (2015)
- "Image Segmentation using deconvolution layer in Tensorflow"
- "Tensorflow Unet"