PDFDataExtractor

PDFDataExtractor is a toolkit for automatically extracting semantic information from PDF files of scientific articles, which features a template-based architecture with abilities to extract information from the following publishers, and more templates are currently under development:

| Elsevier
| Royal Society of Chemistry
| Advanced Material Families (Wiley)
| Angewandte
| Chemistry A European Journal
| American Chemistry Society

This guide provides a quick tour through PDFDataExtractor concepts and functionalities.

Features

| Extract metadata information from scientific PDFs, including: title, anthor, abstract, journal name, journal year, journal volume, journal page number, doi, keywords, figure captions, section titles, heading, page number and references
| Chemistry-aware PDF information extraction
| Outputs PDF articles in plain text, JSON
| Extract articles from six main stream chemistry and physics publishers with high precision
| Automated publisher detection

Developing Features

Web services for a more user friendly experience
Supports for more publishers

Citing

PDFDataExtractor:

Zhu, M. and Cole, J., 2022. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. Journal of Chemical Information and Modeling, 62(7), pp.1633-1643.

This project was financially supported by the Science and Technology Facilities Council (STFC), the Royal Academy of Engineering (RCSRF1819\7\10), and BASF.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
SI		SI
demo		demo
docs		docs
pdfdataextractor		pdfdataextractor
templates		templates
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SI

SI

demo

demo

docs

docs