Skip to content

cat-lemonade/PDFDataExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDFDataExtractor

GitHub

PDFDataExtractor is a toolkit for automatically extracting semantic information from PDF files of scientific articles, which features a template-based architecture with abilities to extract information from the following publishers, and more templates are currently under development:

  • | Elsevier
  • | Royal Society of Chemistry
  • | Advanced Material Families (Wiley)
  • | Angewandte
  • | Chemistry A European Journal
  • | American Chemistry Society

This guide provides a quick tour through PDFDataExtractor concepts and functionalities.

Features

  • | Extract metadata information from scientific PDFs, including: title, anthor, abstract, journal name, journal year, journal volume, journal page number, doi, keywords, figure captions, section titles, heading, page number and references

  • | Chemistry-aware PDF information extraction

  • | Outputs PDF articles in plain text, JSON

  • | Extract articles from six main stream chemistry and physics publishers with high precision

  • | Automated publisher detection

Developing Features

  • Web services for a more user friendly experience
  • Supports for more publishers

Citing

PDFDataExtractor:

Zhu, M. and Cole, J., 2022. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. Journal of Chemical Information and Modeling, 62(7), pp.1633-1643.

This project was financially supported by the Science and Technology Facilities Council (STFC), the Royal Academy of Engineering (RCSRF1819\7\10), and BASF.

About

A toolkit for automatically extracting semantic information from PDF files of scientific articles

Resources

License

Stars

Watchers

Forks

Languages