Turn your documents into data!
Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.
It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.
Currently, Parsr can perform:
- Document Hierarchy Regeneration - Words, Lines and Paragraphs
- Headings Detection
- Table Detection and Reconstruction
- Lists Detection
- Text Order Detection
- Named Entity Recognition (Dates, Percentages, etc)
- Key-Value Pair Detection (for the extraction of specific form-based entries)
- Page Number Detection
- Header-Footer Detection
- Link Detection
- Whitespace Removal
Parsr takes as input an image (.JPG, .PNG, .TIFF, ...) or a PDF generates the following output formats:
- CSV (for tables), or Pandas Dataframes (see here)
Table of Contents
- Turn your documents into data!
-- The advanced installation guide is available here --
The quickest way to install and run the Parsr API is through the docker image:
docker pull axarev/parsr
If you also wish to install the GUI for sending documents and visualising results:
docker pull axarev/parsr-ui-localhost
Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.
-- The advanced usage guide is available here --
To run the API, issue:
docker run -p 3001:3001 axarev/parsr
- To use the Jupyter Notebook and the python interface to the Parsr API, follow here.
- To use the GUI tool (the API needs to already be running), issue:
Then, access it through http://localhost:8080.
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.
All documentation files can be found here.
Please refer to the contribution guidelines.
Third Party Licenses
Third Party Libraries licenses for its dependencies:
- QPDF: Apache http://qpdf.sourceforge.net
- GraphicsMagick: MIT http://www.graphicsmagick.org/index.html
- ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
- Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
- PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
- Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
- Camelot: MIT https://github.com/camelot-dev/camelot
- MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
- Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc