Skip to content
Transforms PDF, Documents and Images into Enriched Structured Data
TypeScript JavaScript Python Other
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github/ISSUE_TEMPLATE [Old Viewer] Linter fixes Oct 30, 2019
.s2i/bin [Old Viewer] Removed more references to web-viewer Oct 30, 2019
.vscode [Patch] Make the default launch.json work out of the box Nov 6, 2019
api Merge branch 'develop' into feature/Images Jan 13, 2020
demo Merge branch 'develop' into feature/Images Jan 13, 2020
docker Revert "[Docker] [BaseImage] Pipenv inclusion (#271)" Dec 23, 2019
docs Fix broken link for Google Vision Doc Jan 14, 2020
samples merge with develop - conflict resulution Nov 6, 2019
server [Images] Commented line Jan 13, 2020
test Merge branch 'develop' into feature/email-attachments Jan 8, 2020
train Revert "[Pipenv] Python package dependency handling: with bugfixes an… Dec 19, 2019
.dockerignore First public release 🚀 Aug 6, 2019
.drone.yml Issue/pipenv (#277) Dec 20, 2019
.gitignore [Test Fix] Skip creating directory to save tesseract optimised images… Dec 11, 2019
.prettierrc.js [TSLint] Configure tslint.json & prettierrc.js Oct 30, 2019
CONTRIBUTING.md First public release 🚀 Aug 6, 2019
LICENSE correcting the copyright notice (#81) Sep 26, 2019
README.md Fix broken link to Pdfminer.six license Jan 13, 2020
README_fr.md Performin requested fixes Jan 7, 2020
README_zh-cn.md Performin requested fixes Jan 7, 2020
docker-compose-build.yml Run a spell checker on the whole project Nov 1, 2019
docker-compose.yml Correct some problems with the docker-compose (#111) Oct 10, 2019
logo.png [Documentation] Favorizing docker installation in the getting started… Nov 21, 2019
package-lock.json [vue-pagination] - package-lock Dec 19, 2019
package.json [docker-fix] - one more dependency - removed puppeteer from package.json Dec 18, 2019
sonar-project.properties First public release 🚀 Aug 6, 2019
tsconfig.json modify heading detection to work using a decision tree classifier Nov 26, 2019
tslint.json [TSLint] Fixed wrong folder to exclude Dec 9, 2019

README.md

Build Status

Turn your documents into data!

Français | 中文

Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.

Currently, Parsr can perform:

  • Document Hierarchy Regeneration - Words, Lines and Paragraphs
  • Headings Detection
  • Table Detection and Reconstruction
  • Lists Detection
  • Text Order Detection
  • Named Entity Recognition (Dates, Percentages, etc)
  • Key-Value Pair Detection (for the extraction of specific form-based entries)
  • Page Number Detection
  • Header-Footer Detection
  • Link Detection
  • Whitespace Removal

Parsr takes as input an image (.JPG, .PNG, .TIFF, ...) or a PDF generates the following output formats:

  • JSON
  • Markdown
  • Text
  • CSV (for tables), or Pandas Dataframes (see here)
  • PDF

Table of Contents

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.

Usage

-- The advanced usage guide is available here --

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.

  1. To use the Jupyter Notebook and the python interface to the Parsr API, follow here.
  2. To use the GUI tool (the API needs to already be running), issue:
    docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
    Then, access it through http://localhost:8080.

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here.

Contribute

Please refer to the contribution guidelines.

Third Party Licenses

Third Party Libraries licenses for its dependencies:

  1. QPDF: Apache http://qpdf.sourceforge.net
  2. GraphicsMagick: MIT http://www.graphicsmagick.org/index.html
  3. ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
  4. Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
  5. PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
  6. Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
  7. Camelot: MIT https://github.com/camelot-dev/camelot
  8. MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
  9. Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc

License

Copyright 2019 AXA Group Operations S.A.
Licensed under the Apache 2.0 license (see the LICENSE file).

You can’t perform that action at this time.