Skip to content
Transforms PDF, Documents and Images into Enriched Structured Data
TypeScript JavaScript Python Other
Branch: master
Clone or download
jvalls-axa Merge branch 'develop'
# Conflicts:
#	docs/configuration.md
1
Latest commit e39901a Jan 24, 2020
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github/ISSUE_TEMPLATE [Old Viewer] Linter fixes Oct 30, 2019
.s2i/bin [Old Viewer] Removed more references to web-viewer Oct 30, 2019
.vscode [input/amazon-textract] - send Amazon access keys via process.env Jan 20, 2020
api [Abbyy] Configure API to properly run abbyy Jan 24, 2020
demo Merge pull request #308 from axa-group/feature/abbyy Jan 24, 2020
docker Revert "[Docker] [BaseImage] Pipenv inclusion (#271)" Dec 23, 2019
docs Merge branch 'develop' Jan 24, 2020
samples removed fake/garbage files from /samples folder Jan 14, 2020
server Merge pull request #308 from axa-group/feature/abbyy Jan 24, 2020
test [Google Vision] Fixed failing test Jan 21, 2020
train Revert "[Pipenv] Python package dependency handling: with bugfixes an… Dec 19, 2019
.dockerignore First public release 🚀 Aug 6, 2019
.drone.yml Issue/pipenv (#277) Dec 20, 2019
.gitignore [Test Fix] Skip creating directory to save tesseract optimised images… Dec 11, 2019
.prettierrc.js [TSLint] Configure tslint.json & prettierrc.js Oct 30, 2019
CONTRIBUTING.md First public release 🚀 Aug 6, 2019
LICENSE correcting the copyright notice (#81) Sep 26, 2019
README.md Fix broken link to Pdfminer.six license Jan 13, 2020
README_fr.md Performin requested fixes Jan 7, 2020
README_zh-cn.md Performin requested fixes Jan 7, 2020
docker-compose-build.yml Run a spell checker on the whole project Nov 1, 2019
docker-compose.yml Correct some problems with the docker-compose (#111) Oct 10, 2019
logo.png [Documentation] Favorizing docker installation in the getting started… Nov 21, 2019
package-lock.json [input/amazon-textract] - added Amazon Textract as new OCR module Jan 17, 2020
package.json [input/amazon-textract] - added Amazon Textract as new OCR module Jan 17, 2020
sonar-project.properties First public release 🚀 Aug 6, 2019
tsconfig.json modify heading detection to work using a decision tree classifier Nov 26, 2019
tslint.json [TSLint] Fixed wrong folder to exclude Dec 9, 2019

README.md

Build Status

Turn your documents into data!

Français | 中文

Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.

Currently, Parsr can perform:

  • Document Hierarchy Regeneration - Words, Lines and Paragraphs
  • Headings Detection
  • Table Detection and Reconstruction
  • Lists Detection
  • Text Order Detection
  • Named Entity Recognition (Dates, Percentages, etc)
  • Key-Value Pair Detection (for the extraction of specific form-based entries)
  • Page Number Detection
  • Header-Footer Detection
  • Link Detection
  • Whitespace Removal

Parsr takes as input an image (.JPG, .PNG, .TIFF, ...) or a PDF generates the following output formats:

  • JSON
  • Markdown
  • Text
  • CSV (for tables), or Pandas Dataframes (see here)
  • PDF

Table of Contents

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.

Usage

-- The advanced usage guide is available here --

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.

  1. To use the Jupyter Notebook and the python interface to the Parsr API, follow here.
  2. To use the GUI tool (the API needs to already be running), issue:
    docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
    Then, access it through http://localhost:8080.

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here.

Contribute

Please refer to the contribution guidelines.

Third Party Licenses

Third Party Libraries licenses for its dependencies:

  1. QPDF: Apache http://qpdf.sourceforge.net
  2. GraphicsMagick: MIT http://www.graphicsmagick.org/index.html
  3. ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
  4. Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
  5. PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
  6. Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
  7. Camelot: MIT https://github.com/camelot-dev/camelot
  8. MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
  9. Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc

License

Copyright 2019 AXA Group Operations S.A.
Licensed under the Apache 2.0 license (see the LICENSE file).

You can’t perform that action at this time.