PDF invoice data extractor

Features

INPUT: PDFs with OCR text layer
OUTPUT: Key-value pairs of invoice data
Extracts invoice data by evaluating multiple regular expressions against the plain text of the pdf document
Provides a GUI for data validation
Electron app with node.js backend
Vendor matching based on vat number (CHE-xxx.xxx.xxx), IBAN from separate CSV-list
Export format options: JSON, XML
Focus on scanned and OCR'd documents with a scan resolution of 300 dpi.
NO OCR! This software does not perform any OCR on the PDF documents. It merely extracts the existing text content.

Screenshots

Motivation

I needed a way to extract textual data from PDF documents specifically invoices.
All available solutions required the upload of the PDF documents to a webservice which was not an option.
Furthermore I required a solution to easily validate the extracted data with a GUI.

Data extraction

Regular expressions are used to extract the field values from a text only version of the PDF document.

PDF extraction pipeline

                extract text from
                text layer in PDF           regex
1 PDF document -------------------> 1 text -------> 1-N key-value pairs
                                                     (extracted_data)

Validation pipeline

                     user validation
1-N key-value pairs ----------------> 1-N key-value pairs
 (extracted_data)                      (validated_data)

JS-File invokation

main.js
- public/web/index_frontend_controller.html
  - ../build/pdf.js
  - viewer.js
    - ../build/pdf.js
  - ../extractor/FrontendController.js
    - ../extractor/SuppliersLoader.js
    - ../extractor/Sidebar.js
      - ../extractor/SidebarField.js
        
        Validator{..}.js (ValidatorClass)
    - ../extractor/Queue.js
- ./public/extractor/BackendController.js
  - ./PdfExtractJob.js
    - Extractor{..}.js (ExtractorClass)
      - ./Extractor.js (Base class)
  - ./SuppliersLoader.js
  - ./Queue.js

Installation and Usage

Clone git archive
cd into the directory
on the command line: npm install
start with: npm start

npm-dependencies

These dependencies are installed by running npm install

express: 4.17.1
js2xmlparser: 4.0.0
pdf.js-extract: 0.1.3
electron: 6.0.12

Packaged libraries

A pre-built version of the PDF.js viewer is included in this project.

pdfjs-2.2.228-dist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PDF invoice data extractor

Features

Screenshots

Motivation

Data extraction

PDF extraction pipeline

Validation pipeline

JS-File invokation

Installation and Usage

npm-dependencies

Packaged libraries

Files

README.md

Latest commit

History

README.md

File metadata and controls

PDF invoice data extractor

Features

Screenshots

Motivation

Data extraction

PDF extraction pipeline

Validation pipeline

JS-File invokation

Installation and Usage

npm-dependencies

Packaged libraries