- INPUT: PDFs with OCR text layer
- OUTPUT: Key-value pairs of invoice data
- Extracts invoice data by evaluating multiple regular expressions against the plain text of the pdf document
- Provides a GUI for data validation
- Electron app with node.js backend
- Vendor matching based on vat number (CHE-xxx.xxx.xxx), IBAN from separate CSV-list
- Export format options: JSON, XML
- Focus on scanned and OCR'd documents with a scan resolution of 300 dpi.
- NO OCR! This software does not perform any OCR on the PDF documents. It merely extracts the existing text content.
- I needed a way to extract textual data from PDF documents specifically invoices.
- All available solutions required the upload of the PDF documents to a webservice which was not an option.
- Furthermore I required a solution to easily validate the extracted data with a GUI.
Regular expressions are used to extract the field values from a text only version of the PDF document.
extract text from
text layer in PDF regex
1 PDF document -------------------> 1 text -------> 1-N key-value pairs
(extracted_data)
user validation
1-N key-value pairs ----------------> 1-N key-value pairs
(extracted_data) (validated_data)
- main.js
- public/web/index_frontend_controller.html
- ../build/pdf.js
- viewer.js
- ../build/pdf.js
- ../extractor/FrontendController.js
- ../extractor/SuppliersLoader.js
- ../extractor/Sidebar.js
- ../extractor/SidebarField.js
- Validator{..}.js (ValidatorClass)
- ../extractor/SidebarField.js
- ../extractor/Queue.js
- ./public/extractor/BackendController.js
- ./PdfExtractJob.js
- Extractor{..}.js (ExtractorClass)
- ./Extractor.js (Base class)
- Extractor{..}.js (ExtractorClass)
- ./SuppliersLoader.js
- ./Queue.js
- ./PdfExtractJob.js
- public/web/index_frontend_controller.html
- Clone git archive
- cd into the directory
- on the command line:
npm install
- start with:
npm start
These dependencies are installed by running npm install
- express: 4.17.1
- js2xmlparser: 4.0.0
- pdf.js-extract: 0.1.3
- electron: 6.0.12
A pre-built version of the PDF.js viewer is included in this project.