PDF-to-Tables Table

Making a table to compare how well tools extract tabular data from PDFs.

Tools

Tabula-Java: the scriptable engine behind Tabula, via OpenCollective.
pdfplumber: a high-level Python interface to pdfminer, created by Jerm-Sing-Vine
PDFTables.com - a cloud service from the Sensible Code Company
CometDocs - a cloud service with PDF-to-XLS functionality
ABBYY FineReader Pro for Mac - What I personally use. Costs money. No idea if it differs via cross-platform. Has no automatable-interface. But very accurate and dependable.
ZamZar: Provides pdf-to-table functionality at the level of Google Drive, which is to say, not much at all. Might include it just as a baseline.
pdftotext: part of the Poppler suite. Its -layout option provides a conversion that attempts to visually emulate the original PDF, but the result is not delimited data.

Benchmark PDFs

Trying to collect publicly-available PDFs that exhibit a wide-range of real-world data-as-PDF characteristics. Feel free to suggest some!

Note: not interested in PDFs that contain scanned documents. Of the tools we're testing, only ABBYY has OCR-to-Excel functionality.

pdfs/ca-warn-2013.pdf: California 2013 WARN Report
pdfs/ca-warn-7-1-2016.pdf: California WARN notices processed from July 1, 2016 through September 25, 2016
pdfs/hcctransparency-report-2010.pdf: 2010 Jan. to Dec. Janssen Pharmaceuticals via ProPublica D4D
pdfs/condemnedinmatelistsecure.pdf: California Condemned Inmate List
pdfs/nypd-weekly-stats.pdf: NYPD Weekly Crime Stats
pdfs/menlo-park-sunridge-cad-interface.pdf - A PDF containing one actual table, and a few pages of screenshots
pdfs/nics-firearm-background-checks.pdf - NICS Firearm Background Checks, as copied from jsvine/pdfplumber

Todos

Come up with a methodology (what's the baseline?) to compare the results across tools. Fidelity to table structure is the main goal, but not the only way that tools differ.
Makes more sense to split every PDF into separate pages. Easier to do unit testing of each program/service.
Think of functionality tests (handling of word-wrap, header/footer text) beyond just cell-by-cell accuracy.
Figure way to kind of automate CometDocs and ABBYY and other services

Prospective project tree:

  ├── README.md
  ├── pdfs
  |    └── ca-warn-2013
  |        ├── 001.pdf
  |        ├── 002.pdf
  |        └── 003.pdf
  └── results
      ├── pdfplumber
      |   └── ca-warn-2013
      |       ├── 001.csv
      |       ├── 002.csv
      |       └── 003.csv
      └── tabula-java
          └── ca-warn-2013
              ├── 001.csv
              ├── 002.csv
              └── 003.csv

Example test suite and results

java -jar \
    bins/tabula-0.9.1-jar-with-dependencies.jar --pages all \
    pdfs/nypd-weekly-stats.pdf \
    > results/tabula-java/nypd-weekly-stats.csv

java -jar \
    bins/tabula-0.9.1-jar-with-dependencies.jar --pages all \
    pdfs/menlo-park-sunridge-cad-interface.pdf \
    > results/tabula-java/menlo-park-sunridge-cad-interface.csv

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
pdfs		pdfs
results/tabula-java		results/tabula-java
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdfs

pdfs

results/tabula-java

results/tabula-java

.gitignore

.gitignore

README.md

README.md

Repository files navigation

PDF-to-Tables Table

Tools

Benchmark PDFs

Todos

Example test suite and results

About

Releases

Packages

dannguyen/pdftotablestable

Folders and files

Latest commit

History

Repository files navigation

PDF-to-Tables Table

Tools

Benchmark PDFs

Todos

Example test suite and results

About

Resources

Stars

Watchers

Forks