Skip to content

dannguyen/pdftotablestable

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PDF-to-Tables Table

Making a table to compare how well tools extract tabular data from PDFs.

Tools

  • Tabula-Java: the scriptable engine behind Tabula, via OpenCollective.
  • pdfplumber: a high-level Python interface to pdfminer, created by Jerm-Sing-Vine
  • PDFTables.com - a cloud service from the Sensible Code Company
  • CometDocs - a cloud service with PDF-to-XLS functionality
  • ABBYY FineReader Pro for Mac - What I personally use. Costs money. No idea if it differs via cross-platform. Has no automatable-interface. But very accurate and dependable.
  • ZamZar: Provides pdf-to-table functionality at the level of Google Drive, which is to say, not much at all. Might include it just as a baseline.
  • pdftotext: part of the Poppler suite. Its -layout option provides a conversion that attempts to visually emulate the original PDF, but the result is not delimited data.

Benchmark PDFs

Trying to collect publicly-available PDFs that exhibit a wide-range of real-world data-as-PDF characteristics. Feel free to suggest some!

Note: not interested in PDFs that contain scanned documents. Of the tools we're testing, only ABBYY has OCR-to-Excel functionality.

Todos

  • Come up with a methodology (what's the baseline?) to compare the results across tools. Fidelity to table structure is the main goal, but not the only way that tools differ.

  • Makes more sense to split every PDF into separate pages. Easier to do unit testing of each program/service.

  • Think of functionality tests (handling of word-wrap, header/footer text) beyond just cell-by-cell accuracy.

  • Figure way to kind of automate CometDocs and ABBYY and other services

Prospective project tree:

  ├── README.md
  ├── pdfs
  |    └── ca-warn-2013
  |        ├── 001.pdf
  |        ├── 002.pdf
  |        └── 003.pdf
  └── results
      ├── pdfplumber
      |   └── ca-warn-2013
      |       ├── 001.csv
      |       ├── 002.csv
      |       └── 003.csv
      └── tabula-java
          └── ca-warn-2013
              ├── 001.csv
              ├── 002.csv
              └── 003.csv

Example test suite and results

java -jar \
    bins/tabula-0.9.1-jar-with-dependencies.jar --pages all \
    pdfs/nypd-weekly-stats.pdf \
    > results/tabula-java/nypd-weekly-stats.csv

java -jar \
    bins/tabula-0.9.1-jar-with-dependencies.jar --pages all \
    pdfs/menlo-park-sunridge-cad-interface.pdf \
    > results/tabula-java/menlo-park-sunridge-cad-interface.csv

About

Comparing the programs that extract tabular data from PDFs, e.g. ABBYY FineReader, Tabula, CometDocs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published