Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.md add link to blog post Oct 26, 2018
extract.py
requirements.txt
samplepage.pdf
samplepage.png
samplepage.txt added new example: schoolinspections_imageproc Oct 26, 2018
samplepage.xml

README.md

Checkboxes and crosses: data mining PDFs with the help of image processing

Author: Markus Konrad markus.konrad@wzb.eu

Date: Oct 2018

A small example of how to extract data from a PDF with the help of image processing. See also the companion blog post.

Source of the sample PDF: https://www.berlin.de/sen/bildung/schule/berliner-schulen/schulverzeichnis/

Note: This is a stripped down example to work with the sample PDF. Not all PDFs of this source could be read with this, some minor adjustments and fallbacks are necessary, for example to handle multi-line items in the table rows.

This script uses pdftabextract to parse the XML representation of the PDF file. The companion tool pdf2xml-viewer can be used to investigate the XML representation of the PDF with its text boxes.

I recommend executing this script cell by cell (denoted with #%% marks and possible to execute separately with IDEs like Spyder or PyCharm) in order to understand it.

Requirements

See requirements.txt.

  • OpenCV
  • pdftabextract
  • matplotlib
  • NumPy
You can’t perform that action at this time.