Skip to content
Binary Python bindings for poppler utils for content extraction
Python Shell
Branch: master
Clone or download
Latest commit 9b32f52 Jul 10, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
build-wheels Switch base image for manylinux wheel building Jun 11, 2019
pdflib Fix for entity resolution security issue Jul 10, 2019
tests Add more sanity checks to avoid segfaults May 15, 2018
.bumpversion.cfg add bumpversion config Jul 10, 2019
.gitignore Initial commit Apr 11, 2018
.travis.yml Get rid of Mac OS build Jun 12, 2019
MANIFEST.in packaging fixes Apr 25, 2018
README.md paths don't have to be bytes now May 15, 2018
setup.py

README.md

pdflib

Build Status

Python binding for poppler.

Installation

Using pip: pip install pdflib

From source:

  • Clone poppler source code and compile it:
git clone --branch poppler-0.63.0 --depth 1 https://anongit.freedesktop.org/git/poppler/poppler.git poppler_src
cd poppler_src/
cmake -DENABLE_SPLASH=OFF -DENABLE_UTILS=OFF -DENABLE_LIBOPENJPEG=none .
make
  • Set POPPLER_SRC environment variable
export POPPLER_ROOT=/pdflib/poppler_src/
  • Install cython
pip install cython
  • Build extension
python setup.py build_ext --inplace

Usage

>>> from pdflib import Document
>>> doc = Document("path/to/file.pdf")

Getting metadata

>>> print(doc.metadata)
>>> print(doc.xmp_metadata)

Getting text content of each page

>>> for page in doc:
        print(' \n'.join(page.lines).strip())

Getting images from each page

>>> for page in doc:
        page.extract_images(path='images', prefix='img')

LICENSE

pdflib is available under GPL v3 (poppler is GPL).

You can’t perform that action at this time.