Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master
Fetching contributors…

Cannot retrieve contributors at this time

39 lines (25 sloc) 1.387 kb


A simple Python wrapper of the great Docsplit utility from DocumentCloud

Please feel free to file issues, fork and extend!



Follow the instructions to install the original Docsplit here:

Put the pydocsplit folder on your python path and change the DOCSPLIT_JAVA_ROOT setting in to point to your installation of the Ruby gem

Remember to run OpenOffice in headless mode if you want to convert documents to pdf. See the Docsplit docs for howto:


from pydocsplit import Docsplit

d = Docsplit()
d.extract_pdf('/path/to/my/document.doc', output='/path/to/outputdir/')
d.extract_pages('/path/to/my/pdffile.pdf', output='/path/to/outputdir/', pages='1-2')
d.extract_text('/path/to/my/pdffile.pdf', output='/path/to/outputdir/', returntext=True)
d.extract_images('/path/to/my/pdffile.pdf', output='/path/to/outputdir/', sizes=['500x', '250x'], formats=['png', 'jpg'], pages=[1,2,5,7])
documenttitle = d.extract_meta('/path/to/my/pdffile.pdf', 'title')


  • Support multiple pdfs as input
  • Enhance parsing of pages options/ranges
  • Fix page numbers on generated images of PDF pages
Jump to Line
Something went wrong with that request. Please try again.