Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Improve the Legibility of PDFs on e-book Readers
Python
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
.gitignore
Readme
pdfmunge.py
tests.py

Readme

pdfmunge - Process PDFs to make them more legible on eBook readers.
Copyright Felix Crux (www.felixcrux.com)

Summary:

This little script performs some processing tasks on PDFs in order to make them
easier to read on an eBook reader. The primary functions are stripping away
large margins, and creating an imitation of landscape-mode on devices that don't
support it by cutting each page in half, and rotating everything 90 degrees
counter-clockwise. It can also remove individual pages from the file.


Usage:
pdfmunge [options]... input_file output_file

  -r  --rotate     Slice pages in half and rotate each half 90 degrees
                   counter-clockwise, creating a pseudo-landscape mode on
                   devices that don't do this automatically. Be warned that
                   this will double the size of your output file.

  -m  --margin     If using rotation/slicing, have each page overlap with the
                   previous one by this amount (helps with lines getting cut
                   off in the middle).

  -b  --bounds     Boundaries of visible area on each PDF page. Useful for
                   cropping off large margins. If this is given, cropping is
                   done automatically; otherwise it is not done. Boundaries
                   should be given as four comma-separated numbers, all
                   enclosed in quotation marks, like so: "10,20,100,120". Any
                   whitespace inside the quotation marks is ignored.

  -o  --oddbounds  For PDFs that have different margins on even and odd pages,
                   use these boundary values for odd numbered pages, with
                   --bounds applying to even numbered ones. If this is given,
                   --bounds is required.

  -e  --exclude    Numbers or ranges of pages to not include in the output PDF.
                   These should be given as a series of numbers or ranges
                   surrounded by quotation marks, and separated by commas. Any
                   whitespace is ignored. Ranges are given as two numbers
                   separated by a hyphen/minus sign (-), where the first number
                   must be smaller than the second. Example: "1,2,4-8,40".
                   This option takes precedence over --intact.

  -i  --intact     Leave these pages completely unchanged, ignoring
                   cropping, rotating, or anything else. Requires a set of
                   numbers or ranges like --exclude. Excluded pages are
                   ignored even if listed here.


Examples:

For a simple example, we can use the PDF version of John Hughes' paper "Why
Functional Programming Matters", which is available from his website. We want
to strip away the large margins, and exclude the last page of references:
  ./pdfmunge.py --exclude "23" --bounds "125,110,485,670" WhyFP.pdf Test.pdf

For a complicated example that really shows off all the features, let's look at
the (free) PDF edition of Mark Dominus' "Higher Order Perl"; which is available
on his website (please consider purchasing it in paper form, too!). This book
needs to have the large margins and printers' guides stripped off, but uses
margins of alternating size on alternating pages. Let's also get rid of some of
the "extraneous" material like the dedication page and index (another reason
you should buy the paper copy!). To make it more readable, we'll slice each
page in half and rotate them, creating an imitation of landscape mode on the
eBook reader. We'll also want to refrain from processing page 5, which is the
title page:
  ./pdfmunge.py --intact "5" --exclude "1-4,6-8,582-592" \
    --bounds "215,240,557,778" --oddbounds "153,240,495,778" \
    --rotate --margin 3 HoP.pdf Test.pdf


Development:

I threw this script together one evening to scratch an itch I was having with
my eBook reader. It wasn't really intended to be released, and despite some
cleanup, this heritage shows. I'll gladly accept patches.

In particular, open issues are:
 - Rotating/slicing should not need to double the filesize. The way it's done
   now is tremendously hacky, but I haven't found a way to do it better in the
   pyPdf library. As they say, storage is cheap; programmer time and patience
   are not.
 - Two-column PDFs would benefit from a column-splitting function. This might
   be useful even if it didn't work with rotation, since the columns would be
   rather narrow.


License:

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
Something went wrong with that request. Please try again.