Permalink
Browse files

Initial commit of files from previously private repository.

  • Loading branch information...
KurtPfeifle committed Mar 17, 2015
0 parents commit 21b690da4e99ec08afc7e5f9e02f52dad5b507de
@@ -0,0 +1,80 @@
# Advanced-PDF-Tricks
This repository (for now) is development home to some hand-crafted PDF files.
These PDFs should serve as study material for everybody who wants to learn about this format.
Students of the PDF file formats should be able to use a *text editor* in order to open and change them.
Some of the samples include commentary to give hints how to start experimenting with them.
These can mostly be changed by overwriting `%` comment characters with a space character (and maybe comment out another line instead) in order to see how the PDF viewer changes its display.
Most of the PDFs don't have the fonts embedded which they may use (Courier, Times, Helvetica).
This may cause that some viewers or Ghostscript complain that they cannot find a substitute font.
## Acknowledgements
The initial creation of these files was inspired by [a talk held](https://www.troopers.de/events/troopers15/451_advanced_pdf_tricks_apt_-_a_workshop-style_presentation_to_understand_the_pdf_file_format_/) at the [TROOPERS15](https://www.troopers.de/) conference in Heidelberg.
## Text editor hints
Make sure that your editor does not change the end-of-line conventions used by the original PDF.
Windows uses `[CARRIAGE_RETURN]+[LINE_FEED]` (2 bytes: in HEX `0D 0A`), while Linux and OS X use `[LINE_FEED]` (1 byte only: `0A` as HEX).
If you don't take care of that, the edited PDF may become "corrupted" by simply opening the file and saving it again.
The reason would be because the file's internal *table of content* (so-called `xref`-table) would point to wrong byte offsets by the changed numbers of bytes.
VIm users should start their editor with `vim -b` (for binary mode) to make sure the internal byte counting works correctly.
[... to be continued ...]
## PDF viewer hints
Some viewers reload the PDF file and change their page display of an opened file "on the fly" as soon as they notice a change (triggered by editor saving the file).
Amongst these are the venerable `gv`, SumatraPDF (Windows only), MuPDF (all platforms), Zathura (Linux only) and in parts Preview.app (OSX only) too.
Acrobat (and Adobe Reader) need to be closed and restarted with the changed file before you can see any edit effect.
## Compressing/Uncompressing data blobs with Zlib algorithm
You can use OpenSSL to compress/uncompress data blobs using the Zlib algorithm:
openssl zlib -d < $IN > $OUT
ZLIB un-/compression is equivalent to PDF's *Flate* de-/encoding.
**Note:** the `zlib` sub-command (as well as the `-z` option to the `enc` sub-command) is not available if your build of OpenSSL was configured with the default options.
Unfortunately these include `--no-zlib` and `--no-zlib-dynamic`.
So this trick only works if your OpenSSL was compiled with the `no-` prefix removed from one of those configure options. You can tell by looking for `-DZLIB` in the output of `openssl version -f`.
## Compressing/Uncompressing data blobs with Zlib algorithm
The following Python one-liner should achieve the same:
python -c "import zlib,sys;print \
repr(zlib.decompress(sys.stdin.read()))" < $IN
## Compressing/Uncompressing data blobs with the `zlib-flate` command
There is an additional command line utility shipping with **[qpdf](http://qpdf.sf.net/)**: its name is `zlib-flate`.
Here is how to use it:
zlib-flate -compress < $IN > $OUT # to compress $IN and generate $OUT
zlib-flate -uncompress < $OUT > $IN # to uncompress again
or
cat $IN | zlib-flate -uncompress # to uncompress $IN
----
Copyright (c) 2015 <kurt.pfeifle@mykolab.com>
License: [Creative Commons "CC-BY-NC-SA" v4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/)
![](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)
52 TODO.md
@@ -0,0 +1,52 @@
# TODO
## General
- Start tracking some of the samples about how they behave in various PDF viewers:
Adobe Reader, Acrobat, Sumatra, Preview.app, Ghostscript, MuPDF, Evince, XPDF, Chrome's internal PDF viewer, PDF.js, Windows 10 builtin Reader,...
- Maybe write some simple Python scripts which auto-generate the `xref` and the `trailer` given a file that already has PDF header and body of objects?
- Other helper tools for creating hand-written PDFs:
Python script which encode/decode arbitrary blobs of data (`/ASCIIHexDecode`, `/ASCII85Decode`, `/LZWDecode`, `/DCTDecode`, `/FlateDecode`, `/RunLengthDecode`, `/JBIG2Decode`, `/CCITTFaxDecode`, `/JPXDecode`, `/Crypt`).
Not all at once, though...
## File Wishlist
- TODO: (almost) empty template(s) to kickstart hand-making PDFs
* Text (1-page, 2-page, 4-page)
* Images (1-page, 2-page, 4-page)
* Vector drawings (1-page, 2-page, 4-page)
- TODO: OCG / Layers: visible - invisible - printable
- TODO: Watermarks (efficient)
- TODO: text hidden by other object
- TODO: cropped object (LaTeX illustrations)
- TODO: JavaScript in a PDF
- TODO: internal links in PDF
- TODO: fillable forms in a PDF
- TODO: bookmarks in a PDF
- TODO: annotations in a PDF
- TODO: some sort of challenge/riddle for TROOPERS in a PDF ?
- TODO: redaction (good + bad)
&nbsp;
- DONE: incrementally updated PDF file
SEE: `114_incrementally-updated.pdf`
- DONE: JPEG image in a PDF
SEE: `107_perspectives-by-banksy.pdf`
- DONE: "text" by drawing outlines
SEE: `102_A-vectorized.pdf`
- DONE: "text" with font NOT embedded
SEE: `104_fonts-not-embedded.pdf`
- DONE: "text" with font embedded
SEE: `106_hello-troopers.pdf`
SEE: `108_text-rendering-modes.pdf`
SEE: `112_play-with-tounicodetable.pdf`
SEE: `112_stegano-with-tounicodetable.pdf`
- TODO: CTM demo + playground
SEE: `105_transformation-matrix.pdf`
SEE: `111_current-transformation-matrix-ctm.pdf`
214 Tools.md
@@ -0,0 +1,214 @@
# A Bag of Useful Tools
When hand-coding PDF files (or manually manipulating PDFs which are already existing), it is useful to be friends with a little arsenal of command line tools for testing and other purposes.
Here is a list of tools which we regularly use.
Most of them are available for all major OS platforms (Windows, OSX, Linux):
* **`pdfinfo`** (the most recent versions based on Poppler -- forked from the original XPDF codebase -- have new features not supported by the venerable XPDF versions).
This utility reports general metadata about a PDF.
* **`pdfimages`**.
This utility reports various details about images embedded in PDFs.
It can also extract them.
* **`pdffonts`**.
This utility reports various details about the fonts used by a PDF.
* **`pdfdetach`**.
This utility reports if a PDF file makes use of the official feature that allows to embed other files within a PDF.
It can also extract a copy of the attached files from the PDF.
* **`pdfresurrect`**.
This utility reports if a PDF makes use of the offical feature that allows to make *"incremental updates"* of the document.
It can also restore previous versions of the file.
* **`pdftk`**.
This utility can ...
* **`qpdf`**.
This utility is, according to its self-description *"a command-line program that does structural, content-preserving transformations on PDF files."
It is extremely valuable to un-compress streams and object streams contained in a PDF that you want to understand, debug or modify.
* **`Ghostscript`**.
This utility can process PostScript and PDF and convert them into a lot of different graphics and printer-understandable raster formats.
It doubles up as a PostScript and PDF viewer also.
Furthermore, it create and modify PDF files from PostScript and PDF input.
* **`zathura`**.
This utility is a very fast and lightweight PDF viewer.
(Additional plugins extend it to a viewer for PostScript, DjVu and CB files.)
It has a very limited GUI; instead of mouse and menue buttons, it is to be controlled from the keyboard.
* **`mupdf`**.
This utility is another lightweight PDF viewer.
(It can also display XPS files.)
It has a very limited GUI; instead of mouse and menue buttons, it is to be controlled from the keyboard.
* **`mutool`**.
This utility is a sibling to `mupdf`.
Not a PDF viewer, but a little toolbox with useful sub-commands:
`clean` (re-writes PDF files), `extract` (extracts font and image resources), `show` (displays internal PDF objects), `poster` (splits large PDF pages into smaller tiles) and `info` (displays a PDF's metadata).
* **`podofo*`**.
This family of utilities can ...
* **`peepdf`**.
This utility can ...
* **`pdfid.py`**.
This utility can
* **`pdf-parser.py`**.
This utility can ...
# Specific hints
In case you are not yet familiar with these tools, you should get soon.
However, here are a few specific hints.
These are beneficial also to "old timers" who are using them since many years already (because they may have missed the updated features we now hint at).
## `pdfinfo`
1. To see if the file includes an `XMP` version of its metadata, use:
pdfinfo -meta the.pdf
1. To see all page boxes (MediaBox, CropBox, TrimBox, ArtBox) used by the file (or what page boxes are implicitely used because they are undefined), use:
pdfinfo -box the.pdf
Be aware, that this form of the command examines only the first page, and the command output reflects this.
As you know, PDF documents may have different page sizes within the same file.
So take a look at the next tip.
1. To see page-related info about different (or all) pages, use the `-f <N> -l <M>` parameters.
The following command retrieves the page box info related to pages 11--15:
pdfinfo -f 11 -l 15 -box the.pdf
1. To print the contents of an embedded JavaScript, use:
pdfinfo -js the.pdf
**Note:** If it is a malicious JavaScript which the originator wanted to hide from the users by applying certain obfuscation techniques, the `-js` key will very likely not work.
## `pdfimages`
1. Older versions of this utility could only *extract* embedded images.
Recent Poppler versions (not the XPDF versions though!) have now the `-list` parameter:
pdfimages -list the.pdf
This command will print a list of images which are used inside the PDF file with various additional info:
the page number where the image appears, the image dimensions, their compression ratio, the PDF object ID of the image, their color depth, the number of color componenents.
1. One notable detail about the previous hint:
since `pdfimages -list` returns the respective PDF object ID, it is worth to check if various images listed for the PDF use *identical* ID numbers!
Because identical object IDs for different image numbers mean:
the PDF makes re-use of that image (it is very efficiently constructed) multiple times, showing it on different locations -- but it is embedded only once.
This is not always the case -- older versions of OpenOffice/LibreOffice embedded a page background image once per page, creating very large output PDFs.
This is no longer the case (at least with the 4.3/4.4 releases of LibreOffice).
1. Take note of the fact that the `-f <N> -l <M>` command line params does also work for `pdfimages`.
## `pdffonts`
1. This utility prints a list of metadata about the fonts used by a PDF file.
If the column headed `uni` does not show up a `yes` for a specific font, it may be difficult or even impossible to extract the text (either by *copy'n'paste'* or with the help of `pdftotext`).
All you get may be unreadable gobble-di-gook.
1. A new feature in recent Poppler versions (again: not available for XPDF-based `pdffonts`!) is the `-subst` parameter.
It prints
pdffonts -subst the.pdf
If the tool returns an empty table, then the PDF may have
(a) either all fonts embedded
(b) no fonts in use at all
**Note:** whatever substitute that tool reports for non-embedded fonts is not true for Acrobat or other PDF viewers.
These may use a different font substitution method (Acrobat frequently generates a *MultipleMaster* font "on the fly" for use in place of a non-embedded one).
The tool's reported substitute font is only applicable for those programms which make use of the *FreeType* font engine.
## `pdfresurrect`
## `pdfdetach`
## `pdftk`
## Ghostscript
1. If you want to generate (or re-generate) a PDF which does not use real fonts for text glyphs, but instead small vector shapes, you can use `-dNoOutputFonts` on the command line (starting with GS release v9.15):
gs -o out.pdf -sDEVICE=pdfwrite -dNoOutputFonts input.pdf
1. If you are on Windows, you'll have different Ghostscript executables:
* **`gswin32c.exe`** and **`gswin64c.exe`** :
32-bit variant of the 'console' version of Ghostscript.
Note the **`c`** in the names.
* **`gswin32.exe`** and **`gswin64.exe`** :
32-bit/64-bit variants of the 'GUI' version of Ghostscript.
(The GUI is only an extra window that opens to report `<stdout>` and `<stderr>` messages, and which serves to receive keyboard input if you use Ghostscript interactively).
Note there is no **`c`** in the names of the GUI versions.
## `zathura`
## `mupdf`
## `mutool`
## `podofo*`
## `peepdf`
## `pdfid.py`
## `pdf-parser.py`
## `qpdf`
1. Be aware: `qpdf` is not perfect! Though it works extremely well for most practical cases, there are some things where its *"content preserving transformation"* fails:
- It cannot preserve "layers" (in PDF parlance: `OCG`, optional content groups) in the output PDF.
If you let `qpdf` transform a PDF file containing layers, the output will have flattened them all into one.
- It does not know how to handle *incremental updates* of PDFs.
If you let `qpdf` transform a PDF file containing incrementally updated versions the output will reflect the last file version only.
1. For me personally, qpdf` ist most useful when it comes to the following three tasks:
- Quick-check the PDFs internal structure by running:
`qpdf --check the.pdf`
- Return the correct `xref` table values by running:
`qpdf --show-xref the.pdf`
- Unpack as many internal structures as possible by running:
`qpdf --qdf --object-streams=disable the.pdf unpacked.pdf`
[... TO BE COMPLETED ...]
----
Copyright (c) 2015 <kurt.pfeifle@mykolab.com>
License: [Creative Commons "CC-BY-NC-SA" v4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/)
![](http://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)
Oops, something went wrong.

0 comments on commit 21b690d

Please sign in to comment.