Initial commit of files from previously private repository.

angea · Mar 17, 2015 · 21b690d · 21b690d
commit 21b690d
Show file tree

Hide file tree

Showing 20 changed files with 472 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,80 @@
+# Advanced-PDF-Tricks
+
+This repository (for now) is development home to some hand-crafted PDF files.
+These PDFs should serve as study material for everybody who wants to learn about this format.
+
+Students of the PDF file formats should be able to use a *text editor* in order to open and change them.
+
+Some of the samples include commentary to give hints how to start experimenting with them.
+These can mostly be changed by overwriting `%` comment characters with a space character (and maybe comment out another line instead) in order to see how the PDF viewer changes its display.
+
+Most of the PDFs don't have the fonts embedded which they may use (Courier, Times, Helvetica).
+This may cause that some viewers or Ghostscript complain that they cannot find a substitute font.
+
+## Acknowledgements
+
+The initial creation of these files was inspired by [a talk held](https://www.troopers.de/events/troopers15/451_advanced_pdf_tricks_apt_-_a_workshop-style_presentation_to_understand_the_pdf_file_format_/) at the [TROOPERS15](https://www.troopers.de/) conference in Heidelberg.
+
+
+## Text editor hints
+
+Make sure that your editor does not change the end-of-line conventions used by the original PDF.
+Windows uses `[CARRIAGE_RETURN]+[LINE_FEED]` (2 bytes: in HEX `0D 0A`), while Linux and OS X use `[LINE_FEED]` (1 byte only: `0A` as HEX).
+
+If you don't take care of that, the edited PDF may become "corrupted" by simply opening the file and saving it again.
+The reason would be because the file's internal *table of content* (so-called `xref`-table) would point to wrong byte offsets by the changed numbers of bytes.
+
+VIm users should start their editor with `vim -b` (for binary mode) to make sure the internal byte counting works correctly.
+
+[... to be continued ...]
+
+## PDF viewer hints
+
+Some viewers reload the PDF file and change their page display of an opened file "on the fly" as soon as they notice a change (triggered by editor saving the file).
+Amongst these are the venerable `gv`, SumatraPDF (Windows only), MuPDF (all platforms), Zathura (Linux only) and in parts Preview.app (OSX only) too.
+Acrobat (and Adobe Reader) need to be closed and restarted with the changed file before you can see any edit effect.
+
+## Compressing/Uncompressing data blobs with Zlib algorithm
+
+You can use OpenSSL to compress/uncompress data blobs using the Zlib algorithm:
+
+       openssl zlib -d < $IN > $OUT
+
+ZLIB un-/compression is equivalent to PDF's *Flate* de-/encoding.
+
+**Note:** the `zlib` sub-command (as well as the `-z` option to the `enc` sub-command) is not available if your build of OpenSSL was configured with the default options.
+Unfortunately these include `--no-zlib` and `--no-zlib-dynamic`.
+So this trick only works if your OpenSSL was compiled with the `no-` prefix removed from one of those configure options. You can tell by looking for `-DZLIB` in the output of `openssl version -f`.
+
+
+## Compressing/Uncompressing data blobs with Zlib algorithm
+
+The following Python one-liner should achieve the same:
+
+      python -c "import zlib,sys;print \
+              repr(zlib.decompress(sys.stdin.read()))" < $IN
+
+
+## Compressing/Uncompressing data blobs with the `zlib-flate` command
+
+There is an additional command line utility shipping with **[qpdf](http://qpdf.sf.net/)**: its name is `zlib-flate`.
+Here is how to use it:
+
+      zlib-flate -compress   < $IN  > $OUT     # to compress $IN and generate $OUT
+      zlib-flate -uncompress < $OUT > $IN      # to uncompress again
+
+or
+
+      cat $IN | zlib-flate -uncompress         # to uncompress $IN
+
+
+
+
+
+----
+
+Copyright (c) 2015 <kurt.pfeifle@mykolab.com>
+
+License: [Creative Commons "CC-BY-NC-SA" v4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/)
+![](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)
+
diff --git a/TODO.md b/TODO.md
@@ -0,0 +1,52 @@
+# TODO
+
+## General
+
+- Start tracking some of the samples about how they behave in various PDF viewers:
+  Adobe Reader, Acrobat, Sumatra, Preview.app, Ghostscript, MuPDF, Evince, XPDF, Chrome's internal PDF viewer, PDF.js, Windows 10 builtin Reader,...
+- Maybe write some simple Python scripts which auto-generate the `xref` and the `trailer` given a file that already has PDF header and body of objects?
+- Other helper tools for creating hand-written PDFs:
+  Python script which encode/decode arbitrary blobs of data (`/ASCIIHexDecode`, `/ASCII85Decode`, `/LZWDecode`, `/DCTDecode`, `/FlateDecode`, `/RunLengthDecode`, `/JBIG2Decode`, `/CCITTFaxDecode`, `/JPXDecode`, `/Crypt`).
+  Not all at once, though...
+
+## File Wishlist
+
+- TODO: (almost) empty template(s) to kickstart hand-making PDFs
+    * Text (1-page, 2-page, 4-page)
+    * Images (1-page, 2-page, 4-page)
+    * Vector drawings (1-page, 2-page, 4-page)
+- TODO: OCG / Layers: visible - invisible - printable
+- TODO: Watermarks (efficient)
+- TODO: text hidden by other object
+- TODO: cropped object (LaTeX illustrations)
+- TODO: JavaScript in a PDF
+- TODO: internal links in PDF
+- TODO: fillable forms in a PDF
+- TODO: bookmarks in a PDF
+- TODO: annotations in a PDF
+- TODO: some sort of challenge/riddle for TROOPERS in a PDF ?
+- TODO: redaction (good + bad)
+
+&nbsp;
+
+- DONE: incrementally updated PDF file    
+  SEE: `114_incrementally-updated.pdf`
+
+- DONE: JPEG image in a PDF    
+  SEE: `107_perspectives-by-banksy.pdf`
+
+- DONE: "text" by drawing outlines     
+  SEE: `102_A-vectorized.pdf`
+
+- DONE: "text" with font NOT embedded     
+  SEE: `104_fonts-not-embedded.pdf`
+
+- DONE: "text" with font embedded     
+  SEE: `106_hello-troopers.pdf`    
+  SEE: `108_text-rendering-modes.pdf`    
+  SEE: `112_play-with-tounicodetable.pdf`    
+  SEE: `112_stegano-with-tounicodetable.pdf`
+
+- TODO: CTM demo + playground     
+  SEE: `105_transformation-matrix.pdf`     
+  SEE: `111_current-transformation-matrix-ctm.pdf`
diff --git a/Tools.md b/Tools.md
@@ -0,0 +1,214 @@
+# A Bag of Useful Tools
+
+When hand-coding PDF files (or manually manipulating PDFs which are already existing), it is useful to be friends with a little arsenal of command line tools for testing and other purposes.
+Here is a list of tools which we regularly use.
+Most of them are available for all major OS platforms (Windows, OSX, Linux):
+
+* **`pdfinfo`** (the most recent versions based on Poppler -- forked from the original XPDF codebase -- have new features not supported by the venerable XPDF versions).
+  This utility reports general metadata about a PDF.
+
+* **`pdfimages`**.
+  This utility reports various details about images embedded in PDFs.
+  It can also extract them.
+
+* **`pdffonts`**.
+  This utility reports various details about the fonts used by a PDF.
+
+* **`pdfdetach`**.
+  This utility reports if a PDF file makes use of the official feature that allows to embed other files within a PDF.
+  It can also extract a copy of the attached files from the PDF.
+
+* **`pdfresurrect`**.
+  This utility reports if a PDF makes use of the offical feature that allows to make *"incremental updates"* of the document.
+  It can also restore previous versions of the file.
+
+* **`pdftk`**.
+  This utility can ...
+
+* **`qpdf`**.
+  This utility is, according to its self-description *"a command-line program that does structural, content-preserving transformations on PDF files."
+  It is extremely valuable to un-compress streams and object streams contained in a PDF that you want to understand, debug or modify.
+
+* **`Ghostscript`**.
+  This utility can process PostScript and PDF and convert them into a lot of different graphics and printer-understandable raster formats.
+  It doubles up as a PostScript and PDF viewer also.
+  Furthermore, it create and modify PDF files from PostScript and PDF input.
+
+* **`zathura`**.
+  This utility is a very fast and lightweight PDF viewer.
+  (Additional plugins extend it to a viewer for PostScript, DjVu and CB files.)
+  It has a very limited GUI; instead of mouse and menue buttons, it is to be controlled from the keyboard.
+
+* **`mupdf`**.
+  This utility is another lightweight PDF viewer.
+  (It can also display XPS files.)
+  It has a very limited GUI; instead of mouse and menue buttons, it is to be controlled from the keyboard.
+
+* **`mutool`**.
+  This utility is a sibling to `mupdf`.
+  Not a PDF viewer, but a little toolbox with useful sub-commands:
+  `clean` (re-writes PDF files), `extract` (extracts font and image resources), `show` (displays internal PDF objects), `poster` (splits large PDF pages into smaller tiles) and `info` (displays a PDF's metadata).
+
+* **`podofo*`**.
+  This family of utilities can ...
+
+* **`peepdf`**.
+  This utility can ...
+
+* **`pdfid.py`**.
+  This utility can 
+
+* **`pdf-parser.py`**.
+  This utility can ...
+
+# Specific hints
+
+In case you are not yet familiar with these tools, you should get soon.
+However, here are a few specific hints.
+These are beneficial also to "old timers" who are using them since many years already (because they may have missed the updated features we now hint at).
+
+## `pdfinfo`
+
+1. To see if the file includes an `XMP` version of its metadata, use:
+
+        pdfinfo -meta the.pdf
+
+1. To see all page boxes (MediaBox, CropBox, TrimBox, ArtBox) used by the file (or what page boxes are implicitely used because they are undefined), use:
+
+        pdfinfo -box the.pdf
+
+    Be aware, that this form of the command examines only the first page, and the command output reflects this.
+   As you know, PDF documents may have different page sizes within the same file.
+   So take a look at the next tip.
+
+1. To see page-related info about different (or all) pages, use the `-f <N> -l <M>` parameters.
+   The following command retrieves the page box info related to pages 11--15:
+
+        pdfinfo -f 11 -l 15 -box the.pdf
+
+1. To print the contents of an embedded JavaScript, use:
+
+        pdfinfo -js the.pdf
+
+    **Note:** If it is a malicious JavaScript which the originator wanted to hide from the users by applying certain obfuscation techniques, the `-js` key will very likely not work.
+
+
+## `pdfimages`
+
+1. Older versions of this utility could only *extract* embedded images.
+   Recent Poppler versions (not the XPDF versions though!) have now the `-list` parameter:
+
+        pdfimages -list the.pdf
+
+    This command will print a list of images which are used inside the PDF file with various additional info:
+   the page number where the image appears, the image dimensions, their compression ratio, the PDF object ID of the image, their color depth, the number of color componenents.
+
+1. One notable detail about the previous hint:
+   since `pdfimages -list` returns the respective PDF object ID, it is worth to check if various images listed for the PDF use *identical* ID numbers!
+
+    Because identical object IDs for different image numbers mean:
+   the PDF makes re-use of that image (it is very efficiently constructed) multiple times, showing it on different locations -- but it is embedded only once.
+   This is not always the case -- older versions of OpenOffice/LibreOffice embedded a page background image once per page, creating very large output PDFs.
+   This is no longer the case (at least with the 4.3/4.4 releases of LibreOffice).
+
+1. Take note of the fact that the `-f <N> -l <M>` command line params does also work for `pdfimages`.
+
+## `pdffonts`
+
+1. This utility prints a list of metadata about the fonts used by a PDF file.
+   If the column headed `uni` does not show up a `yes` for a specific font, it may be difficult or even impossible to extract the text (either by *copy'n'paste'* or with the help of `pdftotext`).
+   All you get may be unreadable gobble-di-gook.
+
+1. A new feature in recent Poppler versions (again: not available for XPDF-based `pdffonts`!) is the `-subst` parameter.
+   It prints 
+
+        pdffonts -subst the.pdf
+
+    If the tool returns an empty table, then the PDF may have
+
+    (a) either all fonts embedded
+    (b) no fonts in use at all
+
+    **Note:** whatever substitute that tool reports for non-embedded fonts is not true for Acrobat or other PDF viewers.
+   These may use a different font substitution method (Acrobat frequently generates a *MultipleMaster* font "on the fly" for use in place of a non-embedded one).
+   The tool's reported substitute font is only applicable for those programms which make use of the *FreeType* font engine.
+
+## `pdfresurrect`
+
+## `pdfdetach`
+
+## `pdftk`
+
+## Ghostscript
+
+1. If you want to generate (or re-generate) a PDF which does not use real fonts for text glyphs, but instead small vector shapes, you can use `-dNoOutputFonts` on the command line (starting with GS release v9.15):
+
+        gs -o out.pdf -sDEVICE=pdfwrite -dNoOutputFonts input.pdf
+
+1. If you are on Windows, you'll have different Ghostscript executables:
+
+    * **`gswin32c.exe`** and  **`gswin64c.exe`** :
+      32-bit variant of the 'console' version of Ghostscript.
+      Note the **`c`** in the names.
+
+    * **`gswin32.exe`**  and  **`gswin64.exe`** :
+      32-bit/64-bit variants of the 'GUI' version of Ghostscript.
+      (The GUI is only an extra window that opens to report `<stdout>` and `<stderr>` messages, and which serves to receive keyboard input if you use Ghostscript interactively).
+      Note there is no **`c`** in the names of the GUI versions.
+
+
+## `zathura`
+
+
+## `mupdf`
+
+
+## `mutool`
+
+
+## `podofo*`
+
+
+## `peepdf`
+
+
+## `pdfid.py`
+
+
+## `pdf-parser.py`
+
+
+## `qpdf`
+
+1. Be aware: `qpdf` is not perfect! Though it works extremely well for most practical cases, there are some things where its *"content preserving transformation"* fails:
+
+    - It cannot preserve "layers" (in PDF parlance: `OCG`, optional content groups) in the output PDF.
+      If you let `qpdf` transform a PDF file containing layers, the output will have flattened them all into one.
+    - It does not know how to handle *incremental updates* of PDFs.
+      If you let `qpdf` transform a PDF file containing incrementally updated versions the output will reflect the last file version only.
+
+1. For me personally, qpdf` ist most useful when it comes to the following three tasks:
+
+    - Quick-check the PDFs internal structure by running:
+
+        `qpdf --check the.pdf`
+
+    - Return the correct `xref` table values by running:
+
+        `qpdf --show-xref the.pdf`
+
+    - Unpack as many internal structures as possible by running:
+
+        `qpdf --qdf --object-streams=disable the.pdf unpacked.pdf`
+
+
+[... TO BE COMPLETED ...]
+
+
+----
+
+Copyright (c) 2015 <kurt.pfeifle@mykolab.com>
+
+License: [Creative Commons "CC-BY-NC-SA" v4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/)
+![](http://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)
+