diff --git a/README.md b/README.md new file mode 100644 index 0000000..2ff0c7c --- /dev/null +++ b/README.md @@ -0,0 +1,80 @@ +# Advanced-PDF-Tricks + +This repository (for now) is development home to some hand-crafted PDF files. +These PDFs should serve as study material for everybody who wants to learn about this format. + +Students of the PDF file formats should be able to use a *text editor* in order to open and change them. + +Some of the samples include commentary to give hints how to start experimenting with them. +These can mostly be changed by overwriting `%` comment characters with a space character (and maybe comment out another line instead) in order to see how the PDF viewer changes its display. + +Most of the PDFs don't have the fonts embedded which they may use (Courier, Times, Helvetica). +This may cause that some viewers or Ghostscript complain that they cannot find a substitute font. + +## Acknowledgements + +The initial creation of these files was inspired by [a talk held](https://www.troopers.de/events/troopers15/451_advanced_pdf_tricks_apt_-_a_workshop-style_presentation_to_understand_the_pdf_file_format_/) at the [TROOPERS15](https://www.troopers.de/) conference in Heidelberg. + + +## Text editor hints + +Make sure that your editor does not change the end-of-line conventions used by the original PDF. +Windows uses `[CARRIAGE_RETURN]+[LINE_FEED]` (2 bytes: in HEX `0D 0A`), while Linux and OS X use `[LINE_FEED]` (1 byte only: `0A` as HEX). + +If you don't take care of that, the edited PDF may become "corrupted" by simply opening the file and saving it again. +The reason would be because the file's internal *table of content* (so-called `xref`-table) would point to wrong byte offsets by the changed numbers of bytes. + +VIm users should start their editor with `vim -b` (for binary mode) to make sure the internal byte counting works correctly. + +[... to be continued ...] + +## PDF viewer hints + +Some viewers reload the PDF file and change their page display of an opened file "on the fly" as soon as they notice a change (triggered by editor saving the file). +Amongst these are the venerable `gv`, SumatraPDF (Windows only), MuPDF (all platforms), Zathura (Linux only) and in parts Preview.app (OSX only) too. +Acrobat (and Adobe Reader) need to be closed and restarted with the changed file before you can see any edit effect. + +## Compressing/Uncompressing data blobs with Zlib algorithm + +You can use OpenSSL to compress/uncompress data blobs using the Zlib algorithm: + + openssl zlib -d < $IN > $OUT + +ZLIB un-/compression is equivalent to PDF's *Flate* de-/encoding. + +**Note:** the `zlib` sub-command (as well as the `-z` option to the `enc` sub-command) is not available if your build of OpenSSL was configured with the default options. +Unfortunately these include `--no-zlib` and `--no-zlib-dynamic`. +So this trick only works if your OpenSSL was compiled with the `no-` prefix removed from one of those configure options. You can tell by looking for `-DZLIB` in the output of `openssl version -f`. + + +## Compressing/Uncompressing data blobs with Zlib algorithm + +The following Python one-liner should achieve the same: + + python -c "import zlib,sys;print \ + repr(zlib.decompress(sys.stdin.read()))" < $IN + + +## Compressing/Uncompressing data blobs with the `zlib-flate` command + +There is an additional command line utility shipping with **[qpdf](http://qpdf.sf.net/)**: its name is `zlib-flate`. +Here is how to use it: + + zlib-flate -compress < $IN > $OUT # to compress $IN and generate $OUT + zlib-flate -uncompress < $OUT > $IN # to uncompress again + +or + + cat $IN | zlib-flate -uncompress # to uncompress $IN + + + + + +---- + +Copyright (c) 2015 + +License: [Creative Commons "CC-BY-NC-SA" v4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/) +![](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png) + diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..ffb7db6 --- /dev/null +++ b/TODO.md @@ -0,0 +1,52 @@ +# TODO + +## General + +- Start tracking some of the samples about how they behave in various PDF viewers: + Adobe Reader, Acrobat, Sumatra, Preview.app, Ghostscript, MuPDF, Evince, XPDF, Chrome's internal PDF viewer, PDF.js, Windows 10 builtin Reader,... +- Maybe write some simple Python scripts which auto-generate the `xref` and the `trailer` given a file that already has PDF header and body of objects? +- Other helper tools for creating hand-written PDFs: + Python script which encode/decode arbitrary blobs of data (`/ASCIIHexDecode`, `/ASCII85Decode`, `/LZWDecode`, `/DCTDecode`, `/FlateDecode`, `/RunLengthDecode`, `/JBIG2Decode`, `/CCITTFaxDecode`, `/JPXDecode`, `/Crypt`). + Not all at once, though... + +## File Wishlist + +- TODO: (almost) empty template(s) to kickstart hand-making PDFs + * Text (1-page, 2-page, 4-page) + * Images (1-page, 2-page, 4-page) + * Vector drawings (1-page, 2-page, 4-page) +- TODO: OCG / Layers: visible - invisible - printable +- TODO: Watermarks (efficient) +- TODO: text hidden by other object +- TODO: cropped object (LaTeX illustrations) +- TODO: JavaScript in a PDF +- TODO: internal links in PDF +- TODO: fillable forms in a PDF +- TODO: bookmarks in a PDF +- TODO: annotations in a PDF +- TODO: some sort of challenge/riddle for TROOPERS in a PDF ? +- TODO: redaction (good + bad) + +  + +- DONE: incrementally updated PDF file + SEE: `114_incrementally-updated.pdf` + +- DONE: JPEG image in a PDF + SEE: `107_perspectives-by-banksy.pdf` + +- DONE: "text" by drawing outlines + SEE: `102_A-vectorized.pdf` + +- DONE: "text" with font NOT embedded + SEE: `104_fonts-not-embedded.pdf` + +- DONE: "text" with font embedded + SEE: `106_hello-troopers.pdf` + SEE: `108_text-rendering-modes.pdf` + SEE: `112_play-with-tounicodetable.pdf` + SEE: `112_stegano-with-tounicodetable.pdf` + +- TODO: CTM demo + playground + SEE: `105_transformation-matrix.pdf` + SEE: `111_current-transformation-matrix-ctm.pdf` diff --git a/Tools.md b/Tools.md new file mode 100644 index 0000000..6078413 --- /dev/null +++ b/Tools.md @@ -0,0 +1,214 @@ +# A Bag of Useful Tools + +When hand-coding PDF files (or manually manipulating PDFs which are already existing), it is useful to be friends with a little arsenal of command line tools for testing and other purposes. +Here is a list of tools which we regularly use. +Most of them are available for all major OS platforms (Windows, OSX, Linux): + +* **`pdfinfo`** (the most recent versions based on Poppler -- forked from the original XPDF codebase -- have new features not supported by the venerable XPDF versions). + This utility reports general metadata about a PDF. + +* **`pdfimages`**. + This utility reports various details about images embedded in PDFs. + It can also extract them. + +* **`pdffonts`**. + This utility reports various details about the fonts used by a PDF. + +* **`pdfdetach`**. + This utility reports if a PDF file makes use of the official feature that allows to embed other files within a PDF. + It can also extract a copy of the attached files from the PDF. + +* **`pdfresurrect`**. + This utility reports if a PDF makes use of the offical feature that allows to make *"incremental updates"* of the document. + It can also restore previous versions of the file. + +* **`pdftk`**. + This utility can ... + +* **`qpdf`**. + This utility is, according to its self-description *"a command-line program that does structural, content-preserving transformations on PDF files." + It is extremely valuable to un-compress streams and object streams contained in a PDF that you want to understand, debug or modify. + +* **`Ghostscript`**. + This utility can process PostScript and PDF and convert them into a lot of different graphics and printer-understandable raster formats. + It doubles up as a PostScript and PDF viewer also. + Furthermore, it create and modify PDF files from PostScript and PDF input. + +* **`zathura`**. + This utility is a very fast and lightweight PDF viewer. + (Additional plugins extend it to a viewer for PostScript, DjVu and CB files.) + It has a very limited GUI; instead of mouse and menue buttons, it is to be controlled from the keyboard. + +* **`mupdf`**. + This utility is another lightweight PDF viewer. + (It can also display XPS files.) + It has a very limited GUI; instead of mouse and menue buttons, it is to be controlled from the keyboard. + +* **`mutool`**. + This utility is a sibling to `mupdf`. + Not a PDF viewer, but a little toolbox with useful sub-commands: + `clean` (re-writes PDF files), `extract` (extracts font and image resources), `show` (displays internal PDF objects), `poster` (splits large PDF pages into smaller tiles) and `info` (displays a PDF's metadata). + +* **`podofo*`**. + This family of utilities can ... + +* **`peepdf`**. + This utility can ... + +* **`pdfid.py`**. + This utility can + +* **`pdf-parser.py`**. + This utility can ... + +# Specific hints + +In case you are not yet familiar with these tools, you should get soon. +However, here are a few specific hints. +These are beneficial also to "old timers" who are using them since many years already (because they may have missed the updated features we now hint at). + +## `pdfinfo` + +1. To see if the file includes an `XMP` version of its metadata, use: + + pdfinfo -meta the.pdf + +1. To see all page boxes (MediaBox, CropBox, TrimBox, ArtBox) used by the file (or what page boxes are implicitely used because they are undefined), use: + + pdfinfo -box the.pdf + + Be aware, that this form of the command examines only the first page, and the command output reflects this. + As you know, PDF documents may have different page sizes within the same file. + So take a look at the next tip. + +1. To see page-related info about different (or all) pages, use the `-f -l ` parameters. + The following command retrieves the page box info related to pages 11--15: + + pdfinfo -f 11 -l 15 -box the.pdf + +1. To print the contents of an embedded JavaScript, use: + + pdfinfo -js the.pdf + + **Note:** If it is a malicious JavaScript which the originator wanted to hide from the users by applying certain obfuscation techniques, the `-js` key will very likely not work. + + +## `pdfimages` + +1. Older versions of this utility could only *extract* embedded images. + Recent Poppler versions (not the XPDF versions though!) have now the `-list` parameter: + + pdfimages -list the.pdf + + This command will print a list of images which are used inside the PDF file with various additional info: + the page number where the image appears, the image dimensions, their compression ratio, the PDF object ID of the image, their color depth, the number of color componenents. + +1. One notable detail about the previous hint: + since `pdfimages -list` returns the respective PDF object ID, it is worth to check if various images listed for the PDF use *identical* ID numbers! + + Because identical object IDs for different image numbers mean: + the PDF makes re-use of that image (it is very efficiently constructed) multiple times, showing it on different locations -- but it is embedded only once. + This is not always the case -- older versions of OpenOffice/LibreOffice embedded a page background image once per page, creating very large output PDFs. + This is no longer the case (at least with the 4.3/4.4 releases of LibreOffice). + +1. Take note of the fact that the `-f -l ` command line params does also work for `pdfimages`. + +## `pdffonts` + +1. This utility prints a list of metadata about the fonts used by a PDF file. + If the column headed `uni` does not show up a `yes` for a specific font, it may be difficult or even impossible to extract the text (either by *copy'n'paste'* or with the help of `pdftotext`). + All you get may be unreadable gobble-di-gook. + +1. A new feature in recent Poppler versions (again: not available for XPDF-based `pdffonts`!) is the `-subst` parameter. + It prints + + pdffonts -subst the.pdf + + If the tool returns an empty table, then the PDF may have + + (a) either all fonts embedded + (b) no fonts in use at all + + **Note:** whatever substitute that tool reports for non-embedded fonts is not true for Acrobat or other PDF viewers. + These may use a different font substitution method (Acrobat frequently generates a *MultipleMaster* font "on the fly" for use in place of a non-embedded one). + The tool's reported substitute font is only applicable for those programms which make use of the *FreeType* font engine. + +## `pdfresurrect` + +## `pdfdetach` + +## `pdftk` + +## Ghostscript + +1. If you want to generate (or re-generate) a PDF which does not use real fonts for text glyphs, but instead small vector shapes, you can use `-dNoOutputFonts` on the command line (starting with GS release v9.15): + + gs -o out.pdf -sDEVICE=pdfwrite -dNoOutputFonts input.pdf + +1. If you are on Windows, you'll have different Ghostscript executables: + + * **`gswin32c.exe`** and **`gswin64c.exe`** : + 32-bit variant of the 'console' version of Ghostscript. + Note the **`c`** in the names. + + * **`gswin32.exe`** and **`gswin64.exe`** : + 32-bit/64-bit variants of the 'GUI' version of Ghostscript. + (The GUI is only an extra window that opens to report `` and `` messages, and which serves to receive keyboard input if you use Ghostscript interactively). + Note there is no **`c`** in the names of the GUI versions. + + +## `zathura` + + +## `mupdf` + + +## `mutool` + + +## `podofo*` + + +## `peepdf` + + +## `pdfid.py` + + +## `pdf-parser.py` + + +## `qpdf` + +1. Be aware: `qpdf` is not perfect! Though it works extremely well for most practical cases, there are some things where its *"content preserving transformation"* fails: + + - It cannot preserve "layers" (in PDF parlance: `OCG`, optional content groups) in the output PDF. + If you let `qpdf` transform a PDF file containing layers, the output will have flattened them all into one. + - It does not know how to handle *incremental updates* of PDFs. + If you let `qpdf` transform a PDF file containing incrementally updated versions the output will reflect the last file version only. + +1. For me personally, qpdf` ist most useful when it comes to the following three tasks: + + - Quick-check the PDFs internal structure by running: + + `qpdf --check the.pdf` + + - Return the correct `xref` table values by running: + + `qpdf --show-xref the.pdf` + + - Unpack as many internal structures as possible by running: + + `qpdf --qdf --object-streams=disable the.pdf unpacked.pdf` + + +[... TO BE COMPLETED ...] + + +---- + +Copyright (c) 2015 + +License: [Creative Commons "CC-BY-NC-SA" v4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/) +![](http://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png) + diff --git a/VIMtricks.md b/VIMtricks.md new file mode 100644 index 0000000..c48f78b --- /dev/null +++ b/VIMtricks.md @@ -0,0 +1,63 @@ +# Useful tricks to know when editing PDF files with VIM + +1. Always start up using *`vim -b`*! + *(Without the `-b` Vim would try to do its clever tricks on a binary-including file the wrong way.)* + Make this a habit. + Only binary editing mode (as triggered by *`-b`*) will correctly give byte-counting if needed. + +1. Once a PDF is open, you can use the *`:goto 3456`* command to jump to byte offset 3456. + Useful if you want to check `xref` entries. + +1. Remember, how to open (in your default PDF viewer) the currently edited PDF file from within Vim: + *`:!open %`* (OSX), *`:!xdg-open %`* (Linux), *`:!start %`* (Windows). + (You know that *`%`* is a VIM shorthand variable for 'currently opened file', right?) + +1. Define a custom status line which returns useful info about the current curser position. + Here is a suggestion: + + :statusline=%F%m%r%h%w[%L][%{&ff}]%y[%p%%][%04l,%04v](%b)(%B)(%o) + + - **`%F`** : currently open file name (with full path) + - **`%m`** : modified flag (*`[+]`* if modified) + - **`%r`** : readonly flag (*`[RO]`* if readonly) + - **`%h`** : helpfile flag (*`[help]`* if helpfile -- maybe localized as *`[Hilfe]`*) + - **`%w`** : preview window flag (*`[Preview]`* if applicable) + - **`%L`** : total lines + - **`%{&ff}`** : file format (unix, dos,...) + - **`%y`** : file type as automatically recognized or manually set + - **`%p%%`** : relative position of cursor within file in percent + - **`%06l`** : current line position (column) of cursor, left padded with zeroes + - **`%06v`** : current line/row number of cursor, left padded with zeroes + - **`%b`** : ASCII value of the current character under cursor + - **`%B`** : HEX value of current character under cursor + - **`%o`** : file byte offset of cursor + + Now a quick look on the status line shows the current file byte offset, line position, HEX value of character,... + +1. How to jump to a specific byte offset (calculated from the start of the file): + + :goto 37737 + + or + + :go 37737 + + or simply (without the `:` to switch to command mode): + + 37737go + +1. Looking at binary bytes? But want them displayed as Hex? Then try this: + + :set display=uhex + + Otherwise, the `ga` command displays the value of the character under the cursor. + + `g CTRL+g` shows which byte offset you are at in the file. + +---- + +Copyright (c) 2015 + +License: [Creative Commons "CC-BY-NC-SA" v4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/) +![](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png) + diff --git a/handcoded/102_A-vectorized.pdf b/handcoded/102_A-vectorized.pdf new file mode 100644 index 0000000..23d3777 Binary files /dev/null and b/handcoded/102_A-vectorized.pdf differ diff --git a/handcoded/104_fonts-not-embedded.pdf b/handcoded/104_fonts-not-embedded.pdf new file mode 100644 index 0000000..a8f7fa5 Binary files /dev/null and b/handcoded/104_fonts-not-embedded.pdf differ diff --git a/handcoded/105_transformation-matrix.pdf b/handcoded/105_transformation-matrix.pdf new file mode 100644 index 0000000..fb65611 Binary files /dev/null and b/handcoded/105_transformation-matrix.pdf differ diff --git a/handcoded/106_hello-troopers.pdf b/handcoded/106_hello-troopers.pdf new file mode 100644 index 0000000..038c85b Binary files /dev/null and b/handcoded/106_hello-troopers.pdf differ diff --git a/handcoded/107_perspectives-by-banksy.pdf b/handcoded/107_perspectives-by-banksy.pdf new file mode 100644 index 0000000..b92e509 Binary files /dev/null and b/handcoded/107_perspectives-by-banksy.pdf differ diff --git a/handcoded/108_text-rendering-modes.pdf b/handcoded/108_text-rendering-modes.pdf new file mode 100644 index 0000000..c8a8ef4 Binary files /dev/null and b/handcoded/108_text-rendering-modes.pdf differ diff --git a/handcoded/111_current-transformation-matrix-ctm.pdf b/handcoded/111_current-transformation-matrix-ctm.pdf new file mode 100644 index 0000000..1d09454 Binary files /dev/null and b/handcoded/111_current-transformation-matrix-ctm.pdf differ diff --git a/handcoded/112_play-with-tounicodetable.pdf b/handcoded/112_play-with-tounicodetable.pdf new file mode 100644 index 0000000..d4ec28d Binary files /dev/null and b/handcoded/112_play-with-tounicodetable.pdf differ diff --git a/handcoded/113_stegano-with-tounicodetable.pdf b/handcoded/113_stegano-with-tounicodetable.pdf new file mode 100644 index 0000000..0958d4d Binary files /dev/null and b/handcoded/113_stegano-with-tounicodetable.pdf differ diff --git a/handcoded/114_incrementally-updated.pdf b/handcoded/114_incrementally-updated.pdf new file mode 100644 index 0000000..a527e0d Binary files /dev/null and b/handcoded/114_incrementally-updated.pdf differ diff --git a/handcoded/READMEhandcoded.md b/handcoded/READMEhandcoded.md new file mode 100644 index 0000000..aadaf14 --- /dev/null +++ b/handcoded/READMEhandcoded.md @@ -0,0 +1,63 @@ +# Hand-coded PDF sample files + +This directory hosts a few hand-coded PDF sample files. + +They are designed to make it easy playing with them in a text editor. +Most of them contain commented lines explaining some technical background about the PDF features they use. +Inline comments also include a few suggestions about possible experiments to conduct with them. + +Before you play with these PDFs and modify them, create backups! + +* **`102_A-vectorized.pdf`** : + *Looks* like glyph for character 'A', but is a vector shape. + Text cannot be copied or extracted. + Play with drawing color, line segments, metadata entries and `cm` operator. + +* **`104_fonts-not-embedded.pdf`** : + Uses "real" text with font, but font is not embedded. + Text can be copied or extracted. + Play with text color, font size, font used, page size, page rotation... + +* **`105_transformation-matrix.pdf`** : + Play with the `cm` (concatenation matrix) operator, applied to text. + +* **`106_hello-troopers.pdf`** : + Play with RGB colors and with the `cm` operator, applied to text. + +* **`107_perspectives-by-banksy.pdf`** : + PDF contains one image. + Play with parameters for the dimensions of the image. + Also play with `/CropBox` and `/Rotate` keys of the page dictionary. + +* **`108_text-rendering-modes.pdf`** : + PDF contains text. + Change PDF page size. + Change text color to "white", using RGB color values. + Extract or copy text from PDF page. + Text may be rendered in different modes. + Play with the `Tr` operator. + +* **`111_current-transformation-matrix-ctm.pdf`** : + PDF contains one 2x2 pixels image. + Image is re-used multiple times on the page, each time at different location, with different scaling/skew/rotation by using different `cm` values. + Observe the effects of enabling/disable different `cm` values (by commenting in/out respective PDF source lines). + Open file in different PDF viewers. + Do they all render the 2x2 image identically? + +* **`112_play-with-tounicodetable.pdf`** : + 2 pages, (almost) identical content. + Second page's content stream is slightly obfuscated. + Experiment with the `/Widths` array for a font used by the file. + Also play with the value for the `/FirstChar` key. + Study the influence of the `/ToUnicode` table upon the text extraction success. + 3 different `/ToUnicode` tables are included. + You can enable one of them (or none) and observe the effects upon text extraction or text copy'n'paste capabilities. + +* **`113_stegano-with-tounicodetable.pdf`** : + A variation of the `112_play-with-tounicodetable.pdf`. + See how a manipulation of the `/ToUnicode` table can in effect apply a `rot13`-like effect when extracting text. + +* **`114_incrementally-updated.pdf`** : + Delete all lines after the first `%%EOF` and observe the effects on the file. + Test also with `pdftotext` or with copy'n'pasting of text. + diff --git a/readme-pdfs/README.pdf b/readme-pdfs/README.pdf new file mode 100644 index 0000000..f08a17f Binary files /dev/null and b/readme-pdfs/README.pdf differ diff --git a/readme-pdfs/READMEhandcoded.pdf b/readme-pdfs/READMEhandcoded.pdf new file mode 100644 index 0000000..396b535 Binary files /dev/null and b/readme-pdfs/READMEhandcoded.pdf differ diff --git a/readme-pdfs/TODO.pdf b/readme-pdfs/TODO.pdf new file mode 100644 index 0000000..0e12ed3 Binary files /dev/null and b/readme-pdfs/TODO.pdf differ diff --git a/readme-pdfs/Tools.pdf b/readme-pdfs/Tools.pdf new file mode 100644 index 0000000..b0c205d Binary files /dev/null and b/readme-pdfs/Tools.pdf differ diff --git a/readme-pdfs/VIMtricks.pdf b/readme-pdfs/VIMtricks.pdf new file mode 100644 index 0000000..7e8852b Binary files /dev/null and b/readme-pdfs/VIMtricks.pdf differ