Permalink
Browse files

Merge pull request #1 from jwilk/spelling

Fix typos
  • Loading branch information...
KurtPfeifle committed Jul 7, 2016
2 parents f050687 + 29de358 commit fd85e6bf7927ed0d016d5ae63478da75f2f9e4df
Showing with 14 additions and 14 deletions.
  1. +6 −6 Tools.md
  2. +1 −1 handcoded/READMEhandcoded.md
  3. +7 −7 handcoded/textextract/READMEtextextract.md
View
@@ -19,7 +19,7 @@ Most of them are available for all major OS platforms (Windows, OSX, Linux):
It can also extract a copy of the attached files from the PDF.
* **`pdfresurrect`**.
This utility reports if a PDF makes use of the offical feature that allows to make *"incremental updates"* of the document.
This utility reports if a PDF makes use of the official feature that allows to make *"incremental updates"* of the document.
It can also restore previous versions of the file.
* **`pdftk`**.
@@ -37,12 +37,12 @@ Most of them are available for all major OS platforms (Windows, OSX, Linux):
* **`zathura`**.
This utility is a very fast and lightweight PDF viewer.
(Additional plugins extend it to a viewer for PostScript, DjVu and CB files.)
It has a very limited GUI; instead of mouse and menue buttons, it is to be controlled from the keyboard.
It has a very limited GUI; instead of mouse and menu buttons, it is to be controlled from the keyboard.
* **`mupdf`**.
This utility is another lightweight PDF viewer.
(It can also display XPS files.)
It has a very limited GUI; instead of mouse and menue buttons, it is to be controlled from the keyboard.
It has a very limited GUI; instead of mouse and menu buttons, it is to be controlled from the keyboard.
* **`mutool`**.
This utility is a sibling to `mupdf`.
@@ -53,7 +53,7 @@ Most of them are available for all major OS platforms (Windows, OSX, Linux):
This family of utilities can ...
* **`peepdf`**.
`peepdf.py` is a Python suite of toolto explore PDF files. It was initially created to help find out if a PDF contains harmful contents or not. It is however extremely useful beyond PDF malware research, because it helps to explore, study, investigate and understand PDF file structures in general.
`peepdf.py` is a Python suite of tool to explore PDF files. It was initially created to help find out if a PDF contains harmful contents or not. It is however extremely useful beyond PDF malware research, because it helps to explore, study, investigate and understand PDF file structures in general.
* **`pdfid.py`**.
@@ -74,7 +74,7 @@ These are beneficial also to "old timers" who are using them since many years al
pdfinfo -meta the.pdf
1. To see all page boxes (MediaBox, CropBox, TrimBox, ArtBox) used by the file (or what page boxes are implicitely used because they are undefined), use:
1. To see all page boxes (MediaBox, CropBox, TrimBox, ArtBox) used by the file (or what page boxes are implicitly used because they are undefined), use:
pdfinfo -box the.pdf
@@ -132,7 +132,7 @@ These are beneficial also to "old timers" who are using them since many years al
**Note:** whatever substitute that tool reports for non-embedded fonts is not true for Acrobat or other PDF viewers.
These may use a different font substitution method (Acrobat frequently generates a *MultipleMaster* font "on the fly" for use in place of a non-embedded one).
The tool's reported substitute font is only applicable for those programms which make use of the *FreeType* font engine.
The tool's reported substitute font is only applicable for those programs which make use of the *FreeType* font engine.
## `pdfresurrect`
@@ -62,6 +62,6 @@ Before you play with these PDFs and modify them, create backups!
Test also with `pdftotext` or with copy'n'pasting of text.
* **`115_little-riddle.pdf`** :
A littel riddle: Something is hidden in this PDF -- what is it?
A little riddle: Something is hidden in this PDF -- what is it?
How did the hiding happen??
@@ -11,17 +11,17 @@ Here we take a look at PDF files which indeed use "fonts" to represent "text".
We'll have 5 different files here, all of them identically looking in all PDF viewers, and very similar in their PDF source code. Of course there are some differences in the source code which make them to behave differently whenever you try to access the textual content outside from *rendering* the pages:
1. **`textextract-good.pdf`**
This file lets you extract or copy'n' paste all text correctly.
This file lets you extract or copy'n'paste all text correctly.
1. **`textextract-bad1.pdf`**
This file lets you extract or copy'n' paste all text -- but none of the strings appears correctly, all of them look like gobble-di-gook.
This file lets you extract or copy'n'paste all text -- but none of the strings appears correctly, all of them look like gobble-di-gook.
1. **`textextract-bad2.pdf`**
This file lets you extract or copy'n' paste all text -- but only the first half of the strings appears correctly, the other half is somehow garbled.
This file lets you extract or copy'n'paste all text -- but only the first half of the strings appears correctly, the other half is somehow garbled.
1. **`textextract-bad3.pdf`**
This file lets you extract or copy'n' paste all text -- but only the second half of the strings appears correctly, the first half is somehow garbled.
This file lets you extract or copy'n'paste all text -- but only the second half of the strings appears correctly, the first half is somehow garbled.
1. **`textextract-bad4.pdf`**
This file lets you extract or copy'n' paste all text -- but only the first half of the strings appears correctly, the second half may, superfically looking, appear correct, but in reality is *"rot13"* encoded.
This file lets you extract or copy'n'paste all text -- but only the first half of the strings appears correctly, the second half may, superficially looking, appear correct, but in reality is *"rot13"* encoded.
If you compare the different files's source code, you'll easily find where they differ:
If you compare the different files' source code, you'll easily find where they differ:
it's the way how they do define and set up (or don't do it at all) their `/ToUnicode` tables for the respective font used by the text strings.
A missing or an incorrect or a corrupt `/ToUnicode` table is the number 1 reason why text is not completely or correctly extractable from (otherwise apparently correct and complete and spec-conforming) "unprotected" PDF files.
@@ -38,7 +38,7 @@ Do the same for each of the `textextract-bad[1-4].pdf` files.
Find out what the differences for each of them are when compared to the `textextract-good.pdf`.
Also compare the `textextract-bad[1-4].pdf` files to each other.
You will see, that it is the `/ToUnicode` informations contained in each of the files which determines whether, and to what degree *correctly*, each text extraction (or copying) works.
You will see, that it is the `/ToUnicode` information contained in each of the files which determines whether, and to what degree *correctly*, each text extraction (or copying) works.
However, `/ToUnicode` does not have any influence on the *rendering* and readability of the PDF page in the PDF viewers.
Manipulated `/ToUnicode` tables can be (ab)used to confer hidden (by obscurity) messages to a receiver, which only reveal their intended meaning when extracted as text, but which use a "decoy" text when reviewed on a rendered PDF page.

0 comments on commit fd85e6b

Please sign in to comment.