Skip to content

Commit

Permalink
Fix not detecting Type 0 (composite) fonts
Browse files Browse the repository at this point in the history
Previously the function to gather used fonts would only walk certain
parts of the PDF object tree. This worked for PDFs which had TrueType
fonts embedded, since the tree looks like this:

(parents)
  BaseFont: AAABQG+ArialMT
  ...
  FontDescriptor:
    FontFile2: ...
    FontName: AAABQG+ArialMT
    ...

Some PDFs have composite fonts, which look like this:

(parents)
  BaseFont: MUFUZY+ArialMT
  ...
  DescendantFonts: # AN ARRAY!
    - BaseFont: MUFUZY+ArialMT
      FontDescriptor:
        FontFile2: ...
        FontName: MUFUZY+ArialMT
        ...

In addition, the FontDescriptor is actually an indirect reference,
which means it doesn't (directly) have a "keys" attribute.

This fixes the false positives for detecting unembedded fonts, by
adding support to walk the tree for arrays and indirect objects.

Note, I found this article helpful [1].

[1]: https://www.prepressure.com/pdf/basics/fonts
  • Loading branch information
Ben Thorner committed Jun 23, 2021
1 parent 4a5323b commit 83c137b
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 3 deletions.
8 changes: 7 additions & 1 deletion app/embedded_fonts.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,10 @@
from io import BytesIO

from PyPDF2 import PdfFileReader
from PyPDF2.generic import IndirectObject


def contains_unembedded_fonts(pdf_data):
def contains_unembedded_fonts(pdf_data): # noqa: C901 (too complex)
"""
Code adapted from https://gist.github.com/tiarno/8a2995e70cee42f01e79
Expand All @@ -30,6 +31,11 @@ def walk(obj, fnt, emb):

for k in obj.keys():
walk(obj[k], fnt, emb)
elif isinstance(obj, list):
for child in obj:
walk(child, fnt, emb)
elif isinstance(obj, IndirectObject):
walk(obj.getObject(), fnt, emb)

pdf = PdfFileReader(pdf_data)
fonts = set()
Expand Down
4 changes: 2 additions & 2 deletions tests/test_embedded_fonts.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@


@pytest.mark.parametrize(['pdf_file', 'has_unembedded_fonts'], [
(BytesIO(blank_with_address), True), # false positive, I think, or maybe because created through Google sheets?
(BytesIO(blank_with_address), False),
(BytesIO(example_dwp_pdf), False),
(BytesIO(multi_page_pdf), True),
(BytesIO(valid_letter), True), # false positive, I think, or maybe because created through Google sheets?
(BytesIO(valid_letter), False)
], ids=['blank_with_address', 'example_dwp_pdf', 'multi_page_pdf', 'valid_letter'])
def test_contains_unembedded_fonts(pdf_file, has_unembedded_fonts):
assert bool(contains_unembedded_fonts(pdf_file)) == has_unembedded_fonts
Expand Down

0 comments on commit 83c137b

Please sign in to comment.