Skip to content

Commit

Permalink
feat: 0.1.1 🎉
Browse files Browse the repository at this point in the history
A very small update that now exposes incremental updates under a new attribute. This also fixes a bug where an indefinite loop was caused while tokenizing an indirect reference.
  • Loading branch information
aescarias committed Apr 14, 2024
1 parent dc0fc98 commit 3d5ee53
Show file tree
Hide file tree
Showing 5 changed files with 10 additions and 10 deletions.
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
project = 'pdfnaut'
copyright = '2024, Angel Carias'
author = 'Angel Carias'
release = '0.1.0'
release = '0.1.1'

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
Expand Down
10 changes: 5 additions & 5 deletions docs/source/guides/reading-pdf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Let's take, for example, the ``sample.pdf`` file available in our `test suite <h
'Pages': PdfIndirectRef(object_number=3, generation=0),
'Type': PdfName(value=b'Catalog')}
Two objects of note can be found: Outlines and Pages. Outlines stores what we commonly refer to as "bookmarks". Pages stores the page tree, which is what we are interested in.
Two objects of note can be found: Outlines and Pages. ``Outlines`` stores what we commonly refer to as bookmarks. ``Pages`` stores the page tree, which is what we are interested in:

.. code-block:: python
Expand All @@ -57,7 +57,7 @@ Two objects of note can be found: Outlines and Pages. Outlines stores what we co
PdfIndirectRef(object_number=6, generation=0)],
'Type': PdfName(value=b'Pages')}
The page tree is seen above. It contains two "kids" which may either be a single page object or a list of more pages.
The page tree is seen above. Given that this document only includes 2 pages, they are specified as "kids" in the root node. For larger documents, it is not uncommon to divide the pages into multiple nodes for performance reasons. Next, we can extract the first page of the document:

.. code-block:: python
Expand All @@ -73,19 +73,19 @@ The page tree is seen above. It contains two "kids" which may either be a single
'Type': PdfName(value=b'Page')
}
Above we see the actual page. This dictionary includes the media box which specifies the dimensions of the page when shown or printed (PDF is all about printing!), a reference to its parent, the resources used such as the font, and the Contents of the page.
Above we see the actual page. This dictionary includes the *media box* which specifies the dimensions of the page when shown or printed (PDF is all about printing!), a reference to its parent, the resources used such as the font, and the contents of the page. We are looking for the contents of the page. Given that the Contents key includes a stream, it is set as an indirect reference.

.. code-block:: python
>>> page_contents = pdf.resolve_reference(first_page["Contents"])
>>> page_contents
PdfStream(details={'Length': 1074})
We find ourselves with a stream. The contents of pages are defined in streams known as content streams. In this case, it is not compressed (it does not have a Filter). So we can easily read it.
We find ourselves with a stream. The contents of pages are defined in streams known as **content streams**. This kind of stream includes instructions on how a PDF processor should render this page. In this case, it is not compressed (it does not have a Filter). So we can easily read it:

.. code-block:: python
>>> page_contents.decompress()
b'2 J\r\nBT\r\n0 0 0 rg\r\n/F1 0027 Tf\r\n57.3750 722.2800 Td\r\n( A Simple PDF File ) Tj\r\nET\r\nBT\r\n/F1 0010 Tf\r\n69.2500 688.6080 Td\r\n[...]ET\r\n'
The content stream is comprised of operators and operands. In this case, it would simply write "A Simple PDF File" at the position defined by the Td operands (and with the font /F1 included in our Resources which, in this case, points to Helvetica).
A content stream is comprised of operators and operands. In this case, it would simply write "A Simple PDF File" at the position defined by the Td operands (and with the font /F1 included in our Resources which, in this case, points to Helvetica).
2 changes: 1 addition & 1 deletion pdfnaut/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,6 @@
__all__ = ("PdfParser", "PdfSerializer")

__name__ = "pdfnaut"
__version__ = "0.1.0"
__version__ = "0.1.1"
__description__ = "Explore PDFs with ease"
__license__ = "Apache 2.0"
4 changes: 2 additions & 2 deletions pdfnaut/parsers/pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,7 @@ def parse_compressed_xref(self) -> tuple[PdfXRefTable, dict[str, Any]]:

return table, xref_stream.details

def parse_indirect_object(self, xref_entry: InUseXRefEntry) -> PdfObject | PdfStream | None:
def parse_indirect_object(self, xref_entry: InUseXRefEntry) -> PdfObject | PdfStream:
"""Parses an indirect object not within an object stream, or basically, an object
that is directly referred to by an ``xref_entry``"""
self._tokenizer.position = xref_entry.offset
Expand Down Expand Up @@ -420,7 +420,7 @@ def parse_stream(self, xref_entry: InUseXRefEntry, extent: int) -> bytes:

return contents

def resolve_reference(self, reference: PdfIndirectRef | tuple[int, int]):
def resolve_reference(self, reference: PdfIndirectRef | tuple[int, int]) -> PdfObject | PdfStream | PdfNull:
"""Resolves a reference into the indirect object it points to.
Arguments:
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "pdfnaut"
version = "0.1.0"
version = "0.1.1"
description = "Explore PDFs with ease"
authors = [
{ name = "Angel Carias" }
Expand Down

0 comments on commit 3d5ee53

Please sign in to comment.