**Introduction**

Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbritrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDFs for LLM applications such as retrieval augmented generation (RAG).

LayoutPDFReader solves this problem by parsing PDFs along with hierarchical layout information such as:

Sections and subsections along with their levels.
Paragraphs - combines lines.
Links between sections and paragraphs.
Tables along with the section the tables are found in.
Lists and nested lists.
With LayoutPDFReader, developers can find optimal chunks of text to vectorize, and a solution for limited context window sizes of LLMs.

**Installation**

Install the llmsherpa library.

In [1]:
!pip install llmsherpa

Collecting llmsherpa
  Downloading llmsherpa-0.1.2-py3-none-any.whl (10 kB)
Installing collected packages: llmsherpa
Successfully installed llmsherpa-0.1.2


The first step in using the LayoutPDFReader is to provide a url or file path to it and get back a document object.

In [2]:
from llmsherpa.readers import LayoutPDFReader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "/content/CKMourya_Resume.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

In [3]:
doc

<llmsherpa.readers.layout_reader.Document at 0x7dca537370d0>

In [4]:
doc.sections()

[<llmsherpa.readers.layout_reader.Section at 0x7dca5034ee60>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034eb60>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e8c0>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e740>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e9e0>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e800>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e980>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e8f0>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e5c0>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e3b0>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e260>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e170>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e140>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034e110>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034dfc0>,
 <llmsherpa.readers.layout_reader.Section at 0x7dca5034df90>,
 <llmshe

In [5]:
from IPython.core.display import display, HTML
selected_section = None
# find a section in the document by title
for section in doc.sections():
    if 'skill' in section.title.lower() or 'expertise' in section.title.lower():
        selected_section = section
        break
# use include_children=True and recurse=True to fully expand the section.
# include_children only returns at one sublevel of children whereas recurse goes through all the descendants
HTML(section.to_html(include_children=True, recurse=True))