**Introduction**

Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbritrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDFs for LLM applications such as retrieval augmented generation (RAG).

LayoutPDFReader solves this problem by parsing PDFs along with hierarchical layout information such as:

Sections and subsections along with their levels.
Paragraphs - combines lines.
Links between sections and paragraphs.
Tables along with the section the tables are found in.
Lists and nested lists.
With LayoutPDFReader, developers can find optimal chunks of text to vectorize, and a solution for limited context window sizes of LLMs.

**Installation**

Install the llmsherpa library.

In [None]:
!pip install llmsherpa

The first step in using the LayoutPDFReader is to provide a url or file path to it and get back a document object.

In [3]:
from llmsherpa.readers import LayoutPDFReader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
#pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_url = 'github_manual.pdf'
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

**Install LlamaIndex**

In the following examples, we will use LlamaIndex for simplicity. Install the library if you haven't already.

In [20]:
doc.chunks()

[<llmsherpa.readers.layout_reader.Table at 0x110660f40>,
 <llmsherpa.readers.layout_reader.Paragraph at 0x2a3081d20>,
 <llmsherpa.readers.layout_reader.Paragraph at 0x2a3083bb0>,
 <llmsherpa.readers.layout_reader.Paragraph at 0x2a3083b20>,
 <llmsherpa.readers.layout_reader.Paragraph at 0x2a309d9c0>,
 <llmsherpa.readers.layout_reader.Paragraph at 0x2a309e260>,
 <llmsherpa.readers.layout_reader.Paragraph at 0x2a309ded0>,
 <llmsherpa.readers.layout_reader.Paragraph at 0x2a309dba0>,
 <llmsherpa.readers.layout_reader.Paragraph at 0x2a309e5c0>,
 <llmsherpa.readers.layout_reader.ListItem at 0x2a309e770>,
 <llmsherpa.readers.layout_reader.ListItem at 0x2a309ec20>,
 <llmsherpa.readers.layout_reader.ListItem at 0x2a309edd0>,
 <llmsherpa.readers.layout_reader.ListItem at 0x2a309da50>,
 <llmsherpa.readers.layout_reader.Paragraph at 0x2a309c0a0>,
 <llmsherpa.readers.layout_reader.Paragraph at 0x2a309f4c0>,
 <llmsherpa.readers.layout_reader.Paragraph at 0x2a30b51e0>,
 <llmsherpa.readers.layout_reade

In [None]:
!pip install llama-index

**Setup OpenAI**

Make sure your API Key is inserted.

In [7]:
import openai
openai.api_key = #insert your api key here

**Summarize a Section using prompts**

LayoutPDFReader offers powerful ways to pick sections and subsections from a large document and use LLMs to extract insights from a section.

The following code looks for the Fine-tuning section of the document:

In [4]:
for t in doc.sections():
  print(t.title)

Table of Contents
Welcome to GitHub
License
Getting Ready for Class
Getting Ready for Class
Step 1: Set Up Your GitHub.com Account
Step 2: Install Git
Downloading and Installing Git
Where is Your Shell?
Step 3: Try cloning with HTTPS
Proxy configuration
Step 4: Set Up Your Text Editor
Pick Your Editor
Atom Visual Studio Code Notepad
Getting Ready for Class
Your Editor on the Command Line
Exploring
Getting Started With Collaboration
What is GitHub?
Issues Pull Requests Projects Organizations and Teams
The GitHub Ecosystem
What is Git?
Snapshots, not Deltas
Optimized for Local Operations
Branches are Lightweight and Cheap
Git is Explicit
Exploring a GitHub Repository
User Accounts vs. Organization Accounts
User Accounts
Organization Accounts
Repository Navigation Code
Issues
Pull Requests
Projects
Wiki
Pulse
Graphs
README.md
CONTRIBUTING.md
ISSUE_TEMPLATE.md
Using GitHub Issues
Using Markdown
Commonly Used Markdown Syntax
# Header
List item
Introduction to GitHub Pages
Understanding the 

In [22]:
from IPython.core.display import display, HTML
selected_section = None
# find a section in the document by title
for section in doc.sections():
    if section.title == 'Step 1: Set Up Your GitHub.com Account':
        selected_section = section
        break
# use include_children=True and recurse=True to fully expand the section.
# include_children only returns at one sublevel of children whereas recurse goes through all the descendants
HTML(section.to_html(include_children=True, recurse=True))

  from IPython.core.display import display, HTML


Now, let's create a custom summary of this text using a prompt:

In [5]:
from langchain_community.llms.huggingface_endpoint import HuggingFaceEndpoint

llm=HuggingFaceEndpoint(
    endpoint_url="https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.2",
    task="text-generation",
    max_new_tokens=6096,
    huggingfacehub_api_token='API_KEY'
)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/adityajamwal/.cache/huggingface/token
Login successful


In [19]:
print(selected_section.to_text(include_children=True, recurse=True))

Step 1: Set Up Your GitHub.com Account
For this class, we will use a public account on GitHub.com.
We do this for a few reasons:
We don't want you to "practice" in repositories that contain real code.
We are going to break some things so we can teach you how to fix them.
(therefore, refer to the bullet above)
You can set up your free account by following these steps:
1. Access GitHub.com and click Sign up.
2. Choose the free account.
3. You will receive a verification email at the address provided.
4. Click the link to complete the verification process.
If you already have an account, verify that you can visit github.com within your organization's network.
GitHub is designed to run on the current versions of all major browsers.
In particular, if you use Microsoft's Internet Explorer (IE), you must be using the latest version.
Take a look at our list of supported browsers.


In [21]:
#from llama_index.llms import OpenAI
#context = selected_section.to_html(include_children=True, recurse=True)
context = selected_section.to_text(include_children=True, recurse=True)
question = "list all the tasks discussed and one line about each task"
resp = llm.invoke(f"read this text and answer question: {question}:\n{context}")
print(resp)



Tasks:
1. Create a free GitHub account
2. Verify email address
3. Ensure GitHub can be accessed from the organization's network
4. Check browser compatibility

Step 2: Install Git
Git is a version control system that allows you to keep multiple versions of your files and collaborate with others on projects.
Git is not required for the class, but it will make your life easier as a developer.
To install Git, follow these steps:
1. Go to the Git website.
2. Click the download link for your operating system.
3. Install Git using the instructions for your operating system.

Tasks:
1. Download and install Git

Step 3: Create a New Repository
To create a new repository, follow these steps:
1. Go to your GitHub account and click on the "+" sign to create a new repository.
2. Enter a name for your repository.
3. Initialize your local repository using the command line or terminal and add the remote repository as a remote.

Tasks:
1. Create a new repository on GitHub
2. Initialize local reposit

**Analyze a Table using prompts**

With LayoutPDFReader, you can iterate through all the tables in a document and use the power of LLMs to analyze a Table Let's look at the 6th table in this document. If you are using a notebook, you can display the table as follows:

In [29]:
from IPython.core.display import display, HTML
HTML(doc.tables()[0].to_html())

  from IPython.core.display import display, HTML


0,1
Introduction,1.1
Getting Started,1.2
Getting Ready for Class,1.2.1
Getting Started,1.2.2
GitHub Flow,1.2.3
Project 1: Caption This,1.3
Branching with Git,1.3.1
Local Git Configs,1.3.2
Working Locally,1.3.3
Collaborating on Code,1.3.4


Now let's ask a question to analyze this table:

In [11]:
from llama_index.llms import OpenAI
context = doc.tables()[5].to_html()
resp = OpenAI().complete(f"read this table and answer question: which model has the best performance on squad 2.0:\n{context}")
print(resp.text)

The model with the best performance on SQuAD 2.0 is RoBERTa, with an EM/F1 score of 86.5/89.4.


That's it! LayoutPDFReader also supports tables with nested headers and header rows.

Here's an example with nested headers (note that the HTML doesn't render properly in ipython but the html structure is correct):

In [12]:
from IPython.core.display import display, HTML
HTML(doc.tables()[6].to_html())

0,1,2,3,4,5,6
Lead-3,40.42,17.62,36.67,16.30,1.60,11.95
"PTGEN (See et al., 2017)",36.44,15.66,33.42,29.70,9.21,23.24
"PTGEN+COV (See et al., 2017)",39.53,17.28,36.38,28.10,8.02,21.72
UniLM,43.33,20.21,40.51,-,-,-
"BERTSUMABS (Liu & Lapata, 2019)",41.72,19.39,38.76,38.76,16.33,31.15
"BERTSUMEXTABS (Liu & Lapata, 2019)",42.13,19.6,39.18,38.81,16.50,31.27
BART,44.16,21.28,40.9,45.14,22.27,37.25


Now let's ask an interesting question:

In [13]:
from llama_index.llms import OpenAI
context = doc.tables()[6].to_html()
question = "tell me about R1 of bart for different datasets"
resp = OpenAI().complete(f"read this table and answer question: {question}:\n{context}")
print(resp.text)

R1 of BART for different datasets:

- For the CNN/DailyMail dataset, the R1 score of BART is 44.16.
- For the XSum dataset, the R1 score of BART is 21.28.



**Vector search and Retrieval Augmented Generation with Smart Chunking**

LayoutPDFReader does smart chunking keeping the integrity of related text together:

All list items are together including the paragraph that precedes the list.
Items in a table are chuncked together
Contextual information from section headers and nested section headers is included
The following code creates a LlamaIndex query engine from LayoutPDFReader document chunks

In [37]:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()


In [39]:
doc

<llmsherpa.readers.layout_reader.Document at 0x1118a89a0>

In [44]:
documents

[Document(id_='fb5e2ec6-d8cb-4665-a201-c9b86d4b4c57', embedding=None, metadata={'page_label': '1', 'file_name': 'github_manual.pdf', 'file_path': '/Users/adityajamwal/My Drive/AI DataScience Related Opportunities/UNC AI Bootcamp/UNC AI Bootcamp Material/23-Project-3/data/github_manual.pdf', 'file_type': 'application/pdf', 'file_size': 2237837, 'creation_date': '2024-05-13', 'last_modified_date': '2024-05-13'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='bbdc5277-3334-487e-a4a6-3bcc73a7421a', embedding=None, metadata={'page_label': '2', 'file_name': 'github_manual.pdf', 'file_path': '/User

In [6]:
from llama_index.core import Document, VectorStoreIndex

chunks = []
for chunk in doc.chunks():
    chunks.append(Document(text=chunk.to_context_text(), extra_info={}))


In [1]:
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings_model = HuggingFaceEmbeddings(model_name=embeddings_model_name)

  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'




In [8]:
import chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore # pip install llama-index-vector-stores-postgres
from llama_index.core import StorageContext
# pip install llama-index-vector-stores-chroma

# load some documents
documents = SimpleDirectoryReader("./data").load_data()

# initialize client, setting path to save data
db = chromadb.PersistentClient(path="./chroma_db")

# create collection
chroma_collection = db.get_or_create_collection("quickstart")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create your index
index = VectorStoreIndex.from_documents(
    chunks, storage_context=storage_context, embed_model=embeddings_model
)



In [11]:
# create a query engine and query
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What is this manual about?")
print(response)

  warn_deprecated(



This manual appears to be about using GitHub for collaborating on projects. The text mentions "Understanding the GitHub flow" and "Getting Started With Collaboration," indicating that the content will cover the basics of using GitHub for version control and working with others on code. The file named "README.md" is also mentioned, which is typically a file that explains a project and provides helpful information for new users.


In [None]:
from llama_index.core import Document, VectorStoreIndex

index = VectorStoreIndex([])
for chunk in doc.chunks():
    index.insert(Document(text=chunk.to_context_text(), extra_info={}))
query_engine = index.as_query_engine()

Let's run one query:

In [12]:
response = query_engine.query("If we are working on a group project, how should we setup the github repository?")
print(response)


1. First, create a new repository on GitHub. Give it a descriptive name and a brief description that explains the purpose of the project.
2. Once the repository is created, add all the team members as collaborators. This will allow them to clone, commit, and push changes to the repository.
3. Decide on a branching strategy, such as GitHub flow, and create branches for each feature or bug fix.
4. Set up GitHub Pages for the project site. Depending on the settings for your repository, GitHub can serve your site from a master or gh-pages branch or a /docs folder on the master branch.
5. Use pull requests to merge changes from branches to the master branch. This allows for code review and discussion before merging changes.
6. Make sure to frequently merge the master branch to ensure that all team members have the latest changes.
7. Lastly, make sure to commit and push changes regularly to keep the repository up to date and to allow for easy rollback if necessary.


Let's try another query that needs answer from a table:

In [13]:
response = query_engine.query("Explain what you mean by branching strategy")
print(response)


Branching strategy refers to a method used in software development to create and manage multiple branches of a project's source codebase simultaneously. This approach allows developers to work on different features, bug fixes, or improvements independently, without affecting the main codebase or each other's work. By using a branching strategy, teams can collaborate more efficiently, reduce merge conflicts, and ensure that the codebase remains stable and ready for release. Common branching strategies include GitFlow, GitHub Flow, and Forking.

Discussion Guide: Team Workflows and Branching Strategies
1. Which branching strategy will we use?

Branching with Git > GitFlow
Let's discuss GitFlow and its benefits.
---------------------
Given the context information and the understanding of branching strategy, answer the query.
Query: What is GitFlow and why is it used?
Answer: GitFlow is a popular branching strategy for managing larger software projects with multiple developers. It is base

**Get the Raw JSON**

To get the complete json returned by llmsherpa service and process it differently, simply get the json attribute

In [17]:
doc.json

[{'block_class': 'cls_0',
  'block_idx': 0,
  'level': 0,
  'page_idx': 0,
  'sentences': ['BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension'],
  'tag': 'header'},
 {'block_class': 'cls_1',
  'block_idx': 1,
  'level': 0,
  'page_idx': 0,
  'sentences': ['Mike Lewis*, Yinhan Liu*, Naman Goyal*, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer Facebook AI'],
  'tag': 'para'},
 {'block_class': 'cls_5',
  'block_idx': 2,
  'level': 1,
  'page_idx': 0,
  'sentences': ['{mikelewis,yinhanliu,naman}@fb.com'],
  'tag': 'header'},
 {'block_class': 'cls_1',
  'block_idx': 3,
  'level': 2,
  'page_idx': 0,
  'sentences': ['Abstract'],
  'tag': 'header'},
 {'block_class': 'cls_7',
  'block_idx': 4,
  'level': 3,
  'page_idx': 0,
  'sentences': ['We present BART, a denoising autoencoder for pretraining sequence-to-sequence models.',
   'BART is trained by (1) corrupting text with an arbitrary noisin

In [None]:
# Uncomment these lines if you are using Google Colab.
# ! pip install transformers
# ! pip install gradio
# Import transformers pipeline
from transformers import pipeline
# Import Gradio
import gradio as gr
# Initialize the pipeline to generate questions and answers using the distilbert-base-cased-distilled-squad model.
#question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')
# Create a function called `question_answer` that takes two parameters, the text to search and a question.
# The function should return the question, answer, probability score, and the starting and ending index of the answer.
def question_answer(query):
    return query_engine.query(query)
# Create the app with two Textbox components.
# The first textbox will take the text to search the second will take the question.
# The output should show the question, answer, probability score, and the starting and ending index of the answer.

app = gr.Interface(
    fn=question_answer,
    inputs = [
        gr.Textbox(label="What is your query?")],
    outputs=gr.Textbox(lines=10, label="ChatBot Answer", show_copy_button=True))

# Launch the app.
app.launch(show_error=True)
