- Table of content
  - [Loading a PDF's](#loading-a-pdfs)
    - [Simple and fast text extraction](#simple-and-fast-text-extraction)
      - [Vector search over PDF's](#vector-search-over-pdfs)
    - [Layout analysis and extraction of text from images](#layout-analysis-and-extraction-of-text-from-images)

# Loading a PDF's.

- PDF's are typically represent via text boxes. that may also contain images.
- A PDF Parser are combination of following:
  1.  Convert text box in to lines, paragraphs, and other structure using heuristics or ML inference.
  2. Run OCR (an electronic device that scan handwritten, printed text into encoded text) on the image to detect text.
  3. Classify text as belonging paragraph, lists, tables other structure.
  4. Structure text into table rows and columns, or key-value pairs.

- **IMPORTANT TO NOTE**
  - My modern LLM's supports multi-model data such as images.
  - So we can you following approach instead of parsing PDF.
    - We can pass the image of pdf page.
    - Below we have example how to do tht.


## Simple and fast text extraction.

- It is very simple, If you want to extract the text content embedded in a PDF. 
- It only extract the text from the pdf, it will not parse text inside the image.
- It return a list of [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html) objects per -- one per page. Containing page text in the Document's **`page_content`** attribute.

- LangChain [document loaders](../document_loader.md) implements with `lazy_load`, for async variants `alazy_load`, which returns an iterators of Document objects.

In [1]:
from os import path

pdf_path = "../../sample_files/welcome.pdf"

if not path.exists(pdf_path):
    raise Exception("Invalid path, File not found.")

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(pdf_path)

pages = []
async for page in loader.alazy_load():
    pages.append(page)

In [None]:
print(f"Page 1 MetaData")
print(f"{pages[0].metadata}")  # It also store the corresponding page numbers.
print("Page 1 Content")
print(f"{pages[0].page_content}")

Page 1 MetaData
{'producer': 'Adobe PDF Library 21.1.174', 'creator': 'Acrobat PDFMaker 21 for Word', 'creationdate': '2024-06-18T22:34:29+00:00', 'author': 'Adobe', 'comments': '', 'company': 'Adobe', 'keywords': '', 'moddate': '2024-06-24T08:31:18-07:00', 'sourcemodified': 'D:20240618223427', 'subject': '', 'title': 'Welcome to Adobe Acrobat', 'source': 'same_files/welcome.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}
Page 1 Content
Welcome to  
Adobe Acrobat 
Here are five tips to get real work  
done from anywhere with Acrobat.  
01 Work where you want 
02 Meet Acrobat AI Assistant 
03 Present perfect content 
04 Share files with others 
05 Get help from Adobe


### Vector search over PDF's

- Once we have loaded pdf's into LangChain [Document](../document_loader.md) object. we can index them (e.g a RAG application) in the usual way.
- Below we use OpenAI embedding, although any LangChain embedding model will sufficient for indexing.

In [23]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

vector_store = InMemoryVectorStore.from_documents(
    pages, OpenAIEmbeddings(model="text-embedding-3-small"))
docs = vector_store.similarity_search("tell me about share files with others")

for doc in docs:
    print(f'Page {doc.metadata["page"]}: {doc.page_content}\n')

Page 4: 04 Share files with others 
Send and manage 
 
Now you can share files for 
viewing, commenting, or 
signature—and track their 
status every step of the way. 
Share files fast. Click, type, and 
send. It’s that easy. 
Collaborate better. Subscribers 
can also send files for e-signature or for review to collect 
group feedback in a single shared file. 
Manage your files. You’re in control. Track your file, 
forward it to others, or stop sharing it at any time.

Page 3: 03 Present perfect content 
Combine and organize files 
 
Share materials exactly how you want—quickly and 
easily. 
Merge multiple files into 
one PDF. Combine different 
file types—spreadsheets, 
images, web pages, and 
videos—into a single PDF 
file that’s easy to share or 
archive. You can even add 
an entire folder. 
Organize pages. Rotate, 
delete, reorder, or insert pages in your PDF on your 
desktop, tablet, or mobile device.

Page 5: 05 Get help from Adobe 
We’ve got your back 
Take advantage of tutorials

### Layout analysis and extraction of text from images

- If you want more control over the text extraction such as (extracting text from image, titles, tables other structure).
- The following methods are good for these requires.
- It will returns a list of document objects each object will represent a structure on the page.
- The document meta data store page numbers and other information related to the objects.
- It use **`langchain_structured`** library. see from [here](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file/).

In [None]:
import getpass
import os
from langchain_unstructured import UnstructuredLoader

if "UNSTRUCTURED_API_KEY" not in os.environ:
    os.environ["UNSTRUCTURED_API_KEY"] = getpass.getpass(
        "Unstructured API Key:")


loader = UnstructuredLoader(
    file_path="sample_files/welcome.pdf",
    strategy="hi_res",
    partition_via_api=True,
    coordinates=True,
)
docs = []
for doc in loader.lazy_load():
    docs.append(doc)

## Extracting Text from image.