# <a id='toc1_'></a>[Parsing strategies](#toc0_)

Parsing documents is a crucial step in enhancing the performance of Large Language Models (LLMs). It transforms unstructured text into structured data, making it accessible for AI models to process and analyze. This process includes extracting text, tables, images, and metadata from various document types.

In this notebook, we will explore a range of parsing techniques with varying levels of complexity. Each method has its strengths and limitations, making it more suitable for specific use cases.

**Table of contents**<a id='toc0_'></a>    
- [Setup](#toc1_1_)    
- [1- Classical parsing](#toc1_2_)    
  - [- PyPDFLoader](#toc1_2_1_)    
  - [- PyMuPDFLoader](#toc1_2_2_)    
- [2- Unstructured Methods](#toc1_3_)    
  - [- Raw unstructured](#toc1_3_1_)    
  - [- Unstructured + Multimodal](#toc1_3_2_)    
- [3 - Docling](#toc1_4_)    
  - [- Raw docling](#toc1_4_1_)    
  - [- Docling with Langchain](#toc1_4_2_)    
- [4- Full multimodal](#toc1_5_)    
- [5- Llamaparse](#toc1_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Setup](#toc0_)

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
from pathlib import Path

from dotenv import load_dotenv

os.chdir(Path.cwd().joinpath(".."))
print(Path.cwd())
load_dotenv(override=True)

In [None]:
import base64
import os
from io import BytesIO
from pathlib import Path

import nest_asyncio
from IPython.display import HTML, Markdown, display
from langchain_community.document_loaders import PyMuPDFLoader, PyPDFLoader
from llama_parse import LlamaParse
from pdf2image import convert_from_path
from unstructured.documents.elements import Image, Table
from unstructured.partition.pdf import partition_pdf

from lib.models import llm

DATA_PATH = Path("data/2_docs")
PDF_FILE = DATA_PATH / "embedded-images-tables.pdf"

## <a id='toc1_2_'></a>[1- Classical parsing](#toc0_)

### <a id='toc1_2_1_'></a>[- PyPDFLoader](#toc0_)
- It is one of the easiest and classical ways to parse a PDF document.<br>
- It extracts basic metadata about the PDF (source and page number), and returns one document per page. 

In [None]:
loader = PyPDFLoader(PDF_FILE)
pages = loader.load_and_split()

# Example of the returned Documents
print(pages[0].page_content)

### <a id='toc1_2_2_'></a>[- PyMuPDFLoader](#toc0_)
- It is recognized for its speed and efficiency in parsing PDF files.<br>
- It extracts detailed metadata about the PDF and its pages, returning one document per page. <br>
- PyMuPDFLoader usually gives the same results quality as PyPDFLoader.

In [None]:
loader = PyMuPDFLoader(PDF_FILE)
pages = loader.load_and_split()

# Example of the returned Documents
print(pages[0].page_content)

**Limitations** 

While classical methods of parsing are very fast and efficient, they usually miss important informations that are contained in complex parts of PDFs like tables, images or simply more complex type of PDFs.

This is why it's important to explore other parsing strategies that can overcome these limitations.

## <a id='toc1_3_'></a>[2- Unstructured Methods](#toc0_)

Many documents contain a mixture of content types, including text, tables and images.

Tables in PDF are important in most of our GenAI use cases since they contain important informations. Classical parsing may break up tables, corrupting the data in retrieval.

Loading PDFs with table with a classical parser may corrupt the retrieval of data and impact the performance of your GenAI solution.
1. The table structure is lost, which negatively impacts the LLM's output quality.
2. Bad embeddings quality thus bad chunks retrieval with RAG use cases
3. When chunking, there is a risk of splitting the table in half, which may result in the loss of information or headers.

Unstructured is a python library that enables complex PDFs parsing, it's based on OCR models like yolox.


Unstructured `partition_pdf` segments a PDF document by using a layout model. This layout model makes it possible to extract elements from pdfs (for example: Title, Text, Header, Image, Tables, etc...). Tables in unstructured can be extracted either in text or html format or as images.


In this section we will show how to use unstructured standalone and how to combine it with other strategies.

Before using unstructured you may need to install Tesseract with `brew install tesseract` and add french language `brew install tesseract-lang`

### <a id='toc1_3_1_'></a>[- Raw unstructured](#toc0_)

With Unstructured, we can extract tables in either HTML format or as images. In this approach we will extract tables as html.

For more info about **partition_pdf** function see: https://docs.unstructured.io/open-source/core-functionality/partitioning#partition-pdf

**partition_pdf** returns a list of element with their types, content and metadata

In [None]:
raw_partitioned_pdf = partition_pdf(
    filename=PDF_FILE,  # PDF path
    strategy="hi_res",  # controls the method that will be used to process the PDF
    languages=["en", "fr"],  # languages to use for OCR
    hi_res_model_name="yolox",  # The layout detection model used when partitioning strategy
    infer_table_structure="True",  # enable extraction of tables for PDFs
    extract_image_block_to_payload=True,  # If True, images of the element type(s) defined in 'extract_image_block_types' will be encoded as base64 data and stored in two metadata fields
    extract_image_block_types=[
        "image",
        "table",
    ],  # Images of the element type(s) specified in this list will be saved
)

In [None]:
for el in raw_partitioned_pdf:
    if type(el) is Table:
        display(Markdown(el.metadata.text_as_html))
    elif type(el) is Image:
        # Display the image using base64
        img_html = f'<img src="data:image/png;base64,{el.metadata.image_base64}" />'
        display(HTML(img_html))
    else:
        display(Markdown(el.text))

**Limitations**

Unstructured may struggle to correctly extract very complex and large tables in PDFs. Which may require multimodal capabilities.

Unstructured can extract images as byte data, but these are not directly usable by an LLM. To work with images, you would need a multimodal model capable of processing both text and images.

### <a id='toc1_3_2_'></a>[- Unstructured + Multimodal](#toc0_)

With Unstructured, we can extract tables in either HTML format or as images.

In this approach, we will extract tables as images and then process them using a multimodal model. This model will convert the images into descriptive text, which can be embedded and utilized in a LLM.

By converting tables into descriptive text, we preserve the information that would otherwise be lost due to structural changes, ensuring high-quality embeddings in our RAG process.

In the example below, we extracted tables as images and used a multimodal model to process them. Alternatively, the tables could be extracted in HTML format (doable with unstructured) and processed using an LLM.

![Unstructured Multimodal](https://i.postimg.cc/hhRmtLc0/img-unstructured-multimodal.png)

In [None]:
# Processing PDF with unstructured

raw_partitioned_pdf = partition_pdf(
    filename=PDF_FILE,
    strategy="hi_res",
    languages=["en", "fr"],
    hi_res_model_name="yolox",
    extract_image_block_to_payload=True,
    extract_image_block_types=["image", "table"],
)

In [None]:
# Prompts used by the multimodal to transform table as image into a descriptive text. You can modify it and optimize it to fit better your use case.
SYSTEM_PROMPT = "Your objective is to convert the table or the graph in the image into descriptive sentences containing all the numbers and information from the table."

HUMAN_PROMPT = "Describe the table or the graph in the image with a paragraph of descriptive sentences"

# Iterating over elements to process tables
full_doc = ""
for el in raw_partitioned_pdf:
    if isinstance(el, (Table, Image)):
        base64_image = el.metadata.image_base64
        messages = [
            {
                "role": "system",
                "content": SYSTEM_PROMPT,
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": HUMAN_PROMPT,
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                    },
                ],
            },
        ]
        ai_message = llm.invoke(messages)
        full_doc += ai_message.content + "\n\n"
    else:
        full_doc += el.text + "\n\n"

In [None]:
Markdown(full_doc)

## <a id='toc1_4_'></a>[3 - Docling](#toc0_)

[Docling]((https://github.com/docling-project/docling)) is an open-source library (by IBM) that converts complex documents (like PDFs) into clean, structured formats such as Markdown or JSON, while preserving layout, tables, and hierarchy. Compared to Unstructured, it is generally faster, more accurate at extracting structured elements (especially tables and sections), and better integrated with RAG/NLP pipelines, whereas Unstructured is more generic but less precise.

See a benchmark comparison here between Unstructured, Docling and LlamaParse: [benchmark](https://procycons.com/en/blogs/pdf-data-extraction-benchmark/#:~:text=Key%20Takeaways%3A,seconds%20depending%20on%20page%20count))

### <a id='toc1_4_1_'></a>[- Raw docling](#toc0_)

In [None]:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True

# Create the Docling converter
converter = DocumentConverter(  # All of the below is optional, has internal defaults.
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)},
)

# Convert the document
result = converter.convert(PDF_FILE).document

In [None]:
display(Markdown(result.export_to_markdown(image_mode="embedded")))

### <a id='toc1_4_2_'></a>[- Docling with Langchain](#toc0_)

In [None]:
from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType

loader = DoclingLoader(file_path=PDF_FILE, export_type=ExportType.MARKDOWN)
pages = loader.load_and_split()

In [None]:
display(Markdown(pages[0].page_content))

## <a id='toc1_5_'></a>[4- Full multimodal](#toc0_)

In this method, we convert the pages of the PDF into images and use a multimodal model to rewrite the content of those pages. This approach is particularly useful for handling complex PDFs but also has some limitations.

1- **Processing All Pages**: Applying this method to every page of the PDF can be resource-intensive, both in terms of cost and time.<br>
2- **Selective Processing**: To improve efficiency, this process can be applied only to complex pages, such as those containing tables or intricate layouts, instead of all pages. (Not implemented in this notebook, but can be done using unstructured for example to detect pages contain tables)

![Full Multimodal](https://i.postimg.cc/CMXH2DT5/img-full-multimodal.png)

In [None]:
# Converting PDF to images

pdf_path = PDF_FILE
images = convert_from_path(pdf_path)

In [None]:
# Converting images from PIL to bytes

images_as_bytes = []
for img in images:
    # Convert image to bytes
    buffered = BytesIO()
    img.save(buffered, format="JPEG")  # Save image to buffer in PPM format
    image_bytes = buffered.getvalue()

    # Encode bytes to Base64
    image_base64 = base64.b64encode(image_bytes).decode("utf-8")

    images_as_bytes.append(image_base64)

In [None]:
# Rewriting each image page into text

# Prompts used by the multimodal to rewrite the page. You can modify and optimize them to better suit your needs.
SYSTEM_PROMPT = (
    "Your objective is to rewrite the text in the image that represents a PDF page. "
    "When you face a table, you should extract it in **Markdown table format**, preserving all numbers and information. "
    "When you face an image or a graph, you should describe it **precisely in a paragraph**, including all visual details and information it contains. "
    "For regular text, simply rewrite it clearly."
)

HUMAN_PROMPT = "Here's the page as image. Perform the extraction"

# Iterating over pages to rewrite them
full_doc = ""
for img in images_as_bytes:
    base64_image = img
    messages = [
        {
            "role": "system",
            "content": SYSTEM_PROMPT,
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": HUMAN_PROMPT,
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                },
            ],
        },
    ]
    ai_message = llm.invoke(messages)
    full_doc += ai_message.content + "\n\n"

In [None]:
print(full_doc)

**Limitations**

Parsing PDFs as images using a multimodal model allows for capturing complex structures and detailed information within the document. This approach is usually useful for PDFs that primarily consist of tables or have layouts like research papers with text in multiple columns, etc...

However, this strategy can be resource-intensive and time-consuming, making it less efficient for some use cases.

## <a id='toc1_6_'></a>[5- Llamaparse](#toc0_)

LlamaParse is a document parsing platform built by LlamaIndex. It exists as a standalone API and also as part of the LlamaCloud platform.

To generate Llama parse API key, go to https://cloud.llamaindex.ai/ sign in and create one. You get to parse 1000 pages for free per day.

In [None]:
nest_asyncio.apply()

parser = LlamaParse(api_key=os.getenv("LLAMA_PARSE_API_KEY"))

documents = parser.load_data(PDF_FILE)

documents

**Limitations**

1- Relying on an external API like Llamaparse for processing confidential documents poses significant risks for data privacy and security concerns, especially when dealing with confidential or sensitive documents.

2- From a cost perspective, external APIs can incur ongoing expenses that scale with usage, making them potentially more expensive than in-house solutions in the long term. Llamaparse offers 1000 free pages parsing per day. And then 3$ per additional 1000 pages.