## Experiment: Which PDF Splitter works the best?

In this notebook we will do the embedding using the different splitting libraries

### Init the Environment

In [None]:
! pip3 install -U trulens-eval # includes lang-chain as a dependency
! pip3 install -U ipython ipywidgets # required for the trulens UI to run from inside the notebook
! pip3 install -U llama-index # for the llamaindex-cli tool to download datasets
! pip3 install -U langchain-openai # for the Azure LLM models
! pip3 install -U astrapy # to access AstraDB vector store

! pip3 install -U llmsherpa # for the LayoutPDFReader pdf parser
! pip3 install -U pypdfium2 # for the PyPDFium2Loader
! pip3 install -U pdfminer-six # for the PDFMinerLoader
! pip3 install -U pypdf # for the PyPDFLoader
! pip3 install -U pymupdf # for the PyMuPDFLoader

You should restart your environment after installing the above.

In [7]:
# this notebook assumes the following env vars exist in a .env file:
#
# ASTRA_DB_ENDPOINT=https://<uuid>-<region>.apps.astra.datastax.com
# ASTRA_DB_TOKEN=AstraCS:<secret>:<secret>
# AZURE_OPENAI_ENDPOINT=https://<domain>.openai.azure.com/
# AZURE_OPENAI_API_KEY=<secret>
# OPENAI_API_TYPE=azure
# OPENAI_API_VERSION=2023-05-15

# and optionally this var if you want to use an external database for TruLens:
# TRULENS_DB_CONN_STRING=<db connection string>

from dotenv import load_dotenv
import os

load_dotenv()
os.environ["ASTRA_DB_ENDPOINT"] = os.environ.get("ASTRA_DB_ENDPOINT_PDF_SPLITS")
os.environ["ASTRA_DB_TOKEN"] = os.environ.get("ASTRA_DB_TOKEN_PDF_SPLITS")

### Setup Azure LLMs for LangChain

In [8]:
# Setup Azure-based models
from langchain_openai import AzureOpenAIEmbeddings

open_ai_embeddings = AzureOpenAIEmbeddings(
    azure_deployment="text-embedding-ada-002",
    openai_api_version="2023-05-15"
)

## Load documents into memory, chunk, create embeddings, store in AstraDB

### LayoutPDFReader

Most PDF to text parsers do not provide layout information. Often times, even the sentences are split with arbitrary CR/LFs making it very difficult to find paragraph boundaries. This poses various challenges in chunking and adding long running contextual information such as section header to the passages while indexing/vectorizing PDFs for LLM applications such as retrieval augmented generation (RAG).

LayoutPDFReader solves this problem by parsing PDFs along with hierarchical layout information such as:

Sections and subsections along with their levels.
Paragraphs - combines lines.
Links between sections and paragraphs.
Tables along with the section the tables are found in.
Lists and nested lists.
Join content spread across pages.
Removal of repeating headers and footers.
Watermark removal.
With LayoutPDFReader, developers can find optimal chunks of text to vectorize, and a solution for limited context window sizes of LLMs.

* https://github.com/nlmatics/llmsherpa
* ☆ 586

### PyPDFium2Loader

pypdfium2 is an ABI-level Python 3 binding to PDFium, a powerful and liberal-licensed library for PDF rendering, inspection, manipulation and creation.

* https://github.com/pypdfium2-team/pypdfium2
* ☆ 231
* https://pdfium.googlesource.com/pdfium/

### PDFMinerLoader

Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.

* https://github.com/pdfminer/pdfminer.six 
* ☆ 5.1k

### PyPDFLoader

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

* https://github.com/py-pdf/pypdf
* ☆ 7.0k

### PyMuPDFLoader

A high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

* https://github.com/pymupdf/PyMuPDF
* ☆ 3.6k


In [None]:
from langchain_community.document_loaders.pdf import PyPDFium2Loader, PyMuPDFLoader, PyPDFLoader, PDFMinerLoader
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import TokenTextSplitter
from langchain.vectorstores.astradb import AstraDB
import os

collection_loader_map = {
    "PyPDFium2Loader" : { "loader": PyPDFium2Loader, "kwargs": {}},
    "PyMuPDFLoader" : { "loader": PyMuPDFLoader, "kwargs": {}},
    "PyPDFLoader" : { "loader": PyPDFLoader, "kwargs": {}},
    "PDFMinerLoader_by_page" : { "loader": PDFMinerLoader, "kwargs": {"concatenate_pages": False}},
    "PDFMinerLoader_by_pdf" : { "loader": PDFMinerLoader, "kwargs": {"concatenate_pages": True}},
}

for collection_name in collection_loader_map:
    vstore = AstraDB(
        collection_name=collection_name,
        embedding=open_ai_embeddings,
        token=os.getenv("ASTRA_DB_TOKEN"),
        api_endpoint=os.getenv("ASTRA_DB_ENDPOINT")
    )

    print(f"Loading PDFs into {collection_name}:")
    loader_cls = collection_loader_map[collection_name]["loader"]
    loader_kwargs = collection_loader_map[collection_name]["kwargs"]
    loader = DirectoryLoader('data/', glob=f"*/source_files/*.pdf", show_progress=True, loader_cls=loader_cls, loader_kwargs=loader_kwargs)

    splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=0)
    vstore.add_documents(splitter.split_documents(loader.load()))

In [None]:


# llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
# pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
# pdf_reader = LayoutPDFReader(llmsherpa_api_url)
# doc = pdf_reader.read_pdf(pdf_url)

# from llama_index.readers.schema.base import Document
# from llama_index import VectorStoreIndex

# index = VectorStoreIndex([])
# for chunk in doc.chunks():
#     index.insert(Document(text=chunk.to_context_text(), extra_info={}))
# query_engine = index.as_query_engine()

In [None]:
import glob

datasets = []

# Example: Find all .txt files in the specified directory
for file_path in glob.glob('data/*/source_files/*.pdf'):
    dataset = file_path.split("/")[1]
    if dataset not in datasets:
        datasets.append(dataset)

print(datasets)
