In [None]:
import dotenv

dotenv.load_dotenv(override=True)

# Loading

There are various loaders that one can use in LangChain.

In particular, for PDFs one can use:

- PyPDF: python library to extract text from PDFs. It works well when the file has a simple structure and there is no need of OCR;

- Azure Document Intelligence: LangChain integration with Azure SDK. Document Intelligence is a Microsoft Azure service that enables the analysis, extraction, and structuring of data from unstructured documents including Word and PDFs, using advanced OCR. This tool is recommended when working with scanned or hand-written documents or documents containing tables or similar structures. Some feature include:

    - **Prebuilt models** for common document types (e.g., invoices, receipts, IDs) and general documents.
    
    - **Custom models** that can be trained on specific document formats.
    
    - **Table** and **key-value** pair extraction for structured data retrieval.

    - **Markdown** text extraction to easy document indexing.

The model we are using is the `prebuild-layout` model. More info on this page [What is Azure AI Document Intelligence?](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-4.0.0).

Note that, while document loading with PyPDF only takes a couple of seconds, using Document Intelligence is a little longer, as the model has to perform OCR on the document and extract tables and document structure.

In [None]:
FILE_PATH = "<path_to_pdf>"

## PyPDF Loader

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(FILE_PATH)
document = loader.load()

## Azure Document Intelligence Loader

In [None]:
import os
from langchain_community.document_loaders.doc_intelligence import (
    AzureAIDocumentIntelligenceLoader,
)

loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=os.environ["DOC_INTELLIGENCE_ENDPOINT"],
    api_key=os.environ["DOC_INTELLIGENCE_KEY"],
    file_path=FILE_PATH,
)

document = loader.load()

# Splitting

Several splitters are already developed in LangChain.

- **CharacterTextSplitter**: is the simplest splitter. It splits the text based on a given chunk size (usually in token) and a give separator.

- **RecursiveCharacterTextSplitter**: is more avdanced. It tries to recursively split the text on a give set of separators, by default `["\n\n", "\n", " ", ""]`, observing a given chunk size. More info [here](https://python.langchain.com/docs/how_to/recursive_text_splitter/).

- **MarkdownTextSplitter**: is specific for markdown text. It extends `RecursiveCharacterTextSplitter` using separators specific for Markdown, such as end of code blocks and markdown headings.

- **MarkdownHeaderTextSplitter**: is specific for markdown text. It splits the text based on a list of tuples defining the headers. More info [here](https://python.langchain.com/docs/how_to/markdown_header_metadata_splitter/).

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter()
splitted_docs = splitter.split_documents(document)

In [12]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")]
)
splitted_docs = splitter.split_text(document[0].page_content)

# Embedding

In [None]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Indexing

Several indexes are available in LangChain and on the market. For this demo we are using **FAISS** index, which is open source.

Complete list of supported indexes in LangChain [here](https://python.langchain.com/docs/integrations/vectorstores/).

In [None]:
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(splitted_docs, embeddings)
vectorstore.save_local("pdf_index")