## Installation
Uncomment and run this only once

In [7]:
#%pip install docling
%pip install markdown

Collecting markdown
  Downloading markdown-3.10-py3-none-any.whl.metadata (5.1 kB)
Downloading markdown-3.10-py3-none-any.whl (107 kB)
Installing collected packages: markdown
Successfully installed markdown-3.10
Note: you may need to restart the kernel to use updated packages.


## Importing Libraries

In [3]:
from docling.document_converter import DocumentConverter

  from .autonotebook import tqdm as notebook_tqdm

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/emalaga/Library/CloudStorage/Box-Box/Personal/Coding Projects/Tutorials/.venv/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/emalaga/Library/CloudStorage/Box-Box/Personal/Coding Projects/Tutorials/.venv/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch

## Analyzing the source

First we set the source and parse it with docling's DocumentConverter

In [5]:

source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
doc = converter.convert(source).document

2025-11-14 09:05:51,660 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-11-14 09:05:51,672 - INFO - Going to convert document batch...
2025-11-14 09:05:51,673 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-11-14 09:05:51,677 - INFO - Auto OCR model selected ocrmac.
2025-11-14 09:05:51,680 - INFO - Accelerator device: 'cpu'
2025-11-14 09:05:52,645 - INFO - Accelerator device: 'cpu'
2025-11-14 09:05:53,669 - INFO - Processing document 2408.09869v5.pdf
2025-11-14 09:05:54,165 - ERROR - Stage layout failed for run 1: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length.
2025-11-14 09:05:55,264 - ERROR - Stage layout failed for run 1: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length.
2025-11-14 09:05:56,238 - ERROR - Stage layout failed for run 1: Unable to create ten

ConversionError: Conversion failed for: 2408.09869v5.pdf with status: ConversionStatus.FAILURE

Displaying the markdown representation of the PDF

In [8]:
from IPython.display import HTML, display

# Convert markdown to HTML and display in scrollable div
from markdown import markdown
html_content = markdown(doc.export_to_markdown())

display(HTML(f'''
<div style="max-height: 600px; overflow-y: auto; border: 1px solid #ddd; padding: 10px;">
    {html_content}
</div>
'''))

NameError: name 'doc' is not defined

## Creating a RAG application
This section splits the text using the markdown headers, process it using an embeddling model and creates a FAISS vector store

In [9]:
# Minimal RAG Application

# 1. Get markdown content from the document
markdown_content = doc.export_to_markdown()

# 2. Split text by markdown headers
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_content)
print(f"Split into {len(md_header_splits)} chunks")

# 3. Create embeddings using HuggingFace model
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 4. Create FAISS vector store
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(md_header_splits, embeddings)

# 5. Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# 6. Set up the LLM
from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="granite4:micro")

# 7. Create RAG chain
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)



2025-10-22 12:05:36,621 - INFO - Use pytorch device_name: mps
2025-10-22 12:05:36,622 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


Split into 20 chunks


Asking questions

In [10]:
# 8. Test with a question
question = "What is this document about?"
result = qa_chain.invoke({"query": question})

print(f"\nQuestion: {question}")
print(f"\nAnswer: {result['result']}")

2025-10-22 12:05:58,823 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"



Question: What is this document about?

Answer: This document describes a system called Docling that converts scanned PDF documents into structured data. The key points are:

1. Docling uses a linear pipeline of operations to process each document sequentially.
2. It first parses the PDF to extract text tokens and render bitmap images of each page.
3. Then, it applies AI models on each page to extract features like layout and table structures.
4. The results from all pages are aggregated and passed through a post-processing stage that augments metadata, detects language, infers reading order, and assembles a typed document object.
5. Docling allows easy extension of the model library and pipelines for improving conversion quality and extracted metadata.

In summary, this document is about an open-source system called Docling that converts scanned PDF documents into structured data using AI models, with plans to expand its capabilities in the future.
