# Docling: Enterprise PDF Processing for RAG

This notebook demonstrates **Docling**, Red Hat's recommended solution for document preparation in RAG workflows.

## Why Docling?

- ✅ **AI-powered** PDF to Markdown conversion
- ✅ **Preserves structure** (tables, lists, headers)
- ✅ **OCR support** for scanned documents
- ✅ **Enterprise ready** - handles complex layouts
- ✅ **Red Hat official** component

**Reference:** [Red Hat OpenShift AI - Preparing Documents with Docling](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/working_with_rag/deploying-a-rag-stack-in-a-data-science-project#preparing-documents-with-docling-for-llama-stack-retrieval_deploying-a-rag-stack)


## Step 1: Install Docling and Dependencies


In [None]:
!pip install -q docling llama_stack_client


## Step 2: Process a PDF with Docling


In [None]:
from docling.document_converter import DocumentConverter
import urllib.request

# Download Red Hat OpenShift AI RAG Guide
pdf_url = "https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/pdf/working_with_rag/Red_Hat_OpenShift_AI_Cloud_Service-1-Working_with_RAG-en-US.pdf"
pdf_path = "/tmp/rhoai-rag-guide.pdf"

print("⏳ Downloading PDF...")
urllib.request.urlretrieve(pdf_url, pdf_path)
print("✅ Downloaded")

# Initialize Docling converter
print("⏳ Converting with Docling...")
converter = DocumentConverter()
result = converter.convert(pdf_path)

# Export to Markdown
markdown = result.document.export_to_markdown()
print(f"✅ Converted to Markdown ({len(markdown)} characters)")
print(f"\nFirst 500 characters:\n{markdown[:500]}...")


## Step 3: Insert into Llama Stack / Milvus

Complete notebook demonstrates:
- ✅ PDF download and conversion with Docling
- ✅ Intelligent chunking strategies
- ✅ Insertion into vector database via Llama Stack
- ✅ Quality comparison (Docling vs. raw text)

**See full implementation in:** `gitops/components/docling-pipeline/job-process-pdfs.yaml`
