# Hybrid Chunking with Docling

Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking.

For more details, see the [Docling HybridChunker docs](https://github.com/openai/docling#hybrid-chunking).

## Setup

Install the required packages:

In [1]:
!pip install -qU docling transformers
# You may need to restart the kernel after installation.

## Conversion

Convert the PDF `AR_2020_WEB2.pdf` into a Docling document:

In [2]:
import logging
from pathlib import Path
from docling.document_converter import DocumentConverter

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Path to your PDF in the tests folder
pdf_path = Path("tests/AR_2020_WEB2.pdf")

# Convert to a Docling document
converter = DocumentConverter()
doc = converter.convert(source=pdf_path).document
logger.info("Conversion complete.")

2025-09-24 16:24:24,018 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]


2025-09-24 16:24:24,028 - INFO - Going to convert document batch...


2025-09-24 16:24:24,028 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e647edf348883bed75367b22fbe60347


2025-09-24 16:24:24,194 - INFO - Loading plugin 'docling_defaults'




2025-09-24 16:24:24,195 - INFO - Registered picture descriptions: ['vlm', 'api']


2025-09-24 16:24:24,202 - INFO - Loading plugin 'docling_defaults'




2025-09-24 16:24:24,203 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']


2025-09-24 16:24:32,320 - INFO - Accelerator device: 'mps'


2025-09-24 16:24:34,364 - INFO - Accelerator device: 'mps'


2025-09-24 16:24:35,237 - INFO - Accelerator device: 'mps'


2025-09-24 16:24:35,609 - INFO - Processing document AR_2020_WEB2.pdf


2025-09-24 16:24:42,359 - INFO - Finished converting document AR_2020_WEB2.pdf in 18.34 sec.


2025-09-24 16:24:42,361 - INFO - Conversion complete.


## Hybrid Chunking

Perform hybrid chunking on the converted document:

In [3]:
from docling.chunking import HybridChunker

# Instantiate the chunker
chunker = HybridChunker()

# Generate chunks (this may emit a harmless tokenization warning)
chunk_iter = chunker.chunk(dl_doc=doc)

# Inspect and serialize the first few chunks
for i, chunk in enumerate(chunk_iter):
    if i >= 10:
        break
    print(f"=== {i} ===")
    print(f"chunk.text:\n{chunk.text[:300]}…")
    enriched_text = chunker.serialize(chunk)
    print(f"chunker.serialize(chunk):\n{enriched_text[:300]}…\n")

Token indices sequence length is longer than the specified maximum sequence length for this model (681 > 512). Running this sequence through the model will result in indexing errors


=== 0 ===
chunk.text:
bridging the gap between poverty and prosperity…
chunker.serialize(chunk):
ANNUAL REPORT 2020
bridging the gap between poverty and prosperity…

=== 1 ===
chunk.text:
No one could have predicted the events of 2020. The global COVID-19 pandemic created a dynamic year. With the help of volunteers, donors, staff, and most importantly, the blessings of God, Midwest Food Bank responded nimbly to the changing landscape.
All  MFB  locations  remained  open  and  respons…
chunker.serialize(chunk):
A message from Co-Founder, President, and CEO, David Kieser
No one could have predicted the events of 2020. The global COVID-19 pandemic created a dynamic year. With the help of volunteers, donors, staff, and most importantly, the blessings of God, Midwest Food Bank responded nimbly to the changing …

=== 2 ===
chunk.text:
We are humbled and thankful. Moving forward, we continue to follow the leading of the Lord as we live out our mission.
In His service, David Kieser
The Lord is

  enriched_text = chunker.serialize(chunk)
