# Hybrid Chunking with Docling

Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking.

For more details, see the [Docling HybridChunker docs](https://github.com/openai/docling#hybrid-chunking).

## Setup

Install the required packages:

In [None]:
!pip install -qU docling transformers
# You may need to restart the kernel after installation.

## Conversion

Convert the PDF `AR_2020_WEB2.pdf` into a Docling document:

In [4]:
import logging
from pathlib import Path
from docling.document_converter import DocumentConverter

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Path to your PDF in the tests folder
pdf_path = Path("tests/AR_2020_WEB2.pdf")

# Convert to a Docling document
converter = DocumentConverter()
doc = converter.convert(source=pdf_path).document
logger.info("Conversion complete.")

INFO:docling.document_converter:Going to convert document batch...
INFO:docling.document_converter:Initializing pipeline for StandardPdfPipeline with options hash 70041f74270850b7bedf7c8f5c2dcede
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.pipeline.base_pipeline:Processing document AR_2020_WEB2.pdf
INFO:docling.document_converter:Finished converting document AR_2020_WEB2.pdf in 10.23 sec.
INFO:__main__:Conversion complete.


## Hybrid Chunking

Perform hybrid chunking on the converted document:

In [6]:
from docling.chunking import HybridChunker

# Instantiate the chunker
chunker = HybridChunker()

# Generate chunks (this may emit a harmless tokenization warning)
chunk_iter = chunker.chunk(dl_doc=doc)

# Inspect and serialize the first few chunks
for i, chunk in enumerate(chunk_iter):
    if i >= 10:
        break
    print(f"=== {i} ===")
    print(f"chunk.text:\n{chunk.text[:300]}…")
    enriched_text = chunker.serialize(chunk)
    print(f"chunker.serialize(chunk):\n{enriched_text[:300]}…\n")

Token indices sequence length is longer than the specified maximum sequence length for this model (681 > 512). Running this sequence through the model will result in indexing errors


=== 0 ===
chunk.text:
bridging the gap between poverty and prosperity…
chunker.serialize(chunk):
bridging the gap between poverty and prosperity…

=== 1 ===
chunk.text:
No one could have predicted the events of 2020. The global COVID-19 pandemic created a dynamic year. With the help of volunteers, donors, staff, and most importantly, the blessings of God, Midwest Food Bank responded nimbly to the changing landscape.
All  MFB  locations  remained  open  and  respons…
chunker.serialize(chunk):
A message from Co-Founder, President, and CEO, David Kieser
No one could have predicted the events of 2020. The global COVID-19 pandemic created a dynamic year. With the help of volunteers, donors, staff, and most importantly, the blessings of God, Midwest Food Bank responded nimbly to the changing …

=== 2 ===
chunk.text:
• MFB distributed a record amount of food, 37% more than in 2019.
· In 2020, we sent a record number of family food boxes in Disaster Relief semi loads, nearly six times more tha

  enriched_text = chunker.serialize(chunk)
