This notebook uses [Docling](https://docling-project.github.io/docling/) to convert any type of document into a Docling Document. A Docling Document is the representation of the document after conversion that can be exported as JSON. The JSON output of this notebook can then be used in others such as one that uses Docling's chunking methods.

In [1]:
!pip install accelerate


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
from docling.document_converter import DocumentConverter, ConversionError, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    HuggingFaceVlmOptions,
    InferenceFramework,
    ResponseFormat,
    VlmPipelineOptions,
)
from docling.pipeline.vlm_pipeline import VlmPipeline
import json
from pathlib import Path

First we set the paths for the documents we want to convert and where the JSON output should live.

In [3]:
doc_path = Path("/home/ffranz/Dev/e2e-poc-source-documents/newtest")
output_dir = Path("/home/ffranz/Dev/e2e-poc-source-documents/newtest/output")

files = []

if doc_path.is_file():
    files = [doc_path]
else:
    files = list(doc_path.rglob("*.pdf"))
print(f"Files to convert: {files}")

Files to convert: [PosixPath('/home/ffranz/Dev/e2e-poc-source-documents/newtest/safebalance-clarity-statement.pdf')]


Next we set the configuration options for our conversion pipeline. The PDF Conversion options set here are the defaults. More information about pipeline configuration can be found [Docling](https://docling-project.github.io/docling/).

In [4]:
pipeline_options = VlmPipelineOptions()

vlm_prompt = f"""Extract all text from the image you received, without modification, summarization, or omission.
    Format the output as markdown, using up to three levels of headers (#, ##, and ###) only where they appear 
    in the image, preserving bulleted and numbered lists, and maintaining basic text formatting (bold, italic, 
    underline) exactly where they appear.
    """

pipeline_options.vlm_options = HuggingFaceVlmOptions(
        repo_id="ibm-granite/granite-vision-3.2-2b",
        prompt=vlm_prompt,
        response_format=ResponseFormat.MARKDOWN,
        inference_framework=InferenceFramework.TRANSFORMERS,
    )

doc_converter = DocumentConverter(
     format_options={
         InputFormat.PDF: PdfFormatOption(
             pipeline_options=pipeline_options,
              pipeline_cls=VlmPipeline,
         )
     }
)

Finally we convert every document into Docling JSON as long as it is a valid file type to be converted

In [5]:
for file in files:
    try:
        doc = doc_converter.convert(source=file).document
        doc_dict = doc.export_to_dict()
        json_output_path = output_dir / f"{file.stem}.json"
        md_output_path = output_dir / f"{file.stem}.md"
        with open(json_output_path, "w") as f:
            json.dump(doc_dict, f)
            print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
        with open(md_output_path, "w") as f:
            f.write(doc.export_to_markdown())
            print(f"Path of MARKDOWN output is: {Path(md_output_path).resolve()}")
    except ConversionError as e:
        print(f"Skipping file {file}")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Path of JSON output is: /home/ffranz/Dev/e2e-poc-source-documents/newtest/output/safebalance-clarity-statement.json
Path of MARKDOWN output is: /home/ffranz/Dev/e2e-poc-source-documents/newtest/output/safebalance-clarity-statement.md
