# Convert With Docling
How to convert a PDF document using Docling:

In [1]:
from docling.datamodel.base_models import InputFormat
from pathlib import Path
from docling.document_converter import DocumentConverter
from ProcessingFunctions import FileProcessor


# Initialize and configure the converter
converter = DocumentConverter(allowed_formats=[InputFormat.PDF])

# Ensure output directory exists
output_dir = Path("result")
output_dir.mkdir(parents=True, exist_ok=True)

pdf_path = "report.pdf"

print("Converting PDF...")

result = converter.convert(source=pdf_path)

print("Exporting to Markdown...")

markdown_output = result.document.export_to_markdown()

print("Writing file in Markdown...")

processor = FileProcessor()
processor.write(markdown_output, output_dir)



2025-11-19 11:43:51,385 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-11-19 11:43:51,519 - INFO - Going to convert document batch...
2025-11-19 11:43:51,521 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-11-19 11:43:51,546 - INFO - Loading plugin 'docling_defaults'
2025-11-19 11:43:51,551 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-11-19 11:43:51,575 - INFO - Loading plugin 'docling_defaults'
2025-11-19 11:43:51,585 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']


Converting PDF...


2025-11-19 11:43:52,132 - INFO - Accelerator device: 'cpu'
[32m[INFO] 2025-11-19 11:43:52,149 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-11-19 11:43:52,165 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\Ana\Documents\SnowCamp\DoclingDemo\venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-11-19 11:43:52,167 [RapidOCR] main.py:53: Using C:\Users\Ana\Documents\SnowCamp\DoclingDemo\venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-11-19 11:43:52,291 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-11-19 11:43:52,295 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\Ana\Documents\SnowCamp\DoclingDemo\venv\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2025-11-19 11:43:52,297 [RapidOCR] main.py:53: Using C:\Users\Ana\Documents\SnowCamp\DoclingDemo\venv\Lib\site-packages\rapidocr\models\c

Exporting to Markdown...
Writing file in Markdown...
Exported to result\report.md
Export to Markdown completed!


## Result snippet:
## Impact, total portfolio

| Category             | Sub- category                  |   Green Bond Asset portfolio amount, EURm | Annual emissions avoided, tC02e   |
|----------------------|--------------------------------|-------------------------------------------|-----------------------------------|
| Clean transportation | Electric cars                  |                                        87 | 4193                              |
| Clean transportation | Electric ferries               |                                         2 | 3200                              |
| Clean transportation | Electric Trains                |                                       285 | 908 520                           |
| Clean transportation | Subtotal                       |                                       373 | 915913                            |
| Energy Efficiency    | Energy Efficiency              |                                         1 | 40                                |
| Energy Efficiency    | Subtotal                       |                                         1 | 40                                |
| Green Buildings      | Green Buildings                |                                      1857 | 6712                              |
| Green Buildings      | Subtotal                       |                                      1857 | 6712                              |
| Renewable            | Hydro                          |                                       569 | 645831                            |
| Renewable            | Solar                          |                                         0 | 198                               |
| energy               | Wind                           |                                       341 | 452242                            |
| energy               | Subtotal                       |                                       911 | 1098272                           |
| Pollution Prevention | Waste to Energy                |                                       385 | 377514                            |
| and control          | Water and WasteWater Treatment |                                       229 |                                   |
| and control          | Subtotal                       |                                       614 | 377514                            |
| and control          | Grand Total                    |                                      3756 | 2398 451                          |



## Token Count

Count the tokens of a PDF and a Markdown file

In [2]:
import tiktoken
from ProcessingFunctions import CountTokens

# Get encoding
encoding = tiktoken.get_encoding("cl100k_base")  # Most GPT models use this encoding

# Create CountTokens instance
token_counter = CountTokens()

# Compare token counts: raw PDF vs parsed markdown
token_counter.compare_token_counts(pdf_path, markdown_output, encoding)

Token count before parsing: 4,754,776 (4.8M)
Token count after PDF parsing: 25,071 (25.1K)

PDF content: 'ŧeEjR'jW5RHtpkQ-).=Ǹϓts=>_
<</Filter/FlateDecode/Length 816>>stream
 [3 6N7LNէƣ&4!t#YaUظ>{͗_3| oheeݹ,ܓ÷8}|2;,\Z7ɱPJݷ'T,g-zO>0ݱ,C"m<NIH@I)/O'UԤP*'7>3,嚏a=۩ihcklΝB3]ސzeZ%:$u~pIJDY͟p])
*C^H3qkWM'

Markdown content: 'build on their disclosure approach.

In updating the Dashboard, the working group noted the following changes and points of continuity since 2021:

- Many, if not most, firms are disclosing a range of metrics in their Task Force on Climate-related Financial Disclosures (TCFD) and other climate-relat'


(4754776, 25071)