## Review docling implementation

Also see: https://ds4sd.github.io/docling/concepts/docling_document/#basic-structure

In [1]:
from docling.document_converter import DocumentConverter
import json
import re

### A. Convert document to json

In [2]:
# Transform pdf
source = "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32013R0575"
converter = DocumentConverter()
result = converter.convert(source)

# Save the dictionary to a JSON file
with open('crr.json', 'w', encoding='utf-8') as json_file:
    json.dump(result.document.export_to_dict(), json_file, indent=4)

# Write the result to a markdown file
with open('crr.md', 'w', encoding='utf-8') as f:
    f.write(result.document.export_to_markdown())

print("Document saved as JSON and markdown file!")

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Document saved as JSON and markdown file!


### B. Inspect the contents

In [3]:
doc = result.document.export_to_dict()

In [4]:
def extract_headings(doc):
    result = []
    texts = doc.get('texts', [])

    for text in texts:
        label = text.get('label', '')
        level = text.get('level', None)
        
        if label == 'section_header' and level in [1, 2]:
            orig = text.get('orig', '')
            prov = text.get('prov', [])
            result.append((level, orig, prov[0]['page_no']))

    return result

# Extract headings
headings = extract_headings(doc)
for level, orig, page_no in headings:
    print(f"Level {level} | {page_no} | {orig} ")

Level 1 | 1 | REGULATIONS 
Level 1 | 18 | PART ONE 
Level 1 | 18 | GENERAL PROVISIONS 
Level 1 | 18 | SUBJECT MATTER, SCOPE AND DEFINITIONS 
Level 1 | 18 | Scope 
Level 1 | 18 | Article 2 
Level 1 | 18 | Supervisory powers 
Level 1 | 18 | Article 3 
Level 1 | 18 | Application of stricter requirements by institutions 
Level 1 | 18 | Article 4 
Level 1 | 18 | Definitions 
Level 1 | 27 | Article 5 
Level 1 | 27 | Definitions specific to capital requirements for credit risk 
Level 1 | 27 | TITLE II 
Level 1 | 27 | LEVEL OF APPLICATION OF REQUIREMENTS 
Level 1 | 27 | CHAPTER 1 
Level 1 | 27 | Application of requirements on an individual basis 
Level 1 | 27 | Article 6 
Level 1 | 27 | General principles 
Level 1 | 28 | Article 7 
Level 1 | 28 | Derogation to the application of prudential requirements on an individual basis 
Level 1 | 28 | Article 8 
Level 1 | 28 | Derogation to the application of liquidity requirements on an individual basis 
Level 1 | 29 | Article 9 
Level 1 | 29 | Individu

In [9]:
def extract_headings(doc):
    result = []
    texts = doc.get('texts', [])

    for text in texts:
        label = text.get('label', '')
        level = text.get('level', None)
        
        if label == 'section_header' and level in [1, 2]:
            orig = text.get('orig', '')
            prov = text.get('prov', [])
            result.append((level, orig, prov[0]['page_no']))

    return result

# Extrtacts paragraphs
headings = extract_headings(doc)
for level, orig, page_no in headings:
    substrings = ['TITLE', 'CHAPTER', 'S e c t i o n', 'PART', 'Article']
    # print(orig)
    for substr in substrings:
        if orig.startswith(substr):
            rest_of_text = orig[len(substr):].lstrip()
            first_20_chars = rest_of_text[:20]
            print(f"{substr} | {first_20_chars}")

PART | ONE
Article | 2
Article | 3
Article | 4
Article | 5
TITLE | II
CHAPTER | 1
Article | 6
Article | 7
Article | 8
Article | 9
Article | 10
Article | 12
Article | 13
CHAPTER | 2
S e c t i o n | 1
Article | 11
Article | 14
Article | 15
Article | 16
Article | 17
S e c t i o n | 2
S e c t i o n | 3
Article | 19
Article | 20
Article | 21
Article | 22
Article | 23
Article | 24
CHAPTER | 2
Article | 27
Article | 28
Article | 29
Article | 30
Article | 31
S e c t i o n | 2
Article | 32
Article | 33
Article | 34
Article | 35
S e c t i o n | 3
Article | 37
Article | 38
Article | 39
Article | 40
Article | 41
Article | 42
Article | 43
Article | 44
Article | 45
Article | 46
Article | 47
Article | 48
Article | 49
S e c t i o n | 4
CHAPTER | 3
Article | 51
Article | 52
Article | 53
Article | 54
Article | 55
S e c t i o n | 2
Article | 57
Article | 58
Article | 59
Article | 60
S e c t i o n | 3
CHAPTER | 4
Article | 63
Article | 64
Article | 65
S e c t i o n | 2
Article | 67
Article | 68
Article | 

### Check rendering of the Table 1 on page 142

| Credit quality  step with which  the credit  assessment of  the debt security  is associated   | Residual  Maturity   | Volatility adjustments for debt securities  issued by entities described in  Article 197(1)(b)   | Volatility adjustments for debt securities  issued by entities described in  Article 197(1)(b)   | Volatility adjustments for debt securities  issued by entities described in  Article 197(1)(b)   | Volatility adjustments for debt securities  issued by entities described in  Article 197(1)(b)   | Volatility adjustments for debt securities  issued by entities described in  Article 197(1) (c) and (d)   | Volatility adjustments for debt securities  issued by entities described in  Article 197(1) (c) and (d)   | Volatility adjustments for securitisation  positions and meeting the criteria in  Article 197(1) (h)   | Volatility adjustments for securitisation  positions and meeting the criteria in  Article 197(1) (h)   | Volatility adjustments for securitisation  positions and meeting the criteria in  Article 197(1) (h)   |
|------------------------------------------------------------------------------------------------|----------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
|                                                                                                |                      | 20-day  liquidation  period (%)                                                                  | 10-day  liquidation  period (%)                                                                  | 5-day  liquidation  period (%)                                                                   | 20-day  liquidation  period (%)                                                                  | 10-day  liquidation  period (%)                                                                           | 5-day  liquidation  period (%)                                                                            | 20-day  liquidation  period (%)                                                                        | 10-day  liquidation  period (%)                                                                        | 5-day  liquidation  period (%)                                                                         |
| 1                                                                                              | ≤ 1 year             | 0,707                                                                                            | 0,5                                                                                              | 0,354                                                                                            | 1,414                                                                                            | 1                                                                                                         | 0,707                                                                                                     | 2,829                                                                                                  | 2                                                                                                      | 1,414                                                                                                  |
|                                                                                                | >1 ≤ 5 years         | 2,828                                                                                            | 2                                                                                                | 1,414                                                                                            | 5,657                                                                                            | 4                                                                                                         | 2,828                                                                                                     | 11,314                                                                                                 | 8                                                                                                      | 5,657                                                                                                  |
|                                                                                                | > 5 years            | 5,657                                                                                            | 4                                                                                                | 2,828                                                                                            | 11,314                                                                                           | 8                                                                                                         | 5,657                                                                                                     | 22,628                                                                                                 | 16                                                                                                     | 11,313                                                                                                 |
| 2-3                                                                                            | ≤ 1 year             | 1,414                                                                                            | 1                                                                                                | 0,707                                                                                            | 2,828                                                                                            | 2                                                                                                         | 1,414                                                                                                     | 5,657                                                                                                  | 4                                                                                                      | 2,828                                                                                                  |
|                                                                                                | >1 ≤ 5 years         | 4,243                                                                                            | 3                                                                                                | 2,121                                                                                            | 8,485                                                                                            | 6                                                                                                         | 4,243                                                                                                     | 16,971                                                                                                 | 12                                                                                                     | 8,485                                                                                                  |
|                                                                                                | > 5 years            | 8,485                                                                                            | 6                                                                                                | 4,243                                                                                            | 16,971                                                                                           | 12                                                                                                        | 8,485                                                                                                     | 33,942                                                                                                 | 24                                                                                                     | 16,970                                                                                                 |
| 4                                                                                              | ≤ 1 year             | 21,213                                                                                           | 15                                                                                               | 10,607                                                                                           | N/A                                                                                              | N/A                                                                                                       | N/A                                                                                                       | N/A                                                                                                    | N/A                                                                                                    | N/A                                                                                                    |
|                                                                                                | >1 ≤ 5 years         | 21,213                                                                                           | 15                                                                                               | 10,607                                                                                           | N/A                                                                                              | N/A                                                                                                       | N/A                                                                                                       | N/A                                                                                                    | N/A                                                                                                    | N/A                                                                                                    |
|                                                                                                | > 5 years            | 21,213                                                                                           | 15                                                                                               | 10,607                                                                                           | N/A                                                                                              | N/A                                                                                                       | N/A                                                                                                       | N/A                                                                                                    | N/A                                                                                                    | N/A                                                                                                    |