## Review docling implementation

Also see: https://ds4sd.github.io/docling/concepts/docling_document/#basic-structure

In [1]:
from docling.document_converter import DocumentConverter
import json
import re

### A. Convert document to json

In [2]:
# Transform pdf
source = "https://www.eba.europa.eu/sites/default/files/documents/10180/1597103/004d3356-a9dc-49d1-aab1-3591f4d42cbb/Final%20Report%20on%20Guidelines%20on%20default%20definition%20%28EBA-GL-2016-07%29.pdf"
converter = DocumentConverter()
result = converter.convert(source)

# Save the dictionary to a JSON file
with open('eba_dod.json', 'w', encoding='utf-8') as json_file:
    json.dump(result.document.export_to_dict(), json_file, indent=4)

# Write the result to a markdown file
with open('eba_dod.md', 'w', encoding='utf-8') as f:
    f.write(result.document.export_to_markdown())

print("Document saved as JSON and markdown file!")

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Document saved as JSON and markdown file!


### B. Inspect the contents

In [3]:
doc = result.document.export_to_dict()

In [4]:
def extract_headings(doc):
    result = []
    texts = doc.get('texts', [])

    for text in texts:
        label = text.get('label', '')
        level = text.get('level', None)
        
        if label == 'section_header' and level in [1, 2]:
            orig = text.get('orig', '')
            prov = text.get('prov', [])
            result.append((level, orig, prov[0]['page_no']))

    return result

# Extract headings
headings = extract_headings(doc)
for level, orig, page_no in headings:
    print(f"Level {level} | {page_no} | {orig} ")

Level 1 | 1 | Final Report 
Level 1 | 2 | Contents 
Level 1 | 3 | 1. Executive Summary 
Level 1 | 3 | Next steps 
Level 1 | 4 | 2. Background and rationale 
Level 1 | 4 | 2.1 Implementation of the changes in the definition of default 
Level 1 | 5 | 2.2 Past due criterion in the identification of default 
Level 1 | 5 | 2.2.1 Counting of days past due 
Level 1 | 5 | 2.2.2 Technical past due situation 
Level 1 | 6 | 2.2.3 Exposures to central governments, local authorities and public sector entities 
Level 1 | 7 | 2.2.4 Factoring and purchased receivables 
Level 1 | 7 | 2.2.5 Materiality threshold 
Level 1 | 7 | 2.3 Indications of unlikeliness to pay 
Level 1 | 7 | 2.3.1 Specific credit risk adjustments (SCRA) 
Level 1 | 9 | 2.3.2 Sale of credit obligations 
Level 1 | 9 | 2.3.3 Distressed restructuring 
Level 1 | 10 | 2.3.4 Bankruptcy 
Level 1 | 10 | 2.3.5 Additional indications of unlikeliness to pay 
Level 1 | 11 | 2.4 Application of the definition of default in external data 
Level 1 |

In [16]:
def extract_paragraphs(doc):
    headings = []
    texts = doc.get('texts', [])

    for text in texts:
        label = text.get('label', '')
        
        if label == 'list_item':
            orig = text.get('orig', '')
            prov = text.get('prov', [])
            headings.append((orig, prov[0]['page_no']))

    return headings

# Extrtacts paragraphs
headings = extract_paragraphs(doc)
start_page = 0
for orig, page_no in headings:
    match = re.match(r'^(\d+)\.\s*(.*)', orig)
    if match:
        print(f"{page_no} | {match.group(1)} | {orig[:20]} ")
        start_page += 1
        assert start_page == int(match.group(1)), "Inconsistent numbering"

18 | 1 | 1. This document con 
18 | 2 | 2. Guidelines set th 
18 | 3 | 3. According to Arti 
18 | 4 | 4. Notifications wil 
19 | 5 | 5. These guidelines  
19 | 6 | 6. These guidelines  
19 | 7 | 7. Institutions that 
19 | 8 | 8. These guidelines  
19 | 9 | 9. Unless otherwise  
20 | 10 | 10. These guidelines 
20 | 11 | 11. In order to appl 
20 | 12 | 12. The changes refe 
20 | 13 | 13. Institutions tha 
21 | 14 | 14. After IRB instit 
21 | 15 | 15. Sections 3.3.2.1 
21 | 16 | 16. For the purposes 
21 | 17 | 17. Where the credit 
21 | 18 | 18. Where the repaym 
22 | 19 | 19. Where the repaym 
22 | 20 | 20. Where the obligo 
22 | 21 | 21. The calculation  
22 | 22 | 22. The classificati 
23 | 23 | 23. A technical past 
23 | 24 | 24. Technical past d 
23 | 25 | 25. Institutions may 
24 | 26 | 26. Institutions tha 
24 | 27 | 27. Where there are  
24 | 28 | 28. Where there are  
25 | 29 | 29. Where the instit 
25 | 30 | 30. Events recognise 
25 | 31 | 31. Where the obligo 
25 | 32 | 32. In 