# Extracting requirements from the ECB Guide

This script is designed to process the JSON output of an SCE Guide converted using Docling and transform  
the structured requirements into a pandas DataFrame. The primary goal is to extract headings at different   
levels (Level 0 to Level 3) into separate columns and organize the associated text into single rows.   
Additionally, the script extracts paragraph numbers from the text for better indexing and fills missing  
numbers where necessary.

**Please note that this approach provides a starting point, which needs to be reviewed manually**

#### Steps in the Code

##### 1. **Extract Headings and Text**

- The `extract_headings_separate_columns` function parses the input JSON (`doc`) to identify headings  
  and classify them into hierarchical levels (Level 0 to Level 3).
- Associated text is grouped under the appropriate headings.


##### 2. **Flatten the Result**

- Headings and text are flattened into single rows to facilitate further transformations.


##### 3. **Create a Pandas DataFrame**

- The resulting data is stored in a DataFrame with clearly defined columns:  
  **Page**, **Level 0 to Level 3**, **Text**, **Label**, **Parent**, and a concatenated levels column.


##### 4. **Extract Paragraph Numbers**

- A helper function, `extract_number`, extracts numeric paragraph identifiers at the start of the text.  
  Missing numbers are forward-filled for consistency.


##### 5. **Join rows, which belong to the same paragraph**

- During the Docling processing some paragraphs are split and need to be joined again

In [1]:
from docling.document_converter import DocumentConverter
import json
import re
import pandas as pd

# Ensure no limitations in Jupyter Notebook output
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

In [2]:
# Transform pdf
source = "ssm.supervisory_guides_202402.pdf"
converter = DocumentConverter()
result = converter.convert(source)
doc = result.document.export_to_dict()

# Save the dictionary to a JSON file
with open('eba_guide.json', 'w', encoding='utf-8') as json_file:
    json.dump(result.document.export_to_dict(), json_file, indent=4)

# Write the result to a markdown file
with open('eba_guide.md', 'w', encoding='utf-8') as f:
    f.write(result.document.export_to_markdown())

print("Document saved as JSON and markdown file!")

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Document saved as JSON and markdown file!


In [3]:
def extract_headings_separate_columns(doc):
    result = []
    texts = doc.get('texts', [])

    # Variables to keep track of Level 0, Level 1, Level 2, and Level 3 headings
    current_level_0 = None
    current_level_1 = None
    current_level_2 = None
    current_row = None

    for text in texts:
        label = text.get('label', '')
        orig = text.get('orig', '')
        prov = text.get('prov', [])
        page_no = prov[0]['page_no'] if prov else None
        parent = text.get('parent', {}).get('$ref', None)

        exceptions = [
            'Risk differentiation across grades or pools',
            'Homogeneity within grades',
            'Distribution of obligors or facilities across grades or pools',
            'Principles specific for grades and pools',
            'Principles for all model types',
            'Principles specific for direct estimates',
            'For example:',
            '77. Validation policy'
        ]

        if orig in exceptions:
            label = 'text'
            
        if label == 'section_header':
            first_word = orig.split(' ', 1)[0]
            
            # Check if this is a Level 0 heading (no dots and starts with a capital letter)
            if '.' not in orig.split(' ', 1)[0] and orig[0].isupper() and not first_word.endswith('.'):
                current_level_0 = orig
                current_level_1 = None  # Reset Level 1 when a new Level 0 is found
                current_level_2 = None
                current_row = (page_no, orig, None, None, None, [], None, None)

            # Check if this is a Level 1 heading (no dot in numbering)
            elif orig.split(' ', 1)[0].isdigit():
                current_level_1 = orig
                current_level_2 = None  # Reset Level 2 when a new Level 1 is found
                current_row = (page_no, current_level_0, orig, None, None, [], None, None)

            # Check if this is a Level 2 heading (contains one dot in numbering)
            elif orig.split(' ', 1)[0].count('.') == 1 and not first_word.endswith('.'):
                current_level_2 = orig
                current_row = (page_no, current_level_0, current_level_1, orig, None, [], None, None)

            # Check if this is a Level 3 heading (contains two dots in numbering)
            elif orig.split(' ', 1)[0].count('.') == 2 and not first_word.endswith('.'):
                current_row = (page_no, current_level_0, current_level_1, current_level_2, orig, [], None, None)

            if current_row:
                result.append(current_row)

        # A simplification: consider including footnotes, tables and pictures
        elif label in ['list_item', 'text']:
            # Append list items (text under the heading) and label, parent to the "Text" field
            if current_row:
                current_row[-3].append((orig, label, parent))

    # Flatten the "Text" field for multiple rows
    final_result = []
    for row in result:
        page_no, level_0, level_1, level_2, level_3, texts, _, _ = row
        if texts:
            for text, label, parent in texts:
                final_result.append((page_no, level_0, level_1, level_2, level_3, text, label, parent))
        else:
            final_result.append((page_no, level_0, level_1, level_2, level_3, None, None, None))

    return final_result

# Extract headings into separate columns
headings_separated = extract_headings_separate_columns(doc)
df = pd.DataFrame(headings_separated, 
                  columns=["Page", "Level 0", "Level 1", "Level 2", "Level 3", "Text", "Label", "Parent"])
df["Concatenated Levels"] = df[["Level 0", "Level 1", "Level 2", "Level 3"]].apply(
    lambda x: " > ".join(filter(None, x)), axis=1
)

# Extract the number at the beginning of the "Text" column
def extract_number(text):
    if isinstance(text, str):  # Ensure the value is a string
        match = re.match(r"^(\d+)\.", text)
        return int(match.group(1)) if match else None
    return None

df['Par'] = df['Text'].apply(extract_number)

# Fill forward the missing numbers in 'Par'
df['Par'] = df['Par'].ffill()

In [4]:
# Remove rows with None values in Text
df = df[df['Text'].notna()]

# Function to determine if rows should be merged
def should_merge(prev_row, curr_row):
    # Check if Level 3 matches
    if prev_row['Level 3'] != curr_row['Level 3']:
        return False

    # Scenario 1: Merge if sentence ends without punctuation and new row starts without capital letter or punctuation
    if not prev_row['Text'].strip().endswith(('.')) and curr_row['Text'].strip()[0].islower():
        return True

    # Scenario 2: Merge if new row starts with (a), (b), (iv) etc.
    patterns = r"^\((?:[a-z]{1,2}|\d+|i{1,3}|iv|v)\)|^·"
    if re.match(patterns, curr_row['Text'].strip()):   
        curr_row['Text'] = '\n' + curr_row['Text']
        return True

    return False

# Process the DataFrame
merged_rows = []
temp_row = None

for index, row in df.iterrows():
    if temp_row is None:
        temp_row = row.copy()
    else:
        if should_merge(temp_row, row):
            temp_row['Text'] += ' ' + row['Text']
        else:
            merged_rows.append(temp_row)
            temp_row = row.copy()

# Add the last temp_row to the merged rows if not None
if temp_row is not None:
    merged_rows.append(temp_row)

# Create a new DataFrame from merged rows
merged_df = pd.DataFrame(merged_rows)

# Store the dataframe
merged_df.to_excel('ecb_guide.xlsx', index=False)

merged_df[500:530]

Unnamed: 0,Page,Level 0,Level 1,Level 2,Level 3,Text,Label,Parent,Concatenated Levels,Par
1128,142,Credit risk,8 Model-related MoC,8.1 Relevant regulatory references,,"209. Since the MoC requirements laid down by the CRR also apply in cases where institutions estimate CCFs, paragraph 208 is also relevant in such cases.",list_item,#/groups/210,Credit risk > 8 Model-related MoC > 8.1 Relevant regulatory references,209.0
1129,142,Credit risk,8 Model-related MoC,8.1 Relevant regulatory references,,"210. In the understanding of the ECB, to reflect the dispersion of the statistical estimators as set out in paragraph 43(b) of the EBA Guidelines on PD and LGD, institutions should adopt the following approach. \n(a) For PD, estimate an MoC to account for statistical uncertainty/sampling error affecting the LRA estimate at grade/pool level. This MoC should be based on the distribution of the estimator, which is the average of one-year default rates of the grade/pool across time (i.e. the distribution of (ΣDR$_{t}$) T / ), considering that the uncertainty is primarily driven by the statistical uncertainty of each one-year default rate and the length of the time series. As a result, it is expected that the lower the number of observations per grade and the shorter the time series are, the higher the MoC of the grade should be.",list_item,#/groups/211,Credit risk > 8 Model-related MoC > 8.1 Relevant regulatory references,210.0
1131,142,Credit risk,8 Model-related MoC,8.1 Relevant regulatory references,,"Institutions need to be aware of and deal adequately with the dependency between default rates over time on the quantification of the MoC, e.g. when using overlapping windows for the calculation of default rates.",text,#/body,Credit risk > 8 Model-related MoC > 8.1 Relevant regulatory references,210.0
1132,142,Credit risk,8 Model-related MoC,8.1 Relevant regulatory references,,"The above principles also apply for institutions using direct PD estimates and for institutions calibrating the LRA default rate at the level of the calibration segment, as referred to in paragraph 92(b) of the EBA Guidelines on PD and LGD. When using direct PD estimates, the MoC is based on the distribution of this direct PD estimator (which includes the risk differentiation function), implicitly reflecting the uncertainty of the LRA. When calibration is performed at calibration segment level, the general estimation error may be computed at that level when the statistical uncertainty/sampling error is neither significantly different across grades or PD sub-ranges nor significantly different between the calibration segment level and the grades or PD sub-ranges level. \n(b) Similarly, for LGD and CCF, estimate an MoC to account for statistical uncertainty/sampling error affecting the final estimates. This MoC should be defined on the basis of the distribution of the estimators, considering that their uncertainty is primarily driven by the statistical uncertainty of the observations used to compute the long-run and downturn estimates and the length of the time series.",text,#/body,Credit risk > 8 Model-related MoC > 8.1 Relevant regulatory references,210.0
1135,144,Credit risk,9 Review of estimates,9.1 Relevant regulatory references,,"211. Institutions must review their estimates whenever new information comes to light but at least on an annual basis. 97 To comply with this requirement, they are expected to have in place a framework under paragraphs 217 to 221 of the EBA Guidelines on PD and LGD.",list_item,#/groups/213,Credit risk > 9 Review of estimates > 9.1 Relevant regulatory references,211.0
1136,144,Credit risk,9 Review of estimates,9.1 Relevant regulatory references,,"212. Since the review of estimates requirements under the CRR also apply in cases where an institution estimates CCFs, paragraph 211 is also relevant to such cases.",list_item,#/groups/213,Credit risk > 9 Review of estimates > 9.1 Relevant regulatory references,212.0
1137,144,Credit risk,9 Review of estimates,9.1 Relevant regulatory references,,"213. In the ECB's understanding and for the purposes of paragraph 211, the following principles apply. \n(a) For PD models and regarding the analysis of the predictive power envisaged by paragraph 218(c) of the EBA Guidelines on PD and LGD: \n(i) the analysis should be performed at grade level; for institutions using direct PD estimates, it should be performed at a sufficient level of granularity; \n(ii) institutions should use a range of metrics to assess predictive ability, including statistical tests and graphical analysis of the evolution of default rates and PD. \n(b) The analysis referred to in paragraph 218(c)(i) of the EBA Guidelines on PD and LGD should also consider, for CCFs, whether including the most recent data leads to a significant change in the LRA CCF or downturn CCF. \n(c) For LGD models that result from a combination of different components (for example, secured and unsecured components), the back-testing analysis referred to in paragraph 218(c)(ii) of the EBA Guidelines on PD and LGD should be run at both component and facility level. \n(d) In addition, institutions should consider in their frameworks for the review of estimates the availability of data for different exposure types, taking into account the specificities of the model architecture, including the existing and potential risk drivers, under paragraph 220 of the EBA Guidelines on PD and LGD. When data are scarce, they should use complementary analyses for those exposure types where quantitative measures prove inconclusive as a result, for example, of the low number of exposures available. \n(e) Where internal data are not considered sufficient to establish fixed targets and tolerances for defined metrics and tools to assess the performance of the PD model in terms of risk differentiation, institutions should define and put in place the appropriate actions to address this. 98 These actions could encompass, for example, the use of complementary analyses for those cases where the results for the application of metrics and tools are proven to be inconclusive. \n(f) When external credit bureau scores or ratings are used as the main (or one of the main) driver(s) of the internal rating, in cases where significant changes are applied to the credit bureau scoring institutions should consider the possibility of adjusting their internal data following the changes applied to the score, and whenever the input variables are no longer considered appropriate in their credit rating process.",list_item,#/groups/213,Credit risk > 9 Review of estimates > 9.1 Relevant regulatory references,213.0
1147,144,Credit risk,9 Review of estimates,9.1 Relevant regulatory references,,"214. In the case of material models where the assignment of the grade is based on a statistical model and where there is a risk that slight changes in the ranking of the obligors, or in the boundaries between grades, could lead to significant changes in the RWEA in that portfolio, the framework referred to in paragraph 211 should also include an analysis of whether the inclusion of the most recent data in the RDS used for model development would lead to materially different model outcomes. This analysis should be conducted on a three-yearly basis, or more often, depending on the materiality of the model. The analysis should consider, in particular, whether the discriminatory power of the PD, LGD or CCF models would be materially increased when re-estimating the model parameters on the basis of the updated RDS. Portfolios should be considered as falling into this category when, for example: (i) a limited number of obligors represent an important share of the total exposure; or (ii) exposures are concentrated near the boundaries between two grades.",list_item,#/groups/214,Credit risk > 9 Review of estimates > 9.1 Relevant regulatory references,214.0
1148,144,Credit risk,9 Review of estimates,9.1 Relevant regulatory references,,"215. When the number of default observations is low, to analyse whether the main drivers of the observed defaults are appropriately reflected in the model in accordance with Article 179(1)(a) of the CRR 99 institutions should analyse individual defaults (or at least a sample of them where the number of defaults makes analysing all of them unduly burdensome). However, the model should not be adapted simply to fit singular events from the institution's file review.",list_item,#/groups/214,Credit risk > 9 Review of estimates > 9.1 Relevant regulatory references,215.0
1150,144,Credit risk,9 Review of estimates,9.1 Relevant regulatory references,,"216. In accordance with Article 172(3) of the CRR, for grade and pool assignments institutions must document those situations in which human judgement may override the inputs or outputs of the assignment process. In addition, institutions must complement the statistical model by human judgement and human oversight to review model-based assignments and ensure that the models are used appropriately. 100 Furthermore, review procedures must be designed to find and limit errors associated with model weaknesses. 101 To comply with these requirements, institutions should assess the impact of the application of human judgement on risk differentiation capability (e.g. on discriminatory power), under paragraph 218(b) of the EBA Guidelines on PD and LGD.",list_item,#/groups/215,Credit risk > 9 Review of estimates > 9.1 Relevant regulatory references,216.0
