Title: Dataset Creation SPEED RUN - Live Coding With Python & Pandas

Source: Rob Mulla YouTube Channel

Author (Original Tutorial): Rob Mulla

URL: https://www.youtube.com/watch?v=wiiCUsGgZx0

Date of Implementation: 2024-12-27

Description:
    Adapted implementation of dataset creation from ECB financial statement.

# Goal of This Notebook
1. Pull links to the most recent data from the SEC website:
   - https://www.bundesbank.de/de/publikationen/ezb/wirtschaftsberichte
2. Download the pdfs from the website.
3. Extract financial balance and debt data for Germany.
4. Save as CSV and Parquet format.
5. Display data in interactive plot.

In [17]:
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import requests
import re 
import glob
from pypdf import PdfReader
from tqdm.notebook  import tqdm
from os import listdir
from os.path import isfile, join

In [18]:
def getHTMLDocument(url):
    return requests.get(url).text

In [19]:
# Get HTML
html_document = getHTMLDocument('https://www.bundesbank.de/de/publikationen/ezb/wirtschaftsberichte')

# Create soap object 
soup = BeautifulSoup(html_document, 'html.parser') 

# Get lines related to links (indicated by 'href')
links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))

## Filter the links using a single list comprehension
#filtered_pdfs = [
#    pdf for pdf in full_pdfs if re.search(r'\bezb-wb-data\b', pdf)
#]

# Filter for wanted .pdf links
pdfs = [r for r in links if str(r).endswith('.pdf') and 'data' in r]
# Exclude links which are not 'ezb-wb-data'
pdfs = [pdf for pdf in pdfs if 'ezb-wb-data' in pdf]

## Filter links using regex
#filtered_pdfs = [pdf for pdf in full_pdfs if re.search(r'\bezb-wb-data\b', pdf)]

# Create set (remove duplicate entries etc, unpack with * to convert into list again)
pdfs = [*set(pdfs)]
full_pdfs = ['https://www.bundesbank.de' + c for c in pdfs]
print(len(full_pdfs), full_pdfs)

10 ['https://www.bundesbank.de/resource/blob/900000/d527d1d2b61980bbe68d8215927fa9e8/472B63F073F071307366337C94F8C870/2023-08-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/900010/649275b65c926929b4816c06f8764fcc/472B63F073F071307366337C94F8C870/2024-04-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/916540/ee016565599e4a58cc48e5eb809bedef/472B63F073F071307366337C94F8C870/2023-06-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/900004/7aa6c9ebde6618aecdefd050b0d73a64/472B63F073F071307366337C94F8C870/2024-07-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/900008/1004383626794b3e8e3e054e9b6cffdb/472B63F073F071307366337C94F8C870/2024-05-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/923726/f2f9fccf5e5c0f0cbe278badcf523643/472B63F073F071307366337C94F8C870/2024-01-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/918322/b56400c60a8db7d0f9f030f1de5157ce/472B63F073F071307366337C94F8C870/2023-07-ezb-wb-data.pdf', 'https://www.bun

In [20]:
# Load binary content from links and store as pdf (use progress bar from tqdm)
total_files = len(full_pdfs)
with tqdm(total=total_files, desc="Downloading PDFs") as pbar:
    for link in full_pdfs:
        fname = link.split('/')[-1].split('.')[0]
        print("Downloading file: ", fname)
    
        # Get response object for link
        response = requests.get(link)

        # Write content in pdf file
        with open(fname + ".pdf", 'wb') as pdf:
            pdf.write(response.content)
        pbar.update(1)

Downloading PDFs:   0%|          | 0/10 [00:00<?, ?it/s]

Downloading file:  2023-08-ezb-wb-data
Downloading file:  2024-04-ezb-wb-data
Downloading file:  2023-06-ezb-wb-data
Downloading file:  2024-07-ezb-wb-data
Downloading file:  2024-05-ezb-wb-data
Downloading file:  2024-01-ezb-wb-data
Downloading file:  2023-07-ezb-wb-data
Downloading file:  2024-06-ezb-wb-data
Downloading file:  2024-03-ezb-wb-data
Downloading file:  2024-02-ezb-wb-data


In [22]:
# Get all downloaded PDF files
pdf_files = [f for f in glob.glob("./*.pdf") if "ezb-wb-data" in f]
pdf_data = {}
for f in pdf_files:
    print("Extracting raw data from ", f)
    doc = PdfReader(f)
    # Access second last page where stats of interest are located
    page = doc.pages[-2]
    text = page.extract_text()
    text = text.replace('\xad', '')
    # Regex, extract data between the indicator lines
    pattern = r"EZB, Wirtschaftsbericht.*?\n(.*?)(?=\n6 Entwicklung der öffentlichen Finanzen)"
    # Search with line breaks
    match = re.search(pattern, text, re.S)
    if match:
        data = match.group(1)  # get lines
        data = data.strip()  # remove trailing spaces
    else:
        print("No matches!")
    pdf_data[f] = data

Extracting raw data from  ./2024-03-ezb-wb-data.pdf
Extracting raw data from  ./2023-07-ezb-wb-data.pdf
Extracting raw data from  ./2024-05-ezb-wb-data.pdf
Extracting raw data from  ./2024-07-ezb-wb-data.pdf
Extracting raw data from  ./2024-04-ezb-wb-data.pdf
Extracting raw data from  ./2023-08-ezb-wb-data.pdf
Extracting raw data from  ./2024-06-ezb-wb-data.pdf
Extracting raw data from  ./2023-06-ezb-wb-data.pdf
Extracting raw data from  ./2024-02-ezb-wb-data.pdf
Extracting raw data from  ./2024-01-ezb-wb-data.pdf


In [24]:
def fix_double_year_line(data: dict, key: str, sub_id: str, typ:str)-> None:
    raw_str = data[key][sub_id][typ]
    lines = raw_str.split("\n") # split strings by line breaks "\n"
    lines = [item for item in lines if item] # remove empty strings ''
    header = lines[0] # countries
    data = lines[1:] # remaining data
    # first data row contains two years, there is no linebreak, split them into two lines
    corrected_data = []
    left_line= [] 
    right_line = []
    line_elements = lines[1].split() # get line of interest
    # iter elements in line, find missing line break point by element length
    for el in line_elements: 
        if len(right_line) > 0:
            right_line.append(el)
        elif len(el) > 2 and "20" in el:
            idx = el.find("20")
            left_line.append(el[:idx])
            right_line.append(el[idx:])
        else:
            left_line.append(el)
    # get remaining data
    remaining_data = ["".join(el) for el in data[1:]]
    str_remaining_data = "\n".join(remaining_data)
    # merge parts into single string
    corrected_data.append(header)
    corrected_data.append(" ".join(left_line))
    corrected_data.append(" ".join(right_line))
    corrected_data.append("".join(str_lasting_data))
    formatted_text = "\n".join(corrected_data)
    # overwrite specific dict value
    tab_dict[key][sub_id][typ] = formatted_text

# Establish dict with subtables in order to merge them into one df
tab_dict = {}
for key, val in pdf_data.items():
    tab_dict[key] = {}
    split  = re.split(r'(\nLettland.*?\n)', val, maxsplit=1)
    first_half = split[0]
    second_half = split[1] + split[2]
    halfs = [first_half, second_half]
    for id, half in enumerate(halfs):
        tab_dict[key][id] = {}
        parts = re.split(r'(\nFinanzierungssaldo.*?\n|\nVerschuldung.*?\n)', half)
        balance_tab = parts[0] + parts[2]
        debt_tab = parts[0] + parts[4]
        tab_dict[key][id]["balance"] = balance_tab
        tab_dict[key][id]["debt"] = debt_tab

# iter through pdf's
for key, val in tab_dict.items():
    # iter through first and second table half
    for sub_id in [0, 1]:
        # iter thorugh both content types
        for typ in ["balance", "debt"]:
            fix_double_year_line(tab_dict, key, sub_id, typ)
    

In [29]:
print(tab_dict["./2024-03-ezb-wb-data.pdf"][0]["balance"])

Belgien Deutschland Estland Irland Griechenland Spanien Frankreich Kroatien Italien Zypern
1 2 3 4 5 6 7 8 9 10
2019 2,0 1,5 0,1 0,5 0,9 3,1 3,1 0,2 1,5 0,9
2020 42,2 46,2 24,6 52,2 54,7 83,0 134,9 79,6 58,9 74,7
2021 44,0 43,4 24,5 54,0 51,7 82,5 124,5 74,4 61,1 72,5
2022 41,0 38,1 24,7 52,3 50,1 78,4 112,4 72,3 57,8 73,3
2022 Q4 41,0 38,1 24,7 51,6 50,1 78,4 112,4 72,3 57,8 73,3
2023 Q1 43,0 38,1 28,3 51,5 48,3 80,2 112,3 72,0 58,0 73,3
Q2 39,5 38,1 28,2 49,6 46,9 78,5 110,0 70,4 59,6 74,5
Q3 41,4 37,4 25,7 49,3 45,9 78,2 107,5 71,4 58,6 73,8
