Title: Dataset Creation SPEED RUN - Live Coding With Python & Pandas

Source: Rob Mulla YouTube Channel

Author (Original Tutorial): Rob Mulla

URL: https://www.youtube.com/watch?v=wiiCUsGgZx0

Date of Implementation: 2024-12-27

Description:
    Adapted implementation of dataset creation from ECB financial statement.

# Goal of This Notebook
1. Pull links to the most recent data from the SEC website:
   - https://www.bundesbank.de/de/publikationen/ezb/wirtschaftsberichte
2. Download the pdfs from the website.
3. Extract financial balance and debt data for Germany.
4. Save as CSV and Parquet format.
5. Display data in interactive plot.

In [1]:
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import requests
import re 
import glob
from pypdf import PdfReader
from tqdm.notebook  import tqdm
from os import listdir
from os.path import isfile, join

In [2]:
def getHTMLDocument(url):
    return requests.get(url).text

In [3]:
# Get HTML
html_document = getHTMLDocument('https://www.bundesbank.de/de/publikationen/ezb/wirtschaftsberichte')

# Create soap object 
soup = BeautifulSoup(html_document, 'html.parser') 

# Get lines related to links (indicated by 'href')
links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))

## Filter the links using a single list comprehension
#filtered_pdfs = [
#    pdf for pdf in full_pdfs if re.search(r'\bezb-wb-data\b', pdf)
#]

# Filter for wanted .pdf links
pdfs = [r for r in links if str(r).endswith('.pdf') and 'data' in r]
# Exclude links which are not 'ezb-wb-data'
pdfs = [pdf for pdf in pdfs if 'ezb-wb-data' in pdf]

## Filter links using regex
#filtered_pdfs = [pdf for pdf in full_pdfs if re.search(r'\bezb-wb-data\b', pdf)]

# Create set (remove duplicate entries etc, unpack with * to convert into list again)
pdfs = [*set(pdfs)]
full_pdfs = ['https://www.bundesbank.de' + c for c in pdfs]
print(len(full_pdfs), full_pdfs)

10 ['https://www.bundesbank.de/resource/blob/916540/ee016565599e4a58cc48e5eb809bedef/472B63F073F071307366337C94F8C870/2023-06-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/900010/649275b65c926929b4816c06f8764fcc/472B63F073F071307366337C94F8C870/2024-04-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/923726/f2f9fccf5e5c0f0cbe278badcf523643/472B63F073F071307366337C94F8C870/2024-01-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/900008/1004383626794b3e8e3e054e9b6cffdb/472B63F073F071307366337C94F8C870/2024-05-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/900004/7aa6c9ebde6618aecdefd050b0d73a64/472B63F073F071307366337C94F8C870/2024-07-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/900006/27e14b08b34ced3121986e5c70f4bc13/472B63F073F071307366337C94F8C870/2024-06-ezb-wb-data.pdf', 'https://www.bundesbank.de/resource/blob/900000/d527d1d2b61980bbe68d8215927fa9e8/472B63F073F071307366337C94F8C870/2023-08-ezb-wb-data.pdf', 'https://www.bun

In [4]:
# Load binary content from links and store as pdf (use progress bar from tqdm)
total_files = len(full_pdfs)
with tqdm(total=total_files, desc="Downloading PDFs") as pbar:
    for link in full_pdfs:
        fname = link.split('/')[-1].split('.')[0]
        print("Downloading file: ", fname)
    
        # Get response object for link
        response = requests.get(link)

        # Write content in pdf file
        with open(fname + ".pdf", 'wb') as pdf:
            pdf.write(response.content)
        pbar.update(1)

Downloading PDFs:   0%|          | 0/10 [00:00<?, ?it/s]

Downloading file:  2023-06-ezb-wb-data
Downloading file:  2024-04-ezb-wb-data
Downloading file:  2024-01-ezb-wb-data
Downloading file:  2024-05-ezb-wb-data
Downloading file:  2024-07-ezb-wb-data
Downloading file:  2024-06-ezb-wb-data
Downloading file:  2023-08-ezb-wb-data
Downloading file:  2024-02-ezb-wb-data
Downloading file:  2024-03-ezb-wb-data
Downloading file:  2023-07-ezb-wb-data


In [5]:
pdf_files = [f for f in glob.glob("./*.pdf") if "ezb-wb-data" in f]
print(pdf_files)

['./2024-03-ezb-wb-data.pdf', './2023-07-ezb-wb-data.pdf', './2024-05-ezb-wb-data.pdf', './2024-07-ezb-wb-data.pdf', './2024-04-ezb-wb-data.pdf', './2023-08-ezb-wb-data.pdf', './2024-06-ezb-wb-data.pdf', './2023-06-ezb-wb-data.pdf', './2024-02-ezb-wb-data.pdf', './2024-01-ezb-wb-data.pdf']


In [55]:
pdf_data = {}
for f in pdf_files:
    print("Extracting raw data from ", f)
    doc = PdfReader(f)
    # Access second last page where stats of interest are located
    page = doc.pages[-2]
    text = page.extract_text()

    text = text.replace('\xad', '')
    
    # Regex, extract data between the indicator lines
    pattern = r"EZB, Wirtschaftsbericht.*?\n(.*?)(?=\n6 Entwicklung der öffentlichen Finanzen)"

    # Search with line breaks
    match = re.search(pattern, text, re.S)

    if match:
        data = match.group(1)  # get lines
        data = data.strip()  # remove trailing spaces
    else:
        print("No matches!")
    pdf_data[f] = data

Extracting raw data from  ./2024-03-ezb-wb-data.pdf
Extracting raw data from  ./2023-07-ezb-wb-data.pdf
Extracting raw data from  ./2024-05-ezb-wb-data.pdf
Extracting raw data from  ./2024-07-ezb-wb-data.pdf
Extracting raw data from  ./2024-04-ezb-wb-data.pdf
Extracting raw data from  ./2023-08-ezb-wb-data.pdf
Extracting raw data from  ./2024-06-ezb-wb-data.pdf
Extracting raw data from  ./2023-06-ezb-wb-data.pdf
Extracting raw data from  ./2024-02-ezb-wb-data.pdf
Extracting raw data from  ./2024-01-ezb-wb-data.pdf


In [110]:
tab_dict = {}
for key, val in pdf_data.items():
    tab_dict[key] = {}
    split  = re.split(r'(\nLettland.*?\n)', val, maxsplit=1)
    first_half = split[0]
    second_half = split[1] + split[2]
    halfs = [first_half, second_half]
    for id, half in enumerate(halfs):
        tab_dict[key][id] = {}
        parts = re.split(r'(\nFinanzierungssaldo.*?\n|\nVerschuldung.*?\n)', half)
        balance_tab = parts[0] + parts [2]
        debt_tab = parts[0] + parts[4]
        tab_dict[key][id]["balance"] = balance_tab
        tab_dict[key][id]["debt"] = debt_tab

In [111]:
tab_dict["./2024-03-ezb-wb-data.pdf"][0]["balance"]

'Belgien Deutschland Estland Irland Griechenland Spanien Frankreich Kroatien Italien Zypern\n1 2 3 4 5 6 7 8 9 102019 2,0 1,5 0,1 0,5 0,9 3,1 3,1 0,2 1,5 0,9\n2020 8,9 4,3 5,4 5,0 9,7 10,1 9,0 7,3 9,6 5,7\n2021 5,4 3,6 2,5 1,5 7,0 6,7 6,5 2,5 8,8 1,9\n2022 3,5 2,5 1,0 1,7 2,4 4,7 4,8 0,1 8,0 2,4\n2022 Q4 3,5 2,5 1,0 1,7 2,4 4,7 4,8 0,1 8,0 2,4\n2023 Q1 3,9 3,0 1,3 2,0 2,5 4,4 4,6 0,2 8,1 3,0\nQ2 4,0 3,1 1,7 2,2 2,4 4,6 4,9 0,4 7,9 3,4\nQ3 4,1 2,7 2,2 1,9 1,2 4,4 4,8 0,3 6,8 3,2'

In [41]:
df_data = {}
for key, val in pdf_data.items():
    print(key)
    print("Before 1:", val[1])
    # Step 1: Split data in sections
    # re.split splits data (above Lettland line and beneath) into 1 split (maxplit 1), re.S Dot-All includes linebreaks \n in search
    sections = re.split(r'\n(Lettland.*?)\n', val[1], maxsplit=1, flags=re.S)
    table_data, countries = sections[0], sections[1]
    print("After 1:", table_data, countries)
    # Step 2: Get countries
    countries = countries.split('\n')[0].split()
    print("After 2:", countries)
    # Step 3: Process table rows
    rows = table_data.split('\n')
    data = []
    for row in rows:
        values = row.split()
        data.append(values)

    # Quartalsangaben korrekt behandeln
    cleaned_data = []

    for row in data:
        if len(row) == len(countries) + 2:  # Zeilen mit Quartalsangaben
            cleaned_data.append([f"{row[0]} {row[1]}"] + row[2:])  # Jahr und Quartal kombinieren
        elif len(row) == len(countries) + 1:  # Zeilen ohne Quartalsangaben
            cleaned_data.append([row[0]] + row[1:])
        else:
            print(f"Ungültige Zeile entfernt: {row}")

    # DataFrame erstellen
    df = pd.DataFrame(cleaned_data, columns=['Jahr'] + countries)

    # Konvertiere Werte zu Float
    for col in df.columns[1:]:  # Überspringe die 'Jahr'-Spalte
        df[col] = df[col].str.replace(',', '.').astype(float)
    df_data[key] = df

./2024-03-ezb-wb-data.pdf
Before 1: 2019 97,6 59,6 8,5 57,1 180,6 98,2 97,4 70,9 134,2 93,0
2020 111,8 68,8 18,6 58,1 207,0 120,3 114,6 86,8 154,9 114,9
2021 108,0 69,0 17,8 54,4 195,0 116,8 112,9 78,1 147,1 99,3
2022 104,3 66,1 18,5 44,4 172,6 111,6 111,8 68,2 141,7 85,6
2022 Q4 104,3 66,1 18,5 44,4 172,6 111,6 111,8 68,2 141,7 85,6
2023 Q1 106,4 65,7 17,2 43,6 169,3 111,2 112,3 69,1 140,9 83,1
Q2 105,9 64,7 18,5 43,2 167,1 111,2 111,8 66,5 142,5 85,1
Q3 108,0 64,8 18,2 43,6 165,5 109,8 111,9 64,4 140,6 79,4
Lettland Litauen Luxemburg Malta Niederlande Österreich Portugal Slowenien Slowakei Finnland
11 12 13 14 15 16 17 18 19 20
Finanzierungssaldo
2019 0,5 0,5 2,2 0,5 1,8 0,6 0,1 0,7 1,2 0,9
2020 4,5 6,5 3,4 9,6 3,7 8,0 5,8 7,6 5,4 5,6
2021 7,2 1,1 0,6 7,5 2,2 5,8 2,9 4,6 5,2 2,8
2022 4,6 0,7 0,3 5,7 0,1 3,5 0,3 3,0 2,0 0,8
2022 Q4 4,6 0,7 0,3 5,6 0,1 3,5 0,3 3,0 2,0 0,5
2023 Q1 4,4 1,2 0,6 4,8 0,1 3,3 0,1 3,2 2,6 0,4
Q2 3,0 1,2 0,7 4,2 0,2 3,6 0,0 3,2 3,4 1,1
Q3 3,3 1,1 0,4 3,4 0,1 3

In [37]:
keys = list(df_data.keys())
vals = list(df_data.values())
keys[0]
vals[1]

Unnamed: 0,Jahr,Lettland,Litauen,Luxemburg,Malta,Niederlande,Österreich,Portugal,Slowenien,Slowakei,Finnland
0,2019,97.6,59.6,8.5,57.1,180.6,98.2,97.4,70.9,134.2,93.0
1,2020,111.8,68.8,18.6,58.1,207.0,120.3,114.6,86.8,154.9,114.9
2,2021,108.0,69.0,17.8,54.4,195.0,116.8,112.9,78.1,147.1,99.3
3,2022,104.3,66.1,18.5,44.4,172.6,111.6,111.8,68.2,141.7,85.6
4,2022 Q3,105.6,66.8,15.9,48.5,175.9,114.0,113.5,69.8,143.1,89.7
5,Q4,104.3,66.1,18.5,44.4,171.4,111.6,111.8,68.2,141.7,85.6
6,2023 Q1,106.4,65.7,17.2,43.6,168.6,111.2,112.4,69.1,140.9,83.1
7,Q2,106.0,64.6,18.5,43.1,166.5,111.2,111.9,66.5,142.4,85.3
