# 01 - Data Ingestion & Cleaning

This notebook loads Eurostat NPL data (macro) and the German Credit dataset (micro). It is designed to be robust: place the German Credit CSV at `data/raw/german_credit.csv` before running, or update the path accordingly.

In [None]:
import pandas as pd
from pathlib import Path

DATA_DIR = Path('data')
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
RAW_DIR.mkdir(parents=True, exist_ok=True)
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print('Directories ready:', RAW_DIR.resolve(), PROCESSED_DIR.resolve())

## Load German Credit CSV
Download the German Credit dataset from Kaggle or UCI and place it at `data/raw/german_credit.csv`. If you have a different filename, update the path below.

In [None]:
german_path = RAW_DIR / 'german_credit.csv'
if german_path.exists():
    df = pd.read_csv(german_path)
    print('Loaded German Credit dataset with shape:', df.shape)
else:
    print('File not found:', german_path)
    print('Please download the dataset from Kaggle (https://www.kaggle.com/datasets/uciml/german-credit) or UCI and save as data/raw/german_credit.csv')

## Example: Load Eurostat CSV (if you have a direct CSV link)
Eurostat bulk downloads sometimes require specific endpoints. If you have a CSV URL, paste it below. Otherwise, download manually from Eurostat and place in `data/raw/`.

In [None]:
eurostat_local = RAW_DIR / 'eurostat_npl.csv'
if eurostat_local.exists():
    euro = pd.read_csv(eurostat_local)
    print('Loaded Eurostat NPL dataset with shape:', euro.shape)
else:
    print('Eurostat CSV not found locally. You can download relevant NPL tables from Eurostat (e.g. https://ec.europa.eu/eurostat/databrowser).')

## Basic cleaning example (German Credit)
This cell shows a conservative preprocessing approach: lowercase columns, strip whitespace, and show basic info.

In [None]:
if german_path.exists():
    df_clean = df.copy()
    df_clean.columns = [c.strip().lower().replace(' ', '_') for c in df_clean.columns]
    display(df_clean.head())
    df_clean.to_csv(PROCESSED_DIR / 'german_credit_clean.csv', index=False)
    print('Saved cleaned German Credit to', (PROCESSED_DIR / 'german_credit_clean.csv'))
else:
    print('No german credit file to process yet.')