# Credit Risk Demo - EDA

This notebook does **exploratory data analysis (EDA)** on the same synthetic credit dataset used in `src/train.py`.

We **reuse code** from the `.py` files instead of rewriting it here:

- `load_data()` from `src.train` to get the dataset
- `create_preprocessing_pipeline()` from `src.train` to see preprocessing

Run the cells from top to bottom.

In [None]:
# Make Python see the project root so we can import src.* modules
import os, sys

project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

project_root

In [None]:
# Import helper functions from our training code
from src.train import load_data, create_preprocessing_pipeline

# Load the synthetic dataset (same as training)
X, y = load_data()

X.shape, y.shape

In [None]:
# Look at the first few rows to understand the features
X.head()

In [None]:
# Data types: which columns are numeric vs categorical
X.dtypes

In [None]:
# Target distribution: how many good (0) vs bad (1)
import pandas as pd

y.value_counts(), y.value_counts(normalize=True)

In [None]:
# Summary statistics for numeric features
numeric_cols = X.select_dtypes(include=["int64", "float64"]).columns
X[numeric_cols].describe()

In [None]:
# Value counts for a few important categorical features
cat_cols = [
    "checking_status",
    "credit_history",
    "purpose",
    "savings_status",
    "employment",
]

for col in cat_cols:
    if col in X.columns:
        display(pd.DataFrame(X[col].value_counts()).rename(columns={col: 'count'}))


In [None]:
# Optional: see how preprocessing changes the data shape
preprocessor = create_preprocessing_pipeline(X)
X_trans = preprocessor.fit_transform(X)

X.shape, X_trans.shape  # (rows, original_cols) vs (rows, transformed_cols)