# Credit Risk Modelling – Feature Engineering

## Objective
This notebook prepares the dataset for Probability of Default (PD) modeling by:
- Handling missing values appropriately
- Encoding categorical variables
- Creating a modeling-ready feature matrix

All transformations are aligned with real-world credit risk practices.

In [2]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", "{:.2f}".format)

In [3]:
df = pd.read_csv("../data/german_credit_data.csv")

df.drop(columns=["Unnamed: 0"], inplace=True)

df.head()

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,67,male,2,own,,little,1169,6,radio/TV,good
1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,49,male,1,own,little,,2096,12,education,good
3,45,male,2,free,little,little,7882,42,furniture/equipment,good
4,53,male,2,free,little,little,4870,24,car,bad


## Target Variable Encoding

For PD modeling:
- 1 → Default (bad risk)
- 0 → Non-default (good risk)

In [4]:
df["Risk"] = df["Risk"].map({"good": 0, "bad": 1})

df["Risk"].value_counts()

Risk
0    700
1    300
Name: count, dtype: int64

## Missing Value Handling

Missing values are retained as a separate category to preserve risk signal.

In [5]:
df["Saving accounts"] = df["Saving accounts"].fillna("missing")
df["Checking account"] = df["Checking account"].fillna("missing")

## Feature Segregation

In [6]:
target = "Risk"

numerical_features = ["Age", "Job", "Credit amount", "Duration"]
categorical_features = [
    "Sex",
    "Housing",
    "Saving accounts",
    "Checking account",
    "Purpose"
]

X = df[numerical_features + categorical_features]
y = df[target]

X.head()

Unnamed: 0,Age,Job,Credit amount,Duration,Sex,Housing,Saving accounts,Checking account,Purpose
0,67,2,1169,6,male,own,missing,little,radio/TV
1,22,2,5951,48,female,own,little,moderate,radio/TV
2,49,1,2096,12,male,own,little,missing,education
3,45,2,7882,42,male,free,little,little,furniture/equipment
4,53,2,4870,24,male,free,little,little,car


## Categorical Variable Encoding

In [7]:
X_encoded = pd.get_dummies(
    X,
    columns=categorical_features,
    drop_first=True
)

X_encoded.head()

Unnamed: 0,Age,Job,Credit amount,Duration,Sex_male,Housing_own,Housing_rent,Saving accounts_missing,Saving accounts_moderate,Saving accounts_quite rich,Saving accounts_rich,Checking account_missing,Checking account_moderate,Checking account_rich,Purpose_car,Purpose_domestic appliances,Purpose_education,Purpose_furniture/equipment,Purpose_radio/TV,Purpose_repairs,Purpose_vacation/others
0,67,2,1169,6,True,True,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False
1,22,2,5951,48,False,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False
2,49,1,2096,12,True,True,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False
3,45,2,7882,42,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
4,53,2,4870,24,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False


## Final Feature Matrix Summary


In [8]:
print("Final feature matrix shape:", X_encoded.shape)
X_encoded.info()

Final feature matrix shape: (1000, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   Age                          1000 non-null   int64
 1   Job                          1000 non-null   int64
 2   Credit amount                1000 non-null   int64
 3   Duration                     1000 non-null   int64
 4   Sex_male                     1000 non-null   bool 
 5   Housing_own                  1000 non-null   bool 
 6   Housing_rent                 1000 non-null   bool 
 7   Saving accounts_missing      1000 non-null   bool 
 8   Saving accounts_moderate     1000 non-null   bool 
 9   Saving accounts_quite rich   1000 non-null   bool 
 10  Saving accounts_rich         1000 non-null   bool 
 11  Checking account_missing     1000 non-null   bool 
 12  Checking account_moderate    1000 non-null   bool 
 13  Checking a

## Feature Scaling (Numerical Variables Only)


In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_encoded[numerical_features] = scaler.fit_transform(
    X_encoded[numerical_features]
)

X_encoded.head()

Unnamed: 0,Age,Job,Credit amount,Duration,Sex_male,Housing_own,Housing_rent,Saving accounts_missing,Saving accounts_moderate,Saving accounts_quite rich,Saving accounts_rich,Checking account_missing,Checking account_moderate,Checking account_rich,Purpose_car,Purpose_domestic appliances,Purpose_education,Purpose_furniture/equipment,Purpose_radio/TV,Purpose_repairs,Purpose_vacation/others
0,2.77,0.15,-0.75,-1.24,True,True,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False
1,-1.19,0.15,0.95,2.25,False,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False
2,1.18,-1.38,-0.42,-0.74,True,True,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False
3,0.83,0.15,1.63,1.75,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
4,1.54,0.15,0.57,0.26,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False


## Save Processed Dataset

In [11]:
processed_df = X_encoded.copy()
processed_df["Risk"] = y.values

processed_df.to_csv("../data/processed/credit_model_ready.csv", index=False)

### Final Feature Matrix Validation

- Numerical variables are standardized using z-score normalization.
- Categorical variables are one-hot encoded with reference categories dropped.
- Missing values are preserved as separate categories to retain risk signal.
- The dataset contains no null values and is fully model-ready.

This dataset is now suitable for Probability of Default (PD) modeling.
