# End-to-End Tabular Machine Learning  
## Credit Risk Modeling with Feature Engineering and Explainability

### Problem Definition
The objective of this project is to predict the probability of loan default for individual applicants.

This is a **binary classification problem** where:
- `1` indicates default
- `0` indicates non-default

### Motivation
Credit risk modeling is a core component of decision-support systems in banking.

- **False negatives** lead to direct financial loss  
- **False positives** result in lost customers  

> This project is not a prediction game, but a **risk modeling study** focusing on methodology, feature engineering, and explainability.


In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc

In [2]:
df = pd.read_csv("loan.csv") 
df.head()

Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0


In [4]:
df.shape

(255347, 18)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255347 entries, 0 to 255346
Data columns (total 18 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   LoanID          255347 non-null  object 
 1   Age             255347 non-null  int64  
 2   Income          255347 non-null  int64  
 3   LoanAmount      255347 non-null  int64  
 4   CreditScore     255347 non-null  int64  
 5   MonthsEmployed  255347 non-null  int64  
 6   NumCreditLines  255347 non-null  int64  
 7   InterestRate    255347 non-null  float64
 8   LoanTerm        255347 non-null  int64  
 9   DTIRatio        255347 non-null  float64
 10  Education       255347 non-null  object 
 11  EmploymentType  255347 non-null  object 
 12  MaritalStatus   255347 non-null  object 
 13  HasMortgage     255347 non-null  object 
 14  HasDependents   255347 non-null  object 
 15  LoanPurpose     255347 non-null  object 
 16  HasCoSigner     255347 non-null  object 
 17  Default   

In [7]:
df["Default"].value_counts()

Default
0    225694
1     29653
Name: count, dtype: int64

In [8]:
df["Default"].value_counts(normalize=True)

Default
0    0.883872
1    0.116128
Name: proportion, dtype: float64

In [9]:
df.duplicated().sum()

np.int64(0)

In [10]:
df["LoanID"].nunique(), len(df)

(255347, 255347)

In [11]:
df.select_dtypes(include=["int64", "float64"]).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,255347.0,43.498306,14.990258,18.0,31.0,43.0,56.0,69.0
Income,255347.0,82499.304597,38963.013729,15000.0,48825.5,82466.0,116219.0,149999.0
LoanAmount,255347.0,127578.865512,70840.706142,5000.0,66156.0,127556.0,188985.0,249999.0
CreditScore,255347.0,574.264346,158.903867,300.0,437.0,574.0,712.0,849.0
MonthsEmployed,255347.0,59.541976,34.643376,0.0,30.0,60.0,90.0,119.0
NumCreditLines,255347.0,2.501036,1.117018,1.0,2.0,2.0,3.0,4.0
InterestRate,255347.0,13.492773,6.636443,2.0,7.77,13.46,19.25,25.0
LoanTerm,255347.0,36.025894,16.96933,12.0,24.0,36.0,48.0,60.0
DTIRatio,255347.0,0.500212,0.230917,0.1,0.3,0.5,0.7,0.9
Default,255347.0,0.116128,0.320379,0.0,0.0,0.0,0.0,1.0


In [12]:
df.select_dtypes(include="object").nunique()

LoanID            255347
Education              4
EmploymentType         4
MaritalStatus          3
HasMortgage            2
HasDependents          2
LoanPurpose            5
HasCoSigner            2
dtype: int64

In [13]:
X = df.drop(columns=["Default", "LoanID"])
y = df["Default"]

X.shape, y.shape

((255347, 16), (255347,))

In [14]:
cat_cols = X.select_dtypes(include="object").columns.tolist()
num_cols = X.select_dtypes(exclude="object").columns.tolist()

len(num_cols), len(cat_cols), num_cols, cat_cols

(9,
 7,
 ['Age',
  'Income',
  'LoanAmount',
  'CreditScore',
  'MonthsEmployed',
  'NumCreditLines',
  'InterestRate',
  'LoanTerm',
  'DTIRatio'],
 ['Education',
  'EmploymentType',
  'MaritalStatus',
  'HasMortgage',
  'HasDependents',
  'LoanPurpose',
  'HasCoSigner'])

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape

((204277, 16), (51070, 16))

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_cols),
        ("cat", categorical_transformer, cat_cols),
    ]
)

preprocess

In [17]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    max_iter=2000,
    solver="lbfgs",
    n_jobs=-1
)

clf_lr = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", log_reg)
])

clf_lr

In [18]:
clf_lr.fit(X_train, y_train)

In [19]:
y_proba_test = clf_lr.predict_proba(X_test)[:, 1]

In [20]:
from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_test, y_proba_test)
roc_auc

np.float64(0.7531082963058535)

In [21]:
from sklearn.metrics import precision_recall_curve, auc

precision, recall, _ = precision_recall_curve(y_test, y_proba_test)
pr_auc = auc(recall, precision)
pr_auc

np.float64(0.311448576495611)

In [22]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(
    clf_lr,
    X_train,
    y_train,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1
)

cv_scores.mean(), cv_scores.std()

(np.float64(0.7463893529283542), np.float64(0.0031491374937079432))

In [23]:
feature_names = (
    clf_lr.named_steps["preprocess"]
    .get_feature_names_out()
)

coef = clf_lr.named_steps["model"].coef_[0]

coef_df = (
    pd.DataFrame({
        "feature": feature_names,
        "coefficient": coef
    })
    .sort_values("coefficient", ascending=False)
)

coef_df.head(10), coef_df.tail(10)

(                           feature  coefficient
 6                num__InterestRate     0.458119
 2                  num__LoanAmount     0.300245
 5              num__NumCreditLines     0.100523
 8                    num__DTIRatio     0.067855
 16  cat__EmploymentType_Unemployed     0.036936
 7                    num__LoanTerm     0.003715
 10      cat__Education_High School    -0.032620
 25       cat__LoanPurpose_Business    -0.048904
 24           cat__LoanPurpose_Auto    -0.105238
 9        cat__Education_Bachelor's    -0.108530,
                           feature  coefficient
 27          cat__LoanPurpose_Home    -0.287717
 12             cat__Education_PhD    -0.288499
 4             num__MonthsEmployed    -0.337132
 1                     num__Income    -0.341721
 18     cat__MaritalStatus_Married    -0.343438
 13  cat__EmploymentType_Full-time    -0.410387
 21           cat__HasMortgage_Yes    -0.413832
 23         cat__HasDependents_Yes    -0.462599
 30           cat__HasCoSign

## Baseline Model: Logistic Regression

As a first reference model, a Logistic Regression classifier was trained using a unified preprocessing pipeline.
The goal of this baseline is not to maximize performance, but to establish an interpretable and methodologically sound benchmark.

### Key Results
- Test ROC-AUC: ~0.75  
- Test PR-AUC: ~0.31  
- Cross-validation ROC-AUC: ~0.75 ± 0.003  

These results indicate a stable and meaningful baseline performance for a moderately imbalanced credit risk dataset.

### Model Interpretability
The learned coefficients align well with financial intuition:

- Higher **interest rates**, **loan amounts**, and **debt-to-income ratios** increase default risk.
- Stable income indicators such as **higher income**, **longer employment duration**, **full-time employment**, and **co-signers** reduce default risk.

This confirms that the model captures realistic credit risk patterns rather than spurious correlations.

> Logistic Regression serves as a transparent baseline, allowing the impact of feature engineering to be clearly observed.


In [26]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42,
    n_jobs=-1
)

clf_rf = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", rf)
])

clf_rf

In [27]:
clf_rf.fit(X_train, y_train)

In [28]:
y_proba_rf_test = clf_rf.predict_proba(X_test)[:, 1]

In [29]:
from sklearn.metrics import roc_auc_score

rf_roc_auc = roc_auc_score(y_test, y_proba_rf_test)
rf_roc_auc

np.float64(0.7405326951846065)

In [30]:
from sklearn.metrics import precision_recall_curve, auc

precision_rf, recall_rf, _ = precision_recall_curve(y_test, y_proba_rf_test)
rf_pr_auc = auc(recall_rf, precision_rf)
rf_pr_auc

np.float64(0.31218081032691414)

In [31]:
from sklearn.model_selection import cross_val_score

rf_cv_scores = cross_val_score(
    clf_rf,
    X_train,
    y_train,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1
)

rf_cv_scores.mean(), rf_cv_scores.std()

(np.float64(0.7339528937578395), np.float64(0.0030225892904273406))

In [32]:
rf_feature_names = (
    clf_rf.named_steps["preprocess"]
    .get_feature_names_out()
)

rf_importances = clf_rf.named_steps["model"].feature_importances_

rf_importance_df = (
    pd.DataFrame({
        "feature": rf_feature_names,
        "importance": rf_importances
    })
    .sort_values("importance", ascending=False)
)

rf_importance_df.head(10), rf_importance_df.tail(10)

(                        feature  importance
 1                   num__Income    0.119602
 6             num__InterestRate    0.113837
 2               num__LoanAmount    0.107123
 0                      num__Age    0.098914
 3              num__CreditScore    0.094559
 4           num__MonthsEmployed    0.093330
 8                 num__DTIRatio    0.083665
 7                 num__LoanTerm    0.038699
 5           num__NumCreditLines    0.031144
 17  cat__MaritalStatus_Divorced    0.011965,
                               feature  importance
 21               cat__HasMortgage_Yes    0.010198
 15  cat__EmploymentType_Self-employed    0.010177
 20                cat__HasMortgage_No    0.010146
 27              cat__LoanPurpose_Home    0.009913
 16     cat__EmploymentType_Unemployed    0.009384
 13      cat__EmploymentType_Full-time    0.007508
 23             cat__HasDependents_Yes    0.006843
 22              cat__HasDependents_No    0.006840
 30               cat__HasCoSigner_Yes    0.0

In [33]:
results_df = pd.DataFrame({
    "Model": [
        "Logistic Regression",
        "Random Forest"
    ],
    "Test ROC-AUC": [
        roc_auc,
        rf_roc_auc
    ],
    "Test PR-AUC": [
        pr_auc,
        rf_pr_auc
    ],
    "CV ROC-AUC (mean)": [
        cv_scores.mean(),
        rf_cv_scores.mean()
    ],
    "CV ROC-AUC (std)": [
        cv_scores.std(),
        rf_cv_scores.std()
    ]
})

results_df

Unnamed: 0,Model,Test ROC-AUC,Test PR-AUC,CV ROC-AUC (mean),CV ROC-AUC (std)
0,Logistic Regression,0.753108,0.311449,0.746389,0.003149
1,Random Forest,0.740533,0.312181,0.733953,0.003023


## Model Comparison: Logistic Regression vs Random Forest

Both models were trained using the same preprocessing pipeline and evaluated with identical metrics to ensure a fair comparison.

### Observations
- Logistic Regression achieves a slightly higher ROC-AUC and cross-validation mean score.
- Random Forest captures non-linear patterns but does not outperform the baseline model.
- PR-AUC scores are comparable, reflecting similar performance under class imbalance.

### Interpretation
Despite its simplicity, Logistic Regression remains competitive and even superior in this setting.
This highlights the importance of strong feature engineering and methodological rigor over model complexity.

> Model selection in credit risk is not about maximizing complexity, but about balancing performance, stability, and interpretability.

In [34]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=3,
    random_state=42
)

clf_gb = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", gb)
])

clf_gb

In [35]:
clf_gb.fit(X_train, y_train)

In [36]:
y_proba_gb_test = clf_gb.predict_proba(X_test)[:, 1]

In [38]:
from sklearn.metrics import roc_auc_score

gb_roc_auc = roc_auc_score(y_test, y_proba_gb_test)
gb_roc_auc

np.float64(0.7581521330790029)

In [39]:
from sklearn.metrics import precision_recall_curve, auc

precision_gb, recall_gb, _ = precision_recall_curve(y_test, y_proba_gb_test)
gb_pr_auc = auc(recall_gb, precision_gb)
gb_pr_auc

np.float64(0.3292230866100341)

In [86]:
from sklearn.model_selection import cross_val_score

gb_cv_scores = cross_val_score(
    clf_gb,
    X_train,
    y_train,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1
)

gb_cv_scores.mean(), gb_cv_scores.std()

(np.float64(0.7517853204540481), np.float64(0.002322750851611539))

In [87]:
gb_feature_names = (
    clf_gb.named_steps["preprocess"]
    .get_feature_names_out()
)

gb_importances = clf_gb.named_steps["model"].feature_importances_

gb_importance_df = (
    pd.DataFrame({
        "feature": gb_feature_names,
        "importance": gb_importances
    })
    .sort_values("importance", ascending=False)
)

gb_importance_df.head(10), gb_importance_df.tail(10)

(                           feature  importance
 0                         num__Age    0.286416
 1                      num__Income    0.211084
 6                num__InterestRate    0.194153
 2                  num__LoanAmount    0.121472
 4              num__MonthsEmployed    0.100301
 3                 num__CreditScore    0.013149
 13   cat__EmploymentType_Full-time    0.011140
 16  cat__EmploymentType_Unemployed    0.007601
 29             cat__HasCoSigner_No    0.007187
 5              num__NumCreditLines    0.007006,
                               feature    importance
 17        cat__MaritalStatus_Divorced  5.266410e-04
 25          cat__LoanPurpose_Business  2.706483e-04
 11            cat__Education_Master's  8.772730e-05
 7                       num__LoanTerm  6.679531e-05
 24              cat__LoanPurpose_Auto  4.903845e-05
 15  cat__EmploymentType_Self-employed  2.380631e-05
 19          cat__MaritalStatus_Single  1.507806e-07
 14      cat__EmploymentType_Part-time  0.00000

In [88]:
final_results_df = pd.DataFrame({
    "Model": [
        "Logistic Regression",
        "Random Forest",
        "Gradient Boosting"
    ],
    "Test ROC-AUC": [
        roc_auc,
        rf_roc_auc,
        gb_roc_auc
    ],
    "Test PR-AUC": [
        pr_auc,
        rf_pr_auc,
        gb_pr_auc
    ],
    "CV ROC-AUC (mean)": [
        cv_scores.mean(),
        rf_cv_scores.mean(),
        gb_cv_scores.mean()
    ],
    "CV ROC-AUC (std)": [
        cv_scores.std(),
        rf_cv_scores.std(),
        gb_cv_scores.std()
    ]
})

final_results_df

Unnamed: 0,Model,Test ROC-AUC,Test PR-AUC,CV ROC-AUC (mean),CV ROC-AUC (std)
0,Logistic Regression,0.753108,0.311449,0.746389,0.003149
1,Random Forest,0.740533,0.312181,0.733953,0.003023
2,Gradient Boosting,0.758152,0.329223,0.751785,0.002323


## Final Results and Conclusions

This project presented an end-to-end, methodology-driven approach to credit risk modeling on a real-world tabular dataset.

Three model families were evaluated under identical preprocessing and evaluation settings:
- Logistic Regression
- Random Forest
- Gradient Boosting

### Key Findings
- Logistic Regression provides a strong and interpretable baseline, demonstrating the effectiveness of domain-driven feature engineering.
- Random Forest captures non-linear patterns but does not outperform the linear baseline in this setting.
- Gradient Boosting achieves the best overall performance, offering a modest but consistent improvement while maintaining stability across cross-validation folds.

### Methodological Insight
The results emphasize that performance gains in tabular machine learning are primarily driven by:
- Thoughtful feature engineering
- Proper preprocessing pipelines
- Fair and consistent model comparison

rather than excessive model complexity or hyperparameter tuning.

### Final Remark
This study demonstrates that credit risk modeling is not a prediction contest, but a disciplined process of designing reliable, interpretable, and generalizable decision-support models.