# Black Box Models

## 1. Random Forest Classifier

In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
import joblib
import os

os.chdir("D:/Algorithmic-Fairness-Interpretability/afi_final_project")

In [2]:
df = pd.read_excel("data/dataproject2024.xlsx")

In [3]:
X = df[
    [
        "Job tenure",
        "Age",
        "Car price",
        "Funding amount",
        "Down payment",
        "Loan duration",
        "Monthly payment",
        "Credit event",
        "Married",
        "Homeowner",
    ]
]


y = df["Default (y)"]

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [5]:
param_grid_rf = {
    "n_estimators": [100, 200, 300],  # Number of trees in the forest
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [
        2,
        5,
        10,
    ],
    "min_samples_leaf": [1, 2, 4],
    "bootstrap": [True, False],
}

In [6]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
grid_search_rf = GridSearchCV(
    estimator=rf_model,
    param_grid=param_grid_rf,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1,
    verbose=2,
)


grid_search_rf.fit(X_train, y_train)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits


In [7]:
print(f"Best parameters for Random Forest: {grid_search_rf.best_params_}")
best_rf_model = grid_search_rf.best_estimator_

Best parameters for Random Forest: {'bootstrap': True, 'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 300}


In [8]:
y_pred = best_rf_model.predict(X_test)
y_pred_proba = best_rf_model.predict_proba(X_test)[:, 1]

In [9]:
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

In [10]:
print(f"Accuracy: {accuracy:.4f}")
print(f"AUC-ROC: {roc_auc:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.8257
AUC-ROC: 0.7825

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.98      0.90      1205
           1       0.68      0.15      0.24       281

    accuracy                           0.83      1486
   macro avg       0.75      0.57      0.57      1486
weighted avg       0.80      0.83      0.78      1486



- Class 0 (Majority Class) has high precision, recall, and F1-score, indicating that the model performs well in identifying this class.
- Class 1 (Minority Class) has lower precision and recall, especially a recall of 0.15, suggesting that the model struggles to correctly identify instances of this class.
- Macro Average Precision (0.75): The precision across both classes is fairly high, especially for the majority class (0.83). However, the precision for the minority class (0.68) pulls the macro average down. This reflects that while the model is better at correctly predicting class 0, it struggles with false positives for class 1.

- Macro Aaverage Recall (0.57): The recall is where the model shows a significant imbalance between the two classes. The recall for class 0 is very high (0.98), but for class 1, it’s only 0.15. This drastic drop lowers the macro average. It indicates that the model often misses instances of class 1 (many false negatives), focusing more on correctly predicting class 0.

- Macro Aaverage F1-score (0.57): The macro average f1-score shows a similar story. Class 0 has a strong balance between precision and recall, but class 1 struggles to achieve a solid f1-score (0.24), which brings the overall macro average down. This reflects that the model does not perform well in balancing precision and recall for the minority class.

In [11]:
feature_importances = best_rf_model.feature_importances_
importance_df = pd.DataFrame(
    {"Feature": X.columns, "Importance": feature_importances}
).sort_values(by="Importance", ascending=False)

print("\nFeature Importances:\n", importance_df)


Feature Importances:
            Feature  Importance
3   Funding amount    0.183970
0       Job tenure    0.156429
1              Age    0.147802
2        Car price    0.145547
6  Monthly payment    0.140753
5    Loan duration    0.084295
9        Homeowner    0.060695
8          Married    0.042211
7     Credit event    0.030507
4     Down payment    0.007791


- ``Funding amount`` is the most important feature, indicating that it plays a significant role in the model’s decision-making.
- ``Job tenure`` and ``Age`` are also highly influential.
- Features like ``Down payment`` and ``Credit event`` have relatively low importance in this model.

In [12]:
result_df = X_test.copy()
result_df["Predicted_PD"] = y_pred_proba
result_df["True_Label"] = y_test.values

print(result_df.head())

result_df.to_csv("predictions_output_random_forest.csv", index=False)

      Job tenure  Age  Car price  Funding amount  Down payment  Loan duration  \
625            2   22       8900            8900             0             60   
2796           0   55      10400            9400             0             72   
101            1   40      15990           14990             0             60   
4767           5   43      19999           23233             0             72   
2018           1   26      11800            4298             1             24   

      Monthly payment  Credit event  Married  Homeowner  Predicted_PD  \
625          0.084828             0        0          0      0.343699   
2796         0.083889             0        0          0      0.296349   
101          0.127142             0        1          0      0.259566   
4767         0.153289             0        0          0      0.354407   
2018         0.115528             0        0          0      0.105127   

      True_Label  
625            1  
2796           0  
101            1 

## 2. Gradient Boosting Classifier

In [13]:
param_grid_gb = {
    "n_estimators": [100, 200, 300],
    "learning_rate": [0.01, 0.05, 0.1],
    "max_depth": [3, 5, 7],
    "subsample": [0.8, 1.0],
    "min_samples_split": [2, 5, 10],
}

In [14]:
gb_model = GradientBoostingClassifier(random_state=42)
grid_search_gb = GridSearchCV(
    estimator=gb_model,
    param_grid=param_grid_gb,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1,
    verbose=2,
)
grid_search_gb.fit(X_train, y_train)

Fitting 5 folds for each of 162 candidates, totalling 810 fits


In [15]:
print(f"Best parameters for Gradient Boosting: {grid_search_gb.best_params_}")
best_gb_model = grid_search_gb.best_estimator_

Best parameters for Gradient Boosting: {'learning_rate': 0.05, 'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 200, 'subsample': 0.8}


In [16]:
y_pred_gb = best_gb_model.predict(X_test)
y_pred_proba_gb = best_gb_model.predict_proba(X_test)[:, 1]
print(f"Optimized Gradient Boosting Accuracy: {accuracy_score(y_test, y_pred_gb):.4f}")
print(
    f"Optimized Gradient Boosting AUC-ROC: {roc_auc_score(y_test, y_pred_proba_gb):.4f}"
)
print("\nClassification Report:\n", classification_report(y_test, y_pred_gb))

Optimized Gradient Boosting Accuracy: 0.8230
Optimized Gradient Boosting AUC-ROC: 0.7819

Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.96      0.90      1205
           1       0.58      0.24      0.34       281

    accuracy                           0.82      1486
   macro avg       0.71      0.60      0.62      1486
weighted avg       0.79      0.82      0.79      1486



- Class 0 (Majority Class): The model performs very well with class 0, achieving high precision (0.84) and recall (0.96). This shows that the model is extremely good at correctly identifying and minimizing false positives for the majority class.

- Class 1 (Minority Class): There is a notable drop in performance for class 1, where precision is 0.58 and recall is much lower at 0.24. This implies that while the model is able to identify a fair number of true positives (precision), it still struggles significantly in detecting all instances of class 1, with many false negatives slipping through.

- F1-score for class 1 (0.34) further highlights the model’s weakness in handling the minority class, showing a balance between the precision and recall that is below optimal.

- The macro average recall (0.60) and f1-score (0.62) show the imbalance between how the model treats class 0 and class 1. The large disparity in recall indicates that while the model is excellent at identifying class 0, it falters when it comes to class 1.

- The weighted averages are skewed towards class 0 due to its higher support in the data, meaning that the overall performance looks decent (weighted accuracy, precision, and recall near 0.80). However, this masks the poor performance for the minority class.

In [17]:
result_df = X_test.copy()
result_df["Predicted_PD"] = y_pred_proba_gb
result_df["True_Label"] = y_test.values

print(result_df.head())

result_df.to_csv("predictions_output_gradient_boosting.csv", index=False)

      Job tenure  Age  Car price  Funding amount  Down payment  Loan duration  \
625            2   22       8900            8900             0             60   
2796           0   55      10400            9400             0             72   
101            1   40      15990           14990             0             60   
4767           5   43      19999           23233             0             72   
2018           1   26      11800            4298             1             24   

      Monthly payment  Credit event  Married  Homeowner  Predicted_PD  \
625          0.084828             0        0          0      0.290062   
2796         0.083889             0        0          0      0.357017   
101          0.127142             0        1          0      0.225008   
4767         0.153289             0        0          0      0.434658   
2018         0.115528             0        0          0      0.083357   

      True_Label  
625            1  
2796           0  
101            1 

## 3. XGBoost

In [20]:
param_grid_xgb = {
    "n_estimators": [100, 200, 300],
    "learning_rate": [0.01, 0.05, 0.1],
    "max_depth": [3, 5, 7],
    "subsample": [0.8, 1.0],
    "colsample_bytree": [0.8, 1.0],
    "min_child_weight": [1, 3, 5],
    "gamma": [0, 0.1, 0.3],
}

In [21]:
xgb_model = XGBClassifier(random_state=42)
grid_search_xgb = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid_xgb,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1,
    verbose=2,
)

grid_search_xgb.fit(X_train, y_train)

Fitting 5 folds for each of 972 candidates, totalling 4860 fits


  _data = np.array(data, dtype=dtype, copy=copy,


In [22]:
print(f"Best parameters for XGBoost: {grid_search_xgb.best_params_}")
best_xgb_model = grid_search_xgb.best_estimator_

Best parameters for XGBoost: {'colsample_bytree': 1.0, 'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 100, 'subsample': 0.8}


In [23]:
y_pred_xgb = best_xgb_model.predict(X_test)
y_pred_proba_xgb = best_xgb_model.predict_proba(X_test)[:, 1]

print(f"Optimized XGBoost Accuracy: {accuracy_score(y_test, y_pred_xgb):.4f}")
print(f"Optimized XGBoost AUC-ROC: {roc_auc_score(y_test, y_pred_proba_xgb):.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))

Optimized XGBoost Accuracy: 0.8237
Optimized XGBoost AUC-ROC: 0.7817

Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.96      0.90      1205
           1       0.59      0.23      0.33       281

    accuracy                           0.82      1486
   macro avg       0.71      0.60      0.62      1486
weighted avg       0.79      0.82      0.79      1486



- Class 0 (Majority Class): Precision (0.84) and Recall (0.96) are strong, much like in the other models. XGBoost handles class 0 very well, minimizing false positives and capturing almost all true positives for this class, leading to a high f1-score (0.90).
- Class 1 (Minority Class): Precision (0.59) is moderate, meaning XGBoost makes some correct predictions for class 1 but still struggles with false positives. The recall (0.23) is notably low, indicating the model misses a large proportion of class 1 instances (many false negatives). Consequently, the f1-score (0.33) reflects the challenge XGBoost faces in balancing precision and recall for the minority class.
- The Macro Average Precision (0.71), Recall (0.60), and F1-Score (0.62) are very similar to Gradient Boosting, indicating that XGBoost handles the overall class balance in a comparable manner. The scores show that while XGBoost manages precision well, recall is still the weaker point, especially for the minority class.

In [24]:
result_df = X_test.copy()
result_df["Predicted_PD"] = y_pred_proba_xgb
result_df["True_Label"] = y_test.values

print(result_df.head())

result_df.to_csv("predictions_output_xgboost.csv", index=False)

      Job tenure  Age  Car price  Funding amount  Down payment  Loan duration  \
625            2   22       8900            8900             0             60   
2796           0   55      10400            9400             0             72   
101            1   40      15990           14990             0             60   
4767           5   43      19999           23233             0             72   
2018           1   26      11800            4298             1             24   

      Monthly payment  Credit event  Married  Homeowner  Predicted_PD  \
625          0.084828             0        0          0      0.311191   
2796         0.083889             0        0          0      0.341794   
101          0.127142             0        1          0      0.249162   
4767         0.153289             0        0          0      0.419638   
2018         0.115528             0        0          0      0.067851   

      True_Label  
625            1  
2796           0  
101            1 

## Comparison of Random Forest, Gradient Boosting, and XGBoost:

1. Accuracy:
- Random Forest: 82.57%
- Gradient Boosting: 82.30%
- XGBoost: 82.37%

All three models show nearly identical accuracy, indicating that they are equally effective at making correct predictions for the dataset as a whole. There is no standout model here in terms of overall accuracy. However, accuracy isn't very meaningful with imbalanced data (many more non-default cases).

2. AUC-ROC:
- Random Forest: 0.7825
- Gradient Boosting: 0.7819
- XGBoost: 0.7817

The AUC-ROC scores across all models are very close, with Random Forest slightly outperforming the others, but the difference is too small to be meaningful. All three models perform similarly in distinguishing between the two classes.

3. Class 1 (Minority Class) Performance:

**Precision**:
- Random Forest: 0.68
- Gradient Boosting: 0.58
- XGBoost: 0.59

Random Forest has the highest precision for class 1, meaning it is the best at minimizing false positives for this class. Both Gradient Boosting and XGBoost have lower precision, meaning they predict more false positives for the minority class.

**Recall**:
- Random Forest: 0.15
- Gradient Boosting: 0.24
- XGBoost: 0.23

In terms of recall, both Gradient Boosting and XGBoost significantly outperform Random Forest. They are better at identifying true positives from the minority class, although the improvement is still modest. Random Forest, by contrast, misses the majority of class 1 instances (low recall).

**F1-Score**:
- Random Forest: 0.24
- Gradient Boosting: 0.34
- XGBoost: 0.33

The f1-score, which balances precision and recall, is highest for Gradient Boosting, followed closely by XGBoost. Both models handle the minority class better than Random Forest, though all struggle to achieve a strong balance between precision and recall for class 1.

4. Macro Average Performance:

**Random Forest**:

- Macro Precision: 0.75
- Macro Recall: 0.57
- Macro F1-Score: 0.57

**Gradient Boosting**:

- Macro Precision: 0.71
- Macro Recall: 0.60
- Macro F1-Score: 0.62

**XGBoost**:

- Macro Precision: 0.71
- Macro Recall: 0.60
- Macro F1-Score: 0.62

Both Gradient Boosting and XGBoost have nearly identical macro average performance, slightly outperforming Random Forest in recall and f1-score. This reflects their improved ability to handle both classes, particularly the minority class, compared to Random Forest.

5. Class Imbalance Handling:
- **Random Forest** favors the majority class heavily, resulting in a high recall for class 0 but poor recall for class 1. This model minimizes false positives for class 1 but misses most instances of the minority class (very low recall).

- **Gradient Boosting** shows better balance by improving recall for class 1, though at the cost of slightly lower precision. It shows a reasonable balance between class 0 and class 1 but still struggles to fully address class imbalance.

- Similar to Gradient Boosting, **XGBoost** improves recall for class 1 while maintaining comparable precision. It performs almost identically to Gradient Boosting, showing better detection of minority class instances while sacrificing some precision.

**Conclusion and Forecasting Analysis**:

Random Forest performs best in terms of precision for class 1, minimizing false positives, but has the weakest recall, leading to the lowest f1-score for class 1.

Gradient Boosting offers a better balance with improved recall for class 1 and a higher f1-score compared to Random Forest. It has a low precision but does a better job overall at capturing minority class instances. 

XGBoost performs nearly identically to Gradient Boosting. It also balances precision and recall for class 1 better than Random Forest, with slightly better recall than Random Forest but at the same precision level as Gradient Boosting.

In summary, for better minority class detection, both Gradient Boosting and XGBoost are superior to Random Forest. However, Gradient Boosting and XGBoost are preferable since class imbalance needs to be addressed with a more balanced precision-recall tradeoff.

# Exporting the best model

In [25]:
model = best_gb_model
joblib.dump(model, "gradient_boosting_model.pkl")

['gradient_boosting_model.pkl']