# Home Credit Modeling
### Chris McTeague
### IS 6812
### 11/3/2024

# Business Problem Statement

Many individuals with limited or no credit history struggle to obtain loans, often falling prey to unreliable lenders. Home Credit Group aims to broaden financial inclusion by providing safe loans to these individuals. This project will focus on improving Home Credit’s loan predictions of client’s repayment abilities. Effectively allowing them to reach more people without the risk of sky rocketing default rates.

In [None]:
!pip install category_encoders
!pip install xgboost
!pip install hyperopt

Collecting category_encoders
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading category_encoders-2.6.4-py2.py3-none-any.whl (82 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.0/82.0 kB[0m [31m434.0 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: category_encoders
Successfully installed category_encoders-2.6.4


In [None]:
#importing libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import category_encoders as ce
import xgboost as xgb
from scipy import stats
from scipy.stats import zscore
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay,accuracy_score, precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import roc_auc_score, precision_recall_curve, confusion_matrix
from sklearn.metrics import classification_report, average_precision_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils import class_weight
from sklearn.metrics import roc_curve, auc
from scipy.stats import uniform, randint
import xgboost as xgb
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.utils.class_weight import compute_class_weight




# Data Cleaning

In [None]:
#file paths
train_data_path = "/content/application_test.csv"
test_data_path = "/content/application_test.csv"

#creating datasets
train_data = pd.read_csv("application_train.csv")
test_data = pd.read_csv("application_test.csv")


Upon reviewing the dataset, we notice that several columns represent the same information but are calculated using different statistical measures, such as the average, mode, or median. To prevent multicollinearity and simplify the model, we will retain only the columns with average calculations. This will ensure we avoid redundant features while still capturing the key trends in the data.

In [None]:
#droping the given columns
train_data = train_data.drop(columns = ['APARTMENTS_MODE','BASEMENTAREA_MODE','YEARS_BEGINEXPLUATATION_MODE','YEARS_BUILD_MODE','COMMONAREA_MODE','ELEVATORS_MODE',
                                        'ENTRANCES_MODE','FLOORSMAX_MODE','FLOORSMIN_MODE','LANDAREA_MODE','LIVINGAPARTMENTS_MODE','LIVINGAREA_MODE','NONLIVINGAPARTMENTS_MODE',
                                        'NONLIVINGAREA_MODE','APARTMENTS_MEDI','BASEMENTAREA_MEDI','YEARS_BEGINEXPLUATATION_MEDI','YEARS_BUILD_MEDI','COMMONAREA_MEDI',
                                        'ELEVATORS_MEDI','ENTRANCES_MEDI','FLOORSMAX_MEDI','FLOORSMIN_MEDI','LANDAREA_MEDI','LIVINGAPARTMENTS_MEDI','LIVINGAREA_MEDI',
                                        'NONLIVINGAPARTMENTS_MEDI','NONLIVINGAREA_MEDI'])

test_data = test_data.drop(columns = ['APARTMENTS_MODE','BASEMENTAREA_MODE','YEARS_BEGINEXPLUATATION_MODE','YEARS_BUILD_MODE','COMMONAREA_MODE','ELEVATORS_MODE',
                                        'ENTRANCES_MODE','FLOORSMAX_MODE','FLOORSMIN_MODE','LANDAREA_MODE','LIVINGAPARTMENTS_MODE','LIVINGAREA_MODE','NONLIVINGAPARTMENTS_MODE',
                                        'NONLIVINGAREA_MODE','APARTMENTS_MEDI','BASEMENTAREA_MEDI','YEARS_BEGINEXPLUATATION_MEDI','YEARS_BUILD_MEDI','COMMONAREA_MEDI',
                                        'ELEVATORS_MEDI','ENTRANCES_MEDI','FLOORSMAX_MEDI','FLOORSMIN_MEDI','LANDAREA_MEDI','LIVINGAPARTMENTS_MEDI','LIVINGAREA_MEDI',
                                        'NONLIVINGAPARTMENTS_MEDI','NONLIVINGAREA_MEDI'])

At this stage, we are removing data points that are extreme outliers or appear to be clear typographical errors. For instance, one of the fields, "Days Employed," contained values exceeding 365,243, which is over 1,000 years—an obvious anomaly. By eliminating such erroneous data, we aim to improve the overall quality of the dataset and ensure that our model isn't skewed by these incorrect entries.

In [None]:
#removing extreme outliers and potential errors that we noticed from the dataset
train_data = train_data[train_data['DAYS_EMPLOYED'] != 365243]
train_data = train_data[train_data['OBS_30_CNT_SOCIAL_CIRCLE'] != 348]
train_data = train_data[train_data['DEF_30_CNT_SOCIAL_CIRCLE'] != 34]
train_data = train_data[train_data['OBS_60_CNT_SOCIAL_CIRCLE'] != 344]
train_data = train_data[train_data['DEF_60_CNT_SOCIAL_CIRCLE'] != 24]
train_data = train_data[train_data['AMT_REQ_CREDIT_BUREAU_QRT'] != 264]
train_data = train_data[train_data['AMT_REQ_CREDIT_BUREAU_QRT'] != 19]

Below we are just binning some of our variables.

In [None]:
#replace NAs in AMT_ANNUITY with the column mean
train_data['AMT_ANNUITY'].fillna(train_data['AMT_ANNUITY'].mean(), inplace=True)
test_data['AMT_ANNUITY'].fillna(test_data['AMT_ANNUITY'].mean(), inplace=True)

#replace NAs in AMT_GOODS_PRICE with the column mean
train_data['AMT_GOODS_PRICE'].fillna(train_data['AMT_GOODS_PRICE'].mean(), inplace=True)
test_data['AMT_GOODS_PRICE'].fillna(test_data['AMT_GOODS_PRICE'].mean(), inplace=True)

#binning DAYS_BIRTH (convert to age in years)
train_data['DAYS_BIRTH'] = abs(train_data['DAYS_BIRTH']) / 365
test_data['DAYS_BIRTH'] = abs(test_data['DAYS_BIRTH']) / 365

#binning ages into categories
age_bins = [18, 25, 35, 45, 60, np.inf]
age_labels = ['18-25', '26-35', '36-45', '46-60', '60+']
train_data['DAYS_BIRTH'] = pd.cut(train_data['DAYS_BIRTH'], bins=age_bins, labels=age_labels, right=False)
test_data['DAYS_BIRTH'] = pd.cut(test_data['DAYS_BIRTH'], bins=age_bins, labels=age_labels, right=False)

#binning AMT_INCOME_TOTAL into categories
income_bins = [0, 50000, 100000, 150000, 300000, np.inf]
income_labels = ['Low income', 'Lower-middle income', 'Middle income', 'Upper-middle income', 'High income']
train_data['AMT_INCOME_TOTAL'] = pd.cut(train_data['AMT_INCOME_TOTAL'], bins=income_bins, labels=income_labels, right=False)
test_data['AMT_INCOME_TOTAL'] = pd.cut(test_data['AMT_INCOME_TOTAL'], bins=income_bins, labels=income_labels, right=False)

#binning AMT_CREDIT into categories
credit_bins = [0, 100000, 500000, 1000000, np.inf]
credit_labels = ['Small loans', 'Medium loans', 'Large loans', 'Very large loans']
train_data['AMT_CREDIT'] = pd.cut(train_data['AMT_CREDIT'], bins=credit_bins, labels=credit_labels, right=False)
test_data['AMT_CREDIT'] = pd.cut(test_data['AMT_CREDIT'], bins=credit_bins, labels=credit_labels, right=False)

#binning CNT_CHILDREN into categories
children_bins = [0, 1, 2, 3, np.inf]
children_labels = ['Single', '2 family members', '3 family members', '4 or more family members']
train_data['CNT_CHILDREN'] = pd.cut(train_data['CNT_CHILDREN'], bins=children_bins, labels=children_labels, right=False)
test_data['CNT_CHILDREN'] = pd.cut(test_data['CNT_CHILDREN'], bins=children_bins, labels=children_labels, right=False)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_data['AMT_ANNUITY'].fillna(train_data['AMT_ANNUITY'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_data['AMT_ANNUITY'].fillna(test_data['AMT_ANNUITY'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never w

As the final step in our data cleaning process, we are applying a few last adjustments. We will convert boolean values into 1s and 0s for consistency. For missing values in numeric columns, we'll replace them with the column's mean to maintain the integrity of the data. For missing values in categorical columns, we'll fill them with "XNA" as a placeholder. This placeholder will not have any significant effect on the models but will ensure that no data is missing when we proceed with the analysis.

In [None]:
#replacing binary variables with 1s and 0s
train_data['FLAG_OWN_REALTY'] = train_data['FLAG_OWN_REALTY'].replace({'Y': 1, 'N': 0})

train_data['FLAG_OWN_CAR'] = train_data['FLAG_OWN_CAR'].replace({'Y': 1, 'N': 0})

#filling in numeric columns that have NAs with the mean of that column
numeric_cols = train_data.select_dtypes(include=np.number).columns
train_data[numeric_cols] = train_data[numeric_cols].fillna(train_data[numeric_cols].mean())

columns_to_fill = ['NAME_TYPE_SUITE', 'OCCUPATION_TYPE', 'FONDKAPREMONT_MODE',
                   'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']

#filling categorical columns with 'XNA' for blank values
train_data[columns_to_fill] = train_data[columns_to_fill].fillna('XNA')

#reseting the index
train_data = train_data.reset_index(drop=True)

  train_data['FLAG_OWN_REALTY'] = train_data['FLAG_OWN_REALTY'].replace({'Y': 1, 'N': 0})
  train_data['FLAG_OWN_CAR'] = train_data['FLAG_OWN_CAR'].replace({'Y': 1, 'N': 0})


# Data Processing

To prepare our data for use in machine learning models, we need to encode the categorical variables appropriately. We employ three different encoding techniques:

One-Hot Encoding: This is used for nominal variables (those without a natural order) to create binary columns for each category.

Label Encoding: Applied to ordinal variables, where the categories have an inherent order, converting them into integer labels that preserve this ranking.
T
arget Encoding: This is used for nominal variables with a high cardinality (i.e., too many unique categories). In this case, we replace each category with the mean value of the target variable for that category.

By applying these encoding techniques, we ensure our categorical data is represented in a format suitable for different machine learning models, while also reducing dimensionality where necessary.

In [None]:
le = LabelEncoder()
train_encoded = train_data

train_encoded = pd.get_dummies(train_data, columns=['CODE_GENDER', 'NAME_CONTRACT_TYPE'], drop_first=True)

train_encoded['CNT_CHILDREN'] = le.fit_transform(train_data['CNT_CHILDREN'])
train_encoded['AMT_INCOME_TOTAL'] = le.fit_transform(train_data['AMT_INCOME_TOTAL'])
train_encoded['AMT_CREDIT'] = le.fit_transform(train_data['AMT_CREDIT'])
train_encoded['DAYS_BIRTH'] = le.fit_transform(train_data['DAYS_BIRTH'])
train_encoded['NAME_EDUCATION_TYPE'] = le.fit_transform(train_data['NAME_EDUCATION_TYPE'])
train_encoded['WEEKDAY_APPR_PROCESS_START'] = le.fit_transform(train_data['WEEKDAY_APPR_PROCESS_START'])

columns_to_encode = ['OCCUPATION_TYPE', 'NAME_HOUSING_TYPE','NAME_FAMILY_STATUS','EMERGENCYSTATE_MODE','WALLSMATERIAL_MODE','HOUSETYPE_MODE',
                     'FONDKAPREMONT_MODE','ORGANIZATION_TYPE','NAME_INCOME_TYPE','NAME_TYPE_SUITE',]
encoder = ce.TargetEncoder(cols=columns_to_encode)

train_encoded = train_encoded.drop(columns=columns_to_encode)

df_encoded = encoder.fit_transform(train_data[columns_to_encode], train_data['TARGET'])

train_encoded = pd.concat([df_encoded, train_encoded], axis=1)

# ML Implementation

In [None]:
#creating test and training sets of our features and target variable
X = train_encoded.drop(columns=['TARGET','SK_ID_CURR'])
y = train_encoded['TARGET']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## XGBoost

In [None]:
# format data set appopriately for XGBoost modeling
weights_train = np.where(y_train == 1, 10, 1)
dtrain = xgb.DMatrix(X_train, label=y_train, weight = weights_train)
dtest = xgb.DMatrix(X_test, label=y_test)

In [None]:
# define custom function
def custom_cost_sensitive_eval(preds, dtrain):
    labels = dtrain.get_label()
    cost_false_negative = 5.0
    cost_false_positive = 1.0
    preds_binary = (preds >= 0.5).astype(int)

    false_negatives = np.sum((labels == 1) & (preds_binary == 0))
    false_positives = np.sum((labels == 0) & (preds_binary == 1))

    custom_cost = (cost_false_negative * false_negatives) + (cost_false_positive * false_positives)
    return "cost_sensitive_error", custom_cost, False

In [None]:
# define hyperparams that will be tested
param_dist = {
    'max_depth': hp.randint('max_depth', 3, 10),
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.3),
    'n_estimators': hp.randint('n_estimators', 50, 500),
    'subsample': hp.uniform('subsample', 0.5, 1.0),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1.0),
    'gamma': hp.uniform('gamma', 0, 5),
    'min_child_weight': hp.randint('min_child_weight', 1, 10),
}


In [None]:
# define custome function
def objective_function(params):
    # Convert the sampled params to a dictionary compatible with XGBoost
    xgb_params = {
        'max_depth': int(params['max_depth']),
        'learning_rate': params['learning_rate'],
        'subsample': params['subsample'],
        'colsample_bytree': params['colsample_bytree'],
        'gamma': params['gamma'],
        'min_child_weight': int(params['min_child_weight']),
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
    }

    # Create DMatrix for training
    weights_train = np.where(y_train == 1, 9, 1)
    dtrain = xgb.DMatrix(X_train, label=y_train, weight=weights_train)

    # Train the model
    model = xgb.train(xgb_params, dtrain, num_boost_round=100)

    # Evaluate on the training set for validation purposes
    preds = model.predict(dtrain)

    # Calculate custom cost-sensitive error
    cost_sensitive_error = custom_cost_sensitive_eval(preds, dtrain)[1]

    return {'loss': cost_sensitive_error, 'status': STATUS_OK}

In [None]:
#trials = Trials()
#best_hyperparams = fmin(
    #fn=objective_function,
    #space=param_dist,
    #algo=tpe.suggest,    # commented out for purpose of knitting final HTML, results shown below.
    #max_evals=100,
    #trials=trials
#)

#print("Best Hyperparameters:", best_hyperparams)

Best Hyperparameters: {'colsample_bytree': 0.9698674464932832, 'gamma': 1.6376259636304558, 'learning_rate': 0.299172816420403, 'max_depth': 9, 'min_child_weight': 1, 'n_estimators': 390, 'subsample': 0.9592765023912387}

In [None]:
# hard code final param set for knitting purposes
final_xgb_params = {
    'max_depth': int(9),
    'learning_rate': 0.299172816420403,
    'subsample': 0.9592765023912387,
    'colsample_bytree': 0.9698674464932832,
    'gamma': 1.6376259636304558,
    'min_child_weight': int(1),
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
}

In [None]:
# create test set
weights_train = np.where(y_train == 1, 10, 1)
dtrain_final = xgb.DMatrix(X_train, label=y_train, weight=weights_train)

# Fit the final model
final_model1 = xgb.train(final_xgb_params, dtrain_final, num_boost_round=100)

# Evaluation on Test Set
dtest_final = xgb.DMatrix(X_test, label=y_test)
preds_final1 = final_model1.predict(dtest_final)

In [None]:
# Get the recall, precision, and thresholds for different decision boundaries
precision, recall, thresholds = precision_recall_curve(y_test, preds_final1)

# Define desired recall
desired_recall = 0.7

# Find the threshold that produces the closest recall to the desired recall
closest_idx = np.argmin(np.abs(recall - desired_recall))
optimal_threshold = thresholds[closest_idx]

print(f"Optimal Threshold for Recall {desired_recall}: {optimal_threshold:.2f}")

# Use this threshold to make new predictions
preds_custom_binary1 = (preds_final1 >= optimal_threshold).astype(int)

Optimal Threshold for Recall 0.7: 0.22


In [None]:
# create confusion matrix for results
conf_matrix_custom = confusion_matrix(y_test, preds_custom_binary1)
print("Confusion Matrix with Custom Threshold:")
print(conf_matrix_custom)

# Calculate accuracy
accuracy_custom = accuracy_score(y_test, preds_custom_binary1)
print(f"Accuracy with Custom Threshold: {accuracy_custom:.4f}")

# Get classification report for precision, recall, and F1 score
class_report_custom = classification_report(y_test, preds_custom_binary1, target_names=['Majority Class', 'Minority Class'])
print("Classification Report with Custom Threshold:")
print(class_report_custom)

Confusion Matrix with Custom Threshold:
[[27566 18545]
 [ 1295  3021]]
Accuracy with Custom Threshold: 0.6066
Classification Report with Custom Threshold:
                precision    recall  f1-score   support

Majority Class       0.96      0.60      0.74     46111
Minority Class       0.14      0.70      0.23      4316

      accuracy                           0.61     50427
     macro avg       0.55      0.65      0.48     50427
  weighted avg       0.89      0.61      0.69     50427



## XGBoost with SMOTE

In [None]:
# intiate SMOTE for unbalanced data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# create new data frame for SMOTE data
dtrain_resampled = xgb.DMatrix(X_resampled, label=y_resampled)

#intiate SMOTE model
model_SMOTE = xgb.train(final_xgb_params, dtrain_resampled, num_boost_round=100)

# create test set
dtest = xgb.DMatrix(X_test, label=y_test)

# make predictions using SMOTE model
preds_final_SMOTE = model_SMOTE.predict(dtest)

# convert preds to binary as target is binary
preds_binary_SMOTE = (preds_final >= .46).astype(int)

# create matrix of results
cm = confusion_matrix(y_test, preds_binary_SMOTE)
report = classification_report(y_test, preds_binary_SMOTE)

preds_final_SMOTE_proba = model_SMOTE.predict(dtest)

fpr, tpr, thresholds = roc_curve(y_test, preds_final_SMOTE_proba)


roc_auc = auc(fpr, tpr)


print(f'ROC AUC: {roc_auc:.4f}')


print("Confusion Matrix:")
print(cm)
print("\nClassification Report:")
print(report)


ROC AUC: 0.7193
Confusion Matrix:
[[37986  8125]
 [ 2383  1933]]

Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.82      0.88     46111
           1       0.19      0.45      0.27      4316

    accuracy                           0.79     50427
   macro avg       0.57      0.64      0.57     50427
weighted avg       0.88      0.79      0.83     50427



## XGBoost Classifier

In [None]:
# initialize XGBoost Model
#xgb_model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='auc')

#param_dist = {
   # 'n_estimators': [300,400,500],
    #'max_depth': [5,6,7],
    #'learning_rate': [0.005, 0.05, 0.01, 0.1, 0.2],
    #'subsample': [0.6, 0.8, 1.0,0.5,0.7],
    #'colsample_bytree': [0.6, 0.8, 1.0],
    #'gamma': [0, 0.1, 0.2],
    #'min_child_weight': [1, 2, 3],
#}

# Set up random parameter search
#random_search = RandomizedSearchCV(
   # estimator=xgb_model,
    #param_distributions=param_dist,      # commented out for purpose of knitting final HTML, results shown below.
    #n_iter=50,
   # scoring=['roc_auc', 'recall'],
   # refit = 'roc_auc',
   # cv=3,
   # verbose=1,
   # n_jobs=-1,
   # random_state=42
#)

# Fit the model using DataFrame (not DMatrix)
#random_search.fit(X_train, y_train)

# Print results
#print("Best Parameters:", random_search.best_params_)
#print("Best Score:", random_search.best_score_)

Best Hyperparameters: {'subsample': 0.8, 'n_estimators': 400, 'min_child_weight': 1, 'max_depth': 7, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.8}
Best Score: 0.966553380305374

In [None]:

# create data set for analysis
X = train_encoded.drop(columns=['TARGET','SK_ID_CURR'])
y = train_encoded['TARGET']
weights_train = np.where(y_train == 1, 10, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

weights_train = np.where(y_train == 1, 10, 1)

# hard code best set of params for knitting purposes
best_params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'subsample': 0.8,
    'n_estimators': 400,
    'min_child_weight': 1,
    'max_depth': 7,
    'learning_rate': 0.01,
    'gamma': 0,
    'colsample_bytree': 0.8
}

# Create an XGBoost classifier with the best parameters
xgb_classifier = xgb.XGBClassifier(**best_params)

# Fit the model on the original training data
xgb_classifier.fit(X_train, y_train, sample_weight=weights_train)

# Generate predictions on the test set
y_classifier_preds = xgb_classifier.predict(X_test)

# Calculate and print accuracy
accuracy_score_classifier = accuracy_score(y_test, y_classifier_preds)
print("Test Accuracy:", accuracy_score_classifier)

# Generate and print the classification report
print("Classification Report:\n", classification_report(y_test, y_classifier_preds))

# Generate and print the confusion matrix
conf_matrix = confusion_matrix(y_test, y_classifier_preds)
print("Confusion Matrix:\n", conf_matrix)

Test Accuracy: 0.7216372181569397
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.73      0.83     46111
           1       0.18      0.63      0.28      4316

    accuracy                           0.72     50427
   macro avg       0.57      0.68      0.55     50427
weighted avg       0.89      0.72      0.78     50427

Confusion Matrix:
 [[33669 12442]
 [ 1595  2721]]


# Feature Expansion and Testing

In [None]:
#aggregating Bureau
bureau = pd.read_csv("bureau.csv")
bureau_agg = bureau.groupby('SK_ID_CURR').agg({'DAYS_CREDIT': 'mean', 'CREDIT_DAY_OVERDUE': 'mean', 'CNT_CREDIT_PROLONG' : 'mean',
                                               'AMT_CREDIT_SUM' : 'mean','AMT_CREDIT_SUM_DEBT' : 'mean','AMT_CREDIT_SUM_OVERDUE' : 'mean'})
pd.DataFrame(bureau_agg)
bureau_agg = bureau_agg.rename(columns={'DAYS_CREDIT':'AVG_DAYS_CREDIT', 'CREDIT_DAY_OVERDUE':'AVG_CREDIT_DAY_OVERDUE','CNT_CREDIT_PROLONG' :'AVG_CNT_CREDIT_PROLONG',
                                        'AMT_CREDIT_SUM':'AVG_AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT':'AVG_AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_OVERDUE':'AVG_AMT_CREDIT_SUM_OVERDUE'})

train_bureau_encoded = pd.merge(train_encoded, bureau_agg, how='left', on='SK_ID_CURR')

In [None]:
# impute NA values
numeric_cols = train_bureau_encoded.select_dtypes(include=np.number).columns

train_bureau_encoded[numeric_cols] = train_bureau_encoded[numeric_cols].fillna(train_bureau_encoded[numeric_cols].mean())

In [None]:
# create train and test sets with new improved data set
X = train_bureau_encoded.drop(columns=['TARGET','SK_ID_CURR'])
y = train_bureau_encoded['TARGET']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



## XGBoost with Feature Expansion

In [None]:
# format train and test for XGBoost
weights_train = np.where(y_train == 1, 10, 1)
dtrain = xgb.DMatrix(X_train, label=y_train, weight = weights_train)
dtest = xgb.DMatrix(X_test, label=y_test)

In [None]:
# define function to apply unequal weight to false pos and false negs
def custom_cost_sensitive_eval(preds, dtrain):
    labels = dtrain.get_label()
    cost_false_negative = 5.0
    cost_false_positive = 1.0
    preds_binary = (preds >= 0.5).astype(int)

    false_negatives = np.sum((labels == 1) & (preds_binary == 0))
    false_positives = np.sum((labels == 0) & (preds_binary == 1))

    custom_cost = (cost_false_negative * false_negatives) + (cost_false_positive * false_positives)
    return "cost_sensitive_error", custom_cost, False

In [None]:
# define parameters to test
param_dist = {
    'max_depth': hp.randint('max_depth', 3, 10),
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.3),
    'n_estimators': hp.randint('n_estimators', 50, 500),
    'subsample': hp.uniform('subsample', 0.5, 1.0),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1.0),
    'gamma': hp.uniform('gamma', 0, 5),
    'min_child_weight': hp.randint('min_child_weight', 1, 10),
}

In [None]:
# define objective function to save best set
def objective_function(params):
    xgb_params = {
        'max_depth': int(params['max_depth']),
        'learning_rate': params['learning_rate'],
        'subsample': params['subsample'],
        'colsample_bytree': params['colsample_bytree'],
        'gamma': params['gamma'],
        'min_child_weight': int(params['min_child_weight']),
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
    }

    # Create DMatrix for training
    weights_train = np.where(y_train == 1, 9, 1)
    dtrain = xgb.DMatrix(X_train, label=y_train, weight=weights_train)

    # Train the model
    model = xgb.train(xgb_params, dtrain, num_boost_round=100)

    # Evaluate on the training set for validation purposes
    preds = model.predict(dtrain)

    # Calculate custom cost-sensitive error
    cost_sensitive_error = custom_cost_sensitive_eval(preds, dtrain)[1]

    return {'loss': cost_sensitive_error, 'status': STATUS_OK}

In [None]:
#trials = Trials()
#best_hyperparams = fmin(
    #fn=objective_function,
    #space=param_dist,          # commented out for purpose of knitting final HTML, results shown below.
    #algo=tpe.suggest,
    #max_evals=100,
    #trials=trials
#)

#print("Best Hyperparameters:", best_hyperparams)

 Best Hyperparameters: {'colsample_bytree': 0.8771449965782033, 'gamma': 0.4993425025409346, 'learning_rate': 0.29480470259015035, 'max_depth': 9, 'min_child_weight': 2, 'n_estimators': 377, 'subsample': 0.9024966949013682}

In [None]:
# hardcode final params for knitting
final_xgb_params = {
    'max_depth': int(9),
    'learning_rate': 0.29480470259015035,
    'subsample': 0.9024966949013682,
    'colsample_bytree':  0.8771449965782033,
    'gamma': 0.4993425025409346,
    'min_child_weight': int(2),
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
}

In [None]:
# create final train set
weights_train = np.where(y_train == 1, 10, 1)
dtrain_final = xgb.DMatrix(X_train, label=y_train, weight=weights_train)

# Fit the final model
final_model = xgb.train(final_xgb_params, dtrain_final, num_boost_round=100)

# Evaluation on Test Set
dtest_final = xgb.DMatrix(X_test, label=y_test)
preds_final = final_model.predict(dtest_final)

In [None]:
# Get the recall, precision, and thresholds for different decision boundaries
precision, recall, thresholds = precision_recall_curve(y_test, preds_final)

# Define desired recall
desired_recall = 0.75

# Find the threshold that produces the closest recall to the desired recall
closest_idx = np.argmin(np.abs(recall - desired_recall))
optimal_threshold = thresholds[closest_idx]

print(f"Optimal Threshold for Recall {desired_recall}: {optimal_threshold:.2f}")

# Use this threshold to make new predictions
preds_custom_binary = (preds_final >= optimal_threshold).astype(int)

Optimal Threshold for Recall 0.75: 0.19


In [None]:
# create conf matrix to display results
conf_matrix_custom = confusion_matrix(y_test, preds_custom_binary)
print("Confusion Matrix with Custom Threshold:")
print(conf_matrix_custom)

# Calculate accuracy
accuracy_custom = accuracy_score(y_test, preds_custom_binary)
print(f"Accuracy with Custom Threshold: {accuracy_custom:.4f}")

# Get classification report for precision, recall, and F1 score
class_report_custom = classification_report(y_test, preds_custom_binary, target_names=['Majority Class', 'Minority Class'])
print("Classification Report with Custom Threshold:")
print(class_report_custom)

Confusion Matrix with Custom Threshold:
[[25042 21069]
 [ 1079  3237]]
Accuracy with Custom Threshold: 0.5608
Classification Report with Custom Threshold:
                precision    recall  f1-score   support

Majority Class       0.96      0.54      0.69     46111
Minority Class       0.13      0.75      0.23      4316

      accuracy                           0.56     50427
     macro avg       0.55      0.65      0.46     50427
  weighted avg       0.89      0.56      0.65     50427



## SMOTE XGBoost with Feature Expansion

In [None]:
# initiate SMOTE
smote = SMOTE(random_state=42)
# new train set
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# format for XGBoost
dtrain_resampled = xgb.DMatrix(X_resampled, label=y_resampled)
# train model
model_SMOTE = xgb.train(final_xgb_params, dtrain_resampled, num_boost_round=100)
# new test set
dtest = xgb.DMatrix(X_test, label=y_test)
# make predictions
preds_final_SMOTE = model_SMOTE.predict(dtest)
# turn preds binary
preds_binary_SMOTE = (preds_final >= .46).astype(int)
# generate confusion matrix to display results
cm = confusion_matrix(y_test, preds_binary_SMOTE)
report = classification_report(y_test, preds_binary_SMOTE)

preds_final_SMOTE_proba = model_SMOTE.predict(dtest)

fpr, tpr, thresholds = roc_curve(y_test, preds_final_SMOTE_proba)


roc_auc = auc(fpr, tpr)


print(f'ROC AUC: {roc_auc:.4f}')


print("Confusion Matrix:")
print(cm)
print("\nClassification Report:")
print(report)


ROC AUC: 0.7190
Confusion Matrix:
[[37986  8125]
 [ 2383  1933]]

Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.82      0.88     46111
           1       0.19      0.45      0.27      4316

    accuracy                           0.79     50427
   macro avg       0.57      0.64      0.57     50427
weighted avg       0.88      0.79      0.83     50427



## XGBClassifier with Feature Expansion

In [None]:
# initialize XGBoost Model
#xgb_model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='auc')

#param_dist = {
    #'n_estimators': [300,400,500],
    #'max_depth': [5,6,7],
    #'learning_rate': [0.005, 0.05, 0.01, 0.1, 0.2],
    #'subsample': [0.6, 0.8, 1.0,0.5,0.7],
    #'colsample_bytree': [0.6, 0.8, 1.0],
    #'gamma': [0, 0.1, 0.2],
    #'min_child_weight': [1, 2, 3],
#}

# Set up random parameter search
#random_search = RandomizedSearchCV(
    #estimator=xgb_model,
    #param_distributions=param_dist,
    #n_iter=50,
    #scoring=['roc_auc', 'recall'],   # chunk commented out for final HTML knot
    #refit = 'roc_auc',
    #cv=3,
    #verbose=1,
    #n_jobs=-1,
    #random_state=42
#)

# Fit the model using DataFrame (not DMatrix)
#random_search.fit(X_train, y_train)

# Print results
#print("Best Parameters:", random_search.best_params_)
#print("Best Score:", random_search.best_score_)

 Best Parameters: {'subsample': 0.7, 'n_estimators': 400, 'min_child_weight': 1, 'max_depth': 6, 'learning_rate': 0.05, 'gamma': 0, 'colsample_bytree': 0.6}

In [None]:
# hard code best set of params
best_params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'subsample': 0.7,
    'n_estimators': 400,
    'min_child_weight': 1,
    'max_depth': 6,
    'learning_rate': 0.05,
    'gamma': 0,
    'colsample_bytree': 0.6
}

# Create an XGBoost classifier with the best parameters
xgb_classifier = xgb.XGBClassifier(**best_params)

# Fit the model on the original training data
xgb_classifier.fit(X_train, y_train, sample_weight=weights_train)

# Generate predictions on the test set
y_classifier_preds = xgb_classifier.predict(X_test)

# Calculate and print accuracy
accuracy_score_classifier = accuracy_score(y_test, y_classifier_preds)
print("Test Accuracy:", accuracy_score_classifier)

# Generate and print the classification report
print("Classification Report:\n", classification_report(y_test, y_classifier_preds))

# Generate and print the confusion matrix
conf_matrix = confusion_matrix(y_test, y_classifier_preds)
print("Confusion Matrix:\n", conf_matrix)



Test Accuracy: 0.7406944692327523
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.75      0.84     46111
           1       0.19      0.61      0.29      4316

    accuracy                           0.74     50427
   macro avg       0.57      0.68      0.56     50427
weighted avg       0.89      0.74      0.79     50427

Confusion Matrix:
 [[34729 11382]
 [ 1694  2622]]


## Kaggle Submission Test Data

# Conclusions

In conclusion, despite implementing numerous modeling methods to attempt to accurately identify whether a customer will default or not, it does not appear that any model will be superior in terms of true accuracy of predictions than a simple majority classifier, which comes out to be around 91%. The dataset is extremely imbalanced, with instances of the majority class being much larger than those of the minority class, which is clearly weighing on our models. To account for that, we also implemented multiple methods to weight the dataset in an attempt to help train the models to predict the minority class with more accuracy, but to no avail.

Ultimately, we determind that the best path forward was to focus on training models with improved recall scores, as that would also improve the number of individuals that were correctly being identified as members of the minority class, in this case not at risk for default and as such valuable to Home Credit's business. This was a much more successful endeavor, as we were able to create numerous models that had improved recall scores. With further feature expansion, as well as the introduction of things like interactive and polynomial terms into the modeling process, we believe we could produce some great results. Unfortunately, due to low computational power and the high computational demand of determining the ideal hyperparameters for these complex models, those theories are unable to be tested at this time.