# Table of Contents
- [0. Project Overview & Introduction](#0.-Project-Overview-&-Introduction)
    - [0.1 A discussion of credit default model objectives](#0.1-A-discussion-of-credit-default-model-objectives)
    - [0.2 Brief introction to this project](#0.2-Brief-introction-to-this-project)
    - [0.3 Importing all relevant modules](#0.3-Importing-all-relevant-modules)
    - [0.4 Setting up the data](#0.4-Setting-up-the-data)
- [1. Building and training our XGBoost default prediction model](#1.-Building-and-training-our-XGBoost-default-prediction-model)
    - [1.1 Training the model](#1.1-Training-the-model)
    - [1.2 Intepretting the preliminary results](#1.2-Intepretting-the-preliminary-results)
    - [1.3 Adjusting the threshold to tilt our model defensively](#1.3-Adjusting-the-threshold-to-tilt-our-model-defensively)
    - [1.4 A brief conclusion...](#1.4-A-brief-conclusion...)
-  [2. Acknowledgements](#2.-Acknowledgements) 

# 0. Project Overview & Introduction

### 0.1 A discussion of credit default model objectives

Suppose you are a business (e.g. a FinTech) handing out loans to private individuals. Understandibly - credit default models would be a very important part of your business in order to understand who you should be giving out loans to and who you should be rejecting.

In its most simple form, a credit default model can be thought of as a dual optimisation problem:
1. **risk-mitigation:** you want to be sure that a small number of "accepted" credit card clients end up defaulting.
2. **profit-maximisation:** you want to make as much money as possible from your accepted credit card clients.

In the context of modelling, this can be simply boiled down to:
1. **high precision** in predicting y=0 (non-default). In other words, if the model predicts a client will <u>not</u> default, then we want to be very sure this is the case.
2. **large number of y=0 (non-default) predictions**. Why? More non-default predictions means more customers paying more fees.

### 0.2 Brief introction to this project

In a [**previous project**](https://github.com/evgeni-g-georgiev/Bayesian-Credit-Card-Default-Model) we had employed a Bayesian Logistic Regression to predict whether credit card users are going to default or not. 

In **this project**, I build a simple XGBoost model to tackle the same problem (on the same dataset) to showcase the potential for superiority of decision tree based models in credit default prediction.

If you haven't done so already, you can follow the above link to read the full model we previously implemented. Otherwise, here is a summary of the main results in our Bayesian model which will act as the threshold we are aiming to beat:

1) we ended up with a threshold adjusted ***defensive*** model biased towards predicting y=1(default). This resulted in a trade-off between overall accuracy and precision in prediction 0s (non-default cases). 
2) the model had high precision in predicting y=0 (non-default cases). It was correct in 90% of instances.
3) The trade-off was that due to the defensive nature of the model, it predicted only 423 non-default cases. In otherwords, over 95% of the test set population was rejected.

**What we will show in this project is that even very simple decision tree models can out-perform fairly complex linear models. In particular, our XGBoost model will do a much better job at the "profit-maximisation" aspect of our problem, while maintaining performance in "risk-mitigation".**




### 0.3 Importing all relevant modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

### 0.4 Setting up the data

We prepare exactly the same data as was used in our [**Bayesian Model**](https://github.com/evgeni-g-georgiev/Bayesian-Credit-Card-Default-Model).

In particular:
- we keep the exact same X variables
- we perform the same feauture engineering e.g. combining of some variables to reduce multicollinearity, one-shot encoding of categorical variables
- we perform same data prep e.g. standardisation of continuous variables

**This is to ensure a fair comparison between our XGBoost Model and our Bayesian Model. We don't want different setups to affect our comparisons.**

**1) Importing the same data:**

In [2]:
# importing the country data csv file
file = "UCI_Credit_Card.csv"
data = pd.read_csv(file)

# setting to display all our columns
pd.set_option('display.max_columns', None)
# display first 5 rows to get a sense of the data
display(data.head())

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,-2,-2,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,0,2,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,0,0,29239.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,0,0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,0,0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


**2) Implementing the same one-hot encoding process:**

In [3]:
# for SEX, using female as the dummy, we create a column for male
data['male'] = (data['SEX'] == 1).astype(int)

# for EDUCATION, letting grad school be the dummy, we create columns for high school, university, and unknown/other education variables
data['high_school'] = (data['EDUCATION'] == 3).astype(int)
data['university'] = (data['EDUCATION'] == 2).astype(int)
data['other_education'] = (data['EDUCATION'] == 4).astype(int)
data['unknown_education_one'] = (data['EDUCATION'] == 5).astype(int)
data['unknown_education_two'] = (data['EDUCATION'] == 6).astype(int)


# for MARRIAGE, letting singles be the dummy, we create columns for married and others
data['married'] = (data['MARRIAGE'] == 1).astype(int)
data['other_marriage_status'] = (data['MARRIAGE'] == 3).astype(int)

data.head()


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month,male,high_school,university,other_education,unknown_education_one,unknown_education_two,married,other_marriage_status
0,1,20000.0,2,2,1,24,2,2,-1,-1,-2,-2,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1,0,0,1,0,0,0,1,0
1,2,120000.0,2,2,2,26,-1,2,0,0,0,2,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1,0,0,1,0,0,0,0,0
2,3,90000.0,2,2,2,34,0,0,0,0,0,0,29239.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0,0,0,1,0,0,0,0,0
3,4,50000.0,2,2,1,37,0,0,0,0,0,0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0,0,0,1,0,0,0,1,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,0,0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0,1,0,1,0,0,0,1,0


**3) Standardising the data in the same way:**

In [4]:
# create new df for standardised data
data_std = data.copy(deep=True)

# I standardize the continuous variables only. Binary variables are left unchanged.
continuous_vars = ['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
scaler = StandardScaler()
data_std[continuous_vars] = scaler.fit_transform(data_std[continuous_vars])

**4) Performing the same feature engineering:**

In [5]:
# predictors only
predictors = data_std.drop(columns=['ID', 'default.payment.next.month', 'male', 'high_school', 
                                    'university', 'other_education', 'unknown_education_one', 'unknown_education_two', 'married', 'other_marriage_status',
                                   'SEX','EDUCATION', 'MARRIAGE'])

# averaging BILL_AMTs & create 'BILL_AMT_average' column
predictors['BILL_AMT_average'] = predictors[['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']].mean(axis=1)
new_predictors = predictors.drop(columns=['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'])
data_std['BILL_AMT_average'] = new_predictors['BILL_AMT_average'] 

data_std['PAY_weighted'] = (data_std['PAY_4'] * 3 + data_std['PAY_5'] * 2 + data_std['PAY_6']) / 6
new_predictors.drop(columns=['PAY_4', 'PAY_5', 'PAY_6'], inplace=True)
new_predictors['PAY_weighted'] = data_std['PAY_weighted']


data_std

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month,male,high_school,university,other_education,unknown_education_one,unknown_education_two,married,other_marriage_status,BILL_AMT_average,PAY_weighted
0,1,-1.136720,2,2,1,-1.246020,2,2,-1,-1,-2,-2,-0.642501,-0.647399,-0.667993,-0.672497,-0.663059,-0.652724,-0.341942,-0.227086,-0.296801,-0.308063,-0.314136,-0.293382,1,0,0,1,0,0,0,1,0,-0.657696,-1.500000
1,2,-0.365981,2,2,2,-1.029047,-1,2,0,0,0,2,-0.659219,-0.666747,-0.639254,-0.621636,-0.606229,-0.597966,-0.341942,-0.213588,-0.240005,-0.244230,-0.314136,-0.180878,1,0,0,1,0,0,0,0,0,-0.631842,0.333333
2,3,-0.597202,2,2,2,-0.161156,0,0,0,0,0,0,-0.298560,-0.493899,-0.482408,-0.449730,-0.417188,-0.391630,-0.250292,-0.191887,-0.240005,-0.244230,-0.248683,-0.012122,0,0,0,1,0,0,0,0,0,-0.422236,0.000000
3,4,-0.905498,2,2,1,0.164303,0,0,0,0,0,0,-0.057491,-0.013293,0.032846,-0.232373,-0.186729,-0.156579,-0.221191,-0.169361,-0.228645,-0.237846,-0.244166,-0.237130,0,0,0,1,0,0,0,1,0,-0.102270,0.000000
4,5,-0.905498,1,2,1,2.334029,-1,0,-1,0,0,0,-0.578618,-0.611318,-0.161189,-0.346997,-0.348137,-0.331482,-0.221191,1.335034,0.271165,0.266434,-0.269039,-0.255187,0,1,0,1,0,0,0,1,0,-0.396290,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29996,0.404759,1,3,1,0.381275,0,0,0,0,0,0,1.870379,2.018136,2.326690,0.695474,-0.149259,-0.384392,0.171250,0.611048,-0.012648,-0.113564,0.013131,-0.237130,0,1,1,0,0,0,0,1,0,1.062838,0.000000
29996,29997,-0.134759,1,3,2,0.815221,-1,-1,-1,-1,0,0,-0.672786,-0.665299,-0.627430,-0.532924,-0.577691,-0.652724,-0.231032,-0.103955,0.214255,-0.299828,-0.314136,-0.293382,0,1,1,0,0,0,0,0,0,-0.621476,-0.500000
29997,29998,-1.059646,1,2,2,0.164303,4,3,2,-1,0,0,-0.647227,-0.643830,-0.638158,-0.347961,-0.324517,-0.327687,-0.341942,-0.256990,0.952725,-0.039964,-0.183229,-0.119001,1,1,0,1,0,0,0,0,0,-0.488230,-0.500000
29998,29999,-0.674276,1,3,1,0.598248,1,-1,0,0,0,-1,-0.717982,0.410269,0.422373,0.147844,-0.468063,0.169130,4.844316,-0.109033,-0.229895,-0.185120,3.152536,-0.191904,1,1,1,0,0,0,0,1,0,-0.006071,-0.166667


**5) Splitting the data into the exact same train and test sets**

In [6]:
# Assuming 'data_std' is already loaded and pre-processed as you mentioned.
X = data_std.drop(['default.payment.next.month', 'ID', 'SEX', 'EDUCATION', 'MARRIAGE', 'PAY_4' , 
                   'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4',
                   'BILL_AMT5', 'BILL_AMT6'], axis=1)
y = data_std['default.payment.next.month']

# Splitting data (correcting the split ratio as per your text to 70-30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. Building and training our XGBoost default prediction model

### 1.1 Training the model

**Note:** it would make sense at this point to have split the data into training, cross-validation and test sets. Then to use the cv sets to test different model architectures (i.e. to optimise n_est, learning_rate, max_depth,...)

We skip this step for two reasons:
1. to save time,
2. because even if we skip this paramater optimisation step, decision tree models still lead to superior performance vs our bayesian model.

In [7]:
model = xgb.XGBClassifier(
    objective='binary:logistic',
    n_estimators=200,
    learning_rate=0.01,
    max_depth=20,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

# Training the model
model.fit(X_train, y_train)

### 1.2 Intepretting the preliminary results

In [8]:
# predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# eval accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy:.2f}")
print(f"Test Accuracy: {test_accuracy:.2f}")

# Classification reports
print("Training Classification Report:")
print(classification_report(y_train, y_train_pred))

print("Test Classification Report:")
print(classification_report(y_test, y_test_pred))

# #of predictions for each class
train_pred_counts = np.bincount(y_train_pred)
test_pred_counts = np.bincount(y_test_pred)

print(f"Number of predictions for class 0 in training set: {train_pred_counts[0]}")
print(f"Number of predictions for class 1 in training set: {train_pred_counts[1]}")
print(f"Number of predictions for class 0 in test set: {test_pred_counts[0]}")
print(f"Number of predictions for class 1 in test set: {test_pred_counts[1]}")


Training Accuracy: 0.91
Test Accuracy: 0.82
Training Classification Report:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95     16324
           1       0.98      0.62      0.76      4676

    accuracy                           0.91     21000
   macro avg       0.94      0.81      0.85     21000
weighted avg       0.92      0.91      0.90     21000

Test Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.96      0.89      7040
           1       0.69      0.30      0.42      1960

    accuracy                           0.82      9000
   macro avg       0.76      0.63      0.66      9000
weighted avg       0.80      0.82      0.79      9000

Number of predictions for class 0 in training set: 18064
Number of predictions for class 1 in training set: 2936
Number of predictions for class 0 in test set: 8131
Number of predictions for class 1 in test set: 869


**Observation:** As can be seen, we are left with a model with 83% precision on y=0s (non-default cases) and 8131 predictions for y=0. This model clearly does a better job at "profit-maximisation" but it also has a significantly lower precision of 83% (vs 90%).

**Next step:** Recall that in our [**Bayesian Model**](https://github.com/evgeni-g-georgiev/Bayesian-Credit-Card-Default-Model) we had adjusted the thresholds lower in order to increase precision at the cost of accuracy. Let us perform the same adjustement on our XGBoost model and see where it takes us...

### 1.3 Adjusting the threshold to tilt our model defensively

In [9]:
# prediction probabilities for the test data
y_prob = model.predict_proba(X_test)[:, 1]  # probability of the positive class (defaults)

In [10]:
# set a lower threshold to make predicting defaults easier i.e. to make the model more cautious
threshold = 0.15  # vs 0.5 earlier

# apply new threshold to the probabilities to make final predictions
y_pred_adjusted = (y_prob > threshold).astype(int)


**and now lets visualise the adjusted model...**

In [11]:
# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_adjusted))

# Precision, Recall & F1 score
print("Classification Report:")
print(classification_report(y_test, y_pred_adjusted))

# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_adjusted))

# Number of predictions for each class
adjusted_pred_counts = np.bincount(y_pred_adjusted)
print(f"Number of predictions for class 0: {adjusted_pred_counts[0]}")
print(f"Number of predictions for class 1: {adjusted_pred_counts[1]}")



Confusion Matrix:
[[4300 2740]
 [ 472 1488]]
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.61      0.73      7040
           1       0.35      0.76      0.48      1960

    accuracy                           0.64      9000
   macro avg       0.63      0.68      0.60      9000
weighted avg       0.78      0.64      0.67      9000

Confusion Matrix:
[[4300 2740]
 [ 472 1488]]
Number of predictions for class 0: 4772
Number of predictions for class 1: 4228


**So... have built and incredibly simple (yet powerful) XGBoost model that:**

1. has the same (90%) precision on y=0 (non-default) predictions as our [**Bayesian Model**](https://github.com/evgeni-g-georgiev/Bayesian-Credit-Card-Default-Model)
2. does a far greater better job at **profit-maximisation**. This model accepts 53% of candidates in our test set (vs just 4.5% previously!)

### 1.4 A brief conclusion...

- What we have shown is that, in the context of credit default prediction, decision tree models have great potential over tradition Bayesian models in:
    1. ease of implementation,
    2. computational efficiency and,
    3. most importantly - solving for our dual optimisation problem.
 
- What remains to be shown is the additional performance that introducing a cross-validation set and model architecture/paramater optimisation can bring. I leave this for another time.

# 2. Acknowledgements

Dataset taken from Kaggle: https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset

<u>As per Kaggle acknowledgements, they reference the following:</u>

- Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

- The original dataset can be found here at the UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients