# INTRODUCTION

I am using Kaggle's Default of Credit Card Clients Dataset as an exercise for default prediction methods. Any comments and suggestions are more than welcome.

## Data information

Following information from Keggle:

> This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

Monetary and payment values are in New Taiwanease dollars. As of 2024, $1 EUR \approx 35 NTD$.

There are 25 variables in the dataset:

- **ID:** ID of each client
- **LIMIT_BAL:** Amount of given credit in NT dollars (includes individual and family/supplementary credit)
- **SEX:** Gender (1=male, 2=female)
- **EDUCATION:** (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- **MARRIAGE:** Marital status (1=married, 2=single, 3=others)
- **AGE:** Age in years
- **PAY_0:** Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- **PAY_2:** Repayment status in August, 2005 (scale same as above)
- **PAY_3:** Repayment status in July, 2005 (scale same as above)
- **PAY_4:** Repayment status in June, 2005 (scale same as above)
- **PAY_5:** Repayment status in May, 2005 (scale same as above)
- **PAY_6:** Repayment status in April, 2005 (scale same as above)
- **BILL_AMT1:** Amount of bill statement in September, 2005 (NT dollar)
- **BILL_AMT2:** Amount of bill statement in August, 2005 (NT dollar)
- **BILL_AMT3:** Amount of bill statement in July, 2005 (NT dollar)
- **BILL_AMT4:** Amount of bill statement in June, 2005 (NT dollar)
- **BILL_AMT5:** Amount of bill statement in May, 2005 (NT dollar)
- **BILL_AMT6:** Amount of bill statement in April, 2005 (NT dollar)
- **PAY_AMT1:** Amount of previous payment in September, 2005 (NT dollar)
- **PAY_AMT2:** Amount of previous payment in August, 2005 (NT dollar)
- **PAY_AMT3:** Amount of previous payment in July, 2005 (NT dollar)
- **PAY_AMT4:** Amount of previous payment in June, 2005 (NT dollar)
- **PAY_AMT5:** Amount of previous payment in May, 2005 (NT dollar)
- **PAY_AMT6:** Amount of previous payment in April, 2005 (NT dollar)
- **default.payment.next.month:** Default payment (1=yes, 0=no)

## Loading libraries and data

We will begin by importing Python libraries that will be used and by loading the dataset:


^C
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement ace_tools (from versions: none)
ERROR: No matching distribution found for ace_tools


In [10]:
### Installing libraries if not yet in prompt
#!pip install scikit-learn
#!pip install xgboost
#!pip install dataprep
#!pip install pandas_profiling
#!pip install cufflinks
#!pip install -U regex
#!pip install -U levenshtein
#!pip install numba==0.58.1
#%pip install dataprep
#!pip install statsmodels


### LIBRARIES to be used
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy.stats import randint  # for statistical distributions
import xgboost as xgb  # for extreme gradient boosting

#For Exploratory Data Analysis - Useful for 
#from dataprep.eda import plot, plot_correlation, create_report, plot_missing
#from dataprep.datasets import load_dataset
#from dataprep.eda import create_report
#from numba import generated_jit
#from pandas_profiling import ProfileReport

# Visualization libraries
import matplotlib.pyplot as plt  # for plotting graphs
import seaborn as sns  # for creating attractive and informative statistical graphics
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Setting display options
from pandas import set_option
plt.style.use('ggplot')  # setting plot style, as used in R's Tidyverse

# Scikit-learn libraries for machine learning tasks
from sklearn.model_selection import train_test_split  # to split the dataset into training and testing sets
from sklearn.linear_model import LogisticRegression  # to apply logistic regression model
from sklearn.feature_selection import RFE  # for recursive feature elimination
from sklearn.model_selection import KFold  # for k-fold cross-validation
from sklearn.model_selection import GridSearchCV  # for hyperparameter tuning using grid search
from sklearn.model_selection import RandomizedSearchCV  # for hyperparameter tuning using randomized search
from sklearn.preprocessing import StandardScaler  # for data normalization
from sklearn.pipeline import Pipeline  # for creating machine learning pipelines
from sklearn.ensemble import RandomForestClassifier  # for applying random forest classification
from xgboost import XGBClassifier  # for XGBoost classifier
from sklearn.model_selection import cross_val_score  # for cross-validation
from sklearn.metrics import classification_report  # for model evaluation metrics
from sklearn.metrics import confusion_matrix  # for confusion matrix
from sklearn.neighbors import KNeighborsClassifier  # for k-nearest neighbors classifier
from sklearn.tree import DecisionTreeClassifier  # for decision tree classifier
from sklearn.ensemble import ExtraTreesClassifier  # for extra trees classifier
from sklearn.feature_selection import SelectFromModel  # for feature selection from model
from sklearn import metrics  # for evaluating model performance




In [20]:
###Importing dataset
data = 'C:/Users/u0135988/OneDrive - KU Leuven/Research/Quant/Methods_Quant_Fin/Example - Credit Card Default - Keggle/UCI_Credit_Card.csv'
data_df = pd.read_csv(data)

print("Default Credit Card data -  rows:",data_df.shape[0]," columns:", data_df.shape[1])

Default Credit Card data -  rows: 30000  columns: 25


In [21]:
###Getting a glimpse of the data

#Show first rows
data_df.head()




Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


In [22]:
#Describe variables with main summary statistics

pd.options.display.float_format = '{:.2f}'.format #to limit float to two numbers after comma
data_df.describe()

#As a reminder:
#Limit_bal is credit; sex 2=female; marriage=1 (single 2), 
#pay_t repayment status (-1 is full payment, >0 shows amount of delay in months)
#bill_t amount of bill; pay_amt_t amount of previous payment
#outcome variable: default.payment.next.month


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,...,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,15000.5,167484.32,1.6,1.85,1.55,35.49,-0.02,-0.13,-0.17,-0.22,...,43262.95,40311.4,38871.76,5663.58,5921.16,5225.68,4826.08,4799.39,5215.5,0.22
std,8660.4,129747.66,0.49,0.79,0.52,9.22,1.12,1.2,1.2,1.17,...,64332.86,60797.16,59554.11,16563.28,23040.87,17606.96,15666.16,15278.31,17777.47,0.42
min,1.0,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,...,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7500.75,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,...,2326.75,1763.0,1256.0,1000.0,833.0,390.0,296.0,252.5,117.75,0.0
50%,15000.5,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,...,19052.0,18104.5,17071.0,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0,0.0
75%,22500.25,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,...,54506.0,50190.5,49198.25,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0,0.0
max,30000.0,1000000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,...,891586.0,927171.0,961664.0,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0,1.0



We see from the data that individuals had an average credit of 167484 NDP (~4735 EUR), most individuals are females, highly educated and maried with an average age of about 35 years.

From the outcome variable, we see that around 22% of our sample defaulted in september.

But it is important to understand better how different is the population defaulting?


In [23]:

defaulting_df = data_df[data_df['default.payment.next.month'] == 1]

defaulting_df.describe()



Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
count,6636.0,6636.0,6636.0,6636.0,6636.0,6636.0,6636.0,6636.0,6636.0,6636.0,...,6636.0,6636.0,6636.0,6636.0,6636.0,6636.0,6636.0,6636.0,6636.0,6636.0
mean,14773.78,130109.66,1.57,1.89,1.53,35.73,0.67,0.46,0.36,0.25,...,42036.95,39540.19,38271.44,3397.04,3388.65,3367.35,3155.63,3219.14,3441.48,1.0
std,8571.62,115378.54,0.5,0.73,0.53,9.69,1.38,1.5,1.5,1.51,...,64351.08,61424.7,59579.67,9544.25,11737.99,12959.62,11191.97,11944.73,13464.01,0.0
min,1.0,10000.0,1.0,1.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,...,-65167.0,-53007.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,7408.5,50000.0,1.0,1.0,1.0,28.0,0.0,0.0,-1.0,-1.0,...,2141.5,1502.75,1150.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,14758.5,90000.0,2.0,2.0,2.0,34.0,1.0,0.0,0.0,0.0,...,19119.5,18478.5,18028.5,1636.0,1533.5,1222.0,1000.0,1000.0,1000.0,1.0
75%,21831.75,200000.0,2.0,2.0,2.0,42.0,2.0,2.0,2.0,2.0,...,50175.75,47853.0,47424.0,3478.25,3309.75,3000.0,2939.25,3000.0,2974.5,1.0
max,30000.0,740000.0,2.0,6.0,3.0,75.0,8.0,7.0,8.0,8.0,...,548020.0,547880.0,514975.0,300000.0,358689.0,508229.0,432130.0,332000.0,345293.0,1.0



As a quick first assessment, we propose a logit regression to understand the variables that can increase the likelihood of default:



In [24]:

#Modify variables to improve interpretation:
data_df['female'] = (data_df['SEX'] == 2).astype(int)
data_df['married'] = (data_df['MARRIAGE'] == 1).astype(int)

# Define the independent variables and the dependent variable
independent_vars = ['LIMIT_BAL', 'EDUCATION', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'female', 'married']
X = data_df[independent_vars]
y = data_df['default.payment.next.month']

# Add a constant to the independent variables
X = sm.add_constant(X)

# Fit the logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit()

# Summary of the logistic regression
logit_summary = result.summary2().tables[1]

# Result in odds ratios by taking the exponential the coefficients
odds_ratios = np.exp(result.params)
odds_ratios_df = pd.DataFrame({
    'Odds Ratio': odds_ratios,
    'p-value': result.pvalues
})


print(logit_summary)
print(odds_ratios_df)



Optimization terminated successfully.
         Current function value: 0.464536
         Iterations 7
           Coef.  Std.Err.      z  P>|z|  [0.025  0.975]
const      -1.08      0.07 -14.83   0.00   -1.23   -0.94
LIMIT_BAL  -0.00      0.00  -5.05   0.00   -0.00   -0.00
EDUCATION  -0.10      0.02  -4.94   0.00   -0.14   -0.06
AGE         0.01      0.00   3.52   0.00    0.00    0.01
PAY_0       0.58      0.02  32.73   0.00    0.54    0.61
PAY_2       0.09      0.02   4.32   0.00    0.05    0.13
PAY_3       0.07      0.02   3.28   0.00    0.03    0.12
PAY_4       0.05      0.02   2.53   0.01    0.01    0.09
BILL_AMT1  -0.00      0.00  -4.87   0.00   -0.00   -0.00
BILL_AMT2   0.00      0.00   1.54   0.12   -0.00    0.00
BILL_AMT3   0.00      0.00   0.98   0.33   -0.00    0.00
BILL_AMT4  -0.00      0.00  -0.11   0.92   -0.00    0.00
BILL_AMT5   0.00      0.00   0.51   0.61   -0.00    0.00
BILL_AMT6   0.00      0.00   0.37   0.71   -0.00    0.00
PAY_AMT1   -0.00      0.00  -5.90   0.00   


From the odd`s ratios, we see that the only variables that decrease the likelihood of default are a higher level of education and being a woman. While being maried, being older, or the amount of payments in september, increase the odds of defaulting.

Still, as limit_bal, bill_amt_t, pay_amt_t are floats with a high variation, it is harder to interpret. For this we propose to standardize with a normal those values.

Finally, is clear that PAY_0 of september will be highly linked with the default status in september. 

Assuming we are in one month prior (without the information of PAY_0, BILL_AMT1, and PAY_AMT1 in september) we will run the same regression without this information:


In [25]:
variables_to_standardize = ['LIMIT_BAL', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

data_df_standard=data_df

# Standardize the variables
for var in variables_to_standardize:
    data_df_standard[var] = (data_df[var] - data_df[var].mean()) / data_df[var].std()


independent_vars = ['LIMIT_BAL', 'EDUCATION', 'AGE', 'PAY_2', 'PAY_3', 'PAY_4', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'female', 'married']
X = data_df_standard[independent_vars]
y = data_df_standard['default.payment.next.month']

# Add a constant to the independent variables
X = sm.add_constant(X)

# Fit the logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit()

odds_ratios = np.exp(result.params)
odds_ratios_df = pd.DataFrame({
    'Odds Ratio': odds_ratios,
    'p-value': result.pvalues
})

print(logit_summary)
print(odds_ratios_df)


Optimization terminated successfully.
         Current function value: 0.483532
         Iterations 8
           Coef.  Std.Err.      z  P>|z|  [0.025  0.975]
const      -1.08      0.07 -14.83   0.00   -1.23   -0.94
LIMIT_BAL  -0.00      0.00  -5.05   0.00   -0.00   -0.00
EDUCATION  -0.10      0.02  -4.94   0.00   -0.14   -0.06
AGE         0.01      0.00   3.52   0.00    0.00    0.01
PAY_0       0.58      0.02  32.73   0.00    0.54    0.61
PAY_2       0.09      0.02   4.32   0.00    0.05    0.13
PAY_3       0.07      0.02   3.28   0.00    0.03    0.12
PAY_4       0.05      0.02   2.53   0.01    0.01    0.09
BILL_AMT1  -0.00      0.00  -4.87   0.00   -0.00   -0.00
BILL_AMT2   0.00      0.00   1.54   0.12   -0.00    0.00
BILL_AMT3   0.00      0.00   0.98   0.33   -0.00    0.00
BILL_AMT4  -0.00      0.00  -0.11   0.92   -0.00    0.00
BILL_AMT5   0.00      0.00   0.51   0.61   -0.00    0.00
BILL_AMT6   0.00      0.00   0.37   0.71   -0.00    0.00
PAY_AMT1   -0.00      0.00  -5.90   0.00   


We see that the it is now easier to interpret variables, since the payment/credit variables are now scaled.

A higher credit (limit_bal) and previous payment (pay_amt) are associated with smaller odds of defaulting. The size of the credit card bill (bill_amt) is less useful to determine if a person will default, with only the previous bill_amt2 (of the month of august) having an odds ratio significantly different than 1. In this case, individuals with higher bill sizes in august were less likely to default in september. This might be linked to non-defaulting individuals experiencing no credit constraint one month prior.


In [20]:
#report = ProfileReport(data_df)
#report

NameError: name 'ProfileReport' is not defined