# INTRODUCTION

I am using Kaggle's Default of Credit Card Clients Dataset as an exercise for default prediction methods. Any comments and suggestions are more than welcome.

## Data information

Following information from Keggle:

> This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

Monetary and payment values are in New Taiwanease dollars. As of 2024, $1 EUR \approx 35 NTD$.

There are 25 variables in the dataset:

- **ID:** ID of each client
- **LIMIT_BAL:** Amount of given credit in NT dollars (includes individual and family/supplementary credit)
- **SEX:** Gender (1=male, 2=female)
- **EDUCATION:** (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- **MARRIAGE:** Marital status (1=married, 2=single, 3=others)
- **AGE:** Age in years
- **PAY_0:** Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- **PAY_2:** Repayment status in August, 2005 (scale same as above)
- **PAY_3:** Repayment status in July, 2005 (scale same as above)
- **PAY_4:** Repayment status in June, 2005 (scale same as above)
- **PAY_5:** Repayment status in May, 2005 (scale same as above)
- **PAY_6:** Repayment status in April, 2005 (scale same as above)
- **BILL_AMT1:** Amount of bill statement in September, 2005 (NT dollar)
- **BILL_AMT2:** Amount of bill statement in August, 2005 (NT dollar)
- **BILL_AMT3:** Amount of bill statement in July, 2005 (NT dollar)
- **BILL_AMT4:** Amount of bill statement in June, 2005 (NT dollar)
- **BILL_AMT5:** Amount of bill statement in May, 2005 (NT dollar)
- **BILL_AMT6:** Amount of bill statement in April, 2005 (NT dollar)
- **PAY_AMT1:** Amount of previous payment in September, 2005 (NT dollar)
- **PAY_AMT2:** Amount of previous payment in August, 2005 (NT dollar)
- **PAY_AMT3:** Amount of previous payment in July, 2005 (NT dollar)
- **PAY_AMT4:** Amount of previous payment in June, 2005 (NT dollar)
- **PAY_AMT5:** Amount of previous payment in May, 2005 (NT dollar)
- **PAY_AMT6:** Amount of previous payment in April, 2005 (NT dollar)
- **default.payment.next.month:** Default payment (1=yes, 0=no)

## Loading libraries and data

We will begin by importing Python libraries that will be used and by loading the dataset:


^C
Note: you may need to restart the kernel to use updated packages.


In [6]:
### Installing libraries if not yet in prompt
#!pip install scikit-learn
#!pip install xgboost
#!pip install dataprep
#!pip install pandas_profiling
#!pip install cufflinks
#!pip install -U regex
#!pip install -U levenshtein
#!pip install numba==0.58.1
#%pip install dataprep

### LIBRARIES to be used
import pandas as pd
import numpy as np
from scipy.stats import randint  # for statistical distributions
import xgboost as xgb  # for extreme gradient boosting

#For Exploratory Data Analysis - Useful for 
#from dataprep.eda import plot, plot_correlation, create_report, plot_missing
#from dataprep.datasets import load_dataset
#from dataprep.eda import create_report
#from numba import generated_jit
#from pandas_profiling import ProfileReport

# Visualization libraries
import matplotlib.pyplot as plt  # for plotting graphs
import seaborn as sns  # for creating attractive and informative statistical graphics
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Setting display options
from pandas import set_option
plt.style.use('ggplot')  # setting plot style, as used in R's Tidyverse

# Scikit-learn libraries for machine learning tasks
from sklearn.model_selection import train_test_split  # to split the dataset into training and testing sets
from sklearn.linear_model import LogisticRegression  # to apply logistic regression model
from sklearn.feature_selection import RFE  # for recursive feature elimination
from sklearn.model_selection import KFold  # for k-fold cross-validation
from sklearn.model_selection import GridSearchCV  # for hyperparameter tuning using grid search
from sklearn.model_selection import RandomizedSearchCV  # for hyperparameter tuning using randomized search
from sklearn.preprocessing import StandardScaler  # for data normalization
from sklearn.pipeline import Pipeline  # for creating machine learning pipelines
from sklearn.ensemble import RandomForestClassifier  # for applying random forest classification
from xgboost import XGBClassifier  # for XGBoost classifier
from sklearn.model_selection import cross_val_score  # for cross-validation
from sklearn.metrics import classification_report  # for model evaluation metrics
from sklearn.metrics import confusion_matrix  # for confusion matrix
from sklearn.neighbors import KNeighborsClassifier  # for k-nearest neighbors classifier
from sklearn.tree import DecisionTreeClassifier  # for decision tree classifier
from sklearn.ensemble import ExtraTreesClassifier  # for extra trees classifier
from sklearn.feature_selection import SelectFromModel  # for feature selection from model
from sklearn import metrics  # for evaluating model performance




In [2]:
###Importing dataset
data = 'C:/Users/u0135988/OneDrive - KU Leuven/Research/Quant/Methods_Quant_Fin/Example - Credit Card Default - Keggle/UCI_Credit_Card.csv'
data_df = pd.read_csv(data)

print("Default Credit Card data -  rows:",data_df.shape[0]," columns:", data_df.shape[1])

Default Credit Card data -  rows: 30000  columns: 25


In [3]:
###Getting a glimpse of the data

#Show first rows
data_df.head()




Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


In [5]:
#Describe variables with main summary statistics

pd.options.display.float_format = '{:.2f}'.format #to limit float to two numbers after comma
data_df.describe()

#As a reminder:
#Limit_bal is credit; sex 2=female; marriage=1 (single 2), 
#pay_t repayment status (-1 is full payment, >0 shows amount of delay in months)
#bill_t amount of bill; pay_amt_t amount of previous payment
#outcome variable: default.payment.next.month


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,...,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,15000.5,167484.32,1.6,1.85,1.55,35.49,-0.02,-0.13,-0.17,-0.22,...,43262.95,40311.4,38871.76,5663.58,5921.16,5225.68,4826.08,4799.39,5215.5,0.22
std,8660.4,129747.66,0.49,0.79,0.52,9.22,1.12,1.2,1.2,1.17,...,64332.86,60797.16,59554.11,16563.28,23040.87,17606.96,15666.16,15278.31,17777.47,0.42
min,1.0,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,...,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7500.75,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,...,2326.75,1763.0,1256.0,1000.0,833.0,390.0,296.0,252.5,117.75,0.0
50%,15000.5,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,...,19052.0,18104.5,17071.0,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0,0.0
75%,22500.25,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,...,54506.0,50190.5,49198.25,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0,0.0
max,30000.0,1000000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,...,891586.0,927171.0,961664.0,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0,1.0



We see from the data that individuals had an average credit of 167484 NDP (~4735 EUR), most individuals were female and maried with an average age of ~35 years.

From the outcome variable, we see that around 22% of our sample defaulted in september.

But it is important to understand better 


In [19]:
#plot(data_df)

NameError: name 'plot' is not defined

In [20]:
#report = ProfileReport(data_df)
#report

NameError: name 'ProfileReport' is not defined