# Introducing the credit card defualt dataset

### Data Set Information:

**This research aimed at the case of customers default payments in Taiwan**

### Features description:

- LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
- SEX: Gender (1 = male; 2 = female). 
- EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
- MARRIAGE: Marital status (1 = married; 2 = single; 3 = others). 
- AGE: Age (year). 
- PAY_1 - PAY_6: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 1 = the repayment status in September, 2005; 1 = the repayment status in August, 2005; . . .; 6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
- BILL_AMT1-BILL_AMT6: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. 
- PAY_AMT1-PAY_AMT6: Amount of previous payment (NT dollar).
- default payment next month: **positive class: default | negative class: pay**

In [19]:
import numpy as np
import pandas as pd
import os

In [20]:
DATA_DIR = '../data'
FILE_NAME = 'credit_card_default.csv'
data_path = os.path.join(DATA_DIR, FILE_NAME)
ccd = pd.read_csv(data_path, index_col="ID")
ccd.head()

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
5,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [21]:
ccd.shape

(30000, 24)

In [22]:
ccd.rename(columns=lambda x: x.lower(), inplace=True)

## Numerical features

In [23]:
bill_amt_features = ['bill_amt'+ str(i) for i in range(1,7)]
pay_amt_features = ['pay_amt'+ str(i) for i in range(1,7)]
numerical_features = ['limit_bal','age'] + bill_amt_features + pay_amt_features

In [24]:
ccd[['limit_bal','age']].describe()

Unnamed: 0,limit_bal,age
count,30000.0,30000.0
mean,167484.322667,35.4855
std,129747.661567,9.217904
min,10000.0,21.0
25%,50000.0,28.0
50%,140000.0,34.0
75%,240000.0,41.0
max,1000000.0,79.0


In [25]:
ccd[bill_amt_features].describe().round()

Unnamed: 0,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,51223.0,49179.0,47013.0,43263.0,40311.0,38872.0
std,73636.0,71174.0,69349.0,64333.0,60797.0,59554.0
min,-165580.0,-69777.0,-157264.0,-170000.0,-81334.0,-339603.0
25%,3559.0,2985.0,2666.0,2327.0,1763.0,1256.0
50%,22382.0,21200.0,20088.0,19052.0,18104.0,17071.0
75%,67091.0,64006.0,60165.0,54506.0,50190.0,49198.0
max,964511.0,983931.0,1664089.0,891586.0,927171.0,961664.0


In [26]:
ccd[pay_amt_features].describe().round()

Unnamed: 0,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,5664.0,5921.0,5226.0,4826.0,4799.0,5216.0
std,16563.0,23041.0,17607.0,15666.0,15278.0,17777.0
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,1000.0,833.0,390.0,296.0,252.0,118.0
50%,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0
75%,5006.0,5000.0,4505.0,4013.0,4032.0,4000.0
max,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0


## Encoding categorical features

In [27]:
ccd['male'] = (ccd['sex'] == 1).astype('int')
ccd['male'].head(n=10)

ID
1     0
2     0
3     0
4     0
5     1
6     1
7     1
8     0
9     0
10    1
Name: male, dtype: int32

In [28]:
ccd['male'].mean()

0.39626666666666666

In [29]:
ccd['education'].value_counts(sort=False)

0       14
1    10585
2    14030
3     4917
4      123
5      280
6       51
Name: education, dtype: int64

In [30]:
ccd['grad_school'] = (ccd['education'] == 1).astype('int')
ccd['university'] = (ccd['education'] == 2).astype('int')
ccd['high_school'] = (ccd['education'] == 3).astype('int')

In [31]:
ccd.loc[(ccd['grad_school']==0) & (ccd['university']==0) & (ccd['high_school']==0)]['education'].head()

ID
48     5
70     5
359    4
386    5
449    4
Name: education, dtype: int64

## Low variance features

In [32]:
ccd['marriage'].value_counts(sort=False)

1    13713
2    15964
3      323
Name: marriage, dtype: int64

In [33]:
ccd['single'] = (ccd['marriage'] == 2).astype('int')
ccd['marital_other'] = (ccd['marriage'] == 3).astype('int')

In [34]:
print("Proportion of singles: ", ccd['single'].mean())
print("Proportion of other marital status: ", ccd['marital_other'].mean())

Proportion of singles:  0.5321333333333333
Proportion of other marital status:  0.010766666666666667


In [35]:
ccd['married'] = (ccd['marriage'] == 1).astype('int')
print(ccd['married'].var())
print(ccd['single'].var())

0.24816786226195736
0.24897574808047968


In [36]:
(ccd['married'] == (1 - ccd['single'])).mean()

0.9892333333333333

## A brief introduction to Feature Engineering

In [37]:
ccd['pay_1'].value_counts().sort_index()

-2     2759
-1     5686
 0    14737
 1     3688
 2     2667
 3      322
 4       76
 5       26
 6       11
 7        9
 8       19
Name: pay_1, dtype: int64

In [38]:
# fixing the pay_i features
pay_features= ['pay_' + str(i) for i in range(1,7)]
for x in pay_features:
    ccd.loc[ccd[x] <= 0, x] = 0

In [39]:
# producing delayed features
delayed_features = ['delayed_' + str(i) for i in range(1,7)]
for pay, delayed in zip(pay_features, delayed_features):
    ccd[delayed] = (ccd[pay] > 0).astype(int)

In [44]:
ccd[delayed_features].mean()

delayed_1    0.227267
delayed_2    0.147933
delayed_3    0.140433
delayed_4    0.117000
delayed_5    0.098933
delayed_6    0.102633
dtype: float64

In [None]:
ccd['months_delayed'] = ccd[delayed_features].sum(axis=1)

Done.