# Default deadline for monthly credit card payment

File path is <a href="https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#">Default of Credit Card Clients Data Set</a>, which consists of 30.000 reports of Taiwanese bank clients for which the bank wants to predict if they will meet the default deadline for paying the credit card debt for the next month. 

**Brief data description**
- Label: Yes (1, will meet the default deadline) or No (0, will not meet the default deadline)

- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.

- X2: Gender (1 = male; 2 = female).

- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

- X4: Marital status (1 = married; 2 = single; 3 = others).

- X5: Age (year).

- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

- X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.

- X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005

### Step 1: Load the data (Pandas) 

In [3]:
import pandas as pd

# loading the dataset with Pandas read_csv function
dataset = pd.read_csv("C:/Users/arnav/Downloads/Python_Arnav/meeting_default_of_credit_card_clients.csv", delimiter = ",")[1:]

# check default data types for columns and adjust them if needed
print(dataset.dtypes)

# adjustment
dataset.X1 = dataset.X1.astype('float')

dataset.X2 = dataset.X2.astype('category') # if each value denotes one of N possible values (discrete or categorical features)
dataset.X3 = dataset.X3.astype('category')
dataset.X4 = dataset.X4.astype('category')

# converting remaining integer and float features
dataset.X5 = dataset.X5.astype('float')
dataset.X6 = dataset.X6.astype('float')
dataset.X7 = dataset.X7.astype('float')
dataset.X8 = dataset.X8.astype('float')
dataset.X9 = dataset.X9.astype('float')
dataset.X10 = dataset.X10.astype('float')
dataset.X11 = dataset.X11.astype('float')

dataset.X12 = dataset.X12.astype('float')
dataset.X13 = dataset.X13.astype('float')
dataset.X14 = dataset.X14.astype('float')
dataset.X15 = dataset.X15.astype('float')
dataset.X16 = dataset.X16.astype('float')
dataset.X17 = dataset.X17.astype('float')

dataset.X18 = dataset.X18.astype('float')
dataset.X19 = dataset.X19.astype('float')
dataset.X20 = dataset.X20.astype('float')
dataset.X21 = dataset.X21.astype('float')
dataset.X22 = dataset.X22.astype('float')
dataset.X23 = dataset.X23.astype('float')

dataset.Y = dataset.Y.astype('int32')

print(dataset.dtypes)

### since we have numeric features, it is a good idea to translate categorical features into one-hot-encoding features
gender_oh = pd.get_dummies(dataset.X2, prefix='Gender')
edu_oh = pd.get_dummies(dataset.X3, prefix='Education')
married_oh = pd.get_dummies(dataset.X4, prefix='MarStat')

dataset = pd.concat([dataset, gender_oh, edu_oh, married_oh], axis = 1)

dataset = dataset.drop(['X2', 'X3', 'X4'], axis = 1)

print(dataset.dtypes)
dataset



Unnamed: 0    object
X1            object
X2            object
X3            object
X4            object
X5            object
X6            object
X7            object
X8            object
X9            object
X10           object
X11           object
X12           object
X13           object
X14           object
X15           object
X16           object
X17           object
X18           object
X19           object
X20           object
X21           object
X22           object
X23           object
Y             object
dtype: object
Unnamed: 0      object
X1             float64
X2            category
X3            category
X4            category
X5             float64
X6             float64
X7             float64
X8             float64
X9             float64
X10            float64
X11            float64
X12            float64
X13            float64
X14            float64
X15            float64
X16            float64
X17            float64
X18            float64
X19            float64
X

Unnamed: 0.1,Unnamed: 0,X1,X5,X6,X7,X8,X9,X10,X11,X12,...,Education_1,Education_2,Education_3,Education_4,Education_5,Education_6,MarStat_0,MarStat_1,MarStat_2,MarStat_3
1,1,20000.0,24.0,2.0,2.0,-1.0,-1.0,-2.0,-2.0,3913.0,...,0,1,0,0,0,0,0,1,0,0
2,2,120000.0,26.0,-1.0,2.0,0.0,0.0,0.0,2.0,2682.0,...,0,1,0,0,0,0,0,0,1,0
3,3,90000.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,29239.0,...,0,1,0,0,0,0,0,0,1,0
4,4,50000.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0,46990.0,...,0,1,0,0,0,0,0,1,0,0
5,5,50000.0,57.0,-1.0,0.0,-1.0,0.0,0.0,0.0,8617.0,...,0,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29996,29996,220000.0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,188948.0,...,0,0,1,0,0,0,0,1,0,0
29997,29997,150000.0,43.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,1683.0,...,0,0,1,0,0,0,0,0,1,0
29998,29998,30000.0,37.0,4.0,3.0,2.0,-1.0,0.0,0.0,3565.0,...,0,1,0,0,0,0,0,0,1,0
29999,29999,80000.0,41.0,1.0,-1.0,0.0,0.0,0.0,-1.0,-1645.0,...,0,0,1,0,0,0,0,1,0,0


### Step 2: Data Exploration and Rule-Based Classification

Let's spend five minutes analyzing the data: 

- Features X6-X11 indicate whether the payments were on time in previous months -- negative values (-1) indicate that the payments were on time

- Features X12-X17 indicate the amounts spent in previous months. Low amounts (or 0, when nothing was spent) could identify people who generally meet their payment deadlines

Can we think of a few simple rules (let's call it "expert knowledge") that would roughly correctly predict if the default payment will be met for the next month?

How about the following: 

1. Predict timely payment for the next month if most payments (more than 3?) for the previous months have been timely 
2. Predict timely payment if the amounts spent in previous months have been small enough (what would be small enough?)
3. 1) AND 2)?
4. 1) OR 2)?


In [4]:
# Let us first define the evaluation: 
def evaluate(pred_field):
    correct = len(dataset[dataset.apply(lambda x: x[pred_field] == x.Y, axis = 1)])
    
    tp = len(dataset[dataset.apply(lambda x: x[pred_field] == 1 and x.Y == 1, axis = 1)])
    fp = len(dataset[dataset.apply(lambda x: x[pred_field] == 1 and x.Y == 0, axis = 1)])
    fn = len(dataset[dataset.apply(lambda x: x[pred_field] == 0 and x.Y == 1, axis = 1)])
    
    prec = tp / (tp + fp)
    rec = tp / (tp + fn)
    f1 = (2 * prec * rec) / (prec + rec)

    print("Accuracy: " + str(correct / len(dataset)))
    print("Precision: " + str(prec))
    print("Recall: " + str(rec))
    print("F1: " + str(f1))
    

In [5]:
# Rule-based baseline #1: we predict the payment will be on time if at least N of previous month payments were on time

# for a given row (client), function computes the number of previous months when payment was on time (value -1)
def on_time_previous_months(x):
    feats = ['X6', 'X7', 'X8', 'X9', 'X10', 'X11']
    num_on_time = sum(x[feats].apply(lambda z: 1 if z == -1 else 0))
    return num_on_time

# create an additional column in which we store the number of previous months the payment was on time
dataset["Def_prev"] = dataset.apply(lambda x: on_time_previous_months(x), axis = 1)


In [6]:
# how many previous months (N) of "on-time" payment is enough to predict that the next month payment will also be on time?
N = 3

dataset["Pred_1"] = dataset.apply(lambda x: 1 if x.Def_prev >= N else 0, axis = 1)

# let's evaluate how accurate our predictions are: 
evaluate("Pred_1")

Accuracy: 0.6391
Precision: 0.16174334140435837
Recall: 0.15099457504520797
F1: 0.15618424129062428


In [7]:
# Rule-based baseline #2: we predict the payment will be on time if the amounts were small enough in previous months

# for a given row (client), function computes the average billing amount in previous months
def avg_amount_previous_months(x):
    feats = ['X12', 'X13', 'X14', 'X15', 'X16', 'X17']
    avg_amnt = sum(x[feats]) / len(feats)
    return avg_amnt

# create an additional column in which we store the number of previous months the payment was on time
dataset["Amnt_prev"] = dataset.apply(lambda x: avg_amount_previous_months(x), axis = 1)


In [8]:
# what average amount (amnt) billed in previous months we want as the threshold
amnt = 10000

dataset["Pred_2"] = dataset.apply(lambda x: 1 if x.Amnt_prev <= amnt else 0, axis = 1)

# let's evaluate how accurate our predictions are: 
evaluate("Pred_2")

Accuracy: 0.5876666666666667
Precision: 0.22517254601226994
Recall: 0.3539783001808318
F1: 0.27525193344269977


### Step 2: Supervised Machine Learning

Now we explore how much we can learn from the data with our supervised ML models: 

- k-Nearest Neighbours (kNN)
- Naive Bayes 
- Decision trees and random forests
- Linear regression, logistic regression and SVM (for these models we need to first transform the categorical features)

We will then investigate cross-validation, fixed-split vs. folded CV

Let us first split our data into train and test portions!

In [9]:
# we select only the input columns corresponding to real features we are going to use
input_features = ["X1"] + ["X" + str(i+1) for i in range(4, 23)]
input_features.extend(["Gender_1", "Gender_2", 
                       "Education_0", "Education_1", "Education_2", "Education_3", "Education_4", "Education_5", "Education_6",
                       "MarStat_0", "MarStat_1", "MarStat_2", "MarStat_3"])

print(input_features)

input_data = dataset[input_features]
output_data = dataset["Y"]


['X1', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'Gender_1', 'Gender_2', 'Education_0', 'Education_1', 'Education_2', 'Education_3', 'Education_4', 'Education_5', 'Education_6', 'MarStat_0', 'MarStat_1', 'MarStat_2', 'MarStat_3']


In [10]:
 # numeric normalization of columns

import scipy
from scipy.stats import zscore

norm_input_data = input_data.copy()

for col in list(input_data.columns):
    norm_input_data[col] = scipy.stats.zscore(norm_input_data[col])

norm_input_data

Unnamed: 0,X1,X5,X6,X7,X8,X9,X10,X11,X12,X13,...,Education_1,Education_2,Education_3,Education_4,Education_5,Education_6,MarStat_0,MarStat_1,MarStat_2,MarStat_3
1,-1.136720,-1.246020,1.794564,1.782348,-0.696663,-0.666599,-1.530046,-1.486041,-0.642501,-0.647399,...,-0.738375,1.066900,-0.442752,-0.064163,-0.097063,-0.041266,-0.042465,1.093780,-1.066471,-0.104326
2,-0.365981,-1.029047,-0.874991,1.782348,0.138865,0.188746,0.234917,1.992316,-0.659219,-0.666747,...,-0.738375,1.066900,-0.442752,-0.064163,-0.097063,-0.041266,-0.042465,-0.914261,0.937672,-0.104326
3,-0.597202,-0.161156,0.014861,0.111736,0.138865,0.188746,0.234917,0.253137,-0.298560,-0.493899,...,-0.738375,1.066900,-0.442752,-0.064163,-0.097063,-0.041266,-0.042465,-0.914261,0.937672,-0.104326
4,-0.905498,0.164303,0.014861,0.111736,0.138865,0.188746,0.234917,0.253137,-0.057491,-0.013293,...,-0.738375,1.066900,-0.442752,-0.064163,-0.097063,-0.041266,-0.042465,1.093780,-1.066471,-0.104326
5,-0.905498,2.334029,-0.874991,0.111736,-0.696663,0.188746,0.234917,0.253137,-0.578618,-0.611318,...,-0.738375,1.066900,-0.442752,-0.064163,-0.097063,-0.041266,-0.042465,1.093780,-1.066471,-0.104326
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29996,0.404759,0.381275,0.014861,0.111736,0.138865,0.188746,0.234917,0.253137,1.870379,2.018136,...,-0.738375,-0.937295,2.258602,-0.064163,-0.097063,-0.041266,-0.042465,1.093780,-1.066471,-0.104326
29997,-0.134759,0.815221,-0.874991,-0.723570,-0.696663,-0.666599,0.234917,0.253137,-0.672786,-0.665299,...,-0.738375,-0.937295,2.258602,-0.064163,-0.097063,-0.041266,-0.042465,-0.914261,0.937672,-0.104326
29998,-1.059646,0.164303,3.574267,2.617654,1.809921,-0.666599,0.234917,0.253137,-0.647227,-0.643830,...,-0.738375,1.066900,-0.442752,-0.064163,-0.097063,-0.041266,-0.042465,-0.914261,0.937672,-0.104326
29999,-0.674276,0.598248,0.904712,-0.723570,0.138865,0.188746,0.234917,-0.616452,-0.717982,0.410269,...,-0.738375,-0.937295,2.258602,-0.064163,-0.097063,-0.041266,-0.042465,1.093780,-1.066471,-0.104326


In [11]:
# splitting the dataset into train and test portions
z_normalized = True

train_portion = 0.8
index_split = int(train_portion * len(dataset))

train_X = (norm_input_data if z_normalized else input_data)[:index_split]
train_Y = output_data[:index_split]

test_X = (norm_input_data if z_normalized else input_data)[index_split:]
test_Y = output_data[index_split:]

print("Size of training dataset: " + str(len(train_X)))
print("Size of test dataset: " + str(len(test_X)))

Size of training dataset: 24000
Size of test dataset: 6000


In [12]:
# Let's define the evaluation function for our supervised ML: 

def evaluate_supml(predictions, gold_labels):
    if len(predictions) != len(gold_labels): 
        raise ValueError("Incorrect number of predictions!")
        
    correct = len([i for i in range(len(predictions)) if predictions[i] == gold_labels[i]])
    
    tp = len([i for i in range(len(predictions)) if predictions[i] == 1 and gold_labels[i] == 1])
    fp = len([i for i in range(len(predictions)) if predictions[i] == 1 and gold_labels[i] == 0])
    fn = len([i for i in range(len(predictions)) if predictions[i] == 0 and gold_labels[i] == 1])
    
    prec = tp / (tp + fp)
    rec = tp / (tp + fn)
    f1 = (2 * prec * rec) / (prec + rec)

    print("Accuracy: " + str(correct / len(predictions)))
    print("Precision: " + str(prec))
    print("Recall: " + str(rec))
    print("F1: " + str(f1))


## kNN 

In [13]:
import sklearn
from sklearn.neighbors import KNeighborsClassifier
# Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

# instantiate the classifier
knn_classifier = KNeighborsClassifier(n_neighbors=10) # hyperparameter: n_neighbours

# "training" the classifier
knn_classifier.fit(train_X, train_Y)

# predict with the classifier
predictions = knn_classifier.predict(test_X)
evaluate_supml(predictions, test_Y.tolist())


Accuracy: 0.8178333333333333
Precision: 0.6719681908548708
Recall: 0.2669826224328594
F1: 0.3821368004522329


### Naive Bayes

In [14]:
from sklearn.naive_bayes import GaussianNB
# Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB

nb_classifier = GaussianNB()
nb_classifier.fit(train_X, train_Y)
predictions = nb_classifier.predict(test_X)

evaluate_supml(predictions, test_Y.tolist())

Accuracy: 0.2738333333333333
Precision: 0.22097851597761328
Recall: 0.966824644549763
F1: 0.35973548861131527


### Decision trees & random forrests

In [15]:
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier(max_depth = 3)
dt_classifier.fit(train_X, train_Y)
predictions = dt_classifier.predict(test_X)

evaluate_supml(predictions, test_Y.tolist())

Accuracy: 0.832
Precision: 0.6919642857142857
Recall: 0.36729857819905215
F1: 0.47987616099071206


In [16]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators = 100, max_features = 15)
rf_classifier.fit(train_X, train_Y)
predictions = rf_classifier.predict(test_X)

evaluate_supml(predictions, test_Y.tolist())


Accuracy: 0.8266666666666667
Precision: 0.6596045197740112
Recall: 0.3688783570300158
F1: 0.47315096251266464


### Support Vector Machines (SVM)

In [17]:
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

svm_classifier = LinearSVC(C=128, verbose=True)
svm_classifier.fit(train_X, train_Y)
predictions = svm_classifier.predict(test_X)

evaluate_supml(predictions, test_Y.tolist())

#print(svm_classifier.coef_[0])

feat_weights = list(zip(input_features, list(svm_classifier.coef_[0])))
for f, w in feat_weights:
    print(f, w) 


[LibLinear]Accuracy: 0.697
Precision: 0.36363636363636365
Recall: 0.5813586097946287
F1: 0.4474164133738602
X1 -0.019584070795141986
X5 -0.04576690444030084
X6 0.27408412632044044
X7 0.492018398322337
X8 0.29632576016496864
X9 -0.02928025919126454
X10 -0.03141220015702316
X11 -0.050696956421002595
X12 -0.2856638232573229
X13 0.15043883112214956
X14 -0.07251760454541233
X15 -0.10410271119329452
X16 0.019679340035115456
X17 -0.05979521883988669
X18 -0.2959927191073685
X19 -0.0366491385653439
X20 -0.08608428767427319
X21 -0.05099191688255839
X22 -0.06050052728938639
X23 -0.2059853555339446
Gender_1 0.20939422167255273
Gender_2 -0.20939422167255273
Education_0 -0.025317764451384162
Education_1 -0.15578199330495326
Education_2 0.010937440798511819
Education_3 0.21965843642194213
Education_4 -0.034643639859919306
Education_5 -0.09182255951256839
Education_6 -0.017985115588186947
MarStat_0 -0.037848554208291384
MarStat_1 0.00037032286047665106
MarStat_2 0.0029941421461305235
MarStat_3 -0.0007

