## Use Case

Credit card is a flexible tool by which a customer can use a bank's money for a short period of time. 

Predicting accurately which customers are most probable to default represents a significant business opportunity for all banks. Bank cards are the most common credit card type in Taiwan, which emphasizes the impact of risk prediction on both the consumers and banks. 

This would inform the bank’s decisions on criteria to approve a credit card application and also decide upon what credit limit to provide.

## Dataset Description

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. 

Using the information given, predict the probability of a customer defaulting in the next month.

## Data Dictionary

- **ID**: Unique ID of each client
- **LIMIT_BAL**: Amount of given credit (NT dollars):  It includes both the individual consumer credit and his/her family (supplementary) credit 
- **SEX**: Gender (1=male, 2=female)
- **EDUCATION**: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- **MARRIAGE**: Marital status (1=married, 2=single, 3=divorced)
- **AGE**: Age of the client
- **PAY_0**: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
- **PAY_2**: Repayment status in August, 2005 (scale same as above)
- **PAY_3**: Repayment status in July, 2005 (scale same as above)
- **PAY_4**: Repayment status in June, 2005 (scale same as above)
- **PAY_5**: Repayment status in May, 2005 (scale same as above)
- **PAY_6**: Repayment status in April, 2005 (scale same as above)
- **BILL_AMT1**: Amount of bill statement in September, 2005 (NT dollar)
- **BILL_AMT2**: Amount of bill statement in August, 2005 (NT dollar)
- **BILL_AMT3**: Amount of bill statement in July, 2005 (NT dollar)
- **BILL_AMT4**: Amount of bill statement in June, 2005 (NT dollar)
- **BILL_AMT5**: Amount of bill statement in May, 2005 (NT dollar)
- **BILL_AMT6**: Amount of bill statement in April, 2005 (NT dollar)
- **PAY_AMT1**: Amount of previous payment in September, 2005 (NT dollar)
- **PAY_AMT2**: Amount of previous payment in August, 2005 (NT dollar)
- **PAY_AMT3**: Amount of previous payment in July, 2005 (NT dollar)
- **PAY_AMT4**: Amount of previous payment in June, 2005 (NT dollar)
- **PAY_AMT5**: Amount of previous payment in May, 2005 (NT dollar)
- **PAY_AMT6**: Amount of previous payment in April, 2005 (NT dollar)
- **default_payment_next_month**: Target Variable: Default payment (1=yes, 0=no)

## Load Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.transforms
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import StratifiedShuffleSplit, KFold
from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve, auc
from imblearn.over_sampling import SMOTE
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, BatchNormalization
from tensorflow.keras.regularizers import l2
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

In [None]:
from learningratefinder import LearningRateFinder
from clr_callback import CyclicLR

## Set file path for train and predict datasets

In [None]:
train_dataset = "Dataset/train.csv"
predict_dataset = "Dataset/test.csv"

## Data Preprocessing

#### Read train/predict data

In [None]:
train_df = pd.read_csv(train_dataset)
predict_df = pd.read_csv(predict_dataset)

In [None]:
train_df.head()

In [None]:
# Check null columns in train and predict datasets
print("Column with NaN value in train_df: {}".format(train_df.columns[train_df.isnull().any()].tolist()))
print("Column with NaN value in predict_df: {}".format(predict_df.columns[predict_df.isnull().any()].tolist()))

#### Display countplot for different classes in training data

In [None]:
sns.set(style="darkgrid")
ax = sns.countplot(x="default_payment_next_month", data=train_df)

#### Separate out the target variable in training data

In [None]:
Ytrain = np.array([train_df['default_payment_next_month'].values]).T
train_df.drop(['default_payment_next_month'], inplace=True, axis=1)
print("Ytrain: {}".format(Ytrain.shape))

#### Feature Engineering

In [None]:
# Combine the train and predict datasets
combined_df = train_df.append(predict_df, sort=False, ignore_index=True)
print(combined_df.shape)

In [None]:
# One-hot encoding for "SEX" field
dummy_val = pd.get_dummies(combined_df['SEX'], prefix='SEX')
combined_df = pd.concat([combined_df, dummy_val], axis=1)
print("Number of features: {}".format(combined_df.shape[1]))

In [None]:
# One-hot encoding for "EDUCATION" field
dummy_val = pd.get_dummies(combined_df['EDUCATION'], prefix='EDUCATION')
combined_df = pd.concat([combined_df, dummy_val], axis=1)
print("Number of features: {}".format(combined_df.shape[1]))

In [None]:
# One-hot encoding for "MARRIAGE" field
dummy_val = pd.get_dummies(combined_df['MARRIAGE'], prefix='MARRIAGE')
combined_df = pd.concat([combined_df, dummy_val], axis=1)
print("Number of features: {}".format(combined_df.shape[1]))

In [None]:
# One-hot encoding for "PAY_0" field
dummy_val = pd.get_dummies(combined_df['PAY_0'], prefix='PAY_0')
combined_df = pd.concat([combined_df, dummy_val], axis=1)
print("Number of features: {}".format(combined_df.shape[1]))

In [None]:
# One-hot encoding for "PAY_2" field
dummy_val = pd.get_dummies(combined_df['PAY_2'], prefix='PAY_2')
combined_df = pd.concat([combined_df, dummy_val], axis=1)
print("Number of features: {}".format(combined_df.shape[1]))

In [None]:
# One-hot encoding for "PAY_3" field
dummy_val = pd.get_dummies(combined_df['PAY_3'], prefix='PAY_3')
combined_df = pd.concat([combined_df, dummy_val], axis=1)
print("Number of features: {}".format(combined_df.shape[1]))

In [None]:
# One-hot encoding for "PAY_4" field
dummy_val = pd.get_dummies(combined_df['PAY_4'], prefix='PAY_4')
combined_df = pd.concat([combined_df, dummy_val], axis=1)
print("Number of features: {}".format(combined_df.shape[1]))

In [None]:
# One-hot encoding for "PAY_5" field
dummy_val = pd.get_dummies(combined_df['PAY_5'], prefix='PAY_5')
combined_df = pd.concat([combined_df, dummy_val], axis=1)
print("Number of features: {}".format(combined_df.shape[1]))

In [None]:
# One-hot encoding for "PAY_6" field
dummy_val = pd.get_dummies(combined_df['PAY_6'], prefix='PAY_6')
combined_df = pd.concat([combined_df, dummy_val], axis=1)
print("Number of features: {}".format(combined_df.shape[1]))

In [None]:
# Calculate "Pay-to-Bill ratio" for (April, 2005)
combined_df['pay_to_bill_april'] = combined_df['PAY_AMT5']/combined_df['BILL_AMT6']
combined_df.loc[~np.isfinite(combined_df['pay_to_bill_april']), 'pay_to_bill_april'] = 0

In [None]:
# Calculate "Pay-to-Bill ratio" for (May, 2005)
combined_df['pay_to_bill_may'] = combined_df['PAY_AMT4']/combined_df['BILL_AMT5']
combined_df.loc[~np.isfinite(combined_df['pay_to_bill_may']), 'pay_to_bill_may'] = 0

In [None]:
# Calculate "Pay-to-Bill ratio" for (June, 2005)
combined_df['pay_to_bill_june'] = combined_df['PAY_AMT3']/combined_df['BILL_AMT4']
combined_df.loc[~np.isfinite(combined_df['pay_to_bill_june']), 'pay_to_bill_june'] = 0

In [None]:
# Calculate "Pay-to-Bill ratio" for (July, 2005)
combined_df['pay_to_bill_july'] = combined_df['PAY_AMT2']/combined_df['BILL_AMT3']
combined_df.loc[~np.isfinite(combined_df['pay_to_bill_july']), 'pay_to_bill_july'] = 0

In [None]:
# Calculate "Pay-to-Bill ratio" for (August, 2005)
combined_df['pay_to_bill_aug'] = combined_df['PAY_AMT1']/combined_df['BILL_AMT2']
combined_df.loc[~np.isfinite(combined_df['pay_to_bill_aug']), 'pay_to_bill_aug'] = 0

In [None]:
# Calculate "% of credit limit used" for (April, 2005)
combined_df['pct_of_limit_used_april'] = combined_df['BILL_AMT6']/combined_df['LIMIT_BAL']
combined_df.loc[~np.isfinite(combined_df['pct_of_limit_used_april']), 'pct_of_limit_used_april'] = 0

In [None]:
# Calculate "% of credit limit used" for (May, 2005)
combined_df['pct_of_limit_used_may'] = combined_df['BILL_AMT5']/combined_df['LIMIT_BAL']
combined_df.loc[~np.isfinite(combined_df['pct_of_limit_used_may']), 'pct_of_limit_used_may'] = 0

In [None]:
# Calculate "% of credit limit used" for (June, 2005)
combined_df['pct_of_limit_used_jun'] = combined_df['BILL_AMT4']/combined_df['LIMIT_BAL']
combined_df.loc[~np.isfinite(combined_df['pct_of_limit_used_jun']), 'pct_of_limit_used_jun'] = 0

In [None]:
# Calculate "% of credit limit used" for (July, 2005)
combined_df['pct_of_limit_used_jul'] = combined_df['BILL_AMT3']/combined_df['LIMIT_BAL']
combined_df.loc[~np.isfinite(combined_df['pct_of_limit_used_jul']), 'pct_of_limit_used_jul'] = 0

In [None]:
# Calculate "% of credit limit used" for (August, 2005)
combined_df['pct_of_limit_used_aug'] = combined_df['BILL_AMT2']/combined_df['LIMIT_BAL']
combined_df.loc[~np.isfinite(combined_df['pct_of_limit_used_aug']), 'pct_of_limit_used_aug'] = 0

In [None]:
# Calculate "% of credit limit used" for (September, 2005)
combined_df['pct_of_limit_used_sep'] = combined_df['BILL_AMT1']/combined_df['LIMIT_BAL']
combined_df.loc[~np.isfinite(combined_df['pct_of_limit_used_sep']), 'pct_of_limit_used_sep'] = 0

In [None]:
# Drop redundant fields
combined_df.drop(['ID', 'SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'], inplace=True, axis=1)
print("Number of features: {}".format(combined_df.shape[1]))

#### Segregate combined_df into train/predict datasets

In [None]:
Xtrain = combined_df[:21000].values
Xpredict = combined_df[21000:].values
print("Xtrain: {}".format(Xtrain.shape))
print("Xpredict: {}".format(Xpredict.shape))

#### Data Scaling

In [None]:
scaler_x = StandardScaler().fit(Xtrain)
Xtrain = scaler_x.transform(Xtrain)
Xpredict = scaler_x.transform(Xpredict)

#### Split training data into train and test datasets

In [None]:
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.07, random_state=1)
for train_index, test_index in sss.split(Xtrain, Ytrain):
    train_x, test_x = Xtrain[train_index], Xtrain[test_index]
    train_y, test_y = Ytrain[train_index], Ytrain[test_index]

print("------------------------- Training Dataset -------------------------")
print("train_x shape: {}".format(train_x.shape))
print("train_y shape: {}".format(train_y.shape))

print("\n------------------------- Test Dataset -------------------------")
print("test_x shape: {}".format(test_x.shape))
print("test_y shape: {}".format(test_y.shape))

In [None]:
train_df = pd.DataFrame({'Class': train_y[:, 0]})
train_df.groupby(['Class']).size().reset_index().rename(columns={0:'Count'})

In [None]:
test_df = pd.DataFrame({'Class': test_y[:, 0]})
test_df.groupby(['Class']).size().reset_index().rename(columns={0:'Count'})

#### Handling class imbalance

In [None]:
sm = SMOTE(k_neighbors=1)
sm_x, sm_y = sm.fit_sample(train_x, train_y.ravel())
train_x = sm_x
train_y = np.array([sm_y]).T

In [None]:
train_df = pd.DataFrame({'Class': train_y[:, 0]})
train_df.groupby(['Class']).size().reset_index().rename(columns={0:'Count'})

## Save the datasets in NPZ file (for reusability)

In [None]:
np.savez_compressed('Dataset/Credit_Card_Payment_Default_Dataset.npz',
                    Xtrain=train_x, Ytrain=train_y,
                    Xtest=test_x, Ytest=test_y,
                    Xpredict=Xpredict)

## Load datasets from the NPZ file

In [None]:
processed_dataset = np.load('Dataset/Credit_Card_Payment_Default_Dataset.npz', allow_pickle=True)

Xtrain, Ytrain = processed_dataset['Xtrain'], processed_dataset['Ytrain']
Xtest, Ytest = processed_dataset['Xtest'], processed_dataset['Ytest']
Xpredict = processed_dataset['Xpredict']

print("------------------------- Training Dataset -------------------------")
print("Xtrain shape: {}".format(Xtrain.shape))
print("Ytrain shape: {}".format(Ytrain.shape))

print("\n------------------------- Test Dataset -------------------------")
print("Xtest shape: {}".format(Xtest.shape))
print("Ytest shape: {}".format(Ytest.shape))

print("\n------------------------- Prediction Dataset -------------------------")
print("Xpredict shape: {}".format(Xpredict.shape))

## Build the model

In [None]:
def nn_model(input_shape):
    
    # Input Layer
    x_input = Input(shape=(input_shape, ), name='INPUT')
    
    # Fully-connected Layer 1
    x = Dense(units=512, name='FC-1', activation='relu', kernel_regularizer=l2(0.1))(x_input)
    x = BatchNormalization(name='BN_FC-1')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-1')(x)
    
    # Fully-connected Layer 2
    x = Dense(units=512, name='FC-2', activation='relu', kernel_regularizer=l2(0.1))(x)
    x = BatchNormalization(name='BN_FC-2')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-2')(x)
    
    # Fully-connected Layer 3
    x = Dense(units=256, name='FC-3', activation='relu', kernel_regularizer=l2(0.1))(x)
    x = BatchNormalization(name='BN_FC-3')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-3')(x)
    
    # Fully-connected Layer 4
    x = Dense(units=256, name='FC-4', activation='relu', kernel_regularizer=l2(0.1))(x)
    x = BatchNormalization(name='BN_FC-4')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-4')(x)
    
    # Fully-connected Layer 5
    x = Dense(units=128, name='FC-5', activation='relu', kernel_regularizer=l2(0.1))(x)
    x = BatchNormalization(name='BN_FC-5')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-5')(x)
    
    # Fully-connected Layer 6
    x = Dense(units=128, name='FC-6', activation='relu', kernel_regularizer=l2(0.1))(x)
    x = BatchNormalization(name='BN_FC-6')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-6')(x)
    
    # Fully-connected Layer 7
    x = Dense(units=64, name='FC-7', activation='relu', kernel_regularizer=l2(0.1))(x)
    x = BatchNormalization(name='BN_FC-7')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-7')(x)
    
    # Fully-connected Layer 8
    x = Dense(units=64, name='FC-8', activation='relu', kernel_regularizer=l2(0.1))(x)
    x = BatchNormalization(name='BN_FC-8')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-8')(x)
    
    # Fully-connected Layer 9
    x = Dense(units=64, name='FC-9', activation='relu', kernel_regularizer=l2(0.1))(x)
    x = BatchNormalization(name='BN_FC-9')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-9')(x)
    
    # Fully-connected Layer 10
    x = Dense(units=64, name='FC-10', activation='relu', kernel_regularizer=l2(0.1))(x)
    x = BatchNormalization(name='BN_FC-10')(x)
    x = Dropout(rate=0.5, name='DROPOUT_FC-10')(x)
    
    # Output Layer
    x = Dense(units=1, activation='sigmoid', name='OUTPUT')(x)

    # Create Keras Model instance
    model = Model(inputs=x_input, outputs=x, name='Credit_Card_Payment_Default_Predictor')

    return model

In [None]:
# Define the model hyperparameters
max_iterations = 10
mini_batch_size = 128
min_lr = 1e-4
max_lr = 1e-2
step_size = 8 * (Xtrain.shape[0] // mini_batch_size)
clr_method = 'triangular2'

In [None]:
# Create the model
model = nn_model(Xtrain.shape[1])

# Compile model to configure the learning process
adam = Adam(lr=min_lr)
model.compile(loss='binary_crossentropy',
              optimizer=adam,
              metrics=['accuracy'])

# Triangular learning rate policy
clr = CyclicLR(base_lr=min_lr, max_lr=max_lr, mode=clr_method, step_size=step_size)

In [None]:
# Learning Rate Finder
lrf = LearningRateFinder(model)
lrf.find((Xtrain, Ytrain),
         startLR=1e-10, endLR=1e+1,
         stepsPerEpoch=np.ceil((len(Xtrain) / float(mini_batch_size))),
         batchSize=mini_batch_size)
lrf.plot_loss()
plt.grid()
plt.show()

In [None]:
# Define 5-fold cross validation test harness
kfold = KFold(n_splits=5, shuffle=True, random_state=10)
cvscores = []
y_pred = 0
Ypredict = 0
loss_values = {}
acc_values = {}

In [None]:
# Train the model using K-fold
counter = 0

for train, val in kfold.split(Xtrain, Ytrain):
    counter += 1
    train_x, train_y = Xtrain[train], Ytrain[train]
    val_x, val_y = Xtrain[val], Ytrain[val]

    # Create the model
    model = nn_model(Xtrain.shape[1])

    # Compile model to configure the learning process
    model.compile(loss='binary_crossentropy', optimizer=Adam(lr=min_lr), metrics=['accuracy'])

    # Triangular learning rate policy
    clr = CyclicLR(base_lr=min_lr, max_lr=max_lr, mode=clr_method, step_size=step_size)

    # Fit the model
    history = model.fit(x=train_x, y=train_y, 
                        batch_size=mini_batch_size, epochs=100, 
                        callbacks=[clr], workers=5,
                        validation_data=(val_x, val_y))
    
    # Store the score values on validation dataset
    scores = model.evaluate(x=Xtest, y=Ytest, verbose=0)
    print("%s: %.2f" % (model.metrics_names[1], scores[1]))
    cvscores.append(scores[1])

    # Run predictions
    pred = model.predict(x=Xtest)
    y_pred += pred

    # Store the history object values for learning curves plotting
    loss_values["train_loss_"+str(counter)] = history.history['loss']
    loss_values["val_loss_"+str(counter)] = history.history['val_loss']
    acc_values["train_acc_"+str(counter)] = history.history['accuracy']
    acc_values["val_acc_"+str(counter)] = history.history['val_accuracy']

print("%.2f (+/- %.2f)" % (np.mean(cvscores), np.std(cvscores)))
y_pred /= float(counter)

## Plot the learning curves

In [None]:
plt.figure(figsize=(10,8))
plt.plot(loss_values["train_loss_1"], label='train_loss_1')
plt.plot(loss_values["train_loss_2"], label='train_loss_2')
plt.plot(loss_values["train_loss_3"], label='train_loss_3')
plt.plot(loss_values["train_loss_4"], label='train_loss_4')
plt.plot(loss_values["train_loss_5"], label='train_loss_5')
plt.plot(loss_values["val_loss_1"], label='val_loss_1')
plt.plot(loss_values["val_loss_2"], label='val_loss_2')
plt.plot(loss_values["val_loss_3"], label='val_loss_3')
plt.plot(loss_values["val_loss_4"], label='val_loss_4')
plt.plot(loss_values["val_loss_5"], label='val_loss_5')
plt.ylabel('Cost')
plt.xlabel('Epoch #')
plt.title("Model Loss Curve")
plt.legend()
plt.grid()
plt.show()

In [2]:
plt.figure(figsize=(10,8))
plt.plot(acc_values["train_acc_1"], label='train_acc_1')
plt.plot(acc_values["train_acc_2"], label='train_acc_2')
plt.plot(acc_values["train_acc_3"], label='train_acc_3')
plt.plot(acc_values["train_acc_4"], label='train_acc_4')
plt.plot(acc_values["train_acc_5"], label='train_acc_5')
plt.plot(acc_values["val_acc_1"], label='val_acc_1')
plt.plot(acc_values["val_acc_2"], label='val_acc_2')
plt.plot(acc_values["val_acc_3"], label='val_acc_3')
plt.plot(acc_values["val_acc_4"], label='val_acc_4')
plt.plot(acc_values["val_acc_5"], label='val_acc_5')
plt.ylabel('Accuracy')
plt.xlabel('Epoch #')
plt.title("Model Accuracy Curve")
plt.legend()
plt.grid()
plt.show()

NameError: name 'plt' is not defined

In [None]:
plt.plot(clr.history["lr"])
plt.ylabel('Learning Rate')
plt.xlabel('Iteration #')
plt.title("Cyclical Learning Rate (CLR)")
plt.grid()
plt.show()

## Validate the model

In [None]:
y_pred_binary = np.where(y_pred > 0.5, 1, 0)

In [None]:
#Print accuracy
acc_score = accuracy_score(Ytest, y_pred_binary)
print('Overall accuracy of Light GBM model:', acc_score)

In [None]:
#Print Area Under Curve
plt.figure()
false_positive_rate, recall, thresholds = roc_curve(Ytest, y_pred_binary)
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic (ROC)')
plt.plot(false_positive_rate, recall, 'b', label = 'AUC = %0.3f' %roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out (1-Specificity)')
plt.show()

print('AUC score:', roc_auc)

In [None]:
#Print Confusion Matrix
cm = confusion_matrix(Ytest, y_pred_binary)
print(cm)
labels = ['No Default', 'Default']
sns.heatmap(cm, xticklabels = labels, yticklabels = labels, annot = True, fmt='d', cmap="Blues", vmin = 0.5);
plt.title('Confusion Matrix')
plt.ylabel('True Class')
plt.xlabel('Predicted Class')
plt.show()

## Make Predictions

In [None]:
y_pred_prob = model.predict(Xpredict)
y_pred_binary = np.where(y_pred_prob > 0.5, 1, 0)
temp_df = pd.DataFrame(y_pred_binary, columns=['prediction'])

In [None]:
submit_df = pd.read_csv("Dataset/sample_submission.csv")
submit_df['default_payment_next_month'] = temp_df['prediction']
submit_df.to_csv("predictions.csv", index=False)