## Summary

1. Overview
2. Dependecies and path parameters
3. Data import
4. Quality check
5. The model
    - Data pre-processing
    - Model development
    - Performance evaluation
4. Prediction on unlabelled data
5. Model saving

# 1. Overview

In scope of predict the default probabilty for a given account, given several predictors, I built a sequential Neural Network.

At a first look, the input data contains a lot of NULL values, so they have to be taken into account in the model, I cannot just remove them.

To develop the NN I used Keras framework, based on Tensorflow. Many predictors are categorical, so I had to encode them with one-hot key method to fit the expected format. Moreover, to be able to correctly encode future new data and make predictions on them, I had to save the categories.

The Net has a initial masking layer, to take into account the NULL values mentioned before. I tuned the hyperparametes (mainly number of layers, number of neurons and optimizer) to improve the performance as much as possible. Accuracy is not a good indicator of the goodness of the model, because the 2 classes (Default=TRUE and Default=FALSE) are very unbalanced, so a dummy model that predicts always Default=FALSE has a good accuracy. For this reason, I used the metrics Precision and Recall as goodness indicator. For the same reason, I added weights based on the class numerosity to the fit function.

Because the large number of parameters, a regularizer has been necessary, otherwise the performance on the dev set would have been very poor.

Finally, I made predicton on unlabelled data.

The model and the list of predictors are saved directly in the folder used to build the Docker image.

# 2. Dependecies and path parameters

In [30]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle
from keras.utils import to_categorical
from keras import metrics

In [31]:
input_folder_path='./../data/'
output_folder_path='./../lambda_default_predictor/trained_model/'

input_file_name = 'dataset.csv'
output_file_name = 'predictions.csv'

# 3. Data import

In [32]:
input_df = pd.read_csv(input_folder_path+input_file_name, sep=';')

In [33]:
input_df.head(2)

Unnamed: 0,uuid,default,account_amount_added_12_24m,account_days_in_dc_12_24m,account_days_in_rem_12_24m,account_days_in_term_12_24m,account_incoming_debt_vs_paid_0_24m,account_status,account_worst_status_0_3m,account_worst_status_12_24m,...,status_3rd_last_archived_0_24m,status_max_archived_0_6_months,status_max_archived_0_12_months,status_max_archived_0_24_months,recovery_debt,sum_capital_paid_account_0_12m,sum_capital_paid_account_12_24m,sum_paid_inv_0_12m,time_hours,worst_status_active_inv
0,63f69b2c-8b1c-4740-b78d-52ed9a4515ac,0.0,0,0.0,0.0,0.0,0.0,1.0,1.0,,...,1,1,1,1,0,0,0,178839,9.653333,1.0
1,0e961183-8c15-4470-9a5e-07a1bd207661,0.0,0,0.0,0.0,0.0,,1.0,1.0,1.0,...,1,1,2,2,0,0,0,49014,13.181389,


In [34]:
unlabelled_df = input_df[input_df.default.isna()].drop(columns=['default'])

In [35]:
df = input_df[input_df.default.notna()]

In [36]:
input_df.shape, unlabelled_df.shape, df.shape

((99976, 43), (10000, 42), (89976, 43))

In [37]:
numerics_columns = ['account_amount_added_12_24m',
       'account_days_in_dc_12_24m', 'account_days_in_rem_12_24m',
       'account_days_in_term_12_24m', 'account_incoming_debt_vs_paid_0_24m', 'age', 'avg_payment_span_0_12m',
       'avg_payment_span_0_3m',
       'has_paid', 'max_paid_inv_0_12m', 'max_paid_inv_0_24m',
       'num_active_div_by_paid_inv_0_12m', 'num_active_inv',
       'num_arch_dc_0_12m', 'num_arch_dc_12_24m', 'num_arch_ok_0_12m',
       'num_arch_ok_12_24m', 'num_arch_rem_0_12m',
       'num_arch_written_off_0_12m', 'num_arch_written_off_12_24m',
       'num_unpaid_bills', 'recovery_debt',
       'sum_capital_paid_account_0_12m', 'sum_capital_paid_account_12_24m',
       'sum_paid_inv_0_12m', 'time_hours']
categorical_columns = ['account_status', 'account_worst_status_0_3m',
       'account_worst_status_12_24m', 'account_worst_status_3_6m',
       'account_worst_status_6_12m', 'merchant_category', 'merchant_group','name_in_email',
                       'status_last_archived_0_24m',
       'status_2nd_last_archived_0_24m', 'status_3rd_last_archived_0_24m',
       'status_max_archived_0_6_months', 'status_max_archived_0_12_months',
       'status_max_archived_0_24_months', 'worst_status_active_inv']

# 4. Quality check

Very unbalances classes:

In [38]:
df.groupby('default').default.count()

default
0.0    88688
1.0     1288
Name: default, dtype: int64

A lot of NULL values for some columns:

In [39]:
df.isna().sum(axis=0)/df.shape[0]

uuid                                   0.000000
default                                0.000000
account_amount_added_12_24m            0.000000
account_days_in_dc_12_24m              0.118732
account_days_in_rem_12_24m             0.118732
account_days_in_term_12_24m            0.118732
account_incoming_debt_vs_paid_0_24m    0.593014
account_status                         0.543856
account_worst_status_0_3m              0.543856
account_worst_status_12_24m            0.667456
account_worst_status_3_6m              0.577243
account_worst_status_6_12m             0.603639
age                                    0.000000
avg_payment_span_0_12m                 0.238597
avg_payment_span_0_3m                  0.493265
merchant_category                      0.000000
merchant_group                         0.000000
has_paid                               0.000000
max_paid_inv_0_12m                     0.000000
max_paid_inv_0_24m                     0.000000
name_in_email                          0

# 5. The model

## 5.1 Data pre-processing

In [40]:
df = df.fillna(-1)

In [41]:
df_numerics = df[numerics_columns]
df_categorical = df[categorical_columns]

In [42]:
df_numerics.shape, df_categorical.shape

((89976, 26), (89976, 15))

In [43]:
one_hot_list=[]
categories_dict = {}
for col_name in df_categorical:
    col = df_categorical[col_name]
    categories = list(col.unique())
    labels = col.apply(lambda x: categories.index(x)).tolist()
    one_hot_list.append(to_categorical(labels))
    categories_dict.update({col_name:categories})

In [44]:
one_hot= np.concatenate(one_hot_list, axis=1)

In [45]:
np.shape(one_hot)

(89976, 135)

In [46]:
X = np.concatenate([df_numerics.values, one_hot], axis=1)

In [47]:
np.shape(X)

(89976, 161)

## 5.2 Model development

In [48]:
y = df['default'].to_numpy()

In [49]:
y = y.reshape(len(y),1)

In [50]:
np.shape(y)

(89976, 1)

In [None]:
test_size = 20000

In [None]:
X_train = X[:-test_size].astype('float32')
y_train = y[:-test_size]

In [None]:
np.shape(y_train)

In [None]:
X_test = X[-test_size:].astype('float32')
y_test = y[-test_size:]

In [None]:
from keras import models, layers, regularizers

In [None]:
model = models.Sequential()
model.add(layers.Masking(mask_value=-1,input_shape=(np.shape(X)[1],)))
model.add(layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4)))
model.add(layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4)))
model.add(layers.Dense(128, activation='relu'))
#model.add(layers.Dense(128, activation='relu'))
#model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [None]:
metrics_list = [
    metrics.BinaryAccuracy(name='accuracy'),
    metrics.FalseNegatives(name="fn"),
    metrics.FalsePositives(name="fp"),
    metrics.TrueNegatives(name="tn"),
    metrics.TruePositives(name="tp"),
    metrics.Precision(name="precision"),
    metrics.Recall(name="recall"),
]

In [None]:
#metrics_list = [ metrics.BinaryAccuracy(name='accuracy')]

In [None]:
from keras import optimizers

In [None]:
opt = optimizers.Adam(learning_rate=0.001)

In [None]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=metrics_list)

In [None]:
counts = df.groupby('default').default.count().to_numpy()

In [None]:
weight_for_0 = 1.0 / counts[0]
weight_for_1 = 1.0 / counts[1]

In [None]:
history = model.fit(X_train,
                    y_train,
                    epochs=20,
                    batch_size=512,
                    validation_split=0.1,
                   class_weight = {0: weight_for_0, 1: weight_for_1})

## 5.3 Performance evaluation

In [None]:
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(loss) + 1)

plt.figure(figsize=(20,10))
plt.grid()
plt.plot(epochs, loss, 'red', label='Training loss')
plt.plot(epochs, val_loss, 'blue', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
history.history.keys()

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

epochs = range(1, len(acc) + 1)

plt.figure(figsize=(20,10))
plt.grid();
plt.plot(epochs, acc, 'red', label='Training accuracy')
plt.plot(epochs, val_acc, 'blue', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
acc = history.history['precision']
val_acc = history.history['val_precision']

epochs = range(1, len(acc) + 1)

plt.figure(figsize=(20,10))
plt.grid();
plt.plot(epochs, acc, 'red', label='Training accuracy')
plt.plot(epochs, val_acc, 'blue', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
acc = history.history['recall']
val_acc = history.history['val_recall']

epochs = range(1, len(acc) + 1)

plt.figure(figsize=(20,10))
plt.grid();
plt.plot(epochs, acc, 'red', label='Training accuracy')
plt.plot(epochs, val_acc, 'blue', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
model.evaluate(X_test, y_test)

In [None]:
plt.hist(model.predict(X_test))

In [None]:
y_pred = model.predict(X_test)

In [None]:
type(X_test)

In [None]:
sum(y_pred>0.5)

In [None]:
y_pred

In [None]:
sum(y_test)

## saving the prototype

In [None]:
model.save(output_folder_path+'nn/')

In [None]:
with open(output_folder_path+'categories.txt', "wb") as fp:
    pickle.dump(categories_dict, fp)

In [None]:
with open(output_folder_path+'numerics_columns.txt', "wb") as fp:
    pickle.dump(list(set(numerics_columns) & set(df.columns)), fp)

In [None]:
# add prediction on the null rows

In [None]:
from keras.models import load_model

In [None]:
model = load_model(output_folder_path + 'nn/')

In [None]:
model.save(output_folder_path+'mymodel.h5',include_optimizer=False, save_traces=False)

In [None]:
output_folder_path

In [None]:
model = load_model(output_folder_path+'mymodel.h5')

In [None]:
model

In [None]:
output_folder_path='./../trained_model/'

In [None]:
import tensorflow as tf

In [None]:
tf.keras.models.load_model(output_folder_path + 'nn/')

In [None]:
tf.__version__

In [None]:
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")