# Capstone Project: The Efficacy of Multilayer Perceptron Algorithms in Predicting Bankruptcy

<ul>
<li><a href="#introduction">INTRODUCTION</a></li>
<li><a href="#Benchmark Logistic Regression">Benchmark Logistic Regression</a></li>
<li><a href="#MLP Model">MLP Model</a></li>
<li><a href="#assess">Data Assessment</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#target">Separation of Target Variables</a></li>
<li><a href="#no nulls">No Nulls Data</a></li>
<li><a href="#one-hot null">Creation of One-hot Null Variable</a></li>
<li><a href="#Creation of Sum Null Variables">Creation of Sum Null Variables</a></li>
<li><a href="#PIVOT">PIVOT</a></li>
<li><a href="#Data Reorganization">Data Reorganization</a></li>
<li><a href="#Data Exploration - Descriptive Statistics">Data Exploration - Descriptive Statistics</a></li>
<li><a href="#Exploratory Visualization">Exploratory Visualization</a></li>
<li><a href="#Preprocessing">Preprocessing</a></li>
<li><a href="#Benchmark: Logistic Regression">Benchmark: Logistic Regression</a></li>
<li><a href="#originalMLP">Origial MLP</a></li>
<li><a href="#conclusion">Conclusion</a></li> 
<li><a href="#references">References</a></li>
</ul>

<a id='introduction'></a>
## INTRODUCTION

<a id='Benchmark Logistic Regression'></a>
## Benchmark Logistic Regression

In [1]:
# This imports the necessary libraries for the logistic regression models.
from sklearn.linear_model import LogisticRegression

# This imports the AUC score for scoring the models.
# This comes from Reference 27 in References.
from sklearn.metrics import roc_auc_score

# These are libraries that will be needed to organize data,
# graph data, and change the working directory.
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline


### Load Data: No Nulls

In [27]:
# This loads the no_nulls X training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_nonulls = pd.read_csv('no-peaking/Xtrain_nonulls.csv')
Xtrain_nonulls = np.array(Xtrain_nonulls)
Xtest_nonulls = pd.read_csv('no-peaking/Xtest_nonulls.csv')
Xtest_nonulls = np.array(Xtest_nonulls)

# This loads the no_nulls Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_nonulls = pd.read_csv('no-peaking/Ytrain_nonulls.csv')
Ytrain_nonulls = np.array(Ytrain_nonulls)
Ytrain_nonulls = Ytrain_nonulls.ravel() 
Ytest_nonulls = pd.read_csv('no-peaking/Ytest_nonulls.csv')
Ytest_nonulls = np.array(Ytest_nonulls)
Ytest_nonulls = Ytest_nonulls.ravel()

### Performing Hold-out Cross-Validation on just the training set
K-fold Cross-Validation is a standard method for preventing a logistic regression model from overfitting. However, because the testing set, the fifth year of the dataset, is arbitrarily chosen (not random), K-fold Cross-Validation cannot be applied to the dataset (Reference 3). K-fold Cross-Validation would corrupt the testing set with data leakage considering that the dataset is a time-series set (Reference 3). To prevent data leakage, Hold-out cross-Validation will only be applied to the training set (References 3 & 4). Hold-out Cross-Validation takes a percentage of the training set as a validation set to test the accuracy of the model during the training stage. This method of cross validation, like all methods, is used to prevent the overfitting of a model and poor accuracy performance when applying the testing data to the fitted model (Reference 3).

### No Nulls Logistic Regression Benchmark

In [28]:
# This import train_test_split from sklearn.
from sklearn.model_selection import train_test_split

In [29]:
# This creates the hold-out validation set for the logistic regression model
# for the No Nulls dataset.
# This comes from Reference 4 in References.
Xtrain_nonulls, Xval_nonulls, Ytrain_nonulls, Yval_nonulls = train_test_split(
                    Xtrain_nonulls, Ytrain_nonulls, test_size = 0.3, random_state = 13)

In [30]:
# This creates the logistic regression model for the No Nulls Dataset.
log_nonulls = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_nonulls model with the training data.
log_nonulls.fit(Xtrain_nonulls,Ytrain_nonulls)

# This predicts the y values from the Xval dataset.
yval_pred_nonulls = log_nonulls.predict(Xval_nonulls)

# This returns the validation AUC score.
VAL_auc_nonulls = roc_auc_score(Yval_nonulls, yval_pred_nonulls)
VAL_auc_nonulls



0.8506022247116249

In [52]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_nonulls = log_nonulls.predict(Xtest_nonulls)
TEST_auc_nonulls = roc_auc_score(Ytest_nonulls, ytest_pred_nonulls)
print("The AUC score for the model is %.4f." % TEST_auc_nonulls)

The AUC score for the model is 0.3163.


### Load Data: Nulls only

In [32]:
#### This loads the Nulls only training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_nullsonly = pd.read_csv('no-peaking/Xtrain_sum.csv')
Xtrain_nullsonly = Xtrain_nullsonly.drop(
    Xtrain_nullsonly.columns[64], axis=1)
Xtrain_nullsonly = np.array(Xtrain_nullsonly)
Xtest_nullsonly = pd.read_csv('no-peaking/Xtest_sum.csv')
Xtest_nullsonly = Xtest_nullsonly.drop(
    Xtest_nullsonly.columns[64], axis=1)
Xtest_nullsonly = np.array(Xtest_nullsonly)

# This loads the Nulls only Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_nullsonly = pd.read_csv('no-peaking/Ytrain_sum.csv')
Ytrain_nullsonly = np.array(Ytrain_nullsonly)
Ytrain_nullsonly = Ytrain_nullsonly.ravel() 
Ytest_nullsonly = pd.read_csv('no-peaking/Ytest_sum.csv')
Ytest_nullsonly = np.array(Ytest_nullsonly)
Ytest_nullsonly = Ytest_nullsonly.ravel()

### Nulls Only Logistic Regression Benchmark

In [33]:
# This creates the hold-out validation set for the logistic regression model
# for the Sum dataset.
# This comes from Reference 4 in References.
Xtrain_nullsonly, Xval_nullsonly, Ytrain_nullsonly, Yval_nullsonly = train_test_split(
                    Xtrain_nullsonly, Ytrain_nullsonly, test_size = 0.3, random_state = 13)

In [34]:
# This creates the logistic regression model for the Nulls Only Dataset.
log_nullsonly = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_onehot model with the training data.
log_nullsonly.fit(Xtrain_nullsonly,Ytrain_nullsonly)

# This predicts the y values from the Xval dataset.
yval_pred_nullsonly = log_nullsonly.predict(Xval_nullsonly)

# This returns the validation AUC score.
VAL_auc_nullsonly = roc_auc_score(Yval_nullsonly, yval_pred_nullsonly)
VAL_auc_nullsonly



0.7350826303765834

In [35]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_nullsonly = log_nullsonly.predict(Xtest_nullsonly)
TEST_auc_nullsonly = roc_auc_score(Ytest_nullsonly, ytest_pred_nullsonly)
TEST_auc_nullsonly = TEST_auc_nullsonly * 100
print("The AUC score for the model is %.2f" % TEST_auc_nullsonly, "%")

The AUC score for the model is 68.29 %


### Load Data: Onehot

In [36]:
# This loads the one hot X training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_onehot = pd.read_csv('no-peaking/Xtrain_onehot.csv')
Xtrain_onehot = np.array(Xtrain_onehot)
Xtest_onehot = pd.read_csv('no-peaking/Xtest_onehot.csv')
Xtest_onehot = np.array(Xtest_onehot)

# This loads the one hot Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_onehot = pd.read_csv('no-peaking/Ytrain_onehot.csv')
Ytrain_onehot = np.array(Ytrain_onehot)
Ytrain_onehot = Ytrain_onehot.ravel() 
Ytest_onehot = pd.read_csv('no-peaking/Ytest_onehot.csv')
Ytest_onehot = np.array(Ytest_onehot)
Ytest_onehot = Ytest_onehot.ravel()

### One Hot Logistic Regression Benchmark

In [37]:
# This creates the hold-out validation set for the logistic regression model
# for the One Hot dataset.
# This comes from Reference 4 in References.
Xtrain_onehot, Xval_onehot, Ytrain_onehot, Yval_onehot = train_test_split(
                    Xtrain_onehot, Ytrain_onehot, test_size = 0.3, random_state = 13)

In [38]:
# This creates the logistic regression model for the One Hot Dataset.
log_onehot = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_onehot model with the training data.
log_onehot.fit(Xtrain_onehot,Ytrain_onehot)

# This predicts the y values from the Xval dataset.
yval_pred_onehot = log_onehot.predict(Xval_onehot)

# This returns the validation AUC score.
VAL_auc_onehot = roc_auc_score(Yval_onehot, yval_pred_onehot)
VAL_auc_onehot



0.8647068328667264

In [53]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_onehot = log_onehot.predict(Xtest_onehot)
TEST_auc_onehot = roc_auc_score(Ytest_onehot, ytest_pred_onehot)
print("The AUC score for the model is %.4f." % TEST_auc_onehot)

The AUC score for the model is 0.6406.


### Load Data: Sum

In [41]:
# This loads the SUM X training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_sum = pd.read_csv('no-peaking/Xtrain_sum.csv')
Xtrain_sum = np.array(Xtrain_sum)
Xtest_sum = pd.read_csv('no-peaking/Xtest_sum.csv')
Xtest_sum = np.array(Xtest_sum)

# This loads the no_nulls Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_sum = pd.read_csv('no-peaking/Ytrain_sum.csv')
Ytrain_sum = np.array(Ytrain_sum)
Ytrain_sum = Ytrain_sum.ravel() 
Ytest_sum = pd.read_csv('no-peaking/Ytest_sum.csv')
Ytest_sum = np.array(Ytest_sum)
Ytest_sum = Ytest_sum.ravel()

### Sum Logistic Regression Benchmark

In [42]:
# This creates the hold-out validation set for the logistic regression model
# for the Sum dataset.
# This comes from Reference 4 in References.
Xtrain_sum, Xval_sum, Ytrain_sum, Yval_sum = train_test_split(
                    Xtrain_sum, Ytrain_sum, test_size = 0.3, random_state = 13)

In [43]:
# This creates the logistic regression model for the One Hot Dataset.
log_sum = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_onehot model with the training data.
log_sum.fit(Xtrain_sum,Ytrain_sum)

# This predicts the y values from the Xval dataset.c
yval_pred_sum = log_sum.predict(Xval_sum)

# This returns the validation AUC score.
VAL_auc_sum = roc_auc_score(Yval_sum, yval_pred_sum)
VAL_auc_sum



0.7437902237728553

In [51]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_sum = log_sum.predict(Xtest_sum)
TEST_auc_sum = roc_auc_score(Ytest_sum, ytest_pred_sum)
print("The AUC score for the model is %.4f." % TEST_auc_sum)

The AUC score for the model is 0.6606.


<a id='MLP Model'></a>
## MLP Model

In [64]:
# This creates a directory to save the best models for the MLP.
os.mkdir('saved_models')

In [45]:
# This imports the necessary libraries for the MLP.

# This imports the sequential model, the layers,
# the SGD optimizer, the regularizers from keras.
# This comes from Reference 5 in Referenes.
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras import regularizers

# This imports checkpointer, which records the best weights
# for the algorithm.
# This comes from Reference 6 in References.
from keras.callbacks import ModelCheckpoint
import matplotlib.pyplot as plt

Using TensorFlow backend.


In [57]:
def build_model(drop_rate, l2_factor, first_dense, second_dense,
                third_dense, hidden_act, out_act, x):
    dim_int = int(np.size(x,1))
    # This defines the model as a sequential model.
    # This comes from References 1 in References.
    model = Sequential()

    # This is the input layer.
    # This comes from References 1 & 3 in References.
    model.add(Dense(first_dense, activation = hidden_act,
        kernel_regularizer = regularizers.l2(l2_factor),
        input_dim = dim_int))
    model.add(Dropout(drop_rate))

    # This creates the first hidden layer.
    # This comes from Reference 7 in References.
    model.add(Dense(second_dense,
        activation = hidden_act,
        kernel_regularizer = regularizers.l2(l2_factor)))
    model.add(Dropout(drop_rate))
    
    # This creates the second hidden layer.
    # This comes from Reference 7 in References.
    model.add(Dense(third_dense,
        activation = hidden_act,
        kernel_regularizer = regularizers.l2(l2_factor)))
    model.add(Dropout(drop_rate))

    # This creates the output layer.
    # This comes from Reference 7 in References.
    model.add(Dense(1, activation=out_act))
    # This returns the model.
    return model

In [66]:
# This comes from Reference 7 in References.
n_epochs = 100
size_of_batch = 50
stochastic = SGD()

# This builds the original MLP model for year 1.
# using Mattson and Steinart's original hyperparameters.
mlp_nonulls = build_model(drop_rate=0.5,
                            l2_factor=0.001,
                             first_dense=32,
                             second_dense=16,
                             third_dense=8,
                            hidden_act='relu',
                            out_act='sigmoid',
                            x=Xtrain_nonulls)

In [67]:
# This compiles the MLP model for the No Null data.
# This comes from Reference 13 in References.
mlp_nonulls.compile(loss='binary_crossentropy',
              optimizer= stochastic,
              metrics=['accuracy'])

In [68]:
# This creates checkpointer, which uses ModelCheckpoint to store the
# best weights of the model.
# This comes from References 15 & 16 in References.
checkpoint = ModelCheckpoint(filepath='saved_models/weights.best.mlp_nonulls.hdf5',
                              monitor='val_accuracy', verbose=1, save_best_only=True,
                            mode='max')
callbacks_list = [checkpoint]

In [71]:
# This fits the model and runs it for 100 epochs.
mlp_nonulls.fit(Xtrain_nonulls, Ytrain_nonulls, validation_data=
                (Xval_nonulls, Yval_nonulls),
                epochs=n_epochs, batch_size=size_of_batch, 
                callbacks = callbacks_list, verbose=1)

Train on 13773 samples, validate on 5903 samples
Epoch 1/100
Epoch 2/100
 3050/13773 [=====>........................] - ETA: 0s - loss: 0.3877 - acc: 0.8751



Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100


Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x1ef92169080>

In [72]:
mlp_nonulls.load_weights('saved_models/weights.best.mlp_nonulls.hdf5')

OSError: Unable to open file (unable to open file: name = 'saved_models/weights.best.mlp_nonulls.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

<a id='references'></a>
## References

1. https://stackoverflow.com/questions/41032551/how-to-compute-receiving-operating-characteristic-roc-and-auc-in-keras
2. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html
3. https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9
4. https://dziganto.github.io/cross-validation/data%20science/machine%20learning/model%20tuning/python/Model-Tuning-with-Validation-and-Cross-Validation/
5. https://keras.io/getting-started/sequential-model-guide/
6. Udacity Machine Learning Engineer Nanodegree Program, Semester 2, Brian Campbell - Dog Breed Classifier Project
7. https://keras.io/getting-started/sequential-model-guide/
8. https://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.size.html