# Capstone Project: The Efficacy of Multilayer Perceptron Algorithms in Predicting Bankruptcy

<ul>
<li><a href="#introduction">INTRODUCTION</a></li>
<li><a href="#Benchmark Logistic Regression">Benchmark Logistic Regression</a></li>
<li><a href="#MLP Model">MLP Model</a></li>
<li><a href="#assess">Data Assessment</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#target">Separation of Target Variables</a></li>
<li><a href="#no nulls">No Nulls Data</a></li>
<li><a href="#one-hot null">Creation of One-hot Null Variable</a></li>
<li><a href="#Creation of Sum Null Variables">Creation of Sum Null Variables</a></li>
<li><a href="#PIVOT">PIVOT</a></li>
<li><a href="#Data Reorganization">Data Reorganization</a></li>
<li><a href="#Data Exploration - Descriptive Statistics">Data Exploration - Descriptive Statistics</a></li>
<li><a href="#Exploratory Visualization">Exploratory Visualization</a></li>
<li><a href="#Preprocessing">Preprocessing</a></li>
<li><a href="#Benchmark: Logistic Regression">Benchmark: Logistic Regression</a></li>
<li><a href="#originalMLP">Origial MLP</a></li>
<li><a href="#conclusion">Conclusion</a></li> 
<li><a href="#references">References</a></li>
</ul>

<a id='introduction'></a>
## INTRODUCTION

<a id='Benchmark Logistic Regression'></a>
## Benchmark Logistic Regression

In [1]:
# This imports the necessary libraries for the logistic regression models.
from sklearn.linear_model import LogisticRegression

# This imports the AUC score for scoring the models.
# This comes from Reference 27 in References.
from sklearn.metrics import roc_auc_score

# These are libraries that will be needed to organize data,
# graph data, and change the working directory.
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline


### Load Data: No Nulls

In [27]:
# This loads the no_nulls X training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_nonulls = pd.read_csv('no-peaking/Xtrain_nonulls.csv')
Xtrain_nonulls = np.array(Xtrain_nonulls)
Xtest_nonulls = pd.read_csv('no-peaking/Xtest_nonulls.csv')
Xtest_nonulls = np.array(Xtest_nonulls)

# This loads the no_nulls Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_nonulls = pd.read_csv('no-peaking/Ytrain_nonulls.csv')
Ytrain_nonulls = np.array(Ytrain_nonulls)
Ytrain_nonulls = Ytrain_nonulls.ravel() 
Ytest_nonulls = pd.read_csv('no-peaking/Ytest_nonulls.csv')
Ytest_nonulls = np.array(Ytest_nonulls)
Ytest_nonulls = Ytest_nonulls.ravel()

### Performing Hold-out Cross-Validation on just the training set
K-fold Cross-Validation is a standard method for preventing a logistic regression model from overfitting. However, because the testing set, the fifth year of the dataset, is arbitrarily chosen (not random), K-fold Cross-Validation cannot be applied to the dataset (Reference 3). K-fold Cross-Validation would corrupt the testing set with data leakage considering that the dataset is a time-series set (Reference 3). To prevent data leakage, Hold-out cross-Validation will only be applied to the training set (References 3 & 4). Hold-out Cross-Validation takes a percentage of the training set as a validation set to test the accuracy of the model during the training stage. This method of cross validation, like all methods, is used to prevent the overfitting of a model and poor accuracy performance when applying the testing data to the fitted model (Reference 3).

### No Nulls Logistic Regression Benchmark

In [28]:
# This import train_test_split from sklearn.
from sklearn.model_selection import train_test_split

In [29]:
# This creates the hold-out validation set for the logistic regression model
# for the No Nulls dataset.
# This comes from Reference 4 in References.
Xtrain_nonulls, Xval_nonulls, Ytrain_nonulls, Yval_nonulls = train_test_split(
                    Xtrain_nonulls, Ytrain_nonulls, test_size = 0.3, random_state = 13)

In [30]:
# This creates the logistic regression model for the No Nulls Dataset.
log_nonulls = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_nonulls model with the training data.
log_nonulls.fit(Xtrain_nonulls,Ytrain_nonulls)

# This predicts the y values from the Xval dataset.
yval_pred_nonulls = log_nonulls.predict(Xval_nonulls)

# This returns the validation AUC score.
VAL_auc_nonulls = roc_auc_score(Yval_nonulls, yval_pred_nonulls)
VAL_auc_nonulls



0.8506022247116249

In [52]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_nonulls = log_nonulls.predict(Xtest_nonulls)
TEST_auc_nonulls = roc_auc_score(Ytest_nonulls, ytest_pred_nonulls)
print("The AUC score for the model is %.4f." % TEST_auc_nonulls)

The AUC score for the model is 0.3163.


### Load Data: Nulls only

In [32]:
#### This loads the Nulls only training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_nullsonly = pd.read_csv('no-peaking/Xtrain_sum.csv')
Xtrain_nullsonly = Xtrain_nullsonly.drop(
    Xtrain_nullsonly.columns[64], axis=1)
Xtrain_nullsonly = np.array(Xtrain_nullsonly)
Xtest_nullsonly = pd.read_csv('no-peaking/Xtest_sum.csv')
Xtest_nullsonly = Xtest_nullsonly.drop(
    Xtest_nullsonly.columns[64], axis=1)
Xtest_nullsonly = np.array(Xtest_nullsonly)

# This loads the Nulls only Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_nullsonly = pd.read_csv('no-peaking/Ytrain_sum.csv')
Ytrain_nullsonly = np.array(Ytrain_nullsonly)
Ytrain_nullsonly = Ytrain_nullsonly.ravel() 
Ytest_nullsonly = pd.read_csv('no-peaking/Ytest_sum.csv')
Ytest_nullsonly = np.array(Ytest_nullsonly)
Ytest_nullsonly = Ytest_nullsonly.ravel()

### Nulls Only Logistic Regression Benchmark

In [33]:
# This creates the hold-out validation set for the logistic regression model
# for the Sum dataset.
# This comes from Reference 4 in References.
Xtrain_nullsonly, Xval_nullsonly, Ytrain_nullsonly, Yval_nullsonly = train_test_split(
                    Xtrain_nullsonly, Ytrain_nullsonly, test_size = 0.3, random_state = 13)

In [34]:
# This creates the logistic regression model for the Nulls Only Dataset.
log_nullsonly = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_onehot model with the training data.
log_nullsonly.fit(Xtrain_nullsonly,Ytrain_nullsonly)

# This predicts the y values from the Xval dataset.
yval_pred_nullsonly = log_nullsonly.predict(Xval_nullsonly)

# This returns the validation AUC score.
VAL_auc_nullsonly = roc_auc_score(Yval_nullsonly, yval_pred_nullsonly)
VAL_auc_nullsonly



0.7350826303765834

In [35]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_nullsonly = log_nullsonly.predict(Xtest_nullsonly)
TEST_auc_nullsonly = roc_auc_score(Ytest_nullsonly, ytest_pred_nullsonly)
TEST_auc_nullsonly = TEST_auc_nullsonly * 100
print("The AUC score for the model is %.2f" % TEST_auc_nullsonly, "%")

The AUC score for the model is 68.29 %


### Load Data: Onehot

In [36]:
# This loads the one hot X training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_onehot = pd.read_csv('no-peaking/Xtrain_onehot.csv')
Xtrain_onehot = np.array(Xtrain_onehot)
Xtest_onehot = pd.read_csv('no-peaking/Xtest_onehot.csv')
Xtest_onehot = np.array(Xtest_onehot)

# This loads the one hot Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_onehot = pd.read_csv('no-peaking/Ytrain_onehot.csv')
Ytrain_onehot = np.array(Ytrain_onehot)
Ytrain_onehot = Ytrain_onehot.ravel() 
Ytest_onehot = pd.read_csv('no-peaking/Ytest_onehot.csv')
Ytest_onehot = np.array(Ytest_onehot)
Ytest_onehot = Ytest_onehot.ravel()

### One Hot Logistic Regression Benchmark

In [37]:
# This creates the hold-out validation set for the logistic regression model
# for the One Hot dataset.
# This comes from Reference 4 in References.
Xtrain_onehot, Xval_onehot, Ytrain_onehot, Yval_onehot = train_test_split(
                    Xtrain_onehot, Ytrain_onehot, test_size = 0.3, random_state = 13)

In [38]:
# This creates the logistic regression model for the One Hot Dataset.
log_onehot = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_onehot model with the training data.
log_onehot.fit(Xtrain_onehot,Ytrain_onehot)

# This predicts the y values from the Xval dataset.
yval_pred_onehot = log_onehot.predict(Xval_onehot)

# This returns the validation AUC score.
VAL_auc_onehot = roc_auc_score(Yval_onehot, yval_pred_onehot)
VAL_auc_onehot



0.8647068328667264

In [53]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_onehot = log_onehot.predict(Xtest_onehot)
TEST_auc_onehot = roc_auc_score(Ytest_onehot, ytest_pred_onehot)
print("The AUC score for the model is %.4f." % TEST_auc_onehot)

The AUC score for the model is 0.6406.


### Load Data: Sum

In [41]:
# This loads the SUM X training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_sum = pd.read_csv('no-peaking/Xtrain_sum.csv')
Xtrain_sum = np.array(Xtrain_sum)
Xtest_sum = pd.read_csv('no-peaking/Xtest_sum.csv')
Xtest_sum = np.array(Xtest_sum)

# This loads the no_nulls Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_sum = pd.read_csv('no-peaking/Ytrain_sum.csv')
Ytrain_sum = np.array(Ytrain_sum)
Ytrain_sum = Ytrain_sum.ravel() 
Ytest_sum = pd.read_csv('no-peaking/Ytest_sum.csv')
Ytest_sum = np.array(Ytest_sum)
Ytest_sum = Ytest_sum.ravel()

### Sum Logistic Regression Benchmark

In [42]:
# This creates the hold-out validation set for the logistic regression model
# for the Sum dataset.
# This comes from Reference 4 in References.
Xtrain_sum, Xval_sum, Ytrain_sum, Yval_sum = train_test_split(
                    Xtrain_sum, Ytrain_sum, test_size = 0.3, random_state = 13)

In [43]:
# This creates the logistic regression model for the One Hot Dataset.
log_sum = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_onehot model with the training data.
log_sum.fit(Xtrain_sum,Ytrain_sum)

# This predicts the y values from the Xval dataset.c
yval_pred_sum = log_sum.predict(Xval_sum)

# This returns the validation AUC score.
VAL_auc_sum = roc_auc_score(Yval_sum, yval_pred_sum)
VAL_auc_sum



0.7437902237728553

In [51]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_sum = log_sum.predict(Xtest_sum)
TEST_auc_sum = roc_auc_score(Ytest_sum, ytest_pred_sum)
print("The AUC score for the model is %.4f." % TEST_auc_sum)

The AUC score for the model is 0.6606.


<a id='MLP Model'></a>
## MLP Model

## MLP Model No Nulls

In [102]:
# This creates a directory to save the best models for the MLP.
os.mkdir('saved_models')

FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'saved_models'

In [183]:
# This imports the necessary libraries for the MLP.

# This imports the sequential model, the layers,
# the SGD optimizer, the regularizers from keras.
# This comes from Reference 5 in Referenes.
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD, RMSprop, Nadam
from keras import regularizers

# This imports checkpointer, which records the best weights
# for the algorithm.
# This comes from Reference 6 in References.
from keras.callbacks import ModelCheckpoint
import matplotlib.pyplot as plt

In [285]:
def build_model(drop_rate, l2_factor, first_dense, second_dense,
                third_dense, hidden_act, out_act, x):
    dim_int = int(np.size(x,1))
    # This defines the model as a sequential model.
    # This comes from References 1 in References.
    model = Sequential()

    # This is the input layer.
    # This comes from References 1 & 3 in References.
    model.add(Dense(first_dense, activation = hidden_act,
        kernel_regularizer = regularizers.l2(l2_factor),
        input_dim = dim_int))
    model.add(Dropout(drop_rate))

    # This creates the first hidden layer.
    # This comes from Reference 7 in References.
    model.add(Dense(second_dense,
        activation = hidden_act,
        kernel_regularizer = regularizers.l2(l2_factor)))
    model.add(Dropout(drop_rate))
    
    # This creates the second hidden layer.
    # This comes from Reference 7 in References.
    model.add(Dense(third_dense,
        activation = hidden_act,
        kernel_regularizer = regularizers.l2(l2_factor)))
    model.add(Dropout(drop_rate))

    # This creates the output layer.
    # This comes from Reference 7 in References.
    model.add(Dense(1, activation=out_act))
    # This returns the model.
    return model

In [293]:
# This comes from Reference 7 in References.
n_epochs = 100
size_of_batch = 50
stochastic = SGD(lr=0.001)
nad = Nadam()
RMS = RMSprop()

# This builds the original MLP model for year 1.
# using Mattson and Steinart's original hyperparameters.
mlp_nonulls = build_model(drop_rate=0.5,
                            l2_factor=0.001,
                             first_dense=32,
                             second_dense=16,
                             third_dense=8,
                            hidden_act='relu',
                            out_act='sigmoid',
                            x=Xtrain_nonulls)

In [294]:
# This compiles the MLP model for the No Null data.
# This comes from Reference 13 in References.
mlp_nonulls.compile(loss='binary_crossentropy',
              optimizer= stochastic,
              metrics=['accuracy'])

In [295]:
# This creates checkpointer, which uses ModelCheckpoint to store the
# best weights of the model.
# This comes from References 6 in References.
checkpoint = ModelCheckpoint(filepath='saved_models/weights.best.mlp_nonulls.hdf5',
                             verbose=1, save_best_only=True)

In [296]:
# This fits the model and runs it for 100 epochs.
mlp_nonulls.fit(Xtrain_nonulls, Ytrain_nonulls, validation_split=0.20,
                epochs=n_epochs, batch_size=size_of_batch, 
                callbacks = [checkpoint], verbose=1)

Train on 11018 samples, validate on 2755 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.73808, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 0.73808 to 0.73423, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 0.73423 to 0.73160, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 0.73160 to 0.72911, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 0.72911 to 0.72727, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 6/100

Epoch 00006: val_loss improved from 0.72727 to 0.72538, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 7/100

Epoch 00007: val_loss improved from 0.72538 to 0.72365, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 8/100

Epoch 00008: val_loss improved from 0.72365 to

Epoch 33/100

Epoch 00033: val_loss improved from 0.67888 to 0.67667, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 34/100

Epoch 00034: val_loss improved from 0.67667 to 0.67489, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 35/100

Epoch 00035: val_loss improved from 0.67489 to 0.67273, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 36/100

Epoch 00036: val_loss improved from 0.67273 to 0.67088, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 37/100

Epoch 00037: val_loss improved from 0.67088 to 0.66889, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 38/100

Epoch 00038: val_loss improved from 0.66889 to 0.66720, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 39/100

Epoch 00039: val_loss improved from 0.66720 to 0.66530, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 40/100

Epoch 00040: val_loss improved from 0.66530 to 0.66344, saving model to saved_model


Epoch 00065: val_loss improved from 0.62178 to 0.62047, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 66/100

Epoch 00066: val_loss improved from 0.62047 to 0.61903, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 67/100

Epoch 00067: val_loss improved from 0.61903 to 0.61727, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 68/100

Epoch 00068: val_loss improved from 0.61727 to 0.61624, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 69/100

Epoch 00069: val_loss improved from 0.61624 to 0.61494, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 70/100

Epoch 00070: val_loss improved from 0.61494 to 0.61351, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 71/100

Epoch 00071: val_loss improved from 0.61351 to 0.61232, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 72/100

Epoch 00072: val_loss improved from 0.61232 to 0.61106, saving model to saved_models/weights.bes


Epoch 00097: val_loss improved from 0.58291 to 0.58167, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 98/100

Epoch 00098: val_loss improved from 0.58167 to 0.58042, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 99/100

Epoch 00099: val_loss improved from 0.58042 to 0.57961, saving model to saved_models/weights.best.mlp_nonulls.hdf5
Epoch 100/100

Epoch 00100: val_loss improved from 0.57961 to 0.57879, saving model to saved_models/weights.best.mlp_nonulls.hdf5


<keras.callbacks.History at 0x1efc2029a90>

In [300]:
# This loads the best weights from the model.
# This comes from Reference 6 in References.
mlp_nonulls.load_weights('saved_models/weights.best.mlp_nonulls.hdf5')

In [301]:
# This prints the accuracy of the model.
# This comes from Reference 9 in References.
score = mlp_nonulls.evaluate(Xtest_nonulls, Ytest_nonulls, verbose=0)
accuracy = 100*score[1]
print('Test acuracy: %.4f%%' %accuracy)

Test acuracy: 96.1937%


In [302]:
# This prints the AUC score for the model.
Ypred_nonulls = mlp_nonulls.predict(Xtest_nonulls)
mlp_nonulls_ROC = roc_auc_score(Ytest_nonulls, Ypred_nonulls)
print("The AUC score for the model is %.4f." % mlp_nonulls_ROC)

The AUC score for the model is 0.5130.


### MLP Model Nulls Only

In [342]:
# This comes from Reference 7 in References.
n_epochs = 100
size_of_batch = 50
stochastic = SGD(lr=0.001)
nad = Nadam()
RMS = RMSprop()

# This builds the original MLP model for year 1.
# using Mattson and Steinart's original hyperparameters.
mlp_nullsonly = build_model(drop_rate=0.5,
                            l2_factor=0.001,
                             first_dense=32,
                             second_dense=16,
                             third_dense=8,
                            hidden_act='relu',
                            out_act='sigmoid',
                            x=Xtrain_nullsonly)

In [343]:
# This compiles the MLP model for the No Null data.
# This comes from Reference 13 in References.
mlp_nullsonly.compile(loss='binary_crossentropy',
              optimizer= stochastic,
              metrics=['accuracy'])

In [344]:
# This creates checkpointer, which uses ModelCheckpoint to store the
# best weights of the model.
# This comes from References 6 in References.
checkpoint = ModelCheckpoint(filepath='saved_models/weights.best.mlp_nullsonly.hdf5',
                             verbose=1, save_best_only=True)

In [345]:
# This fits the model and runs it for 100 epochs.
mlp_nullsonly.fit(Xtrain_nullsonly, Ytrain_nullsonly, validation_split=0.20,
                epochs=n_epochs, batch_size=size_of_batch, 
                callbacks = [checkpoint], verbose=1)

Train on 45858 samples, validate on 11465 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.78523, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 0.78523 to 0.76681, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 0.76681 to 0.75406, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 0.75406 to 0.74457, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 0.74457 to 0.73697, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 6/100

Epoch 00006: val_loss improved from 0.73697 to 0.73190, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 7/100

Epoch 00007: val_loss improved from 0.73190 to 0.72787, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 8/100

Epoch 00008: val_loss improved 


Epoch 00032: val_loss improved from 0.69282 to 0.69174, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 33/100

Epoch 00033: val_loss improved from 0.69174 to 0.69071, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 34/100

Epoch 00034: val_loss improved from 0.69071 to 0.68977, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 35/100

Epoch 00035: val_loss improved from 0.68977 to 0.68881, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 36/100

Epoch 00036: val_loss improved from 0.68881 to 0.68778, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 37/100

Epoch 00037: val_loss improved from 0.68778 to 0.68725, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 38/100

Epoch 00038: val_loss improved from 0.68725 to 0.68652, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 39/100

Epoch 00039: val_loss improved from 0.68652 to 0.68538, saving model to saved_mode


Epoch 00064: val_loss improved from 0.66295 to 0.66214, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 65/100

Epoch 00065: val_loss improved from 0.66214 to 0.66098, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 66/100

Epoch 00066: val_loss improved from 0.66098 to 0.66024, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 67/100

Epoch 00067: val_loss improved from 0.66024 to 0.65934, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 68/100

Epoch 00068: val_loss improved from 0.65934 to 0.65836, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 69/100

Epoch 00069: val_loss improved from 0.65836 to 0.65739, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 70/100

Epoch 00070: val_loss improved from 0.65739 to 0.65657, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 71/100

Epoch 00071: val_loss improved from 0.65657 to 0.65565, saving model to saved_mode


Epoch 00096: val_loss improved from 0.63457 to 0.63379, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 97/100

Epoch 00097: val_loss improved from 0.63379 to 0.63323, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 98/100

Epoch 00098: val_loss improved from 0.63323 to 0.63241, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 99/100

Epoch 00099: val_loss improved from 0.63241 to 0.63174, saving model to saved_models/weights.best.mlp_nullsonly.hdf5
Epoch 100/100

Epoch 00100: val_loss improved from 0.63174 to 0.63095, saving model to saved_models/weights.best.mlp_nullsonly.hdf5


<keras.callbacks.History at 0x1efbb8c4ba8>

In [346]:
# This loads the best weights from the model.
# This comes from Reference 6 in References.
mlp_nullsonly.load_weights('saved_models/weights.best.mlp_nullsonly.hdf5')

In [347]:
# This prints the accuracy of the model.
# This comes from Reference 9 in References.
score = mlp_nullsonly.evaluate(Xtest_nullsonly, Ytest_nullsonly, verbose=0)
accuracy = 100*score[1]
print('Test acuracy: %.4f%%' %accuracy)

Test acuracy: 71.8531%


In [348]:
# This prints the AUC score for the model.
Ypred_nullsonly= mlp_nullsonly.predict(Xtest_nullsonly)
mlp_nullsonly_ROC = roc_auc_score(Ytest_nullsonly, Ypred_nullsonly)
print("The AUC score for the model is %.4f." % mlp_nullsonly_ROC)

The AUC score for the model is 0.6277.


### MLP Model One Hot

In [318]:
# This comes from Reference 7 in References.
n_epochs = 100
size_of_batch = 50
stochastic = SGD(lr=0.001)
nad = Nadam()
RMS = RMSprop()

# This builds the original MLP model for year 1.
# using Mattson and Steinart's original hyperparameters.
mlp_onehot = build_model(drop_rate=0.5,
                            l2_factor=0.001,
                             first_dense=64,
                             second_dense=32,
                             third_dense=16,
                            hidden_act='relu',
                            out_act='sigmoid',
                            x=Xtrain_onehot)

In [319]:
# This compiles the MLP model for the No Null data.
# This comes from Reference 13 in References.
mlp_onehot.compile(loss='binary_crossentropy',
              optimizer= RMS,
              metrics=['accuracy'])

In [320]:
# This creates checkpointer, which uses ModelCheckpoint to store the
# best weights of the model.
# This comes from References 6 in References.
checkpoint = ModelCheckpoint(filepath='saved_models/weights.best.mlp_onehot.hdf5',
                             verbose=1, save_best_only=True)

In [321]:
# This fits the model and runs it for 100 epochs.
mlp_onehot.fit(Xtrain_onehot, Ytrain_onehot, validation_split=0.20,
                epochs=n_epochs, batch_size=size_of_batch, 
                callbacks = [checkpoint], verbose=1)

Train on 45858 samples, validate on 11465 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.39939, saving model to saved_models/weights.best.mlp_onehot.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 0.39939 to 0.37504, saving model to saved_models/weights.best.mlp_onehot.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 0.37504 to 0.36887, saving model to saved_models/weights.best.mlp_onehot.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 0.36887 to 0.35693, saving model to saved_models/weights.best.mlp_onehot.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 0.35693 to 0.35077, saving model to saved_models/weights.best.mlp_onehot.hdf5
Epoch 6/100

Epoch 00006: val_loss improved from 0.35077 to 0.34978, saving model to saved_models/weights.best.mlp_onehot.hdf5
Epoch 7/100

Epoch 00007: val_loss did not improve from 0.34978
Epoch 8/100

Epoch 00008: val_loss did not improve from 0.34978
Epoch 9/100

Epoch 00009: val_loss did not improve from 0.34978



Epoch 00040: val_loss improved from 0.34459 to 0.34348, saving model to saved_models/weights.best.mlp_onehot.hdf5
Epoch 41/100

Epoch 00041: val_loss did not improve from 0.34348
Epoch 42/100

Epoch 00042: val_loss improved from 0.34348 to 0.34133, saving model to saved_models/weights.best.mlp_onehot.hdf5
Epoch 43/100

Epoch 00043: val_loss did not improve from 0.34133
Epoch 44/100

Epoch 00044: val_loss did not improve from 0.34133
Epoch 45/100

Epoch 00045: val_loss did not improve from 0.34133
Epoch 46/100

Epoch 00046: val_loss did not improve from 0.34133
Epoch 47/100

Epoch 00047: val_loss did not improve from 0.34133
Epoch 48/100

Epoch 00048: val_loss did not improve from 0.34133
Epoch 49/100

Epoch 00049: val_loss did not improve from 0.34133
Epoch 50/100

Epoch 00050: val_loss did not improve from 0.34133
Epoch 51/100

Epoch 00051: val_loss did not improve from 0.34133
Epoch 52/100

Epoch 00052: val_loss did not improve from 0.34133
Epoch 53/100

Epoch 00053: val_loss did no


Epoch 00082: val_loss did not improve from 0.33980
Epoch 83/100

Epoch 00083: val_loss did not improve from 0.33980
Epoch 84/100

Epoch 00084: val_loss did not improve from 0.33980
Epoch 85/100

Epoch 00085: val_loss did not improve from 0.33980
Epoch 86/100

Epoch 00086: val_loss did not improve from 0.33980
Epoch 87/100

Epoch 00087: val_loss did not improve from 0.33980
Epoch 88/100

Epoch 00088: val_loss improved from 0.33980 to 0.33933, saving model to saved_models/weights.best.mlp_onehot.hdf5
Epoch 89/100

Epoch 00089: val_loss did not improve from 0.33933
Epoch 90/100

Epoch 00090: val_loss did not improve from 0.33933
Epoch 91/100

Epoch 00091: val_loss did not improve from 0.33933
Epoch 92/100

Epoch 00092: val_loss did not improve from 0.33933
Epoch 93/100

Epoch 00093: val_loss did not improve from 0.33933
Epoch 94/100

Epoch 00094: val_loss did not improve from 0.33933
Epoch 95/100

Epoch 00095: val_loss did not improve from 0.33933
Epoch 96/100

Epoch 00096: val_loss did 

<keras.callbacks.History at 0x1efb9539630>

In [325]:
# This loads the best weights from the model.
# This comes from Reference 6 in References.
mlp_onehot.load_weights('saved_models/weights.best.mlp_onehot.hdf5')

In [326]:
# This prints the accuracy of the model.
# This comes from Reference 9 in References.
score = mlp_onehot.evaluate(Xtest_onehot, Ytest_onehot, verbose=0)
accuracy = 100*score[1]
print('Test acuracy: %.4f%%' %accuracy)

Test acuracy: 81.1686%


In [327]:
# This prints the AUC score for the model.
Ypred_onehot = mlp_onehot.predict(Xtest_onehot)
ROC_mlp_onehot = roc_auc_score(Ytest_onehot, Ypred_onehot)
print("The AUC score for the model is %.4f." % ROC_mlp_onehot)

The AUC score for the model is 0.7063.


### MLP Model Sum

In [335]:
# This comes from Reference 7 in References.
n_epochs = 100
size_of_batch = 50
stochastic = SGD(lr=0.001)
nad = Nadam()
RMS = RMSprop()

# This builds the original MLP model for year 1.
# using Mattson and Steinart's original hyperparameters.
mlp_sum = build_model(drop_rate=0.5,
                            l2_factor=0.001,
                             first_dense=32,
                             second_dense=16,
                             third_dense=8,
                            hidden_act='relu',
                            out_act='sigmoid',
                            x=Xtrain_sum)

In [336]:
# This compiles the MLP model for the No Null data.
# This comes from Reference 13 in References.
mlp_sum.compile(loss='binary_crossentropy',
              optimizer= stochastic,
              metrics=['accuracy'])

In [337]:
# This creates checkpointer, which uses ModelCheckpoint to store the
# best weights of the model.
# This comes from References 6 in References.
checkpoint = ModelCheckpoint(filepath='saved_models/weights.best.mlp_sum.hdf5',
                             verbose=1, save_best_only=True)

In [338]:
# This fits the model and runs it for 100 epochs.
mlp_sum.fit(Xtrain_sum, Ytrain_sum, validation_split=0.20,
                epochs=n_epochs, batch_size=size_of_batch, 
                callbacks = [checkpoint], verbose=1)

Train on 45858 samples, validate on 11465 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 0.76227, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 0.76227 to 0.74232, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 0.74232 to 0.73451, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 0.73451 to 0.72772, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 0.72772 to 0.72186, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 6/100

Epoch 00006: val_loss improved from 0.72186 to 0.71789, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 7/100

Epoch 00007: val_loss improved from 0.71789 to 0.71362, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 8/100

Epoch 00008: val_loss improved from 0.71362 to 0.71099, saving model to s


Epoch 00033: val_loss improved from 0.66789 to 0.66726, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 34/100

Epoch 00034: val_loss improved from 0.66726 to 0.66580, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 35/100

Epoch 00035: val_loss improved from 0.66580 to 0.66482, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 36/100

Epoch 00036: val_loss improved from 0.66482 to 0.66409, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 37/100

Epoch 00037: val_loss improved from 0.66409 to 0.66336, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 38/100

Epoch 00038: val_loss improved from 0.66336 to 0.66209, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 39/100

Epoch 00039: val_loss improved from 0.66209 to 0.66107, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 40/100

Epoch 00040: val_loss improved from 0.66107 to 0.66034, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 41/100



Epoch 00066: val_loss improved from 0.63706 to 0.63613, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 67/100

Epoch 00067: val_loss improved from 0.63613 to 0.63529, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 68/100

Epoch 00068: val_loss improved from 0.63529 to 0.63457, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 69/100

Epoch 00069: val_loss improved from 0.63457 to 0.63324, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 70/100

Epoch 00070: val_loss improved from 0.63324 to 0.63220, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 71/100

Epoch 00071: val_loss improved from 0.63220 to 0.63124, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 72/100

Epoch 00072: val_loss improved from 0.63124 to 0.63060, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 73/100

Epoch 00073: val_loss improved from 0.63060 to 0.63017, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 74/100



Epoch 00099: val_loss improved from 0.61005 to 0.60893, saving model to saved_models/weights.best.mlp_sum.hdf5
Epoch 100/100

Epoch 00100: val_loss improved from 0.60893 to 0.60807, saving model to saved_models/weights.best.mlp_sum.hdf5


<keras.callbacks.History at 0x1efc981c5c0>

In [339]:
# This loads the best weights from the model.
# This comes from Reference 6 in References.
mlp_sum.load_weights('saved_models/weights.best.mlp_sum.hdf5')

In [340]:
# This prints the accuracy of the model.
# This comes from Reference 9 in References.
score = mlp_sum.evaluate(Xtest_sum, Ytest_sum, verbose=0)
accuracy = 100*score[1]
print('Test acuracy: %.4f%%' %accuracy)

Test acuracy: 61.6694%


In [341]:
# This prints the AUC score for the model.
Ypred_sum = mlp_sum.predict(Xtest_sum)
ROC_mlp_sum = roc_auc_score(Ytest_sum, Ypred_sum)
print("The AUC score for the model is %.4f." % ROC_mlp_sum)

The AUC score for the model is 0.6096.


<a id='references'></a>
## References

1. https://stackoverflow.com/questions/41032551/how-to-compute-receiving-operating-characteristic-roc-and-auc-in-keras
2. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html
3. https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9
4. https://dziganto.github.io/cross-validation/data%20science/machine%20learning/model%20tuning/python/Model-Tuning-with-Validation-and-Cross-Validation/
5. https://keras.io/getting-started/sequential-model-guide/
6. Udacity Machine Learning Engineer Nanodegree Program, Semester 2, Brian Campbell - Dog Breed Classifier Project
7. https://keras.io/getting-started/sequential-model-guide/
8. https://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.size.html
9. https://keras.io/models/sequential/