# Capstone Project: The Efficacy of Multilayer Perceptron Algorithms in Predicting Bankruptcy

<ul>
<li><a href="#introduction">INTRODUCTION</a></li>
<li><a href="#Benchmark Logistic Regression">Benchmark Logistic Regression</a></li>
<li><a href="#MLP Model">MLP Model</a></li>
<li><a href="#assess">Data Assessment</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#target">Separation of Target Variables</a></li>
<li><a href="#no nulls">No Nulls Data</a></li>
<li><a href="#one-hot null">Creation of One-hot Null Variable</a></li>
<li><a href="#Creation of Sum Null Variables">Creation of Sum Null Variables</a></li>
<li><a href="#PIVOT">PIVOT</a></li>
<li><a href="#Data Reorganization">Data Reorganization</a></li>
<li><a href="#Data Exploration - Descriptive Statistics">Data Exploration - Descriptive Statistics</a></li>
<li><a href="#Exploratory Visualization">Exploratory Visualization</a></li>
<li><a href="#Preprocessing">Preprocessing</a></li>
<li><a href="#Benchmark: Logistic Regression">Benchmark: Logistic Regression</a></li>
<li><a href="#originalMLP">Origial MLP</a></li>
<li><a href="#conclusion">Conclusion</a></li> 
<li><a href="#references">References</a></li>
</ul>

<a id='introduction'></a>
## INTRODUCTION

<a id='Benchmark Logistic Regression'></a>
## Benchmark Logistic Regression

In [1]:
# This imports the necessary libraries for the logistic regression models.
from sklearn.linear_model import LogisticRegression

# This imports the AUC score for scoring the models.
# This comes from Reference 27 in References.
from sklearn.metrics import roc_auc_score

# These are libraries that will be needed to organize data,
# graph data, and change the working directory.
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline


### Load Data: No Nulls

In [27]:
# This loads the no_nulls X training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_nonulls = pd.read_csv('no-peaking/Xtrain_nonulls.csv')
Xtrain_nonulls = np.array(Xtrain_nonulls)
Xtest_nonulls = pd.read_csv('no-peaking/Xtest_nonulls.csv')
Xtest_nonulls = np.array(Xtest_nonulls)

# This loads the no_nulls Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_nonulls = pd.read_csv('no-peaking/Ytrain_nonulls.csv')
Ytrain_nonulls = np.array(Ytrain_nonulls)
Ytrain_nonulls = Ytrain_nonulls.ravel() 
Ytest_nonulls = pd.read_csv('no-peaking/Ytest_nonulls.csv')
Ytest_nonulls = np.array(Ytest_nonulls)
Ytest_nonulls = Ytest_nonulls.ravel()

### Performing Hold-out Cross-Validation on just the training set
K-fold Cross-Validation is a standard method for preventing a logistic regression model from overfitting. However, because the testing set, the fifth year of the dataset, is arbitrarily chosen (not random), K-fold Cross-Validation cannot be applied to the dataset (Reference 3). K-fold Cross-Validation would corrupt the testing set with data leakage considering that the dataset is a time-series set (Reference 3). To prevent data leakage, Hold-out cross-Validation will only be applied to the training set (References 3 & 4). Hold-out Cross-Validation takes a percentage of the training set as a validation set to test the accuracy of the model during the training stage. This method of cross validation, like all methods, is used to prevent the overfitting of a model and poor accuracy performance when applying the testing data to the fitted model (Reference 3).

### No Nulls Logistic Regression Benchmark

In [28]:
# This import train_test_split from sklearn.
from sklearn.model_selection import train_test_split

In [29]:
# This creates the hold-out validation set for the logistic regression model
# for the No Nulls dataset.
# This comes from Reference 4 in References.
Xtrain_nonulls, Xval_nonulls, Ytrain_nonulls, Yval_nonulls = train_test_split(
                    Xtrain_nonulls, Ytrain_nonulls, test_size = 0.3, random_state = 13)

In [30]:
# This creates the logistic regression model for the No Nulls Dataset.
log_nonulls = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_nonulls model with the training data.
log_nonulls.fit(Xtrain_nonulls,Ytrain_nonulls)

# This predicts the y values from the Xval dataset.
yval_pred_nonulls = log_nonulls.predict(Xval_nonulls)

# This returns the validation AUC score.
VAL_auc_nonulls = roc_auc_score(Yval_nonulls, yval_pred_nonulls)
VAL_auc_nonulls



0.8506022247116249

In [31]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_nonulls = log_nonulls.predict(Xtest_nonulls)
TEST_auc_nonulls = roc_auc_score(Ytest_nonulls, ytest_pred_nonulls)
TEST_auc_nonulls = TEST_auc_nonulls * 100
print("The AUC score for the model is %.2f" % TEST_auc_nonulls, "%")

The AUC score for the model is 31.63 %


### Load Data: Nulls only

In [32]:
#### This loads the Nulls only training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_nullsonly = pd.read_csv('no-peaking/Xtrain_sum.csv')
Xtrain_nullsonly = Xtrain_nullsonly.drop(
    Xtrain_nullsonly.columns[64], axis=1)
Xtrain_nullsonly = np.array(Xtrain_nullsonly)
Xtest_nullsonly = pd.read_csv('no-peaking/Xtest_sum.csv')
Xtest_nullsonly = Xtest_nullsonly.drop(
    Xtest_nullsonly.columns[64], axis=1)
Xtest_nullsonly = np.array(Xtest_nullsonly)

# This loads the Nulls only Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_nullsonly = pd.read_csv('no-peaking/Ytrain_sum.csv')
Ytrain_nullsonly = np.array(Ytrain_nullsonly)
Ytrain_nullsonly = Ytrain_nullsonly.ravel() 
Ytest_nullsonly = pd.read_csv('no-peaking/Ytest_sum.csv')
Ytest_nullsonly = np.array(Ytest_nullsonly)
Ytest_nullsonly = Ytest_nullsonly.ravel()

### Nulls Only Logistic Regression Benchmark

In [33]:
# This creates the hold-out validation set for the logistic regression model
# for the Sum dataset.
# This comes from Reference 4 in References.
Xtrain_nullsonly, Xval_nullsonly, Ytrain_nullsonly, Yval_nullsonly = train_test_split(
                    Xtrain_nullsonly, Ytrain_nullsonly, test_size = 0.3, random_state = 13)

In [34]:
# This creates the logistic regression model for the Nulls Only Dataset.
log_nullsonly = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_onehot model with the training data.
log_nullsonly.fit(Xtrain_nullsonly,Ytrain_nullsonly)

# This predicts the y values from the Xval dataset.
yval_pred_nullsonly = log_nullsonly.predict(Xval_nullsonly)

# This returns the validation AUC score.
VAL_auc_nullsonly = roc_auc_score(Yval_nullsonly, yval_pred_nullsonly)
VAL_auc_nullsonly



0.7350826303765834

In [35]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_nullsonly = log_nullsonly.predict(Xtest_nullsonly)
TEST_auc_nullsonly = roc_auc_score(Ytest_nullsonly, ytest_pred_nullsonly)
TEST_auc_nullsonly = TEST_auc_nullsonly * 100
print("The AUC score for the model is %.2f" % TEST_auc_nullsonly, "%")

The AUC score for the model is 68.29 %


### Load Data: Onehot

In [36]:
# This loads the one hot X training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_onehot = pd.read_csv('no-peaking/Xtrain_onehot.csv')
Xtrain_onehot = np.array(Xtrain_onehot)
Xtest_onehot = pd.read_csv('no-peaking/Xtest_onehot.csv')
Xtest_onehot = np.array(Xtest_onehot)

# This loads the one hot Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_onehot = pd.read_csv('no-peaking/Ytrain_onehot.csv')
Ytrain_onehot = np.array(Ytrain_onehot)
Ytrain_onehot = Ytrain_onehot.ravel() 
Ytest_onehot = pd.read_csv('no-peaking/Ytest_onehot.csv')
Ytest_onehot = np.array(Ytest_onehot)
Ytest_onehot = Ytest_onehot.ravel()

### One Hot Logistic Regression Benchmark

In [37]:
# This creates the hold-out validation set for the logistic regression model
# for the One Hot dataset.
# This comes from Reference 4 in References.
Xtrain_onehot, Xval_onehot, Ytrain_onehot, Yval_onehot = train_test_split(
                    Xtrain_onehot, Ytrain_onehot, test_size = 0.3, random_state = 13)

In [38]:
# This creates the logistic regression model for the One Hot Dataset.
log_onehot = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_onehot model with the training data.
log_onehot.fit(Xtrain_onehot,Ytrain_onehot)

# This predicts the y values from the Xval dataset.
yval_pred_onehot = log_onehot.predict(Xval_onehot)

# This returns the validation AUC score.
VAL_auc_onehot = roc_auc_score(Yval_onehot, yval_pred_onehot)
VAL_auc_onehot



0.8647068328667264

In [39]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_onehot = log_onehot.predict(Xtest_onehot)
TEST_auc_onehot = roc_auc_score(Ytest_onehot, ytest_pred_onehot)
TEST_auc_onehot = TEST_auc_onehot * 100
print("The AUC score for the model is %.2f" % TEST_auc_onehot, "%")

The AUC score for the model is 64.06 %


### Load Data: Sum

In [41]:
# This loads the SUM X training and testing data
# from the CSVs and converts the data to np.arrays.
Xtrain_sum = pd.read_csv('no-peaking/Xtrain_sum.csv')
Xtrain_sum = np.array(Xtrain_sum)
Xtest_sum = pd.read_csv('no-peaking/Xtest_sum.csv')
Xtest_sum = np.array(Xtest_sum)

# This loads the no_nulls Y training and tesing data
# from the CSVs. It also ravels the y_data, so only 
# rows are shown, not columns.
Ytrain_sum = pd.read_csv('no-peaking/Ytrain_sum.csv')
Ytrain_sum = np.array(Ytrain_sum)
Ytrain_sum = Ytrain_sum.ravel() 
Ytest_sum = pd.read_csv('no-peaking/Ytest_sum.csv')
Ytest_sum = np.array(Ytest_sum)
Ytest_sum = Ytest_sum.ravel()

### Sum Logistic Regression Benchmark

In [42]:
# This creates the hold-out validation set for the logistic regression model
# for the Sum dataset.
# This comes from Reference 4 in References.
Xtrain_sum, Xval_sum, Ytrain_sum, Yval_sum = train_test_split(
                    Xtrain_sum, Ytrain_sum, test_size = 0.3, random_state = 13)

In [43]:
# This creates the logistic regression model for the One Hot Dataset.
log_sum = LogisticRegression(penalty='l2', max_iter=1000)

# This fits the log_onehot model with the training data.
log_sum.fit(Xtrain_sum,Ytrain_sum)

# This predicts the y values from the Xval dataset.c
yval_pred_sum = log_sum.predict(Xval_sum)

# This returns the validation AUC score.
VAL_auc_sum = roc_auc_score(Yval_sum, yval_pred_sum)
VAL_auc_sum



0.7437902237728553

In [44]:
# This tests the model, built on the training data, with the
# testing data from year 5.
ytest_pred_sum = log_sum.predict(Xtest_sum)
TEST_auc_sum = roc_auc_score(Ytest_sum, ytest_pred_sum)
TEST_auc_sum = TEST_auc_sum * 100
print("The AUC score for the model is %.2f" % TEST_auc_sum, "%")

The AUC score for the model is 66.06 %



## MLP Model

<a id='references'></a>
## References

1. https://stackoverflow.com/questions/41032551/how-to-compute-receiving-operating-characteristic-roc-and-auc-in-keras
2. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html
3. https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9
4. https://dziganto.github.io/cross-validation/data%20science/machine%20learning/model%20tuning/python/Model-Tuning-with-Validation-and-Cross-Validation/