# Capstone 2

## Polish Bankruptcy

## Pre-processing and Training

In the pre-processing step I will impute the missing values and from the data, drop redundent high correlation variables from the dataset, complete a stratified train test split and and investigate scaling the training data.  Because of the high amount of skewed data the possibility of transforming the data may need to be done after the initial models.  

## Import Statements

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression

## Load Data

In [2]:
#load bankruptcy data and column key data
bankruptcy_data = pd.read_csv('bankruptcy_data_comb.csv')
data_columns = pd.read_csv('column_key.csv')

#load high correlation dataframe
dataCorrhigh = pd.read_csv('dataCorrhigh.csv')

#display high correlation pairs dataframe
print(dataCorrhigh.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Feature1     111 non-null    object 
 1   Feature2     111 non-null    object 
 2   Correlation  111 non-null    float64
dtypes: float64(1), object(2)
memory usage: 2.7+ KB
None


## Impute missing data

Based of the distribution of data and the number of outliers, I have made the decsion to imput the missing values based on the median of each column.

In [3]:
#create new dataframe by imputing missing values with the median
bankruptcy_complete = bankruptcy_data.apply(lambda x: x.fillna(x.median()),axis=0)

#display complete dataframe
bankruptcy_complete.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43405 entries, 0 to 43404
Data columns (total 66 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X01     43405 non-null  float64
 1   X02     43405 non-null  float64
 2   X03     43405 non-null  float64
 3   X04     43405 non-null  float64
 4   X05     43405 non-null  float64
 5   X06     43405 non-null  float64
 6   X07     43405 non-null  float64
 7   X08     43405 non-null  float64
 8   X09     43405 non-null  float64
 9   X10     43405 non-null  float64
 10  X11     43405 non-null  float64
 11  X12     43405 non-null  float64
 12  X13     43405 non-null  float64
 13  X14     43405 non-null  float64
 14  X15     43405 non-null  float64
 15  X16     43405 non-null  float64
 16  X17     43405 non-null  float64
 17  X18     43405 non-null  float64
 18  X19     43405 non-null  float64
 19  X20     43405 non-null  float64
 20  X21     43405 non-null  float64
 21  X22     43405 non-null  float64
 22

## Drop redundant high correlation  columns


In [4]:
# drop first row of correlation dataframe with correlation of 1.  Data is identical 
dataCorrhigh = dataCorrhigh.drop(dataCorrhigh.index[0]).reset_index(drop=True)

#create empty final high correlation dataframe
corr_high_final = pd.DataFrame(columns = dataCorrhigh.columns)

# iterate over high correlation dataframe and check for redundant pairs
for x in range(len(dataCorrhigh)-1):
    if dataCorrhigh['Feature1'][x] != dataCorrhigh['Feature1'][x+1]:
        corr_high_final = corr_high_final.append(dataCorrhigh.loc[x])
    
#add back last row to final dataframe and reset indexes
corr_high_final = corr_high_final.append(dataCorrhigh.loc[len(dataCorrhigh)-1])
corr_high_final = corr_high_final.reset_index(drop=True)

print(corr_high_final)

   Feature1 Feature2  Correlation
0       X56      X58     0.999976
1       X04      X46     0.999920
2       X20      X56     0.999880
3       X17      X08     0.999588
4       X23      X19     0.999290
..      ...      ...          ...
90      X07      X63     0.709315
91      X14      X63     0.709315
92      X44      X13     0.708622
93      X13      X19     0.706408
94      X63      X18     0.703221

[95 rows x 3 columns]


In [5]:
# remove redundant feature X07
corr_high_final = corr_high_final.drop(corr_high_final[(corr_high_final['Feature1'] == 'X07') | (corr_high_final['Feature2'] == 'X07')].index,axis = 0 ,inplace = False)
corr_high_final = corr_high_final.reset_index(drop=True)
print(corr_high_final) 

   Feature1 Feature2  Correlation
0       X56      X58     0.999976
1       X04      X46     0.999920
2       X20      X56     0.999880
3       X17      X08     0.999588
4       X23      X19     0.999290
..      ...      ...          ...
84      X13      X43     0.717323
85      X14      X63     0.709315
86      X44      X13     0.708622
87      X13      X19     0.706408
88      X63      X18     0.703221

[89 rows x 3 columns]


## Create final feature array 

In [6]:
# get unique feature array
feature_df = np.unique(corr_high_final[['Feature1', 'Feature2']].values)
feature_df = feature_df[feature_df != 'X07']
print(feature_df)

['X02' 'X03' 'X04' 'X06' 'X08' 'X09' 'X10' 'X11' 'X12' 'X13' 'X14' 'X16'
 'X17' 'X18' 'X19' 'X20' 'X22' 'X23' 'X24' 'X25' 'X26' 'X30' 'X31' 'X32'
 'X33' 'X34' 'X35' 'X36' 'X38' 'X43' 'X44' 'X46' 'X47' 'X48' 'X49' 'X50'
 'X51' 'X52' 'X53' 'X54' 'X56' 'X58' 'X62' 'X63' 'X64']


## Remove non correlated columns from bankruptcy data

In [7]:
#create new dataframe with high correlation features
bankruptcy_data_final = bankruptcy_complete[feature_df].copy()

# add back in the year and class features
bankruptcy_data_final = pd.concat([bankruptcy_data_final,bankruptcy_complete['Year']], axis = 1)
bankruptcy_data_final = pd.concat([bankruptcy_data_final,bankruptcy_complete['Class']], axis = 1)
print(bankruptcy_data_final.head(20))

         X02       X03      X04       X06       X08      X09      X10  \
0   0.379510  0.396410   2.0472  0.388250   1.33050  1.13890  0.50494   
1   0.499880  0.472250   1.9447  0.000000   0.99601  1.69960  0.49788   
2   0.695920  0.267130   1.5548  0.000000   0.43695  1.30900  0.30408   
3   0.307340  0.458790   2.4928  0.149880   1.86610  1.05710  0.57353   
4   0.613230  0.229600   1.4063  0.187320   0.63070  1.15590  0.38677   
5   0.497940  0.359690   1.7502  0.000000   1.00830  1.97860  0.50206   
6   0.647440  0.289710   1.4705  0.000000   0.54454  1.73480  0.35256   
7   0.027059  0.705540  53.9540  0.000000  35.95700  0.65273  0.97294   
8   0.632020  0.053735   1.1263  0.000000   0.58223  1.33320  0.36798   
9   0.838370  0.142040   1.1694  0.000000   0.19279  2.11560  0.16163   
10  0.443550  0.188350   1.4400 -0.931900   1.25450  4.74470  0.55645   
11  0.111480  0.119890   2.0754 -0.084883   7.67410  0.90732  0.85551   
12  0.349940  0.611470   3.0243  0.559830   1.85770

## Drop column variables from column key dataframe

## Train test split


In [8]:
X_train, X_test, y_train, y_test = train_test_split(bankruptcy_data_final.drop(columns='Class'), bankruptcy_data_final.Class,
                                                    test_size=.25  ,random_state=5, stratify=bankruptcy_data_final.Class)

In [9]:
X_train.shape, X_test.shape

((32553, 46), (10852, 46))

In [10]:
y_train.shape, y_test.shape

((32553,), (10852,))

In [11]:
print(X_train.columns)

Index(['X02', 'X03', 'X04', 'X06', 'X08', 'X09', 'X10', 'X11', 'X12', 'X13',
       'X14', 'X16', 'X17', 'X18', 'X19', 'X20', 'X22', 'X23', 'X24', 'X25',
       'X26', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X38', 'X43',
       'X44', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54',
       'X56', 'X58', 'X62', 'X63', 'X64', 'Year'],
      dtype='object')


In [12]:
#drop year column

In [13]:
lm = LinearRegression().fit(X_train, y_train)

In [14]:
y_tr_pred = lm.predict(X_train)
y_te_pred = lm.predict(X_test)

In [15]:
print(y_tr_pred)

[0.05032518 0.03583252 0.04246837 ... 0.06539626 0.03843153 0.05183285]


In [16]:
print(y_te_pred)

[0.0525168  0.03670107 0.03202197 ... 0.06805883 0.06522195 0.05621996]


In [17]:
#scaler = StandardScaler()
#scaler.fit(X_tr)
#X_tr_scaled = scaler.transform(X_tr)
#X_te_scaled = scaler.transform(X_te)

In [18]:
#y_tr_pred = lm.predict(X_tr_scaled)
#y_te_pred = lm.predict(X_te_scaled)

In [19]:
#clf = LogisticRegression()

#clf.fit(X_train, y_train) 

#ypredict_test = clf.predict(X_test)

#ypredict_train = clf.predict(X_train)
