## IMPORT LIBRARY

Import the required libraries such as the numpy library, matplotlib, sklearn and others.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import joblib
import keras
import tensorflow as tf

from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.preprocessing import RobustScaler, LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, roc_curve, auc, precision_score, recall_score, f1_score, confusion_matrix, classification_report, jaccard_score, log_loss, mean_squared_error, confusion_matrix
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.model_selection import cross_val_score
from keras.utils import np_utils
import keras
from keras.wrappers.scikit_learn import KerasClassifier
from keras.regularizers import l2
from keras.optimizers import SGD
from imblearn.over_sampling import SMOTE

## IMPORT DATASET

Calling the dataset in csv format and stored in the "df_train" variable then displaying the top 5 data using the .head() syntax



In [None]:
df = pd.read_csv("/content/drive/MyDrive/Kaggle/company_bankcruptcy.csv")
df.head()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Exploratory Data Analysis (EDA)

In the early stages of EDA, the first thing to do is look at the information from the dataset using the info() syntax. By using this syntax we can see the amount of data in each column and the data type. Because this dataset aims to predict whether the company will go bankrupt or not, the data needed is data in int and float format. from this dataset it can also be seen that there are 95 features and 1 target. After that other information can be seen using the syntax describe(). With this syntax we can see the average value and standard deviation of the dataset.

In [None]:
df.info()
df.describe()

Look at the dataset whether there are null and duplicate data

In [None]:
print("=========== null of dataset================== ")
print(df.isnull().values.any())
print("=========== Sum Duplicate of dataset================== ")
df[df.duplicated()]

Because this dataset is included in the classification, balancing the target data must be considered. How to see it can use the value_counts() syntax to find out the amount of data in each column.

In [None]:
df['Bankrupt?'].value_counts()

## Preprocessing Data

The first stage is to overcome the problem of data inbalancing. From the EDA process we can find out the amount of data in class 0 in the target column there are 6599 data and class 0 is 220 data. With this amount of data I try to use SMOTE because if you use Undersampling then around 6300 data will be wasted.

In [None]:
X = df.drop('Bankrupt?', axis=1).reset_index(drop=True)
y = df['Bankrupt?'].reset_index(drop=True)

In [None]:
# SMOTE on Datasets
sm = SMOTE(random_state = 2)
X_smote, y_smote = sm.fit_resample(X, y.ravel())

# Give back value after SMOTE to df
df = X_smote
df['Bankrupt'] = y_smote

In [None]:
df['Bankrupt'].value_counts()

Then I try to check again whether there are null or duplicate data. This stage is optional and can be done again or not.

In [None]:
print("=========== Sum null of dataset================== ")
print(df.isnull().values.any())
print("=========== Sum Duplicate of dataset================== ")
df[df.duplicated()]

After that look at the correlation between the data. In the image below, what must be seen is the correlation between the Bankrupt data and other data. If you look at the picture, the highest correlation is found in Borrowing Dependency.



In [None]:
corr = df.corr()[['Bankrupt']].sort_values(by='Bankrupt', ascending=False)
sns.heatmap(corr, annot=True)

Then look at the distribution of data in the dataset using boxplot plots. By using a boxplot you can also check whether the data has outliers or not. If there is then the outlier must be removed.



In [None]:
for column in df:
    plt.figure()
    df.boxplot([column])

The following is the syntax for removing outliers in the dataset.

In [None]:
for i in df.columns:
    Q1 = df[i].quantile(0.25)
    Q3 = df[i].quantile(0.75)
    IQR = Q3 - Q1
    df[i] = np.where(df[i]>(Q3+1.5*IQR),(Q3+1.5*IQR),df[i])
    df[i] = np.where(df[i]<(Q1-1.5*IQR),(Q1-1.5*IQR),df[i])

Check again whether the outlier data has been deleted or not.

In [None]:
# for column in df:
#     plt.figure()
#     df.boxplot([column])

After that, also look at the distribution of data using histogram plotting.

In [None]:
df.hist(bins=50, figsize=(20,15))
plt.show()

After the information from the dataset is sufficient, the next step is to separate the data between feature data and target data. The reason is because the next stage is data scaling and data scaling is only done on feature data.

In [None]:
X = df.drop('Bankrupt', axis=1).reset_index(drop=True)
y = df['Bankrupt'].reset_index(drop=True)

At this stage, scaling the data using a standard scaler. The purpose of the standard scaler is to make the mean value 0 and the variance 1.

In [None]:
tf = StandardScaler().fit_transform(X)
scaledf = pd.DataFrame(tf, columns=X.columns)
scaledf['Bankrupt'] = y
scaledf

Checks whether the mean value on the df scale is close to 0 and the standard deviation is 1

In [None]:
print(scaledf.isnull().values.any())
scaledf.describe()

Then the data that has been scaled will be divided into 4 parts, namely x_test, x_train, y_train, and y_test. The division of the data is 80% train data and 20% test data.


In [None]:
xx = scaledf.drop('Bankrupt', axis=1)
yy = scaledf['Bankrupt']
X_train, X_test, y_train, y_test = train_test_split(xx, yy, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

Create a validation variable by means of reverse data.

In [None]:
x_val = X_train[-2640:]
y_val = y_train[-2640:]

## Modelling

Create a model with 3 layers, namely 1 input layer, 1 hidden layer, and 1 output layer. The input layer uses 95 nodes and requires relative activation. 95 nodes were chosen because in the input layer the number of nodes is the number of features used and because the data that has been scaled has negative value data, relu activation is used. After that, the hidden layer uses 2 nodes and activates Relu. Then the output layer uses 1 number of nodes and the sigmoid activation function. The number of these nodes depends on the type of classification contained in the dataset, because the dataset has 2 classes in the target column, this classification is called binary classification. And Binary classification uses the sigmoid activation function for the output layer with the number of nodes 1.

In [None]:
model = keras.Sequential()
model.add(Dense(units = 95, activation='relu', input_dim= 95)) #input layer
model.add(Dense(units = 2, activation='relu')) #hidden layer
model.add(Dense(units = 1, activation='sigmoid')) # output layer

Here is a summery model that has been made

In [None]:
model.summary()

The next stage is compiling the model. The optimizer used is ADAM with a learning rate of 0.0001. Then the loss used is Binary crossentropy, this loss was chosen because for the binary class loss classification case used is Binary crossentropy. And for the metrics use accuracy.

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
                  loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
                  metrics=['accuracy'])

Then the model will be fitted with batch size 16 and epoch 30. And obtained loss 0..03, accuracy 0.99, val loss 0.02, and val accuracy 0.99



In [None]:
history = model.fit(
    X_train,
    y_train,
    batch_size=16,
    epochs=30,
    validation_data=(x_val, y_val),
)


## Model Evaluation

Plotting the loss and accuracy of the model that has been made.

In [None]:
plt.figure(figsize=(10,4))
plt.plot(history.history['loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.grid(True)
plt.show()

plt.figure(figsize=(10,4))
plt.plot(history.history['accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.grid(True)
plt.show()

syntax to get prediction results using np.round

In [None]:
kelas = np.round(model.predict(X_test),0)
hasil_prediksi = np.asarray(kelas, dtype = 'int')
print(hasil_prediksi)
ypred = hasil_prediksi

The following is the confusion matrix of the model

In [None]:
conf_matrix = confusion_matrix(y_true=y_test, y_pred=ypred)
fig, ax = plt.subplots(figsize=(7.5, 7.5))
ax.matshow(conf_matrix, cmap=plt.cm.Blues, alpha=0.3)
for i in range(conf_matrix.shape[0]):
  for j in range(conf_matrix.shape[1]):
    ax.text(x=j, y=i,s=conf_matrix[i, j], va='center', ha='center', size='xx-large')
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()
report = classification_report(ypred, y_test)
print(report)