The term bankruptcy is expressed as the inability of a company to pay its debts to its creditors. The bankruptcy of a company and even the possibility of going bankrupt is important for the company's investors and society. Therefore, bankruptcy prediction should be made before the bankruptcy of a company and necessary and appropriate models should be built. In this part of the model, we created machine learning algorithms that can predict whether companies will go bankrupt. In this way, it will be possible to predict the bankruptcy of companies with their financial statements and financial ratios.

# INTRODUCTION

There are more than 6800 companies in the data used in the bankruptcy prediction model. The bankruptcy cases of these companies in the data are shown as 1 (bankrupted) and 0 (failed to go bankrupt) and it is tried to predict whether they will go bankrupt with 95 financial ratios.

95 features (X1-X95)
Our goal is to use these features to have clearer information about the future and legitimacy of the companies.

In [1]:
import numpy as np
import pandas as pd 

In [2]:
import pandas as pd

In [3]:
data_bankruptcy = pd.read_csv("../input/company-bankruptcy-prediction/data.csv")
pd.set_option('display.max_columns', None)
data_bankruptcy.head()

show 6819 rows and 96 columns

In [4]:
data_bankruptcy.info()

In [5]:
data_bankruptcy["Bankrupt?"].value_counts()

In [6]:
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
plot_1 = sns.countplot('Bankrupt?',data=data_bankruptcy , palette="Set2")
plt.title('Count values of Bankrupt? ')

for container in plot_1.containers:
    plot_1.bar_label(container)

In [48]:
plt.figure(figsize=(17,17))
sns.heatmap(data_bankruptcy.corr(), annot=False, cmap='coolwarm')
plt.show()

In [50]:
sns.displot(data_bankruptcy[' ROA(A) before interest and % after tax'])
sns.displot(data_bankruptcy[' ROA(B) before interest and depreciation after tax'])

In [21]:
# Checking labels distributions

sns.set_theme(context = 'talk', style='darkgrid', palette='deep', font='sans-serif', font_scale = 0.8, rc={"grid.linewidth": 4})

plt.figure(figsize = (16,9))
sns.countplot(data_bankruptcy['Bankrupt?'])
plt.title('Class Distributions \n (0: Failed to go bankrupt || 1: Went bankrupt)', fontsize=16)
plt.show()

In [51]:
sns.countplot('Bankrupt?',data=data_bankruptcy)

In [53]:
sns.countplot(x = ' Liability-Assets Flag',hue = 'Bankrupt?',data = data_bankruptcy)

In [25]:
numeric_features = data_bankruptcy.dtypes[data_bankruptcy.dtypes != 'int64'].index
categorical_features = data_bankruptcy.dtypes[data_bankruptcy.dtypes == 'int64'].index

data_bankruptcy[categorical_features].columns.tolist()

In [27]:
positive_corr = data_bankruptcy[numeric_features].corrwith(data_bankruptcy["Bankrupt?"]).sort_values(ascending=False)[:6].index.tolist()
negative_corr = data_bankruptcy[numeric_features].corrwith(data_bankruptcy["Bankrupt?"]).sort_values()[:6].index.tolist()

positive_corr = data_bankruptcy[positive_corr + ["Bankrupt?"]].copy()
negative_corr = data_bankruptcy[negative_corr + ["Bankrupt?"]].copy()

In [35]:
def corrbargraph(x_value, y_value):
    
    plt.figure(figsize=(15,8))
    value = randint(0, 6)

    for i in range(1,7):
        plt.subplot(2,3,i)  
        sns.barplot(x = x_value, y = y_value[i-1],data = data_bankruptcy)

    plt.tight_layout(pad=0.5)

In [36]:
from random import randint
x_value = positive_corr.columns.tolist()[-1]
y_value = positive_corr.columns.tolist()[:-1]

corrbargraph(x_value, y_value)

# We see that three attributes - "Debt Ratio %, Current Liability To Assets, Current Liability To Current Assets" are commonly high in bankrupt organizations.

In [37]:
x_value = negative_corr.columns.tolist()[-1]
y_value = negative_corr.columns.tolist()[:-1]

corrbargraph(x_value, y_value)

In [41]:
data_bankruptcy.columns

In [45]:
plt.figure(figsize=(10,3))

plt.suptitle("Correlation Between Negative Attributes")

plt.subplot(1,2,1)
plt.xlabel("ROA (A)")
plt.ylabel("ROA (B)")
sns.scatterplot(data=data_bankruptcy, x=' ROA(A) before interest and % after tax', y=' ROA(B) before interest and depreciation after tax',color = 'red')

plt.subplot(1,2,2)
plt.xlabel("ROA (B)")
plt.ylabel("ROA (C)")
sns.scatterplot(data=data_bankruptcy, x=' ROA(B) before interest and depreciation after tax', y=' ROA(C) before interest and depreciation before interest',color = 'red')

plt.tight_layout(pad=0.8)

In [47]:
relation = positive_corr.columns.tolist()[:-1] + negative_corr.columns.tolist()[:-1]
plt.figure(figsize=(8,7))
sns.heatmap(data_bankruptcy[relation].corr(),annot=True)

# Summary of Analysis
The number of organizations that have gone bankrupt in 10 years between 1999 – 2000 is few.
Several companies possess many assets, which is always a good sign for an organization.
An organization cannot guarantee not being bankrupt, although owning several assets.
The organizations in the dataset are running into losses for the past two years as their net income poses to be negative.
Very few of the organizations that have had negative income in the past two years suffer from bankruptcy.
It is observed that “Debt Ratio %, Current Liability To Assets, Current Liability To Current Assets" attributes are a few of the attributes that have a high correlation with the target attribute.
An increase in the values of the attributes “Debt Ratio %, Current Liability To Assets, Current Liability To Current Assets” causes an organization to suffer heavy losses, thus resulting in bankruptcy.
An increase in the values of the attributes that have a negative correlation with the target attribute helps an organization avoid bankruptcy.
There seems to be a relation between attributes that have a high correlation with the target attribute and a low correlation with the target attribute.
We observed several correlations among the top 12 attributes, one of which being “Net Worth/Assets and Debt Ratio %” that is negatively correlated with one another.

# Machine Learning model

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Data pre-processing, which is essentially a technique used in data mining, is a very frequently applied method that should be adopted before the development of machine learning models. Often real-world data is inconsistent, incomplete, and contains some errors, so they are unlikely to be directly analysed. Therefore, the raw data is transformed by pre-processing and put into a useful and an effective format before analysis. Data pre-processing, in its simplest form, includes the steps of cleaning the data (management of incomplete and noisy data), transforming the data (normalization, etc.) and reducing the data (methods such as dimension reduction, etc.). In machine learning processes, data pre-processing is simply done with the following steps:

Installing open-source libraries required for data manipulation and analysis, especially Pandas and NumPy
Uploading the dataset in the appropriate format
Observing the features such as missing data and data type in the data set and eliminating this problem if they are problematic
Splitting the data as train and test data in order to apply machine learning algorithms
In order to detect machine learning model behaviours, it is important to divide the data into two as train and test data because machine learning methods are primarily trained through updating various parameters with train data. After the train phase, the machine learning model is tested with a different data set (test da-ta). Thus, how the established model responds to new data (observations) is measured. There are various opinions about how much of the total data should be allocated to train and how much to test data. Although it is necessary to keep the train data as large as possible for very large data, the general opinion is to separate the train and test data with a rate of 80-20%.

In [9]:
name_col = data_bankruptcy.columns
name_col

In [10]:
x = data_bankruptcy.drop('Bankrupt?', axis=1)
y = data_bankruptcy['Bankrupt?']

In [11]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 10)

In [12]:
model = LogisticRegression(max_iter = 7000)

In [13]:
model.fit(x_train, y_train)

confusion matrix

In [14]:
from sklearn.metrics import confusion_matrix,classification_report
from mlxtend.plotting import plot_confusion_matrix

check model predict target from x_test and show confusion matrix of prediction (y_test = answer)


In [15]:
y_predict = model.predict(x_test)
confusion_matrix_model = confusion_matrix(y_test,y_predict)
print(confusion_matrix_model)
print(f'rows mean Actual , columns mean Predicted')

In [16]:
fig, ax = plot_confusion_matrix(conf_mat=confusion_matrix_model, figsize=(6, 6), cmap=plt.cm.Greens)
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=25 , pad=15)

plt.show()

check accuracy score

In [17]:
from sklearn.metrics import accuracy_score, precision_score

In [18]:
print("Classification Report",classification_report(y_test,y_predict))

In [19]:
print("accuracy of Logistic Regression model :", accuracy_score(y_test,y_predict))