# Financial Distress Prediction Project
### Context
The project aims to identify companies that are likely to deal with bankruptcy.
Each company has a score/target **Financial Distress** associated with the probability of leading to bankruptcy.
### Content of Dataset
**First column: Company** represents sample companies.

**Second column: Time** shows different time periods that data belongs to. Time series length varies between 1 to 14 for each company.

**Third column**: The target variable is denoted by "**Financial Distress**" if it is greater than -0.50 the company should be considered as **healthy (0)**. Otherwise, it would be regarded as **financially distressed (1)**.

**Fourth column to the last column**: The anonymized features denoted by **x1** to **x83**, are some financial and non-financial characteristics of the sampled companies. These features belong to the previous time period, which should be used to predict whether the company will be financially distressed or not (classification). Feature **x80** is a **categorical variable**.

### Goals of the project

As a classification problem, finding:

- the most indicative features of financial distress
- a well performing machine learning model to predict the state of bankruptcy's risk.
## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

## Data preparation
Data from https://www.kaggle.com/datasets/shebrahimi/financial-distress
### Reading the data

In [2]:
df = pd.read_csv("Financial-Distress.csv")

### Making column names and values look uniform

In [3]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [4]:
df.head()

Unnamed: 0,company,time,financial_distress,x1,x2,x3,x4,x5,x6,x7,...,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83
0,1,1,0.010636,1.281,0.022934,0.87454,1.2164,0.06094,0.18827,0.5251,...,85.437,27.07,26.102,16.0,16.0,0.2,22,0.06039,30,49
1,1,2,-0.45597,1.27,0.006454,0.82067,1.0049,-0.01408,0.18104,0.62288,...,107.09,31.31,30.194,17.0,16.0,0.4,22,0.010636,31,50
2,1,3,-0.32539,1.0529,-0.059379,0.92242,0.72926,0.020476,0.044865,0.43292,...,120.87,36.07,35.273,17.0,15.0,-0.2,22,-0.45597,32,51
3,1,4,-0.56657,1.1131,-0.015229,0.85888,0.80974,0.076037,0.091033,0.67546,...,54.806,39.8,38.377,17.167,16.0,5.6,22,-0.32539,33,52
4,2,1,1.3573,1.0623,0.10702,0.8146,0.83593,0.19996,0.0478,0.742,...,85.437,27.07,26.102,16.0,16.0,0.2,29,1.251,7,27


### Feature **x80**

In [5]:
df.dtypes[df.dtypes == 'object'].index

Index([], dtype='object')

It seem's there no categorical feature as **x80** should be.

In [6]:
df["x80"] = df["x80"].astype(dtype="str")

### Features **company** and **time** useless

In [7]:
df = df.drop(columns={"company", "time"})

### Target **financial_distress** preparation for classification
According to the limit of **-0.50**, the target becomes binary.

In [8]:
df["financial_distress"] = (df["financial_distress"] <= np.float64(-0.5)).astype(int)
df["financial_distress"]

0       0
1       0
2       0
3       1
4       0
       ..
3667    0
3668    0
3669    0
3670    0
3671    0
Name: financial_distress, Length: 3672, dtype: int64

## Setting up the validation framework
### Perform the train/validation/test split with Scikit-Learn

In [9]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=39)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=39)
len(df_train), len(df_val), len(df_test)

(2202, 735, 735)

In [10]:
df_full_train = df_full_train.reset_index(drop=True)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [11]:
y_full_train = df_full_train.financial_distress.values
y_train = df_train.financial_distress.values
y_val = df_val.financial_distress.values
y_test = df_test.financial_distress.values

In [12]:
del df_full_train['financial_distress']
del df_train['financial_distress']
del df_val['financial_distress']
del df_test['financial_distress'] 

## EDA – Exploratory Data Analysis
### Checking missing values

In [13]:
df.isna().sum().sum()

np.int64(0)

There's **no** *NaN* values.
### Looking at the target variable **financial_distress**

In [14]:
df.financial_distress.value_counts(normalize=True)

financial_distress
0    0.962963
1    0.037037
Name: proportion, dtype: float64

In [15]:
global_bankruptcy_rate = df.financial_distress.mean()
round(global_bankruptcy_rate*100, 2)

np.float64(3.7)

### Looking at numerical and categorical variables

In [16]:
numerical_vars = df_train.select_dtypes(include=['int64', 'float64'])
categorical_vars = df.select_dtypes(include=['object'])
print("There're", len(numerical_vars.columns),  "numerical variables :")
print(numerical_vars.columns)
print("\nThere're", len(categorical_vars.columns),  "categorical variables :")
print(categorical_vars.columns)

There're 82 numerical variables :
Index(['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11',
       'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21',
       'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31',
       'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41',
       'x42', 'x43', 'x44', 'x45', 'x46', 'x47', 'x48', 'x49', 'x50', 'x51',
       'x52', 'x53', 'x54', 'x55', 'x56', 'x57', 'x58', 'x59', 'x60', 'x61',
       'x62', 'x63', 'x64', 'x65', 'x66', 'x67', 'x68', 'x69', 'x70', 'x71',
       'x72', 'x73', 'x74', 'x75', 'x76', 'x77', 'x78', 'x79', 'x81', 'x82',
       'x83'],
      dtype='object')

There're 1 categorical variables :
Index(['x80'], dtype='object')


In [17]:
numerical = numerical_vars.columns.to_list()
categorical = categorical_vars.columns.to_list()
categorical

['x80']

In [18]:
df[categorical].nunique()

x80    37
dtype: int64

## Feature importance
### Risk ratio

In [19]:
df_group = df.groupby("x80").financial_distress.agg(['mean', 'count'])
df_group['diff'] = df_group['mean'] - global_bankruptcy_rate
df_group['risk'] = df_group['mean'] / global_bankruptcy_rate
display(df_group.sort_values("risk", ascending = False))

Unnamed: 0_level_0,mean,count,diff,risk
x80,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
13,0.25,4,0.212963,6.75
12,0.177215,79,0.140178,4.78481
11,0.09375,32,0.056713,2.53125
27,0.085106,47,0.048069,2.297872
7,0.083333,12,0.046296,2.25
1,0.071429,14,0.034392,1.928571
22,0.070064,157,0.033027,1.89172
20,0.067308,104,0.030271,1.817308
21,0.067227,119,0.03019,1.815126
31,0.055556,18,0.018519,1.5


### Mutual information

In [20]:
mutual_info_score(df.x80, df.financial_distress)

np.float64(0.01289788037037554)

### Correlation

In [21]:
df[numerical].corrwith(df.financial_distress).abs().sort_values(ascending = False)

x14    0.291381
x49    0.277275
x9     0.269254
x2     0.242083
x10    0.234725
         ...   
x34    0.003199
x22    0.002311
x17    0.001951
x71    0.000659
x57    0.000111
Length: 82, dtype: float64

## One-hot encoding

In [22]:
train_dict = df_train[categorical + numerical].to_dict(orient='records')
dv = DictVectorizer(sparse=False)

In [23]:
X_train = dv.fit_transform(train_dict)
X_train.shape

(2202, 118)

In [24]:
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

## Training logistic regression with Scikit-Learn

In [25]:
model = LogisticRegression(solver="liblinear", max_iter=500, random_state=39)
model.fit(X_train, y_train)

In [26]:
y_pred = model.predict_proba(X_val)[:, 1]
y_pred

array([2.18498065e-09, 1.06932211e-01, 6.23620622e-02, 1.87783451e-02,
       6.53658557e-02, 3.54531083e-02, 1.09652367e-02, 8.01686238e-03,
       2.67662900e-04, 8.85217541e-03, 1.43479964e-01, 4.89573198e-03,
       1.15133378e-02, 1.95367492e-02, 8.11912207e-03, 9.19646231e-04,
       4.72937391e-01, 3.50553930e-02, 2.52186513e-01, 5.42199375e-02,
       1.80115064e-18, 5.85997212e-02, 1.23401772e-02, 1.29605570e-01,
       4.82826013e-06, 2.67278726e-04, 2.03853015e-02, 2.00831004e-46,
       7.05500151e-03, 1.08703890e-01, 1.56902920e-04, 5.37946917e-02,
       1.66689641e-02, 2.93367769e-02, 3.79453104e-02, 1.36042942e-01,
       9.22312417e-02, 2.55473075e-03, 5.73683821e-02, 1.14203682e-03,
       7.06445899e-04, 6.66583330e-04, 5.12378353e-11, 3.01639683e-02,
       9.60998859e-02, 1.77418680e-02, 1.42330170e-01, 1.17026725e-02,
       2.66822678e-03, 1.46558607e-04, 3.18615588e-04, 2.49329958e-06,
       3.23281394e-02, 2.93823639e-02, 4.47209470e-08, 2.37133086e-02,
      

In [27]:
bankruptcy = y_pred > 0.5

In [28]:
(y_val == bankruptcy).mean()

np.float64(0.9687074829931973)

## Model interpretation
### Look at the coefficients

In [29]:
dv.get_feature_names_out()

array(['x1', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17',
       'x18', 'x19', 'x2', 'x20', 'x21', 'x22', 'x23', 'x24', 'x25',
       'x26', 'x27', 'x28', 'x29', 'x3', 'x30', 'x31', 'x32', 'x33',
       'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x4', 'x40', 'x41',
       'x42', 'x43', 'x44', 'x45', 'x46', 'x47', 'x48', 'x49', 'x5',
       'x50', 'x51', 'x52', 'x53', 'x54', 'x55', 'x56', 'x57', 'x58',
       'x59', 'x6', 'x60', 'x61', 'x62', 'x63', 'x64', 'x65', 'x66',
       'x67', 'x68', 'x69', 'x7', 'x70', 'x71', 'x72', 'x73', 'x74',
       'x75', 'x76', 'x77', 'x78', 'x79', 'x8', 'x80=1', 'x80=10',
       'x80=11', 'x80=12', 'x80=13', 'x80=14', 'x80=15', 'x80=16',
       'x80=17', 'x80=18', 'x80=19', 'x80=2', 'x80=20', 'x80=21',
       'x80=22', 'x80=23', 'x80=24', 'x80=25', 'x80=26', 'x80=27',
       'x80=28', 'x80=29', 'x80=3', 'x80=30', 'x80=31', 'x80=32',
       'x80=33', 'x80=34', 'x80=36', 'x80=37', 'x80=4', 'x80=5', 'x80=6',
       'x80=7', 'x80=8', 'x80=9', 'x81',

In [30]:
model.intercept_[0]

np.float64(-1.477925722751304e-05)

In [31]:
dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3)))

{'x1': np.float64(-0.0),
 'x10': np.float64(-0.0),
 'x11': np.float64(0.0),
 'x12': np.float64(-0.0),
 'x13': np.float64(-0.0),
 'x14': np.float64(0.0),
 'x15': np.float64(-0.0),
 'x16': np.float64(0.0),
 'x17': np.float64(0.0),
 'x18': np.float64(-0.0),
 'x19': np.float64(-0.0),
 'x2': np.float64(-0.0),
 'x20': np.float64(-0.0),
 'x21': np.float64(-0.0),
 'x22': np.float64(0.001),
 'x23': np.float64(-0.0),
 'x24': np.float64(0.0),
 'x25': np.float64(-0.003),
 'x26': np.float64(-0.0),
 'x27': np.float64(0.0),
 'x28': np.float64(-0.0),
 'x29': np.float64(0.0),
 'x3': np.float64(0.0),
 'x30': np.float64(0.0),
 'x31': np.float64(-0.0),
 'x32': np.float64(-0.0),
 'x33': np.float64(-0.0),
 'x34': np.float64(-0.0),
 'x35': np.float64(-0.001),
 'x36': np.float64(-0.0),
 'x37': np.float64(0.0),
 'x38': np.float64(-0.0),
 'x39': np.float64(-0.0),
 'x4': np.float64(-0.0),
 'x40': np.float64(-0.0),
 'x41': np.float64(0.0),
 'x42': np.float64(-0.0),
 'x43': np.float64(-0.001),
 'x44': np.float64(-

### Train a smaller model with fewer features

In [32]:
subset = ['x22', 'x25', 'x43', 'x48', 'x65', 'x69', 'x71', 'x74']
train_dict_small = df_train[subset].to_dict(orient='records')
dv_small = DictVectorizer(sparse=False)
X_small_train = dv_small.fit_transform(train_dict_small)

dv_small.get_feature_names_out()

array(['x22', 'x25', 'x43', 'x48', 'x65', 'x69', 'x71', 'x74'],
      dtype=object)

In [33]:
model_small = LogisticRegression(solver="liblinear", max_iter=500, random_state=39)
model_small.fit(X_small_train, y_train)

In [34]:
model_small.intercept_[0]

np.float64(-0.5912895763639529)

In [35]:
dict(zip(dv_small.get_feature_names_out(), model_small.coef_[0].round(3)))

{'x22': np.float64(0.009),
 'x25': np.float64(-0.002),
 'x43': np.float64(-0.026),
 'x48': np.float64(-0.001),
 'x65': np.float64(0.008),
 'x69': np.float64(0.001),
 'x71': np.float64(-0.004),
 'x74': np.float64(-0.008)}

## Using the model

In [37]:
full_train_dict = df_full_train[categorical + numerical].to_dict(orient='records')
dv = DictVectorizer(sparse=False)

In [38]:
X_full_train = dv.fit_transform(full_train_dict)

In [39]:
model = LogisticRegression(solver="liblinear", max_iter=500, random_state=39)
model.fit(X_full_train, y_full_train)

In [40]:
test_dict = df_test[categorical + numerical].to_dict(orient='records')
X_test = dv.transform(test_dict)

In [41]:
y_pred = model.predict_proba(X_test)[:, 1]

In [42]:
bankruptcy = y_pred > 0.5
(y_test == bankruptcy).mean()

np.float64(0.9700680272108844)

Let’s take a sample customer from our test set:

In [43]:
company = test_dict[10]
company

{'x80': '9',
 'x1': 1.0213,
 'x2': 0.064186,
 'x3': 0.81257,
 'x4': 1.0453,
 'x5': 0.10491,
 'x6': 0.016622,
 'x7': 0.2324,
 'x8': 0.061403,
 'x9': 0.34246,
 'x10': 0.041145,
 'x11': 0.79639,
 'x12': 3.2981,
 'x13': 0.18743,
 'x14': 4.3354,
 'x15': 50.82,
 'x16': 0.10036,
 'x17': 1.5523,
 'x18': 0.036723,
 'x19': 0.030431,
 'x20': 0.047095,
 'x21': 1.3126,
 'x22': 6.2834,
 'x23': 0.15903,
 'x24': 0.77977,
 'x25': 653.63,
 'x26': 12.63,
 'x27': 62.889,
 'x28': 0.046864,
 'x29': 0.032805,
 'x30': 0.18122,
 'x31': 15.627,
 'x32': 0.17336,
 'x33': 0.95963,
 'x34': 1.8458,
 'x35': 7.0839,
 'x36': 0.080596,
 'x37': 0.17503,
 'x38': 0.035131,
 'x39': 1.9395,
 'x40': 0.16636,
 'x41': 5.5773,
 'x42': 0.57553,
 'x43': 1.1266,
 'x44': 0.38582,
 'x45': 0.088685,
 'x46': 0.078991,
 'x47': 4.8796,
 'x48': 1908.6,
 'x49': 4.1604,
 'x50': 0.84097,
 'x51': 12.674,
 'x52': 0.10924,
 'x53': 0.16623,
 'x54': 531.36,
 'x55': 0.22346,
 'x56': 0.56631,
 'x57': 0.031478,
 'x58': -0.01052,
 'x59': -0.010064,
 

In [44]:
X_small = dv.transform([company])
X_small

array([[ 1.0213e+00,  4.1145e-02,  7.9639e-01,  3.2981e+00,  1.8743e-01,
         4.3354e+00,  5.0820e+01,  1.0036e-01,  1.5523e+00,  3.6723e-02,
         3.0431e-02,  6.4186e-02,  4.7095e-02,  1.3126e+00,  6.2834e+00,
         1.5903e-01,  7.7977e-01,  6.5363e+02,  1.2630e+01,  6.2889e+01,
         4.6864e-02,  3.2805e-02,  8.1257e-01,  1.8122e-01,  1.5627e+01,
         1.7336e-01,  9.5963e-01,  1.8458e+00,  7.0839e+00,  8.0596e-02,
         1.7503e-01,  3.5131e-02,  1.9395e+00,  1.0453e+00,  1.6636e-01,
         5.5773e+00,  5.7553e-01,  1.1266e+00,  3.8582e-01,  8.8685e-02,
         7.8991e-02,  4.8796e+00,  1.9086e+03,  4.1604e+00,  1.0491e-01,
         8.4097e-01,  1.2674e+01,  1.0924e-01,  1.6623e-01,  5.3136e+02,
         2.2346e-01,  5.6631e-01,  3.1478e-02, -1.0520e-02, -1.0064e-02,
         1.6622e-02,  2.0954e-01,  7.4166e+00,  7.1050e+00,  1.4321e+01,
         1.8770e+01,  1.2476e+02,  2.6124e+01,  1.1800e+01,  8.3228e+03,
         1.8960e-01,  2.3240e-01,  1.5600e+01,  2.4

In [45]:
X_small.shape

(1, 119)

In [46]:
model.predict_proba(X_small)[0, 1]

np.float64(0.0213328728602854)

The prediction assumes the campany isn't going to bankruptcy.

In [48]:
y_test[10]

np.int64(0)

Actually this company isn't in financial distress.

Let's find another campany.

In [63]:
company = test_dict[220]
X_small = dv.transform([company])
model.predict_proba(X_small)[0, 1]

np.float64(0.5388781809419455)

In [64]:
y_test[220]

np.int64(1)

Bankruptcy prediction is validated.