# Financial Distress Prediction Project
### Context
The project aims to identify companies that are likely to deal with bankruptcy.
Each company has a score/target **Financial Distress** associated with the probability of leading to bankruptcy.
### Content of Dataset
**First column: Company** represents sample companies.

**Second column: Time** shows different time periods that data belongs to. Time series length varies between 1 to 14 for each company.

**Third column**: The target variable is denoted by "**Financial Distress**" if it is greater than -0.50 the company should be considered as **healthy (0)**. Otherwise, it would be regarded as **financially distressed (1)**.

**Fourth column to the last column**: The anonymized features denoted by **x1** to **x83**, are some financial and non-financial characteristics of the sampled companies. These features belong to the previous time period, which should be used to predict whether the company will be financially distressed or not (classification). Feature **x80** is a **categorical variable**.

### Goals of the project

As a classification problem, finding:

- the most indicative features of financial distress
- a well performing machine learning model to predict the state of bankruptcy's risk.
## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

## Data preparation
Data from https://www.kaggle.com/datasets/shebrahimi/financial-distress
### Reading the data

In [2]:
df = pd.read_csv("Financial-Distress.csv")

### Making column names and values look uniform

In [3]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [4]:
df.head()

Unnamed: 0,company,time,financial_distress,x1,x2,x3,x4,x5,x6,x7,...,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83
0,1,1,0.010636,1.281,0.022934,0.87454,1.2164,0.06094,0.18827,0.5251,...,85.437,27.07,26.102,16.0,16.0,0.2,22,0.06039,30,49
1,1,2,-0.45597,1.27,0.006454,0.82067,1.0049,-0.01408,0.18104,0.62288,...,107.09,31.31,30.194,17.0,16.0,0.4,22,0.010636,31,50
2,1,3,-0.32539,1.0529,-0.059379,0.92242,0.72926,0.020476,0.044865,0.43292,...,120.87,36.07,35.273,17.0,15.0,-0.2,22,-0.45597,32,51
3,1,4,-0.56657,1.1131,-0.015229,0.85888,0.80974,0.076037,0.091033,0.67546,...,54.806,39.8,38.377,17.167,16.0,5.6,22,-0.32539,33,52
4,2,1,1.3573,1.0623,0.10702,0.8146,0.83593,0.19996,0.0478,0.742,...,85.437,27.07,26.102,16.0,16.0,0.2,29,1.251,7,27


### Feature **x80**

In [5]:
df.dtypes[df.dtypes == 'object'].index

Index([], dtype='object')

It seem's there no categorical feature as **x80** should be.

In [6]:
df["x80"] = df["x80"].astype(dtype="str")

### Features **company** and **time** useless

In [7]:
df = df.drop(columns={"company", "time"})

### Target **financial_distress** preparation for classification
According to the limit of **-0.50**, the target becomes binary.

In [8]:
df["financial_distress"] = (df["financial_distress"] <= np.float64(-0.5)).astype(int)
df["financial_distress"]

0       0
1       0
2       0
3       1
4       0
       ..
3667    0
3668    0
3669    0
3670    0
3671    0
Name: financial_distress, Length: 3672, dtype: int64

## Setting up the validation framework
### Perform the train/validation/test split with Scikit-Learn

In [9]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=39)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=39)
len(df_train), len(df_val), len(df_test)

(2202, 735, 735)

In [10]:
df_full_train = df_full_train.reset_index(drop=True)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [11]:
y_full_train = df_full_train.financial_distress.values
y_train = df_train.financial_distress.values
y_val = df_val.financial_distress.values
y_test = df_test.financial_distress.values

In [12]:
del df_train['financial_distress']
del df_val['financial_distress']
del df_test['financial_distress'] 

## EDA – Exploratory Data Analysis
### Checking missing values

In [13]:
df_full_train.isna().sum().sum()

np.int64(0)

There's **no** *NaN* values.
### Looking at the target variable **financial_distress**

In [14]:
df_full_train.financial_distress.value_counts(normalize=True)

financial_distress
0    0.961525
1    0.038475
Name: proportion, dtype: float64

In [15]:
global_bankruptcy_rate = df_full_train.financial_distress.mean()
round(global_bankruptcy_rate*100, 2)

np.float64(3.85)

### Looking at numerical and categorical variables

In [16]:
del df_full_train['financial_distress']

numerical_vars = df_full_train.select_dtypes(include=['int64', 'float64'])
categorical_vars = df_full_train.select_dtypes(include=['object'])
print("There're", len(numerical_vars.columns),  "numerical variables :")
print(numerical_vars.columns)
print("\nThere're", len(categorical_vars.columns),  "categorical variables :")
print(categorical_vars.columns)

There're 82 numerical variables :
Index(['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11',
       'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21',
       'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31',
       'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41',
       'x42', 'x43', 'x44', 'x45', 'x46', 'x47', 'x48', 'x49', 'x50', 'x51',
       'x52', 'x53', 'x54', 'x55', 'x56', 'x57', 'x58', 'x59', 'x60', 'x61',
       'x62', 'x63', 'x64', 'x65', 'x66', 'x67', 'x68', 'x69', 'x70', 'x71',
       'x72', 'x73', 'x74', 'x75', 'x76', 'x77', 'x78', 'x79', 'x81', 'x82',
       'x83'],
      dtype='object')

There're 1 categorical variables :
Index(['x80'], dtype='object')


In [17]:
numerical = numerical_vars.columns
categorical = categorical_vars.columns
categorical

Index(['x80'], dtype='object')

In [18]:
df_full_train[categorical].nunique()

x80    37
dtype: int64