# 🏢 **Financial Distress Prediction**

### *Bankruptcy Prediction using Machine Learning*

In this notebook, I (👨🏻‍💼) use the dataset [Financial Distress](https://www.kaggle.com/datasets/shebrahimi/financial-distress?select=Financial+Distress.csv), available on [Kaggle](https://www.kaggle.com/), to evaluate the performance of different Machine Learning models in predicting financial distress. More specifficaly, I will be exploring the following topics on Data Science and Machine Learning:

🧹 Data Cleaning and Preprocessing

📈 Exploratory Data Analysis (EDA)

⚖️ Techniques to deal with **imbalanced data**
- Undersampling
- Oversampling

🤖 Machine Learning (both **Frequentist** and **Bayesian** approaches)
- Logistic Regression
  
For this part, the model selection and hyperparameter tuning will be done using $k$-fold cross-validation and grid search.

📋 Model evaluation and comparison
  - Confusion Matrix
  - Accuracy
  - Recall
  - Precision
  - ROC Curve

It goes without saying that the **topics listed above are not necessairly to be explored in a linear order**. On the contrary: I may need to go back and forth between them, as I see fit. This is pretty common in Data Science projects, as we are always learning new stuff and improving our models.

Along the notebook, I will be explaining the concepts and techniques used, as well as the results obtained.

Naturaly, I will be using Python and some of its libraries, such as [pandas](https://pandas.pydata.org/), [NumPy](https://numpy.org/), [matplotlib](https://matplotlib.org/) and [scikit-learn](https://scikit-learn.org/stable/). On the Bayesian side, I will be implementing the models using NumPy and statistical knowledge on Bayesian inference and classification models. 

Without further ado, let's get started!

## 💰 **Understanding the Subject**
### *Why is it important to understand the subject?*

**Our starting point should be the data**. The "Financial Distress" dataset contains a lot of information about companies. This is pretty important for Machine Learning as the more data we have the better. The quantity of data is important because it allows the models to learn more about the problem and, consequently, to generalize better. Besides, the more data we have, the more confident we can be about the results obtained - that is, the model tends to have its precision improved and its variance reduced. 

❗**BUT WE NEED TO BE CAREFUL:** more data does not always mean better results. For instance, if the data was not collected properly, its quality may be compromised and the model may not be able to learn anything from it. In this case, the model will not be able to generalize well and its performance will be poor. 

Another example is when you train a model with a lot of features, but most of them are not relevant to the problem. This will probabibly lead to poor results as well. It happens because one is adding noise to the model and increasing its complexity, which may lead to overfitting [1]. **This is one of the many reasons why we should always try to understand the data before training any model.**

[1]: We see overfitting when the model learns the train data very well, but is not able to generalize. That it, it has fitted too well to the train data and is not able to learn the underlying patterns of the problem. It might be a little bit clearer now why adding irrelevant features to the model may lead to overfitting.

### *Getting to know the data*

Keeping that in mind, let me explain what each column of our dataset means:

According to the documentation, the rows represent some measurements of companies across time. 

- The first column, `Company`, is the identifier of the company. 
  
- The second column, `Time`, is the identifier of the time when the measurements were taken. The third column, `Financial Distress`, is the target variable. It is a continuous variable which we aim to binarize: "*[...] if it is greater than -0.50 the company should be considered as healthy (0). Otherwise, it would be regarded as financially distressed (1).*".

- The other columns are the features, which are measurements of the companies that will be used to predict the target variable. By the time we perform the Exploratory Data Analysis (EDA), we will have a better understanding of what each feature means and its importance to the problem.

## 📚 **Importing the libraries**

WRITE SOMETHING ABOUT THE LIBRARIES IMPORTED........

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 🧹 **Data Cleaning and Preprocessing**

In [2]:
df = pd.read_csv("../data/data.csv")

In [3]:
df.head()

Unnamed: 0,Company,Time,Financial Distress,x1,x2,x3,x4,x5,x6,x7,...,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83
0,1,1,0.010636,1.281,0.022934,0.87454,1.2164,0.06094,0.18827,0.5251,...,85.437,27.07,26.102,16.0,16.0,0.2,22,0.06039,30,49
1,1,2,-0.45597,1.27,0.006454,0.82067,1.0049,-0.01408,0.18104,0.62288,...,107.09,31.31,30.194,17.0,16.0,0.4,22,0.010636,31,50
2,1,3,-0.32539,1.0529,-0.059379,0.92242,0.72926,0.020476,0.044865,0.43292,...,120.87,36.07,35.273,17.0,15.0,-0.2,22,-0.45597,32,51
3,1,4,-0.56657,1.1131,-0.015229,0.85888,0.80974,0.076037,0.091033,0.67546,...,54.806,39.8,38.377,17.167,16.0,5.6,22,-0.32539,33,52
4,2,1,1.3573,1.0623,0.10702,0.8146,0.83593,0.19996,0.0478,0.742,...,85.437,27.07,26.102,16.0,16.0,0.2,29,1.251,7,27


We must check if there are null values or duplicated rows.

In [4]:
print(f"Columns with NaN values: {df.columns[df.isna().any()].to_list()}")

Columns with NaN values: []


In [5]:
print(f"# of duplicated rows: {df.duplicated().sum()}")

# of duplicated rows: 0


Given that there is no NaN values or duplicated rows, we can proceed to the next step. Remember that the column `Financial Distress` is a continuous variable, but we would like to model it as a binary variable. The rule has already been given: if it is smaller than -0.5, then it is financially distressed. Otherwise, it is not. We can use the `apply` method to do this.

In [6]:
def isDistressed(x):
    """
    Returns 1 if x is less than -0.5, 0 otherwise. 
    1 means Financially Distressed, 0 means Financially Healthy.
    """
    if x < -0.5:
        return 1
    elif x >= -0.5:
        return 0

In [7]:
df["Financial Distress Binary"] = df["Financial Distress"].apply(isDistressed)

In [8]:
df[["Company", "Time", "Financial Distress", "Financial Distress Binary"]].head()

Unnamed: 0,Company,Time,Financial Distress,Financial Distress Binary
0,1,1,0.010636,0
1,1,2,-0.45597,0
2,1,3,-0.32539,0
3,1,4,-0.56657,1
4,2,1,1.3573,0


Now that the dataset is cleaned, we can start to analyze it. We will start by looking at the distribution of the data.

---

## 📈 **Exploratory Data Analysis (EDA)**