# **Key Financial Indicators in Corporate Bankruptcy Prediction**

## 1. introduction
### (1)Background information
### Project Outline
For this individual project, we are conducting a data analysis using a publicly available dataset from Kaggle titled “Company Bankruptcy Prediction.” The dataset contains financial ratios of thousands of companies, along with a label indicating whether or not the company eventually went bankrupt. The goal of this project is to explore which financial indicators are most predictive of bankruptcy, and to build a simple classification model that can identify companies at high risk. I will apply the data wrangling, visualization, and classification techniques I have learned so far, using R to carry out this analysis.
### Company Bankruptcy Prediction dataset
The dataset used in this project comes from a publicly available Kaggle dataset titled Company Bankruptcy Prediction. The data were originally collected from the Taiwan Economic Journal between 1999 and 2009, and company bankruptcy was defined according to the business regulations of the Taiwan Stock Exchange.
### (2)Question we tried to answer
**Broad Question**  
What financial characteristics are most predictive of company bankruptcy, and how do these factors differ between bankrupt and non-bankrupt firms?  
**Specific Question**    
Can selected financial ratios—such as net income to total assets, debt ratio, net value growth rate, interest coverage ratio, and current ratio—predict whether a company is likely to go bankrupt?
### (3)Describe the dataset
It includes financial information from over 6,000 companies, with over 60 financial ratios such as Net Income to Total Assets, Debt Ratio, Net Value Growth Rate, and Interest Coverage Ratio. Each company is labeled as either bankrupt or non-bankrupt, allowing for binary classification tasks.

## 2. Methods & Results
### (1) K-Nearest Neighbors (KNN) Classification.
We are using KNN classification to predict whether a company will go bankrupt based on its financial ratios:
net income to total assets, debt ratio, net value growth rate, interest coverage ratio, and current ratio.

This classification method calculates the distance between a new company and its K-nearest neighbors in the training set, assigning the majority class (bankrupt or not) as the predicted outcome. The steps are as follows:

* Load in R packages and dataset

* Clean the dataset:

   * Convert Bankrupt? into a factor variable
 
   * Select only the five financial ratio columns and the target variable

   * Check for missing values (NA) in the selected financial ratios
 
   * If missing values are found, remove or impute them as appropriate

   * Standardize the five selected financial ratios to ensure equal scale

   * Remove any companies with incomplete or invalid financial data

* Explore and visualize the cleaned data:

   * Plot distribution of each variable

   * Use boxplots or scatterplots to compare bankrupt vs non-bankrupt companies

* Split the data into training and testing sets (e.g., 70% training, 30% testing)

* Train a KNN model on the training data

* Use cross-validation to select the optimal value of K

* Retrain the model using the chosen K

* Evaluate the model on the test data:

   * Generate a confusion matrix

   * Calculate accuracy, precision, recall

* Use the model to predict whether a new company is likely to go bankrupt

### (2) Execution

In [1]:
# Load necessary R packages
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.2.0 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [23]:
#Read csv files

bank_data <- read_csv('company_bankruptcy_data.csv')
bank_data

[1mRows: [22m[34m6819[39m [1mColumns: [22m[34m96[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (96): Bankrupt?, ROA(C) before interest and depreciation before interest...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,⋯,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0.3705943,0.4243894,0.4057498,0.6014572,0.6014572,0.9989692,0.7968871,0.8088094,0.3026464,⋯,0.7168453,0.009219440,0.6228790,0.6014533,0.8278902,0.2902019,0.02660063,0.5640501,1,0.01646874
1,0.4642909,0.5382141,0.5167300,0.6102351,0.6102351,0.9989460,0.7973802,0.8093007,0.3035564,⋯,0.7952971,0.008323302,0.6236517,0.6102365,0.8399693,0.2838460,0.26457682,0.5701749,1,0.02079431
1,0.4260713,0.4990188,0.4722951,0.6014500,0.6013635,0.9988574,0.7964034,0.8083875,0.3020352,⋯,0.7746697,0.040002853,0.6238410,0.6014493,0.8367743,0.2901885,0.02655472,0.5637061,1,0.01647411
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0,0.4727246,0.5337440,0.5206381,0.6104441,0.6102135,0.9989845,0.7974009,0.8093169,0.3035122,⋯,0.7977781,0.0028398848,0.6241561,0.6104406,0.8401383,0.2757887,0.02679116,0.5651584,1,0.09764874
0,0.5062643,0.5599106,0.5540446,0.6078496,0.6078496,0.9990738,0.7974999,0.8093994,0.3034982,⋯,0.8118079,0.0028371958,0.6239575,0.6078459,0.8410836,0.2775472,0.02682205,0.5653015,1,0.04400945
0,0.4930532,0.5701047,0.5495476,0.6274089,0.6274089,0.9980803,0.8019867,0.8137996,0.3134153,⋯,0.8159559,0.0007068724,0.6266803,0.6274082,0.8410185,0.2751141,0.02679295,0.5651669,1,0.23390224


In [29]:
#Cleaning Dataset

#Convert subscribe to factor
bank_data <- bank_data |>
    mutate(`Bankrupt?` = as.factor(`Bankrupt?`))

#select the target variable
bank_data <- bank_data|>
select(`Bankrupt?`,`Net Income to Total Assets`,`Debt ratio %`,`Net Value Growth Rate`,`Interest Coverage Ratio (Interest expense to EBIT)`,`Current Ratio`)

#Check for missing values in the dataset
any(is.na(bank_data))

bank_data <- arrange(bank_data, `Net Income to Total Assets`)
bank_data 


Bankrupt?,Net Income to Total Assets,Debt ratio %,Net Value Growth Rate,Interest Coverage Ratio (Interest expense to EBIT),Current Ratio
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,0.0000000,0.1775856,0.0002298299,0.5651584,0.0038982718
0,0.2247920,0.1832996,0.0002339473,0.5651523,0.0068186995
1,0.4118092,0.3318923,0.0001670197,0.5650981,0.0007194985
⋮,⋮,⋮,⋮,⋮,⋮
0,0.9813151,0.07437192,0.0013438933,0.5651957,0.02005423
0,0.9828793,0.02551622,0.0008258403,0.5651601,0.04095588
0,1.0000000,0.02950691,0.0006389070,0.5651640,0.03274311


**We removed extreme or invalid entries based on domain-specific thresholds for key financial indicators. These include companies with zero profitability, negative or excessively high debt ratios, unrealistically large changes in net value, or nonviable interest coverage and liquidity levels.**

In [33]:
#Remove any companies with incomplete or invalid financial data
bank_data <- bank_data |>
  filter(`Net Income to Total Assets` != 0,
         `Debt ratio %` > 0, `Debt ratio %` < 100,
         `Net Value Growth Rate` > -1, `Net Value Growth Rate` < 5,
         `Interest Coverage Ratio (Interest expense to EBIT)` > 0,
         `Interest Coverage Ratio (Interest expense to EBIT)` < 100,
         `Current Ratio` > 0, `Current Ratio` < 50,
         `Bankrupt?` == 1)

bank_data

Bankrupt?,Net Income to Total Assets,Debt ratio %,Net Value Growth Rate,Interest Coverage Ratio (Interest expense to EBIT),Current Ratio
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0.4118092,0.3318923,0.0001670197,0.5650981,0.0007194985
1,0.4209955,0.2437041,0.0002269851,0.5650899,0.0060361746
1,0.4237553,0.0704417,0.0002974562,0.5651570,0.0192873033
⋮,⋮,⋮,⋮,⋮,⋮
1,0.813856,0.2314902,0.0004797730,0.5653242,0.007342879
1,0.818788,0.2199716,0.0005930908,0.5666527,0.006209029
1,0.819091,0.1562415,0.0004941217,0.5658072,0.005926347
