<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:Blue;
           font-size:210%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
          color:white;
          text-align:center;"
          >
       WELCOME TO MY NOTEBOOK
</p>
</div>

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:red;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
      About Dataset: Company Bankruptcy Prediction
</p>
</div>

> Data was gathered from the Taiwan Economic Journal spanning the period from 1999 to 2009. The criteria for identifying company bankruptcy were established according to the regulations set forth by the Taiwan Stock Exchange.

![](https://images.wsj.net/im-749712?width=700&height=466)

> This dataset contain total 95 features, and 1 target variable namely Bankrupt?(0:No,1:Yes) 

In [None]:
conda install "numpy>=1.16.5,<1.23.0"

In [None]:
# Import all the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score,roc_auc_score,precision_score, recall_score, f1_score,ConfusionMatrixDisplay,classification_report


import warnings 
warnings.filterwarnings("ignore")

In [None]:
# Read the dataset
dataframe=pd.read_csv("/kaggle/input/company-bankruptcy-prediction/data.csv")
dataframe.head()

In [None]:
# check the shape of the dataset
dataframe.shape

In [None]:
# check the datatype of each column
dataframe.info()

In [None]:
# Describe the dataset
dataframe.describe()

In [None]:
# Check Is there any null value in the dataset
dataframe.isna().sum()

In [None]:
# Check Is there any duplicate value in the dataset
dataframe.duplicated().sum()

> There is no duplicate value in the dataset

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
      Lets check the Dataset is Balanced or Not?
</p>
</div>

In [None]:
dataframe["Bankrupt?"].value_counts()

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
      Exploratory Data Analysis
</p>
</div>

In [None]:
plt.figure(figsize=(5,5))
sns.countplot(x=dataframe["Bankrupt?"], palette="dark")
plt.title("Data Distribution of Bankrupt?")
plt.show()

> Here we can notice that most of the data belongs to class 0, which means dataset is unbalanced.

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
      Lets calculate the Imbalance Ratio: The ratio of the number of samples in the majority class to the number of samples in the minority class
</p>
</div>

In [None]:
majority_class_samples=dataframe["Bankrupt?"].loc[dataframe["Bankrupt?"]==0]
minority_class_samples=dataframe["Bankrupt?"].loc[dataframe["Bankrupt?"]==1]
Imbalance_Ratio= len(majority_class_samples)/len(minority_class_samples)
print(f"Imbalance Ratio is:{Imbalance_Ratio}")

In [None]:
dataframe.hist(figsize=(40,40), edgecolor='white')
plt.show()

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
      Getting Numerical and Categorical Columns
</div>

In [None]:
def get_num_cat_columns(dataframe):
    categorical_cols=dataframe.select_dtypes(include="object").columns
    numerical_cols=dataframe.select_dtypes(exclude="object").columns
    
    return categorical_cols, numerical_cols

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
       Divide the dataset into training and testing set
</p>
</div>

In [None]:
def train_test_split_data(dataframe,target,test_size, random_state):
    x_train,x_test, y_train, y_test= train_test_split(dataframe.drop([target], axis=1),
                                                      dataframe[target],
                                                      test_size=test_size,
                                                      random_state=random_state,
                                                      stratify=dataframe[target]
                                                      )
    
    return x_train,x_test, y_train, y_test

In [None]:
x_train,x_test, y_train, y_test= train_test_split_data(dataframe,target="Bankrupt?",test_size=0.3, random_state=42)

In [None]:
x_train.shape,x_test.shape, y_train.shape, y_test.shape

In [None]:
categorical_cols, numerical_cols= get_num_cat_columns(x_train)


<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
      Lets detect the outliers in the Dataset And Remove it
</p>
</div>

In [None]:
def Winsorization_Method(columns, x_train, y_train , a, b):
    outliers=[]

    for col in columns:
        q1= np.percentile(x_train[col], a)
        q2= np.percentile(x_train[col],b)
        
        for pos in range(len(x_train)):
            if x_train[col].iloc[pos]>q2 or x_train[col].iloc[pos]<q1:
                outliers.append(pos) 
                
    outliers= set(outliers)                   # remove the duplicates from the outliers
    outliers= list(outliers)
    
    ratio= round(len(outliers)/len(x_train)*100, 2)                       # Ratio of outliers
    x_train.drop(x_train.index[outliers], inplace=True)    # remove the outliers from the training dataset
    y_train.drop(y_train.index[outliers], inplace=True)
    
    
    
    return ratio, x_train, y_train

In [None]:
ratio, x_train, y_train= Winsorization_Method(numerical_cols, x_train, y_train,a=0.3,b=99.7)

In [None]:
print(f"Ratio of Outliers Detected in the dataset:{ratio}")

In [None]:
# shape of data after removing the outliers in the training data
x_train.shape, y_train.shape

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
      Data Preprocessing
</p>
</div>

In [None]:
scaler= RobustScaler()
x_train_processed= scaler.fit_transform(x_train)
x_test_processed=scaler.transform(x_test)

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
       Perform the Principle Component Analysis to select the features that has high impact on the Target Variable(Bankrupt)
</p>
</div>

In [None]:
pca= PCA(n_components=70)                     
x_train_pca= pca.fit_transform(x_train_processed)
x_test_pca=pca.transform(x_test_processed)

In [None]:
x_train_pca.shape, y_train.shape, x_test_pca.shape, y_test.shape

In [None]:
# Most important features 
print(f"No. of Components Used:{pca.n_components_}")

In [None]:
# The amount of variance explained by each of the selected components. 
print(f"Variance:{pca.explained_variance_}")

In [None]:
# Percentage of variance explained by each of the selected components.
print(f"Variance_Ratio:{pca.explained_variance_ratio_}")

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
       Lets Balance the dataset using SMOTE Oversampling Technique
</p>
</div>

> SMOTE should only be used to augment training data. Your test dataset should remain untouched. Applying SMOTE to the entire dataset will result in data leakage.

In [None]:
smote= SMOTE(sampling_strategy='minority', random_state=43)
x_train_smote, y_train_smote= smote.fit_resample(x_train_pca, y_train)

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
        Lets Perform the Modelling 
</p>

In [None]:
def modelling(x_train, x_test, y_train, y_test):
    
    cv_result = []
    best_estimators = []
    recall_scores = []
    precision_scores = []
    f1_scores = []
    
    
    classifiers = [DecisionTreeClassifier(),
             RandomForestClassifier(),
             LogisticRegression(random_state=0),
             GradientBoostingClassifier(),
             ]

    dt_param_grid = {"min_samples_split" : range(10,500,20),
                     "max_depth": range(1,20,2)}


    rf_param_grid = {"max_features": [1,3,10],
                     "min_samples_split":[2,3,10],
                     "min_samples_leaf":[1,3,10],
                     "n_estimators":[100,300],
                     "criterion":["gini"]}
    

    logreg_param_grid = {"C":np.logspace(-4, 4, 20),
                         "penalty": ["l1","l2","None"],
                         "max_iter":[1000]}


    gbc_param_grid = {
                      "learning_rate": [0.05, 0.1, 0.2],
                      "min_samples_split": [2,3,10],
                      "min_samples_leaf": [1,3,10]
                      }


    classifier_parameters = [dt_param_grid,
                            rf_param_grid,
                            logreg_param_grid,
                            gbc_param_grid,
                             ]
    
    for i in range(len(classifiers)):
        model= GridSearchCV(classifiers[i], classifier_parameters[i], cv=5, scoring ="accuracy", n_jobs = -1)
        model.fit(x_train, y_train)
        y_pred= model.predict(x_test)
        
        cv_result.append(model.best_score_)
        recall_scores.append(recall_score(y_pred, y_test))
        precision_scores.append(precision_score(y_pred, y_test))
        f1_scores.append(f1_score(y_pred, y_test))
        best_estimators.append(model.best_estimator_)
        
        
        print(f"Model:{classifiers[i]}")
        print(f"Accuracy:{round(cv_result[i]*100,2)}")
        print(f"Recall:{recall_scores[i]}")
        print(f"Precision:{precision_scores[i]}")
        print(f"F1-Score:{f1_scores[i]}")
        print(f"Best Estimator:{model.best_estimator_}")
        print("Classifiaction Reoprt")
        print("---------------------")
        print(classification_report(y_test,y_pred,digits=3))
        print("Confusion_Matrix")
        print("---------------------")
        ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
        plt.show()
    
        print("---------------------------------------------------------------------------------------------------------------")

                         
                         
    model_names = ['DecisionTreeClassifier','RandomForestClassifier','LogisticRegression','GradientBoostingClassifier']
    result_df = pd.DataFrame({'Recall':recall_scores, 'Precision':precision_scores, 'F1_Score':f1_scores, 'Accuracy': cv_result,},index=model_names)
    result_df=result_df.sort_values(by="Accuracy", ascending=False)
    return result_df

In [None]:
result_df= modelling(x_train_smote, x_test_pca, y_train_smote, y_test)

In [None]:
result_df

<div style="color:white;
           display:fill;
           border-radius:25px;
    ](http://)       background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
         Result Comparison and Visualisation
</p>

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
        Result Comparison and Visualization
</p>

In [None]:
result_df.plot(kind="barh", figsize=(10, 7), grid=True).legend(bbox_to_anchor=(1.2,1));

<div style="color:white;
           display:fill;
           border-radius:25px;
    ](http://)       background-color:blue;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
         Conclusion
</p>

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:red;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
        Conclusion
</p>
    
  

> Here we notice that Random Forest Classifier performs well with the accuracy of 99.67% as compared to other classifiers.    

😊Thank you for taking the time to visit my notebook! Your support means a lot to me. If you found my content interesting or useful, I kindly ask for your upvote. It encourages me to keep sharing valuable insights. Gratitude for being a part of this journey!🌻


![](https://thumbs.gfycat.com/CourteousDesertedDiamondbackrattlesnake-size_restricted.gif)

