<a href="https://colab.research.google.com/github/Vagueken/Company-Bankruptcy-Prediction-Classification-/blob/main/Company_Bankruptcy_Prediction_(Classification).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Company Bankruptcy Prediction



##### **Project Type**    - Classification
##### **By -** Karan Malhotra

# **Project Summary -** The Bank of England has warned that the number of companies declaring bankruptcy is on the rise, with more than 1,000 in the UK facing the threat of going bust. Prediction of bankruptcy is a phenomenon of increasing interest to firms who stand to lose money because of unpaid debts. Bankruptcy experts are using computers to predict which companies are likely to go bust, since computers can store huge data sets pertaining to bankruptcy, making accurate predictions from them beforehand is becoming important. The data were collected from the Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange. In this project we use various classification algorithms on bankruptcy dataset to predict bankruptcies. Bankruptcy is one of the biggest challenges facing the global financial sector, according to a report published by the Bank of England. Bankruptcy of companies and enterprises effects the financial market at multiple fronts, and hence the need to predict bankruptcy among companies by monitoring multiple variables takes on an added significance. A better understanding of bankruptcy and the ability to predict it will impact affect the profitability of lending institutions worldwide. With that in mind, we undertook the task of building different supervised-machine learning algorithms, along with a comparative analysis of each model, in order to identify those that are better suited for predicting economic bankruptcy. The history of bankruptcy prediction includes application of numerous statistical tools which gradually became available, and involves deepening appreciation of various pitfalls in early analyses. Research is still published that suffers pitfalls that have been understood for many years. Bankruptcy prediction has been a subject of formal analysis since at least 1932, when FitzPatrick published a study of 20 pairs of firms, one failed and one surviving, matched by date, size and industry, in The Certified Public Accountant. He did not perform statistical analysis as is now common, but he thoughtfully interpreted the ratios and trends in the ratios. His interpretation was effectively a complex, multiple variable analysis. The latest research within the field of Bankruptcy and Insolvency Prediction compares various differing approaches, modelling techniques, and individual models to ascertain whether any one technique is superior to its counterparts. The term bankruptcy is expressed as the inability of a company to pay its debts to its creditors. The bankruptcy of a company and even the possibility of going bankrupt is important for the company's investors and society. Therefore, bankruptcy prediction should be made before the bankruptcy of a company and necessary and appropriate models should be built. In this part of the model, we created machine learning algorithms that can predict whether companies will go bankrupt. In this way, it will be possible to predict the bankruptcy of companies with their financial statements and financial ratios. There are more than 6800 companies in the data used in the bankruptcy prediction model. The bankruptcy cases of these companies in the data are shown as 1 (bankrupted) and 0 (failed to go bankrupt) and it is tried to predict whether they will go bankrupt with 95 financial ratios i.e 95 features in the dataset. Our goal is to use these features to have clearer information about the future and legitimacy of the companies.

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/Vagueken/Company-Bankruptcy-Prediction-Classification-

# **Problem Statement**


#Business Context

Prediction of bankruptcy is a phenomenon of increasing interest to firms who stand to lose money because of unpaid debts. Since computers can store huge data sets pertaining to bankruptcy, making accurate predictions from them beforehand is becoming important.
The data were collected from the Taiwan Economic Journal for the years 1999 to 2009.
Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange. In this project you will use various classification algorithms on bankruptcy dataset to predict bankruptcies with satisfying accuracies long before the actual event.

#Main Libraries used:

• Pandas for data manipulation, aggregation

• Matplotlib and Seaborn for visualization and behavior with respect to the target variable

• NumPy for computationally efficient operations

• Scikit Learn for model training, model optimization, and metrics calculation

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from random import randint
import math

import plotly.express as px
from matplotlib.pyplot import figure, savefig, show, show, subplots, Axes, title
from scipy.stats import norm
import plotly.graph_objects as go
import plotly.figure_factory as ff

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.subplots import make_subplots
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import (confusion_matrix, precision_recall_curve, auc, roc_curve, accuracy_score, recall_score, classification_report, f1_score, average_precision_score, precision_recall_fscore_support, roc_auc_score)
from sklearn.metrics import confusion_matrix
from sklearn.utils import class_weight
from sklearn.feature_selection import RFE

from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from imblearn.over_sampling import SMOTE
from sklearn.metrics import roc_auc_score
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.isotonic import IsotonicRegression

import warnings

#Ignore warnings
warnings.filterwarnings(action='ignore')



### Dataset Loading

In [None]:
# Load Dataset

#mounting drive
from google.colab import drive
drive.mount('/content/drive')

path = "/content/drive/MyDrive/AlmaBetter DS/Capstone Project/(Supervised ML - Classification) Company Bankruptcy Prediction/COMPANY BANKRUPTCY PREDICTION.csv"
Bankrupty_data = pd.read_csv(path)

In [None]:
#increasing no. of columns to display

pd.set_option('display.max_columns', None)
     

### Dataset First View

In [None]:
# Dataset First Look

Bankrupty_data.head()

In [None]:
Bankrupty_data.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count


print("The dataset has" ,Bankrupty_data.shape[0],"rows")
print("\n")
print("The dataset has" ,Bankrupty_data.shape[1],"columns")

### Dataset Information

In [None]:
# Dataset Info

Bankrupty_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

# get the number of missing data points per column
missing_value_count = (Bankrupty_data.isnull().sum())
print(missing_value_count[missing_value_count > 0])
# percent of data that is missing
total_cells = np.product(Bankrupty_data.shape)
total_missing_value = missing_value_count.sum()
print(total_missing_value / total_cells * 100)
print('Total number of our cells is :',total_cells)
print('Total number of our missing value is :',total_missing_value)
print('Total number of duplicate data is :',Bankrupty_data.duplicated().sum())


#### Missing Values/Null Values

In [None]:
# using isnull() function crosscheck

print('Total number of null value is :',Bankrupty_data.isnull().sum().max())

In [None]:
# Visualizing the missing values

import missingno as msno 

msno.bar(Bankrupty_data)


In [None]:
#analyzing Target variable (Class: 0 = Not Bankrupt, 1 = Bankrupt)

# Checking labels distributions

sns.set_theme(context = 'talk', style='darkgrid', palette='deep', font='sans-serif', font_scale = 0.8, rc={"grid.linewidth": 4})

plt.figure(figsize = (9,6))
sns.countplot(Bankrupty_data['Bankrupt?'])
plt.title('Class Distributions \n (0: Failed to go bankrupt || 1: Went bankrupt)', fontsize=16)
plt.show()

### What did you know about your dataset?

 1. Fortunately we don't have any missing values, inspecting missing values using isnull() and missingno function gave the same result. By visualization we see that in each column we have 6819 non null values.

 2. The dataset has 6819 rows and 96 columns, there are 95 dimensions in data, leaving column 'Bankrupt?' which is the target to predict, regarded as the y of data.

3. The 'Bankrupt?' column consists of label '1' and '0', and the label '1' points to the bankrupt condition of company, the label '0' means not bankrupt. The number of bankrupt company : not bankrupt company =1:30, as we turn it into binary classifier problem.

4. Missing value analysis, According to the statistics, there are no missing values in 95 feature columns, which have relatively complete features and little interference to model prediction. So instead of removing any of the original features, we use automatic model selection, as the model can automatically assign features of the weight.




## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

Bankrupty_data.columns

In [None]:
# Dataset Describe

Bankrupty_data.describe()

### Variables Description 


• Bankrupt?: Class label 1: Yes, O: No

• ROA(C) before interest and depreciation before interest: Return On Total
Assets(C)

• ROA(A) before interest and % after tax: Return On Total Assets(A)

• ROA (B) before interest and depreciation after tax: Return On Total Assets(B)

• Operating Gross Margin: Gross Profit/Net Sales

• Realized Sales Gross Margin: Realized Gross Profit/Net Sales

• Operating Profit Rate: Operating Income/Net Sales

• Pre-tax net Interest Rate: Pre-Tax Income/Net Sales• After-tax net Interest Rate: Net Income/Net Sales

• Non-industry income and expenditure/revenue: Net Non-operating Income Ratio

• Continuous interest rate after tax): Net Income-Exclude Disposal Gain or
Loss/Net Sales

• Operating Expense Rate: Operating Expenses/Net Sales

• Research and development expense rate: (Research and Development
Expenses)/Net Sales

• Cash flow rate: Cash Flow from Operating/Current Liabilities

• Interest-bearing debt interest rate: Interest-bearing Debt/Equity

• Tax rate (A): Effective Tax Rate

• Net Value Per Share B): Book Value Per Share B)

• Net Value Per Share (A): Book Value Per Share(A)

• Net Value Per Share (C): Book Value Per Share(C)

• Persistent EPS in the Last Four Seasons: EPS-Net Income

• Cash Flow Per Share

• Revenue Per Share Yuan ¥: Sales Per Share

• Operating Profit Per Share (Yuan *): Operating Income Per Share

• Per Share Net profit before tax (Yuan ¥): Pretax Income Per Share

• Realized Sales Gross Profit Growth Rate

• Operating Profit Growth Rate: Operating Income Growth

• After-tax Net Profit Growth Rate: Net Income Growth

• Regular Net Profit Growth Rate: Continuing Operating Income after Tax Growth

• Continuous Net Profit Growth Rate: Net Income-Excluding Disposal Gain or Loss
Growth

• Total Asset Growth Rate: Total Asset Growth• Net Value Growth Rate: Total Equity Growth

• Total Asset Return Growth Rate Ratio: Return on Total Asset Growth

• Cash Reinvestment %: Cash Reinvestment Ratio

• Current Ratio

• Quick Ratio: Acid Test

• Interest Expense Ratio: Interest Expenses/Total Revenue

• Total debt/Total net worth: Total Liability/Equity Ratio

• Debt ratio %: Liability/Total Assets

• Net worth/Assets: Equity/Total Assets

• Long-term fund suitability ratio (A): (Long-term Liability+Equity)/ Fixed Assets

• Borrowing dependency: Cost of Interest-bearing Debt

• Contingent liabilities/Net worth: Contingent Liability/Equity

• Operating profit/Paid-in capital: Operating Income/Capital

• Net profit before tax/Paid-in capital: Pretax Income/Capital

• Inventory and accounts receivable/Net value: (Inventory+Accounts
Receivables)/Equity

• Total Asset Turnover

• Accounts Receivable Turnover

• Average Collection Days: Days Receivable Outstanding

• Inventory Turnover Rate (times)

• Fixed Assets Turnover Frequency

• Net Worth Turnover Rate (times): Equity Turnover

• Revenue per person: Sales Per Employee

• Operating profit per person: Operation Income Per Employee

• Allocation rate per person: Fixed Assets Per Employee

• Working Capital to Total Assets

• Quick Assets/Total Assets

• Current Assets/Total Assets

• Cash/Total Assets

• Quick Assets/Current Liability

• Cash/Current Liability

• Current Liability to Assets

• Operating Funds to Liability

• Inventory/Working Capital

• Inventory/Current Liability

• Current Liabilities/Liability

• Working Capital/Equity

• Current Liabilities/Equity

• Long-term Liability to Current Assets

• Retained Earnings to Total Assets

• Total income/Total expense

• Total expense/Assets

• Current Asset Turnover Rate: Current Assets to Sales

• Ouick Asset Turnover Rate: Ouick Assets to Sales

• Working capital Turnover Rate: Working Capital to Sales

• Cash Turnover Rate: Cash to Sales

• Cash Flow to Sales

• Fixed Assets to Assets

• Current Liability to Liability

• Current Liability to Equity

• Equity to Long-term Liability

• Cash Flow to Total Assets

• Cash Flow to Liability• CFO to Assets

• Cash Flow to Equity

• Current Liability to Current Assets

• Liability-Assets Flag: 1 if Total Liability exceeds Total Assets, O otherwise

• Net Income to Total Assets

• Total assets to GNP price

• No-credit Interval

• Gross Profit to Sales

• Net Income to Stockholder's Equity

• Liability to Equity

• Degree of Financial Leverage (DFL)

• Interest Coverage Ratio (Interest expense to EBIT)

• Net Income Flag: 1 if Net Income is Negative for the last two years, O otherwise

• Equity to Liability

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

{column: len(Bankrupty_data[column].unique()) for column in Bankrupty_data.columns}

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to df

df=Bankrupty_data.copy()

# Checking Shape of True Bankrupt Comapnies

print("No. of Bankrupt companies -",len(df[df['Bankrupt?']==1]))

# Assigning True Bankrupt Companies to variable df_unstable

df_unstable = df[(df['Bankrupt?']==1)]




In [None]:
print(df['Bankrupt?'].value_counts())
labels = df['Bankrupt?'].unique()
values = df['Bankrupt?'].value_counts().sort_values(ascending = True)

fig = go.Figure(data = [
    go.Pie(
    labels = labels,
    values = values,
    hole = .5)
])

fig.update_layout(title_text = "Bankrupt Distribution")
fig.show()

In [None]:
#Take sample to balance the data

bankrupt_sample = df[df['Bankrupt?'] == 0][0:220]
non_bankrupt_sample = df[df['Bankrupt?'] == 1]

#create new data frame

new_df = pd.concat([bankrupt_sample,non_bankrupt_sample],axis = 0)
new_df.head()

In [None]:
print("The new shape for our data",new_df.shape)

# preview 'Bankrupt' column

print(new_df['Bankrupt?'].value_counts())
labels = new_df['Bankrupt?'].unique()
values = new_df['Bankrupt?'].value_counts().sort_values(ascending = True)

fig = go.Figure(data = [
    go.Pie(
    labels = labels,
    values = values,
    hole = .5)
])

fig.update_layout(title_text = "Bankrupt after update Distribution")
fig.show()



# Plot Balanced Bankrupt data

In [None]:
# declare feature and target variable

X = new_df.drop('Bankrupt?', axis = 1)
y = new_df['Bankrupt?']

In [None]:
numeric_features = df.dtypes[df.dtypes != 'int64'].index
categorical_features = df.dtypes[df.dtypes == 'int64'].index

df[categorical_features].columns.tolist()

### What all manipulations have you done and insights you found?

1. There is a huge difference between bankrupt and non-bankrupt companies. As we can see that 96.8% of companies are non-bankrupt and 3.2% are bankrupt so now we must balance the data in order to build an ideal model capable of learning between the two types of companies.

2. Took sample from data to balance it and create new Date Frame. One way the imbalance may affect our Machine Learning algorithm is when our algorithm completely ignores the minority class. The reason this is an issue is because the minority class is often the class that we are most interested in.

3.  The reason we identify imbalanced classification as a problem is because it can influence the performance on our Machine Learning algorithms.

4. There are only three categorical data columns, we will first explore these columns



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
colors = ['Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r', 'BuGn', 'BuGn_r', 'BuPu', 'BuPu_r', 'CMRmap',
          'CMRmap_r', 'Dark2', 'Dark2_r', 'GnBu', 'GnBu_r', 'Greens', 'Greens_r', 'Greys', 'Greys_r', 'OrRd', 'OrRd_r', 
          'Oranges', 'Oranges_r', 'PRGn', 'PRGn_r', 'Paired', 'Paired_r', 'Pastel1', 'Pastel1_r', 'Pastel2', 'Pastel2_r', 
          'PiYG', 'PiYG_r', 'PuBu', 'PuBuGn', 'PuBuGn_r', 'PuBu_r', 'PuOr', 'PuOr_r', 'PuRd', 'PuRd_r', 'Purples', 'Purples_r', 
          'RdBu', 'RdBu_r', 'RdGy', 'RdGy_r', 'RdPu', 'RdPu_r', 'RdYlBu', 'RdYlBu_r', 'RdYlGn', 'RdYlGn_r', 'Reds', 'Reds_r', 
          'Set1', 'Set1_r', 'Set2', 'Set2_r', 'Set3', 'Set3_r', 'Spectral', 'Spectral_r', 'Wistia', 'Wistia_r', 'YlGn', 'YlGnBu', 
          'YlGnBu_r', 'YlGn_r', 'YlOrBr', 'YlOrBr_r', 'YlOrRd', 'YlOrRd_r', 'afmhot', 'afmhot_r', 'autumn', 'autumn_r', 'binary', 
          'binary_r', 'bone', 'bone_r', 'brg', 'brg_r', 'bwr', 'bwr_r', 'cividis', 'cividis_r', 'cool', 'cool_r', 'coolwarm', 
          'coolwarm_r', 'copper', 'copper_r', 'crest', 'crest_r', 'cubehelix', 'cubehelix_r', 'flag', 'flag_r', 'flare', 
          'flare_r', 'gist_earth', 'gist_earth_r', 'gist_gray', 'gist_gray_r', 'gist_heat', 'gist_heat_r', 'gist_ncar', 
          'gist_ncar_r', 'gist_rainbow', 'gist_rainbow_r', 'gist_stern', 'gist_stern_r', 'gist_yarg', 'gist_yarg_r', 'gnuplot', 
          'gnuplot2', 'gnuplot2_r', 'gnuplot_r', 'gray', 'gray_r', 'hot', 'hot_r', 'hsv', 'hsv_r', 'icefire', 'icefire_r', 
          'inferno', 'inferno_r', 'jet', 'jet_r', 'magma', 'magma_r', 'mako', 'mako_r', 'nipy_spectral', 'nipy_spectral_r', 
          'ocean', 'ocean_r', 'pink', 'pink_r', 'plasma', 'plasma_r', 'prism', 'prism_r', 'rainbow', 'rainbow_r', 'rocket', 
          'rocket_r', 'seismic', 'seismic_r', 'spring', 'spring_r', 'summer', 'summer_r', 'tab10', 'tab10_r', 'tab20', 
          'tab20_r', 'tab20b', 'tab20b_r', 'tab20c', 'tab20c_r', 'terrain', 'terrain_r', 'turbo', 'turbo_r', 'twilight', 
          'twilight_r', 'twilight_shifted', 'twilight_shifted_r', 'viridis', 'viridis_r', 'vlag', 'vlag_r', 'winter', 'winter_r']

#### Chart - 1 - Pie Chart on Dependant Variable i.e., Bankrupt? (Univariate)

In [None]:
# Chart - 1 visualization code

# Dependant Column Value Counts
print(df['Bankrupt?'].value_counts())
print(" ")

# Dependant Variable Column Visualization
df['Bankrupt?'].value_counts().plot(kind='pie',
                              figsize=(15,6),
                               autopct="%1.1f%%",
                               startangle=90,
                               shadow=True,
                               labels=['Not Bankrupt(%)','Bankrupt(%)'],
                               colors=['skyblue','red'],
                               explode=[0,0]
                              )

##### 1. Why did you pick the specific chart?

A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors, where different percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know that, there are 6599 companies that did not went bankrupt which is 96.8% of the whole customers data given in the dataset. In other hand, 220 companies are bankrupt which is 3.2% of the whole bankruptcy data given in the dataset.

3.2% companies are bankrupt which might look like a small number, but since it can increase, so immediate action should be taken.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Financial statements are  records of a  company's financial 
information  in  an  accounting  period  that  can  be  used  to 
describe  the  performance  of  the  company.  Bankruptcy  is 
defined as a failure of the company in carrying out operations 
to  achieve  its  objectives.

A  decrease  in  sales  can  cause  a  decrease  in  company's 
profit  and  if it  continuously  happens  then the  company  will 
experience  bankruptcy.  Bankruptcy  begins  with  financial 
distress.  Financial  distress  can  be  seen  and  measured  by 
analysing financial  statements. Ratio  analysis have  a role  in 
assessing the financial condition of a company. 

There are five 
types,  namely  liquidity  ratio,  activity  ratio,  solvency  ratio, 
profitability ratio, and market ratio. Anticipating the company’s 
financial  condition  early  is  essential  conducted  by  each 
company for the continuation of the company’s operations and 
better marketing strategies.  

#### Chart - 2 Countplot on Categorical Variable i.e., Liability-Assets Flag    (Univariate)

In [None]:
value = randint(0, len(colors)-1)

sns.countplot('Bankrupt?',data=df,palette = colors[value])

In [None]:
# Chart - 2 visualization code

value = randint(0, len(colors)-1)

print(df[' Liability-Assets Flag'].value_counts())
sns.countplot(' Liability-Assets Flag',data=df,palette = colors[value])




##### 1. Why did you pick the specific chart?

Countplot is used to do counts of observations in each categorical bin using bars. A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.)

##### 2. What is/are the insight(s) found from the chart?

High 0's and negligible 1's.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Assets are the items your company owns that can provide future economic benefit. Liabilities are what you owe other parties. In short, assets put money in your pocket, and liabilities take money out!

The "Liability-Assets" flag denotes the status of an organization, where if the total liability exceeds total assets, the flagged value will be 1, else the value is 0. A majority number of times, organizations/company's assets are more than their liabilities. 

#### Chart - 3 Countplot on Categorical Variable i.e., Net Income Flag    (Univariate)

In [None]:
# Chart - 3 visualization code

value = randint(0, len(colors)-1)

print(df[' Net Income Flag'].value_counts())
sns.countplot(' Net Income Flag',data=df,palette = colors[value])

##### 1. Why did you pick the specific chart?

The countplot is majorly used for showing the observational count in different category based bins with the help of bars.

##### 2. What is/are the insight(s) found from the chart?

The "Net Income" flag denotes the status of an organization's income in the last two years, where if the net income is negative for the past two years, the flagged value will be 1, else the value is 0. We observe that all the records have been exhibiting a loss for the past two years.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The "Net Income" flag denotes the status of an organization's income in the last two years, where if the net income is negative for the past two years, the flagged value will be 1, else the value is 0.

#### Chart - 4 Countplot on Categorical Variable i.e., Liability-Assets Flag with Bankrupt  (Bivariate)

In [None]:
# Chart - 4 visualization code

value = randint(0, len(colors)-1)

print(df[[' Liability-Assets Flag','Bankrupt?']].value_counts())
sns.countplot(x = ' Liability-Assets Flag',hue = 'Bankrupt?',data = df,palette = colors[value])



##### 1. Why did you pick the specific chart?

The categorical data is distinguished as binary 1 and 0, thus stored as "int64". We separate the numeric and categoric data to analyze our dataset. Here is an example of a count plot, which shows the number of observations across a set of categorical bins.

##### 2. What is/are the insight(s) found from the chart?

A small portion of organizations suffers bankruptcy, although possessing more assets than their liabilities.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, we can create a positive impact with this information as bankruptcy is when a company's assets fall short of its liabilities, and the company is forced to sell its assets. Loans issued by banks or other financial institutions that are secured by a specific asset, such as a building or a piece of expensive machinery, are examples of secured debt. Whatever assets and cash remain after all the secured creditors have been paid are pooled together and distributed to creditors with unsecured debt. Those would include bondholders and shareholders with preferred stock.  

#### Chart - 5 Countplot on Categorical Variable i.e., Net Income Flag with Bankrupt (Bivariate)

In [None]:
# Chart - 5 visualization code

value = randint(0, len(colors)-1)

print(df[[' Net Income Flag','Bankrupt?']].value_counts())
sns.countplot(x = ' Net Income Flag',hue = 'Bankrupt?',data = df,palette = colors[value])

##### 1. Why did you pick the specific chart?

The categorical data is distinguished as binary 1 and 0, thus stored as "int64". We separate the numeric and categoric data to analyze our dataset. Here is an example of a count plot, which shows the number of observations across a set of categorical bins.

##### 2. What is/are the insight(s) found from the chart?

Many organizations that have suffered losses for the past two years have stabilized their business, thus avoiding bankruptcy.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes a positive impact can be seen but since an organization cannot guarantee not being bankrupt, although owning several assets, Very few of the organizations that have had negative income in the past two years suffer from bankruptcy.

#### Chart - 6 Boxplot on Numerical Variable i.e., Operating Profit Rate with Bankrupt (Bivariate)

In [None]:
# Chart - 6 visualization code

q1,q9 = df[" Operating Profit Rate"].quantile([0.1,0.9])
mask=df[" Operating Profit Rate"].between(q1,q9)
sns.boxplot(x="Bankrupt?",y=" Operating Profit Rate", data=df[mask]);


##### 1. Why did you pick the specific chart?

Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups. They are built to provide high-level information at a glance, offering general information about a group of data's symmetry, skew, variance, and outliers.

##### 2. What is/are the insight(s) found from the chart?

The Features are skewed.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

No it is used just for seeing the distribution of Data.

#### Chart - 7 Column wise Histogram (Univariate)

In [None]:
# Chart - 7 visualization code

df.hist(figsize=(60, 50), bins=50)
plt.show()


##### 1. Why did you pick the specific chart?

The histogram is a popular graphing tool. It is used to summarize discrete or continuous data that are measured on an interval scale. It is often used to illustrate the major features of the distribution of the data in a convenient form. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data.

Thus, I used the histogram plot to analysis the variable distributions over the whole dataset whether it's symmetric or not.

##### 2. What is/are the insight(s) found from the chart?

Many features are skewed and data is imbalanced.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

No it is used just for seeing the distribution of Data.

#### Chart - 8 Plotting interesting features using Boxplot(Bivariate)

In [None]:
# Chart - 8 visualization code

value = randint(0, len(colors)-1)

f, axes=plt.subplots(ncols=4, figsize=(28, 10))

sns.boxplot(x='Bankrupt?', y=' Net Income to Total Assets', data=df, ax=axes[0],palette = colors[value])
axes[0].set_title('Bankrupt Vs Net Income to Total Assets')

sns.boxplot(x='Bankrupt?', y=' Total debt/Total net worth', data=df, ax=axes[1],palette = colors[value])
axes[1].set_title('Bankrupt Vs Total debt/Total net worth Correlation')

sns.boxplot(x='Bankrupt?', y=' Debt ratio %', data=df, ax=axes[2],palette = colors[value])
axes[2].set_title('Bankrupt Vs Debt ratio % Correlation')

sns.boxplot(x='Bankrupt?', y=' Net worth/Assets', data=df, ax=axes[3],palette = colors[value])
axes[3].set_title('Bankrupt Vs Net worth/Assets Correlation')

plt.show()

##### 1. Why did you pick the specific chart?

Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups. So I used it to get general information about a group of data's symmetry, skew, variance, and outliers.

##### 2. What is/are the insight(s) found from the chart?

Data is not symmetric and is skewed and also it can be seen that there are a lot multicollinearity issues, skewed features, and the data is imbalanced. With all that in mind for this model it is better not to use regression (logistic regression), it is better to use decision tree model.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Box plot cannot define business impact. It's done just to see the distribution of the column data.

#### Chart - 9 Plotting interesting features using Distplot (Univariate)

In [None]:
# Chart - 9 visualization code

f, (ax1, ax2, ax3, ax4) = plt.subplots(1,4, figsize=(24, 6))

cash_flow_rate = df[' Net Income to Total Assets'].loc[df['Bankrupt?'] == 1].values
sns.distplot(cash_flow_rate,ax=ax1, fit=norm, color='#FB8861')
ax1.set_title(' Net Income to Total Assets \n (Unstable companies)', fontsize=14)

tot_debt_net = df[' Total debt/Total net worth'].loc[df['Bankrupt?'] == 1].values
sns.distplot(tot_debt_net ,ax=ax2, fit=norm, color='#56F9BB')
ax2.set_title('total debt/tot net worth \n (Unstable companies)', fontsize=14)


debt_ratio = df[' Debt ratio %'].loc[df['Bankrupt?'] == 1].values
sns.distplot(debt_ratio,ax=ax3, fit=norm, color='#C5B3F9')
ax3.set_title('debt_ratio \n (Unstable companies)', fontsize=14)

net_worth_assets = df[' Net worth/Assets'].loc[df['Bankrupt?'] == 1].values
sns.distplot(net_worth_assets,ax=ax4, fit=norm, color='#C5B3F9')
ax4.set_title('net worth/assets \n (Unstable companies)', fontsize=14)

plt.show()

##### 1. Why did you pick the specific chart?

A Distplot or distribution plot, depicts the variation in the data distribution. Seaborn Distplot represents the overall distribution of continuous data variables. I used this distribution to find the matching possible distribution.

##### 2. What is/are the insight(s) found from the chart?

Distribution of features for companies that are close to bankruptcy.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Just looking at the Distribution of features.

#### Chart - 10 Bargraph(Bivariate with Categorical - Numerical)

In [None]:
# Chart - 10 visualization code

#For the sake of simplicity, we analyze the six top positively and negatively correlated attributes.


positive_corr = df[numeric_features].corrwith(df["Bankrupt?"]).sort_values(ascending=False)[:6].index.tolist()
negative_corr = df[numeric_features].corrwith(df["Bankrupt?"]).sort_values()[:6].index.tolist()

positive_corr = df[positive_corr + ["Bankrupt?"]].copy()
negative_corr = df[negative_corr + ["Bankrupt?"]].copy()

def corrbargraph(x_value, y_value):
    
    plt.figure(figsize=(15,8))
    value = randint(0, len(colors)-1)

    for i in range(1,7):
        plt.subplot(2,3,i)  
        sns.barplot(x = x_value, y = y_value[i-1],data = df,palette = colors[value])

    plt.tight_layout(pad=0.5)



x_value = positive_corr.columns.tolist()[-1]
y_value = positive_corr.columns.tolist()[:-1]

corrbargraph(x_value, y_value)


##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages. 

To show the average percentage of unstable companies with respect to features, I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

We see that three attributes - "Debt Ratio %, Current Liability To Assets, Current Liability To Current Assets" are commonly high in bankrupt organizations.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

We look at some of the key financial ratios that can indicate whether a company is going to be able to repay its debts.

#### Chart - 11 Bargraph(Bivariate)

In [None]:
# Chart - 11 visualization code

x_value = negative_corr.columns.tolist()[-1]
y_value = negative_corr.columns.tolist()[:-1]

corrbargraph(x_value, y_value)

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the average percentage of unstable companies with respect to features, I have used Bar Chart.


##### 2. What is/are the insight(s) found from the chart?

These attributes show us that the more the assets and earing of a company, the less likely is the organization to be bankrupt. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

These attributes show that the more the assets and earning of a company, the less likely it is to go bust.

#### Chart - 12 Scatterplot(Bivariate)

In [None]:
# Chart - 12 visualization code

#Let us check the relation of top six positive and negative correlation attributes among each other

plt.figure(figsize=(20,9))

plt.suptitle("Correlation Between Positive Attributes")

plt.subplot(1,2,1)
plt.xlabel("Debt Ratio")
plt.ylabel("Current Liability To Assets Ratio")
plt.scatter(df[' Debt ratio %'],df[' Current Liability to Assets'], marker='v',color = 'red')

plt.subplot(1,2,2)
plt.xlabel("Borrowing Dependency")
plt.ylabel("Liability To Equity Ratio")
plt.scatter(df[' Borrowing dependency'],df[' Liability to Equity'], marker='v',color = 'red')

plt.tight_layout(pad=0.8)

plt.figure(figsize=(20,9))

plt.suptitle("Correlation Between Negative Attributes")

plt.subplot(1,2,1)
plt.xlabel("ROA (A)")
plt.ylabel("ROA (B)")
sns.scatterplot(data=df, x=' ROA(A) before interest and % after tax', y=' ROA(B) before interest and depreciation after tax',color = 'red')

plt.subplot(1,2,2)
plt.xlabel("ROA (B)")
plt.ylabel("ROA (C)")
sns.scatterplot(data=df, x=' ROA(B) before interest and depreciation after tax', y=' ROA(C) before interest and depreciation before interest',color = 'red')

plt.tight_layout(pad=0.8)



##### 1. Why did you pick the specific chart?

A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.

Thus, I have used the scatter plot to depict the relationship between top six positive and negative correlation attributes among each other Debt Ratio and Current Liability To Assets Ratio and Between Borrowing Dependency and Liability to Equity Ratio being the positive correlation attributes and ROA(A) vs ROA(B) and ROA(B) vs ROA(C) being the negative correlation attributes.

##### 2. What is/are the insight(s) found from the chart?

There is a positive relation between attributes that have a high correlation with the target attribute and there is a positive relation between attributes that have a low correlation with the target attribute.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

An increase in the values of the attributes “Debt Ratio %, Current Liability To Assets, Current Liability To Current Assets” causes an organization to suffer heavy losses, thus resulting in bankruptcy.
An increase in the values of the attributes that have a negative correlation with the target attribute helps an organization avoid bankruptcy.
There seems to be a relation between attributes that have a high correlation with the target attribute and a low correlation with the target attribute.
Therefore gained insights could be useful for an oraganisation from .

#### Chart - 13 (Multivariate)

In [None]:
# Chart - 13 visualization code

relation = positive_corr.columns.tolist()[:-1] + negative_corr.columns.tolist()[:-1]
plt.figure(figsize=(20,10))
sns.heatmap(df[relation].corr(),annot=True)


##### 1. Why did you pick the specific chart?

To get the total correlation of the top 12 attributes.

##### 2. What is/are the insight(s) found from the chart?

We observed several correlations among the top 12 attributes, one of which being “Net Worth/Assets and Debt Ratio %” that is negatively correlated with one another. 

Rest all correlation can be depicted from the above chart.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr = df.corr()
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)

def magnify():
    return [dict(selector="th",
                 props=[("font-size", "7pt")]),
            dict(selector="td",
                 props=[('padding', "0em 0em")]),
            dict(selector="th:hover",
                 props=[("font-size", "12pt")]),
            dict(selector="tr:hover td:hover",
                 props=[('max-width', '200px'),
                        ('font-size', '12pt')])
]

corr.style.background_gradient(cmap, axis=1)\
    .set_properties(**{'max-width': '80px', 'font-size': '10pt'})\
    .set_caption("Hover to magify")\
    .set_precision(2)\
    .set_table_styles(magnify())

##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, i used correlation heatmap.


##### 2. What is/are the insight(s) found from the chart?

Best five positive correlations:
Debt ratio, Current Liability to Assets, Borrowing dependency, Current Liability to Current Assets, Liability to Equity 

Best five negative correlations:
Net Income to Total Assets, ROA(A) before interest and % after tax, ROA(B) before interest and depreciation after tax , ROA(C) before interest and depreciation before interest, Net worth/Assets 

Rest all correlation can be depicted from the above chart.


## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1.  Companies that encounter financial difficulties tend to have lower ROA.

2.  Net Value per Share does not affect the Target Variable.

3.  The Per Share Net profit before tax (Yuan ¥) is lower for stable companies.

### Hypothetical Statement - 1

Companies that encounter financial difficulties tend to have lower ROA


#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: Companies that encounter financial difficulties tend to have lower ROA.

Alternate Hypothesis : Companies that encounter financial difficulties tend to have higher ROA.

Test Type: Two tailed test


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Seperate dfs
bankrupt_df = df[df['Bankrupt?']==True]
not_bankrupt_df = df[df['Bankrupt?']==False]

# Analyze distributions of selected features using rfecv
cols = df.drop("Bankrupt?", axis=1).columns
cols = df.iloc[:,1:3]

for feature in cols:
  
  a = bankrupt_df[feature]
  b = not_bankrupt_df[feature]
  b = b.sample(n=len(a), random_state=42) # Take random sample to match length of target
  # Running t-tests
  test = stats.ttest_ind(a,b)   
  plt.figure() 
  sns.distplot(bankrupt_df[feature], kde=True, label="Bankrupt")
  sns.distplot(not_bankrupt_df[feature], kde=True, label="Not Bankrupt") 
  plt.title("{} / p-value of t-test = :{}".format(feature, test[1]))
  plt.legend()

##### Which statistical test have you done to obtain P-Value?

I have used T-Test as the statistical testing to obtain P-Value and found the result that Null hypothesis can't be rejected and Companies that encounter financial difficulties tend to have lower ROA.

##### Why did you choose the specific statistical test?

For studies with a large sample size, t-tests and their corresponding confidence intervals can and should be used even for heavily skewed data.

So, for skewed data we can use T-test for better result. Thus, I used t - test.


### Hypothetical Statement - 2

Net Value per Share does not affect the Target Variable.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: Net Value per Share does not affect the Target Variable.

Alternate Hypothesis : Net Value per Share affect the Target Variable.

Test Type: Two tailed test


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

cols = df.iloc[:,16:19]
for feature in cols:
  
  a = bankrupt_df[feature]
  b = not_bankrupt_df[feature]
  b = b.sample(n=len(a), random_state=42) # Take random sample from each feature to match length of target
  # Running t-tests
  test = stats.ttest_ind(a,b)   
  plt.figure() 
  sns.distplot(bankrupt_df[feature], kde=True, label="Bankrupt")
  sns.distplot(not_bankrupt_df[feature], kde=True, label="Not Bankrupt") 
  plt.title("{} / p-value of t-test = :{}".format(feature, test[1]))
  plt.legend()


##### Which statistical test have you done to obtain P-Value?

I have used T-Test as the statistical testing to obtain P-Value and found the result that Null hypothesis isn't True and Net Value per Share does affect the Target Variable. Therfore, we accept the Alternate Hypothesis here.

##### Why did you choose the specific statistical test?

We have skewed and highly imbalanced data. Therfore we use t-test for our hypothesis and find our p value.

### Hypothetical Statement - 3

 The Per Share Net profit before tax (Yuan ¥) is lower for stable companies.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: The Per Share Net profit before tax (Yuan ¥) is lower for stable companies.

Alternate Hypothesis : The Per Share Net profit before tax (Yuan ¥) is lower for unstable companies.

Test Type: Two tailed test


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

cols = df.iloc[:,23:24]
for feature in cols:
  
  a = bankrupt_df[feature]
  b = not_bankrupt_df[feature]
  b = b.sample(n=len(a), random_state=42) # Take random sample from each feature to match length of target
  # Running t-tests
  test = stats.ttest_ind(a,b)   
  plt.figure() 
  sns.distplot(bankrupt_df[feature], kde=True, label="Bankrupt")
  sns.distplot(not_bankrupt_df[feature], kde=True, label="Not Bankrupt") 
  plt.title("{} / p-value of t-test = :{}".format(feature, test[1]))
  plt.legend()


##### Which statistical test have you done to obtain P-Value?

I have used T-Test as the statistical testing to obtain P-Value and found the result that Null hypothesis isn't True and The Per Share Net profit before tax (Yuan ¥) is lower for unstable companies. Therfore, we accept the Alternate Hypothesis here and will reject the null hypothesis.

##### Why did you choose the specific statistical test?

We have skewed and highly imbalanced data. Therfore we use t-test for our hypothesis and find our p value.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***