# Problem description

You are to predict whether a company will go bankrupt in the following year, based on financial attributes of the company.

Perhaps you are contemplating lending money to a company, and need to know whether the company
is in near-term danger of not being able to repay.


## Goal

## Learning objectives

- Demonstrate mastery on solving a classification problem and presenting
the entire Recipe for Machine Learning process in a notebook.
- We will make suggestions for ways to approach the problem
    - But there will be little explicit direction for this task.
- It is meant to be analogous to a pre-interview task that a potential employer might assign
to verify your skill

# Import modules

In [1]:
## Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

import os
import math

%matplotlib inline


# API for students

In [2]:
## Load the bankruptcy_helper module

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

# Import bankruptcy_helper module
import bankruptcy_helper
%aimport bankruptcy_helper

helper = bankruptcy_helper.Helper()

# Get the data

The first step in our Recipe is Get the Data.

- Each example is a row of data corresponding to a single company
- There are 64 attributes, described in the section below
- The column `Bankrupt` is 1 if the company subsequently went bankrupt; 0 if it did not go bankrupt
- The column `Id` is a Company Identifier

In [3]:
# Data directory
DATA_DIR = "./Data"

if not os.path.isdir(DATA_DIR):
    DATA_DIR = "../resource/asnlib/publicdata/bankruptcy/data"

data_file = "5th_yr.csv"
data = pd.read_csv( os.path.join(DATA_DIR, "train", data_file) )

target_attr = "Bankrupt"

n_samples, n_attrs = data.shape
print("Date shape: ", data.shape)

Date shape:  (4818, 66)


## Have a look at the data

We will not go through all steps in the Recipe, nor in depth.

But here's a peek

In [4]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X57,X58,X59,X60,X61,X62,X63,X64,Bankrupt,Id
0,0.025417,0.41769,0.0568,1.1605,-126.39,0.41355,0.025417,1.2395,1.165,0.51773,...,0.049094,0.85835,0.12322,5.6167,7.4042,164.31,2.2214,1.334,0,4510
1,-0.023834,0.2101,0.50839,4.2374,22.034,0.058412,-0.027621,3.6579,0.98183,0.76855,...,-0.031011,1.0185,0.069047,5.7996,7.7529,26.446,13.802,6.4782,0,3537
2,0.030515,0.44606,0.19569,1.565,35.766,0.28196,0.039264,0.88456,1.0526,0.39457,...,0.077337,0.95006,0.25266,15.049,2.8179,104.73,3.4852,2.6361,0,3920
3,0.052318,0.056366,0.54562,10.68,438.2,0.13649,0.058164,10.853,1.0279,0.61173,...,0.085524,0.97282,0.0,6.0157,7.4626,48.756,7.4863,1.0602,0,1806
4,0.000992,0.49712,0.12316,1.3036,-71.398,0.0,0.001007,1.0116,1.2921,0.50288,...,0.001974,0.99925,0.019736,3.4819,8.582,114.58,3.1854,2.742,0,1529


Pretty *unhelpful* !

What are these mysteriously named features ?

## Description of attributes

This may still be somewhat unhelpful for those of you not used to reading Financial Statements.

But that's partially the point of the exercise
- You can *still* perform Machine Learning *even if* you are not an expert in the problem domain
    - That's what makes this a good interview exercise: you can demonstrate your thought process even if you don't know the exact meaning of the terms
- Of course: becoming an expert in the domain *will improve* your ability to create better models
    - Feature engineering is easier if you understand the features, their inter-relationships, and the relationship to the target

Let's get a feel for the data
- What is the type of each attribute ?


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4818 entries, 0 to 4817
Data columns (total 66 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   X1        4818 non-null   object 
 1   X2        4818 non-null   object 
 2   X3        4818 non-null   object 
 3   X4        4818 non-null   object 
 4   X5        4818 non-null   object 
 5   X6        4818 non-null   object 
 6   X7        4818 non-null   object 
 7   X8        4818 non-null   object 
 8   X9        4818 non-null   float64
 9   X10       4818 non-null   object 
 10  X11       4818 non-null   object 
 11  X12       4818 non-null   object 
 12  X13       4818 non-null   float64
 13  X14       4818 non-null   object 
 14  X15       4818 non-null   object 
 15  X16       4818 non-null   object 
 16  X17       4818 non-null   object 
 17  X18       4818 non-null   object 
 18  X19       4818 non-null   float64
 19  X20       4818 non-null   float64
 20  X21       4818 non-null   obje

You may be puzzled:
- Most attributes are `object` and *not* numeric (`float64`)
- But looking at the data via `data.head()` certainly gives the impression that all attributes are numeric

Welcome to the world of messy data !  The dataset has represented numbers as strings.
- These little unexpected challenges are common in the real-word
- Data is rarely perfect and clean

So you might want to first convert all attributes to numeric

**Hint**
- Look up the Pandas method `to_numeric`
    - We suggest you use the option `errors='coerce'`
    

# Preparing the data

Since I do not know if the bakrupt/not_bankropt is randomlly distributed or not, I want to first to reorder the data in random order.

In [6]:
data = sklearn.utils.shuffle(data, random_state = 42)

Now I will drop the ID column because it is completely not relevant to bankrupt ornot

In [7]:
#data = data.drop(["Id"], axis = 1)

Now transfer all datatype to float64

In [8]:
data = data.apply(pd.to_numeric, errors = 'coerce')
from sklearn.impute import SimpleImputer

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4818 entries, 4340 to 860
Data columns (total 66 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   X1        4816 non-null   float64
 1   X2        4816 non-null   float64
 2   X3        4816 non-null   float64
 3   X4        4803 non-null   float64
 4   X5        4808 non-null   float64
 5   X6        4816 non-null   float64
 6   X7        4816 non-null   float64
 7   X8        4804 non-null   float64
 8   X9        4818 non-null   float64
 9   X10       4816 non-null   float64
 10  X11       4816 non-null   float64
 11  X12       4803 non-null   float64
 12  X13       4818 non-null   float64
 13  X14       4816 non-null   float64
 14  X15       4812 non-null   float64
 15  X16       4804 non-null   float64
 16  X17       4804 non-null   float64
 17  X18       4816 non-null   float64
 18  X19       4818 non-null   float64
 19  X20       4818 non-null   float64
 20  X21       4744 non-null   fl

now split the data set to train_set and test_set use train_test_split, I also drop the id clumn as it is completely not relevant to bankrupt or not.
I use median to fill all missing data because it not as sensitive to extreme value as average.

In [10]:
from sklearn.model_selection import train_test_split
traindata, testdata = train_test_split(data, test_size=0.3, random_state=42)
X_train, y_train = traindata.drop(columns=["Bankrupt","Id"] ), traindata[ ["Bankrupt"] ]
X_test, y_test = testdata.drop(columns=["Bankrupt"] ), testdata[ ["Bankrupt"] ]
X_train

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X55,X56,X57,X58,X59,X60,X61,X62,X63,X64
109,0.019213,0.52630,-0.015068,0.89429,-88.844,0.019358,0.013060,0.88923,1.24840,0.46800,...,-13975.00,0.199000,0.041054,0.80100,0.820000,7.2844,11.5030,107.0900,3.40820,0.556780
1140,0.160410,0.69907,0.090965,1.13010,10.146,0.000000,0.198730,0.43047,3.44550,0.30093,...,134.00,0.055834,0.533050,0.94243,0.000000,,4.5676,74.0570,4.92870,16.410000
4399,0.089987,0.56987,0.097846,6.35790,54.840,0.000000,0.100760,0.75477,0.78787,0.43012,...,745.54,0.104040,0.209210,0.88032,1.279100,,7.9970,8.4604,43.14200,0.891360
4524,0.001961,0.48070,0.332930,1.74460,-82.067,0.028042,0.003307,1.08030,1.21650,0.51930,...,3098.00,0.020703,0.003776,0.99730,0.046356,2.0511,7.7519,134.1600,2.72070,5.530400
2435,0.165180,0.20127,0.781430,4.88250,52.287,0.000000,0.204360,3.96850,1.31190,0.79873,...,3739.00,0.163830,0.206810,0.84584,0.000000,2.1170,3.7472,56.0000,6.51790,75.829000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4471,0.298400,0.27122,0.546070,3.01340,78.472,0.388460,0.368400,2.68700,2.88010,0.72878,...,651.31,0.136680,0.409450,0.87237,0.000000,330.7100,6.7177,34.3730,10.61900,15.763000
3095,0.174440,0.54906,0.280100,1.55000,-10.833,0.602800,0.217750,0.82129,1.05060,0.45094,...,49626.00,0.048165,0.386840,0.95183,0.088172,19.8080,9.8811,42.6080,8.56640,20.716000
1367,-0.130150,0.34488,-0.062240,0.31906,-265.850,-0.387750,-0.130150,1.16780,0.28446,0.40275,...,-71677.00,-2.515400,-0.323150,3.51540,0.629350,23.5710,3.5718,720.7700,0.50641,0.047678
1724,0.045643,0.50982,0.128820,1.43650,-46.091,0.000000,0.058781,0.96148,1.06330,0.49018,...,667.55,0.053665,0.093114,0.94594,0.185170,4.3959,6.0577,101.3100,3.60270,1.846000


# Evaluating your project

We will evaluate your submission on a test dataset that we provide
- It has no labels, so **you** can't use it to evaluate your model, but **we** have the labels
- We will call this evaluation dataset the "holdout" data

Let's get it

In [11]:
holdout_data = pd.read_csv( os.path.join(DATA_DIR, "holdout", '5th_yr.csv') )

print("Data shape: ", holdout_data.shape)


Data shape:  (1092, 65)


We will evaluate your model on the holdout examples using metrics
- Accuracy
- Recall
- Precision

From our lecture: we may have to make a trade-off between Recall and Precision.

Our evaluation of your submission will be partially based on how you made (and described) the trade-off.

You may assume that it is 5 times worse to *fail to identify a company that will go bankrupt*
than it is to fail to identify a company that won't go bankrupt.

# Your model

Time for you to continue the Recipe for Machine Learning on your own.



In [12]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import decomposition
import warnings
warnings.filterwarnings("ignore")

I will try several models in this project including SVC, Decision Tree, and Logistic Regression.
I apply GridSearchCV to tuning the model

First I apply some data transform: StandardScaler and SimpleImputer. 
It will be applied in all pipelines for different model

In [13]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy = 'median')
scaler = StandardScaler()
from sklearn.metrics import accuracy_score

Decision Tree Model

In [14]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
DT = DecisionTreeClassifier(random_state = 42)
pipe_DT = Pipeline(steps=[('imputer', imp),('scaler', scaler),
                          ('model', DT)])

In [15]:
param_grid_DT = {'model__max_features': [2,4,7,10,15,20,30,50],
              'model__criterion': ['gini', 'entropy'],
                'model__max_depth' : [2,4,7,10,15,20,30,50]}

In [16]:
grid_DT = GridSearchCV(pipe_DT, param_grid_DT)
grid_DT.fit(X_train, y_train)
print(grid_DT.best_params_)

GridSearchCV(estimator=Pipeline(steps=[('imputer',
                                        SimpleImputer(strategy='median')),
                                       ('scaler', StandardScaler()),
                                       ('model',
                                        DecisionTreeClassifier(random_state=42))]),
             param_grid={'model__criterion': ['gini', 'entropy'],
                         'model__max_depth': [2, 4, 7, 10, 15, 20, 30, 50],
                         'model__max_features': [2, 4, 7, 10, 15, 20, 30, 50]})

{'model__criterion': 'gini', 'model__max_depth': 2, 'model__max_features': 4}


In [17]:
optimal_DT = DecisionTreeClassifier(criterion = "gini", random_state = 42,
                           max_depth = 2,
                           max_features = 4)
pipe_DT = Pipeline(steps=[('imputer', imp),('scaler', scaler),
                           ('model', optimal_DT)])
pipe_DT.fit(X_train,y_train)
y_pred = pipe_DT.predict(X_train)
from sklearn.metrics import accuracy_score
print("Decision Tree Model in_samle_score is: ",accuracy_score(y_train, y_pred))

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler()),
                ('model',
                 DecisionTreeClassifier(max_depth=2, max_features=4,
                                        random_state=42))])

Decision Tree Model in_samle_score is:  0.9383155397390273


Logistic regression

In [18]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
param_grid_LR = {'model__C': [0.00001,0.001,0.1,1,10, 50, 100],
             'model__penalty':('l1', 'l2', 'elasticnet')}
pipe_LR = Pipeline(steps=[('imputer', imp),('scaler', scaler),
                           ('model', LR)])
grid_LR = GridSearchCV(pipe_LR, param_grid_LR)
grid_LR.fit(X_train, y_train)
print(grid_LR.best_params_)

GridSearchCV(estimator=Pipeline(steps=[('imputer',
                                        SimpleImputer(strategy='median')),
                                       ('scaler', StandardScaler()),
                                       ('model', LogisticRegression())]),
             param_grid={'model__C': [1e-05, 0.001, 0.1, 1, 10, 50, 100],
                         'model__penalty': ('l1', 'l2', 'elasticnet')})

{'model__C': 0.001, 'model__penalty': 'l2'}


In [19]:
optimal_LR = LogisticRegression(C=0.001,penalty='l2')
pipe_LR = Pipeline(steps=[('imputer', imp),('scaler', scaler),
                           ('model', optimal_LR)])
pipe_LR.fit(X_train,y_train)
y_pred = pipe_LR.predict(X_train)
print("Logistic Regression Model in_samle_score is: ",accuracy_score(y_train, y_pred))

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler()),
                ('model', LogisticRegression(C=0.001))])

Logistic Regression Model in_samle_score is:  0.9350533807829181


SVC MODEL

In [20]:
from sklearn.svm import SVC

svc = SVC(kernel='rbf')
pipe_svc = Pipeline(steps=[('scaler', scaler),('imputer', imp),
                           ('model', svc)])
param_grid_svc = {'model__C': [ 0.1, 1,5],
              'model__gamma': [0.001,0.05, 0.01, 0.5,1]}

grid_svc = GridSearchCV(pipe_svc, param_grid_svc)
grid_svc.fit(X_train, y_train)
print(grid_svc.best_params_)

GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('imputer',
                                        SimpleImputer(strategy='median')),
                                       ('model', SVC())]),
             param_grid={'model__C': [0.1, 1, 5],
                         'model__gamma': [0.001, 0.05, 0.01, 0.5, 1]})

{'model__C': 1, 'model__gamma': 0.5}


In [21]:
optimal_svc = SVC(kernel='rbf',C=1,gamma=0.5)
pipe_svc = Pipeline(steps=[('imputer', imp),('scaler', scaler),
                           ('model', optimal_svc)])
pipe_svc.fit(X_train,y_train)
y_pred = pipe_svc.predict(X_train)
print("SVC Model in_samle_score is: ",accuracy_score(y_train, y_pred))

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler()), ('model', SVC(C=1, gamma=0.5))])

SVC Model in_samle_score is:  0.9620403321470937


# Error analysis and Recall_precission trade off

Let's first take a look at the cross validation score for this three model

In [22]:
from sklearn.model_selection import cross_val_score

In [23]:
cross_val_scores = cross_val_score(pipe_DT, X_train, y_train, cv=5)
print("{m:s} Model: avg cross validation score={s:3.2f}\n".format(m="Decission Tree", s=cross_val_scores.mean()) )
cross_val_scores = cross_val_score(pipe_LR, X_train, y_train, cv=5)
print("{m:s} Model: avg cross validation score={s:3.2f}\n".format(m="Logistic Regression", s=cross_val_scores.mean()) )
cross_val_scores = cross_val_score(pipe_svc, X_train, y_train, cv=5)
print("{m:s} Model: avg cross validation score={s:3.2f}\n".format(m="SVC", s=cross_val_scores.mean()) )

Decission Tree Model: avg cross validation score=0.94

Logistic Regression Model: avg cross validation score=0.94

SVC Model: avg cross validation score=0.94



Let's compute the f1 score to see the recall_precission situation
Where f1 score range from 0 to 1, 1 is the best and 0 is generally the worse

In [24]:
>>> from sklearn.metrics import f1_score
y_DT=pipe_DT.predict(X_train)
y_LR=pipe_LR.predict(X_train)
y_svc=pipe_svc.predict(X_train)
print("{m:s} Model: f1 score={s:3.2f}\n".format(m="Decission Tree", s=f1_score(y_train, y_DT) ) )
print("{m:s} Model: f1 score={s:3.2f}\n".format(m="Logistic Regression", s=f1_score(y_train, y_LR)) )
print("{m:s} Model: f1 score={s:3.2f}\n".format(m="SVC", s=f1_score(y_train, y_svc)) )

Decission Tree Model: f1 score=0.15

Logistic Regression Model: f1 score=0.02

SVC Model: f1 score=0.59



We can see that the Logistic regression model have very poor recall_precission tradeoff, other two models are also not optimal.
let's take a look at the data again

In [25]:
y_train.mean()

Bankrupt    0.065243
dtype: float64

This actually tell us that in our training model, ony 6.5% of cases go to bankrupt, this data is highly unbalanced and we may able to improve the recall_precissiom by balance the training data.

With balanced data, the more case of bankrupt involved, our may be able to identified the bankrupt more accuracy.

# Improving our model base on the error analysis and the recall_precission tradeoff

First, we resample the data to balance

In [26]:
data_bankrupt_0 = traindata[traindata.Bankrupt==0]
data_bankrupt_1 = traindata[traindata.Bankrupt==1]
data_bankrupt_0

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X57,X58,X59,X60,X61,X62,X63,X64,Bankrupt,Id
109,0.019213,0.52630,-0.015068,0.89429,-88.844,0.019358,0.013060,0.88923,1.24840,0.46800,...,0.041054,0.80100,0.820000,7.2844,11.5030,107.0900,3.40820,0.556780,0,2917
1140,0.160410,0.69907,0.090965,1.13010,10.146,0.000000,0.198730,0.43047,3.44550,0.30093,...,0.533050,0.94243,0.000000,,4.5676,74.0570,4.92870,16.410000,0,4798
4399,0.089987,0.56987,0.097846,6.35790,54.840,0.000000,0.100760,0.75477,0.78787,0.43012,...,0.209210,0.88032,1.279100,,7.9970,8.4604,43.14200,0.891360,0,2904
4524,0.001961,0.48070,0.332930,1.74460,-82.067,0.028042,0.003307,1.08030,1.21650,0.51930,...,0.003776,0.99730,0.046356,2.0511,7.7519,134.1600,2.72070,5.530400,0,3872
2435,0.165180,0.20127,0.781430,4.88250,52.287,0.000000,0.204360,3.96850,1.31190,0.79873,...,0.206810,0.84584,0.000000,2.1170,3.7472,56.0000,6.51790,75.829000,0,1711
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4471,0.298400,0.27122,0.546070,3.01340,78.472,0.388460,0.368400,2.68700,2.88010,0.72878,...,0.409450,0.87237,0.000000,330.7100,6.7177,34.3730,10.61900,15.763000,0,5569
3095,0.174440,0.54906,0.280100,1.55000,-10.833,0.602800,0.217750,0.82129,1.05060,0.45094,...,0.386840,0.95183,0.088172,19.8080,9.8811,42.6080,8.56640,20.716000,0,1822
1367,-0.130150,0.34488,-0.062240,0.31906,-265.850,-0.387750,-0.130150,1.16780,0.28446,0.40275,...,-0.323150,3.51540,0.629350,23.5710,3.5718,720.7700,0.50641,0.047678,0,5388
1724,0.045643,0.50982,0.128820,1.43650,-46.091,0.000000,0.058781,0.96148,1.06330,0.49018,...,0.093114,0.94594,0.185170,4.3959,6.0577,101.3100,3.60270,1.846000,0,2196


We do the oversample to balance the data because if we do underfitting, there will be too little data to train.  
we see there are 3152 cases not bankrupt, so we will also match 3152 case of bankrupt to balance our data

In [27]:
from sklearn.utils import resample
new_data_bankrupt = resample(data_bankrupt_1,replace = True, n_samples = 3152,random_state = 42)
balanced_traindata = pd.concat([data_bankrupt_0, new_data_bankrupt])

take a look at new training data

In [28]:
balanced_traindata

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X57,X58,X59,X60,X61,X62,X63,X64,Bankrupt,Id
109,0.019213,0.52630,-0.015068,0.89429,-88.8440,0.019358,0.013060,0.88923,1.248400,0.46800,...,0.041054,0.80100,0.820000,7.2844,11.503000,107.0900,3.408200,0.556780,0,2917
1140,0.160410,0.69907,0.090965,1.13010,10.1460,0.000000,0.198730,0.43047,3.445500,0.30093,...,0.533050,0.94243,0.000000,,4.567600,74.0570,4.928700,16.410000,0,4798
4399,0.089987,0.56987,0.097846,6.35790,54.8400,0.000000,0.100760,0.75477,0.787870,0.43012,...,0.209210,0.88032,1.279100,,7.997000,8.4604,43.142000,0.891360,0,2904
4524,0.001961,0.48070,0.332930,1.74460,-82.0670,0.028042,0.003307,1.08030,1.216500,0.51930,...,0.003776,0.99730,0.046356,2.0511,7.751900,134.1600,2.720700,5.530400,0,3872
2435,0.165180,0.20127,0.781430,4.88250,52.2870,0.000000,0.204360,3.96850,1.311900,0.79873,...,0.206810,0.84584,0.000000,2.1170,3.747200,56.0000,6.517900,75.829000,0,1711
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3090,-0.507060,0.15578,0.594650,4.81740,45.5530,0.000000,-0.507060,5.41940,2.088900,0.84420,...,-0.600640,1.15750,0.000000,16.4940,5.294200,27.2190,13.410000,8.369900,1,5162
2154,-0.682290,4.77090,0.235260,1.64010,1032.2000,-3.277000,-0.682290,-0.79040,0.031879,-3.77090,...,0.180940,22.40200,-1.167700,,0.053105,4208.3000,0.086734,0.080263,1,1776
234,0.003961,0.69863,-0.000999,0.99768,-78.5210,-0.019608,0.007031,0.43138,1.304600,0.30137,...,0.013144,0.99506,0.522540,4.9624,8.207700,120.5800,3.027100,2.288600,1,4155
677,-0.405880,2.15750,-1.576300,0.25320,-287.5200,-0.405880,-0.405880,-0.61162,0.872050,-1.31960,...,0.307590,1.14670,-0.035435,9.1461,7.719200,371.9100,0.981410,4.449400,1,2202


looks good but all bankrupt case is at the end because the way we resample it, need to reorder to ramdom order
we also need to confirmed that the % of bankrupts is about 50% in our training data.

In [29]:
balanced_traindata = sklearn.utils.shuffle(balanced_traindata, random_state = 42)
balanced_traindata.Bankrupt.mean()

0.5

now construct the X_train and y_train

In [30]:
X_train, y_train = balanced_traindata.drop(columns=["Bankrupt","Id"] ), balanced_traindata[ ["Bankrupt"] ]

Since our training set have changed, we need to retest our models, tuning them and to see if the performance is improved

Logistic regression model

In [31]:
LR = LogisticRegression()
param_grid_LR = {'model__C': [0.00001,0.001,0.1,1,10, 50, 100],
             'model__penalty':('l1', 'l2', 'elasticnet')}
pipe_LR = Pipeline(steps=[('scaler', scaler),('imputer', imp),
                           ('model', LR)])
grid_LR = GridSearchCV(pipe_LR, param_grid_LR)
grid_LR.fit(X_train, y_train)
print(grid_LR.best_params_)

GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('imputer',
                                        SimpleImputer(strategy='median')),
                                       ('model', LogisticRegression())]),
             param_grid={'model__C': [1e-05, 0.001, 0.1, 1, 10, 50, 100],
                         'model__penalty': ('l1', 'l2', 'elasticnet')})

{'model__C': 10, 'model__penalty': 'l2'}


In [32]:
optimal_LR = LogisticRegression(C=10,penalty='l2')
pipe_LR = Pipeline(steps=[('imputer', imp),('scaler', scaler),
                           ('model', optimal_LR)])
pipe_LR.fit(X_train,y_train)
y_pred = pipe_LR.predict(X_train)
print("Decision Tree Model in_samle_score is: ",accuracy_score(y_train, y_pred))

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler()),
                ('model', LogisticRegression(C=10))])

Decision Tree Model in_samle_score is:  0.7991751269035533


SVC model

In [33]:
svc = SVC(kernel='rbf')
pipe_svc = Pipeline(steps=[('scaler', scaler),('imputer', imp),
                           ('model', svc)])
param_grid_svc = {'model__C': [ 0.1, 1 , 5],
              'model__gamma': [0.01 ,0.02,0.05,0.2, 1 ]}

grid_svc = GridSearchCV(pipe_svc, param_grid_svc)
grid_svc.fit(X_train, y_train)
print(grid_svc.best_params_)

GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('imputer',
                                        SimpleImputer(strategy='median')),
                                       ('model', SVC())]),
             param_grid={'model__C': [0.1, 1, 5],
                         'model__gamma': [0.01, 0.02, 0.05, 0.2, 1]})

{'model__C': 5, 'model__gamma': 1}


In [34]:
optimal_svc = SVC(kernel='rbf',C=5,gamma=1)
pipe_svc = Pipeline(steps=[('imputer', imp),('scaler', scaler),
                           ('model', optimal_svc)])
pipe_svc.fit(X_train,y_train)
y_pred = pipe_svc.predict(X_train)
print("SVC Model in_samle_score is: ",accuracy_score(y_train, y_pred))

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler()), ('model', SVC(C=5, gamma=1))])

SVC Model in_samle_score is:  0.9915926395939086


Decision tree model

In [35]:
DT = DecisionTreeClassifier(random_state = 42)
pipe_DT = Pipeline(steps=[('scaler', scaler),('imputer', imp),
                          ('model', DT)])
param_grid_DT = {'model__max_features': [2,4,7,10,15,20,30,50],
              'model__criterion': ['gini', 'entropy'],
                'model__max_depth' : [2,4,7,10,15,20,30,50]}
grid_DT = GridSearchCV(pipe_DT, param_grid_DT)
grid_DT.fit(X_train, y_train)
print(grid_DT.best_params_)

GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('imputer',
                                        SimpleImputer(strategy='median')),
                                       ('model',
                                        DecisionTreeClassifier(random_state=42))]),
             param_grid={'model__criterion': ['gini', 'entropy'],
                         'model__max_depth': [2, 4, 7, 10, 15, 20, 30, 50],
                         'model__max_features': [2, 4, 7, 10, 15, 20, 30, 50]})

{'model__criterion': 'gini', 'model__max_depth': 30, 'model__max_features': 30}


In [36]:
optimal_DT = DecisionTreeClassifier(criterion = "gini", random_state = 42,
                           max_depth = 30,
                           max_features = 30)
pipe_DT = Pipeline(steps=[('imputer', imp),('scaler', scaler),
                           ('model', optimal_DT)])
pipe_DT.fit(X_train,y_train)
y_pred = pipe_DT.predict(X_train)
from sklearn.metrics import accuracy_score
print("Decision Tree Model in_samle_score is: ",accuracy_score(y_train, y_pred))

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler()),
                ('model',
                 DecisionTreeClassifier(max_depth=30, max_features=30,
                                        random_state=42))])

Decision Tree Model in_samle_score is:  0.9980964467005076


Now let's look at our models performance again

In [37]:
cross_val_scores = cross_val_score(pipe_DT, X_train, y_train, cv=5)
print("{m:s} Model: avg cross validation score={s:3.2f}\n".format(m="Decission Tree", s=cross_val_scores.mean()) )
cross_val_scores = cross_val_score(pipe_LR, X_train, y_train, cv=5)
print("{m:s} Model: avg cross validation score={s:3.2f}\n".format(m="Logistic Regression", s=cross_val_scores.mean()) )
cross_val_scores = cross_val_score(pipe_svc, X_train, y_train, cv=5)
print("{m:s} Model: avg cross validation score={s:3.2f}\n".format(m="SVC", s=cross_val_scores.mean()) )

Decission Tree Model: avg cross validation score=0.98

Logistic Regression Model: avg cross validation score=0.80

SVC Model: avg cross validation score=0.98



In [38]:
y_DT=pipe_DT.predict(X_train)
y_LR=pipe_LR.predict(X_train)
y_svc=pipe_svc.predict(X_train)
print("{m:s} Model: f1 score={s:3.2f}\n".format(m="Decission Tree", s=f1_score(y_train, y_DT) ) )
print("{m:s} Model: f1 score={s:3.2f}\n".format(m="Logistic Regression", s=f1_score(y_train, y_LR)) )
print("{m:s} Model: f1 score={s:3.2f}\n".format(m="SVC", s=f1_score(y_train, y_svc)) )

Decission Tree Model: f1 score=1.00

Logistic Regression Model: f1 score=0.80

SVC Model: f1 score=0.99



Wee can see that our f1 score for all tree model have improved a lot. Rasmaple and balanced the data improved our performance significantly.

base on the f1 score and cross validation score, I think Decision Tree model is slightly better for this case
Now, I will try to tuning the model again as in previous tuning, the distence between params is pretty big, I want to avoid any boundary value.

In [None]:
DT = DecisionTreeClassifier(random_state = 42)
pipe_DT = Pipeline(steps=[('scaler', scaler),('imputer', imp),
                          ('model', DT)])
param_grid_DT = {'model__max_features': [26,27,28,29,30,31,32,33,34],
              'model__criterion': ['gini', 'entropy'],
                'model__max_depth' : [28,29,30,31,32,33,34,35]}
grid_DT = GridSearchCV(pipe_DT, param_grid_DT)
grid_DT.fit(X_train, y_train)
print(grid_DT.best_params_)

In [None]:
optimal_DT = DecisionTreeClassifier(criterion = "gini", random_state = 42,max_depth = 30,max_features = 33)
pipe_DT = Pipeline(steps=[('imputer', imp),('scaler', scaler),('model', optimal_DT)])
pipe_DT.fit(X_train,y_train)
cross_val_scores = cross_val_score(pipe_DT, X_train, y_train, cv=5)
print("{m:s} Model: avg cross validation score={s:3.2f}\n".format(m="Decission Tree", s=cross_val_scores.mean()) )

So pipe_DT is our best model.

## Submission guidelines

Although your notebook may contain many models (e.g., due to your iterative development)
we will only evaluate a single model.
So choose one (explain why !) and do the following.

- You will implement the body of a subroutine `MyModel`
    - That takes as argument a Pandas DataFrame 
        - Each row is an example on which to predict
        - The features of the example are elements of the row
    - Performs predictions on each example
    - Returns an array or predictions with a one-to-one correspondence with the examples in the test set
    

We will evaluate your model against the holdout data
- By reading the holdout examples `X_hold` (as above)
- Calling `y_hold_pred = MyModel(X_hold)` to get the predictions
- Comparing the predicted values `y_hold_pred` against the true labels `y_hold` which are known only to the instructors

See the following cell as an illustration

**Remember**

The holdout data is in the same format as the one we used for training
- Except that it has no attribute for the target
- So you will need to perform all the transformations on the holdout data
    - As you did on the training data
    - Including turning the string representation of numbers into actual numeric data types

All of this work *must* be performed within the body of the `MyModel` routine you will write

We will grade you by comparing the predictions array you create to the answers known to us.

In [None]:

import pandas as pd
import os

def MyModel(X):
    # It should create an array of predictions; we initialize it to the empty array for convenience
    predictions = []
    
    # YOUR CODE GOES HERE
    X = X.drop(columns = "Id")
    X = X.apply(pd.to_numeric, errors='coerce')
    predictions = pipe_DT.predict(X)
    return predictions



# Check your work: predict and evaluate metrics on *your* test examples

Although only the instructors have the correct labels for the holdout dataset, you may want
to create your own test dataset on which to evaluate your out of sample metrics.

If you choose to do so, you can evaluate your models using the same metrics that the instructors will use.

- Test whether your implementation of `MyModel` works
- See the metrics  your model produces

The following cell
- Assumes that you have created `X_test, y_test` as your proxy for an out of sample dataset
    - It serves the same function as `X_hold`, the holdout dataset, but you have the associated target (only the instructors have `y_hold`)

In [None]:
# Give the model a name (will appear in the print statement)
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
name = "Decision Tree Model"

y_test_pred = MyModel(X_test)

accuracy_test = accuracy_score(y_test, y_test_pred)
recall_test = recall_score(y_test, y_test_pred, pos_label=1, average="binary")
precision_test = precision_score(y_test,   y_test_pred, pos_label=1, average="binary")

print("\t{m:s} Accuracy: {a:3.1%}, Recall {r:3.1%}, Precision {p:3.1%}".format(m=name,
                                                                            a=accuracy_test,
                                                                            r=recall_test,
                                                                            p=precision_test
                                                                            )
         )

In [None]:
print("Done")