## Dataset description
The dataset for Lab-2 is sample of the SBA dataset posted on Kaggle.
The dataset is from the U.S. Small Business Administration (SBA) The U.S. SBA was founded in 1953 on the principle of promoting and assisting small enterprises in the U.S. credit market (SBA Overview and History, US Small Business Administration (2015)). Small businesses have been a primary source of job creation in the United States; therefore, fostering small business formation and growth has social benefits by creating job opportunities and reducing unemployment. There have been many success stories of start-ups receiving SBA loan guarantees such as FedEx and Apple Computer. However, there have also been stories of small businesses and/or start-ups that have defaulted on their SBA-guaranteed loans.  
More info on the original dataset: https://www.kaggle.com/mirbektoktogaraev/should-this-loan-be-approved-or-denied


## Preparation

Use dataset provided in the eLearning

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 1500)

import warnings
warnings.filterwarnings('ignore')

#Extend cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [2]:
"""
Created on Mon Mar 18 18:25:50 2019

@author: Uri Smashnov

Purpose: Analyze input Pandas DataFrame and return stats per column
Details: The function calculates levels for categorical variables and allows to analyze summarized information

To view wide table set following Pandas options:
pd.set_option('display.width', 1000)
pd.set_option('max_colwidth',200)
"""
import pandas as pd
def describe_more(df,normalize_ind=False, weight_column=None, skip_columns=[], dropna=True):
    var = [] ; l = [] ; t = []; unq =[]; min_l = []; max_l = [];
    assert isinstance(skip_columns, list), "Argument skip_columns should be list"
    if weight_column is not None:
        if weight_column not in list(df.columns):
            raise AssertionError('weight_column is not a valid column name in the input DataFrame')
      
    for x in df:
        if x in skip_columns:
            pass
        else:
            var.append( x )
            uniq_counts = len(pd.value_counts(df[x],dropna=dropna))
            uniq_counts = len(pd.value_counts(df[x], dropna=dropna)[pd.value_counts(df[x],dropna=dropna)>0])
            l.append(uniq_counts)
            t.append( df[ x ].dtypes )
            min_l.append(df[x].apply(str).str.len().min())
            max_l.append(df[x].apply(str).str.len().max())
            if weight_column is not None and x not in skip_columns:
                df2 = df.groupby(x).agg({weight_column: 'sum'}).sort_values(weight_column, ascending=False)
                df2['authtrans_vts_cnt']=((df2[weight_column])/df2[weight_column].sum()).round(2)
                unq.append(df2.head(n=100).to_dict()[weight_column])
            else:
                df_cat_d = df[x].value_counts(normalize=normalize_ind,dropna=dropna).round(decimals=2)
                df_cat_d = df_cat_d[df_cat_d>0]
                #unq.append(df[x].value_counts().iloc[0:100].to_dict())
                unq.append(df_cat_d.iloc[0:100].to_dict())
            
    levels = pd.DataFrame( { 'A_Variable' : var , 'Levels' : l , 'Datatype' : t ,
                             'Min Length' : min_l,
                             'Max Length': max_l,
                             'Level_Values' : unq} )
    #levels.sort_values( by = 'Levels' , inplace = True )
    return levels

### Load data

In [3]:
data = pd.read_csv(r'C:\Users\General\Documents\UTD\Semester 3\Applied Machine Learning\Projects\SBA_loans_project_1\SBA_loans_project_1.csv')

In [4]:
print("Data shape:", data.shape)

Data shape: (809247, 20)


In [5]:
data.head()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
0,GLEN BURNIE,MD,21060,"BUSINESS FINANCE GROUP, INC.",VA,811111,240,7,1.0,6,7,1,1,0,N,"$743,000.00",$0.00,"$743,000.00","$743,000.00",P I F
1,WEST BEND,WI,53095,JPMORGAN CHASE BANK NATL ASSOC,IL,722410,240,20,1.0,0,0,1,0,N,N,"$137,000.00",$0.00,"$137,000.00","$109,737.00",P I F
2,SAN DIEGO,CA,92128,UMPQUA BANK,OR,0,120,2,1.0,0,0,1,0,0,N,"$280,000.00",$0.00,"$280,000.00","$210,000.00",P I F
3,WEBSTER,MA,1570,HOMETOWN BANK A CO-OPERATIVE B,MA,621310,84,7,1.0,0,0,1,1,0,Y,"$144,500.00",$0.00,"$144,500.00","$122,825.00",P I F
4,JOPLIN,MO,64804,U.S. BANK NATIONAL ASSOCIATION,OH,0,60,2,2.0,0,0,1,0,N,Y,"$52,500.00",$0.00,"$52,500.00","$42,000.00",P I F


**Review dataset**

In [6]:
desc_df = describe_more(data)
desc_df

Unnamed: 0,A_Variable,Levels,Datatype,Min Length,Max Length,Level_Values
0,City,31320,object,1,30,"{'LOS ANGELES': 10372, 'HOUSTON': 9260, 'NEW Y..."
1,State,51,object,2,3,"{'CA': 117341, 'TX': 63425, 'NY': 51877, 'FL':..."
2,Zip,32731,int64,1,5,"{10001: 841, 90015: 830, 93401: 729, 90010: 65..."
3,Bank,5716,object,3,30,"{'BANK OF AMERICA NATL ASSOC': 78111, 'WELLS F..."
4,BankState,55,object,2,3,"{'CA': 106293, 'NC': 71557, 'IL': 59258, 'OH':..."
5,NAICS,1307,int64,1,6,"{0: 181845, 722110: 25217, 722211: 17476, 8111..."
6,Term,407,int64,1,3,"{84: 207228, 60: 80965, 240: 77385, 120: 69852..."
7,NoEmp,581,int64,1,4,"{1: 138836, 2: 124470, 3: 81466, 4: 66306, 5: ..."
8,NewExist,3,float64,3,3,"{1.0: 580478, 2.0: 227709, 0.0: 932}"
9,CreateJob,234,int64,1,4,"{0: 566148, 1: 56789, 2: 52162, 3: 25945, 4: 1..."


## Dataset preparation and clean-up

Modify and clean-up the dataset as following:
- Replace encode Na/Null values
- Convert the strings styled as '$XXXX.XX' to float values. Columns = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']
- Convert MIS_Status to 0/1. Make value "CHGOFF" as 1

***Part 1: Replace encode Na/Null values***

In [7]:
values_to_fill = {}
for col in data.drop(columns=['MIS_Status']).columns:
    if data[col].dtype == 'object':
        values_to_fill[col] = "Missing"
    else:
        values_to_fill[col] = 0

data.fillna(value=values_to_fill,inplace=True)

In [8]:
for col in data.columns:
    if data[col].isna().any() == True:
        print("Column(s) with missing value(s) in data is:", col, "of type" , data[col].dtype.name)

Column(s) with missing value(s) in data is: MIS_Status of type object


In [9]:
data['MIS_Status'].unique()

array(['P I F', 'CHGOFF', nan], dtype=object)

In [10]:
len(data[data.isna().any(axis=1)]['MIS_Status'])

1822

##### There are 1822 records with nan values in MIS_Status column. We are going to drop those rows from the dataset

In [11]:
data = data.dropna()

***Part 2: Convert the strings styled as '$XXXX.XX' to float values. Columns = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']***

In [12]:
col_toFloat = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']

##### Removed the extra spaces at the end of the string, removed the commas, removed the ' $' value from the values in those columns and then converted to Float

In [13]:
for col in col_toFloat:
    data[col] = data[col].apply(lambda x: (x.replace(' ','')))
    data[col] = data[col].apply(lambda x: (x.replace(',','')))
    data[col] = data[col].apply(lambda x: (x.replace('$','')))
    data[col] = data[col].astype(float)

In [14]:
data[col_toFloat].dtypes

DisbursementGross    float64
BalanceGross         float64
GrAppv               float64
SBA_Appv             float64
dtype: object

***Part 3: Convert MIS_Status to 0/1. Make value "CHGOFF" as 1***

In [15]:
data['MIS_Status'] = data['MIS_Status'].map({'P I F':0, 'CHGOFF':1}).astype(int)
data['MIS_Status'].unique()

array([0, 1])

In [16]:
data.head()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
0,GLEN BURNIE,MD,21060,"BUSINESS FINANCE GROUP, INC.",VA,811111,240,7,1.0,6,7,1,1,0,N,743000.0,0.0,743000.0,743000.0,0
1,WEST BEND,WI,53095,JPMORGAN CHASE BANK NATL ASSOC,IL,722410,240,20,1.0,0,0,1,0,N,N,137000.0,0.0,137000.0,109737.0,0
2,SAN DIEGO,CA,92128,UMPQUA BANK,OR,0,120,2,1.0,0,0,1,0,0,N,280000.0,0.0,280000.0,210000.0,0
3,WEBSTER,MA,1570,HOMETOWN BANK A CO-OPERATIVE B,MA,621310,84,7,1.0,0,0,1,1,0,Y,144500.0,0.0,144500.0,122825.0,0
4,JOPLIN,MO,64804,U.S. BANK NATIONAL ASSOCIATION,OH,0,60,2,2.0,0,0,1,0,N,Y,52500.0,0.0,52500.0,42000.0,0


#### Feature Engineering

I added below two new features to see if it helps the model with more information:
- LoanDisbursedPerCity = Loan Disbursed to small businesses per city
- LoanPaid = DisbursementGross - BalanceGross

In [17]:
import numpy as np

In [18]:
data['LoanDisbursedPerCity'] = data.groupby('City')['DisbursementGross'].transform(np.sum)

In [19]:
data['LoanPaid'] = data['DisbursementGross'] - data['BalanceGross']

In [20]:
data.head(n=3)

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status,LoanDisbursedPerCity,LoanPaid
0,GLEN BURNIE,MD,21060,"BUSINESS FINANCE GROUP, INC.",VA,811111,240,7,1.0,6,7,1,1,0,N,743000.0,0.0,743000.0,743000.0,0,34559260.0,743000.0
1,WEST BEND,WI,53095,JPMORGAN CHASE BANK NATL ASSOC,IL,722410,240,20,1.0,0,0,1,0,N,N,137000.0,0.0,137000.0,109737.0,0,34561890.0,137000.0
2,SAN DIEGO,CA,92128,UMPQUA BANK,OR,0,120,2,1.0,0,0,1,0,0,N,280000.0,0.0,280000.0,210000.0,0,1246528000.0,280000.0


In [21]:
data.dtypes

City                     object
State                    object
Zip                       int64
Bank                     object
BankState                object
NAICS                     int64
Term                      int64
NoEmp                     int64
NewExist                float64
CreateJob                 int64
RetainedJob               int64
FranchiseCode             int64
UrbanRural                int64
RevLineCr                object
LowDoc                   object
DisbursementGross       float64
BalanceGross            float64
GrAppv                  float64
SBA_Appv                float64
MIS_Status                int32
LoanDisbursedPerCity    float64
LoanPaid                float64
dtype: object

## Splitting dataset into Train and Test

- Converting the dataset into target(y) and independent variables(X)
- Splitting the dataset into Training dataset (80%) and Test dataset (20%).
- Not Splitting it into validation dataset as I will be performing cross validation on my chosen model

In [22]:
data.shape

(807425, 22)

In [23]:
y = data['MIS_Status']
X = data.drop(['MIS_Status'], axis=1)

In [24]:
from sklearn.model_selection import train_test_split

# In the first step we will split the data in training and remaining dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.8)


In [25]:
print("Dimensions of train dataset: ",X_train.shape)
print("Dimensions of test dataset: ",X_test.shape)


Dimensions of train dataset:  (645940, 21)
Dimensions of test dataset:  (161485, 21)


In [26]:
X_train.isna().sum()

City                    0
State                   0
Zip                     0
Bank                    0
BankState               0
NAICS                   0
Term                    0
NoEmp                   0
NewExist                0
CreateJob               0
RetainedJob             0
FranchiseCode           0
UrbanRural              0
RevLineCr               0
LowDoc                  0
DisbursementGross       0
BalanceGross            0
GrAppv                  0
SBA_Appv                0
LoanDisbursedPerCity    0
LoanPaid                0
dtype: int64

In [27]:
X_test.isna().sum()

City                    0
State                   0
Zip                     0
Bank                    0
BankState               0
NAICS                   0
Term                    0
NoEmp                   0
NewExist                0
CreateJob               0
RetainedJob             0
FranchiseCode           0
UrbanRural              0
RevLineCr               0
LowDoc                  0
DisbursementGross       0
BalanceGross            0
GrAppv                  0
SBA_Appv                0
LoanDisbursedPerCity    0
LoanPaid                0
dtype: int64

## Categorical variables encoding

Encode categorical variables using either one of the techniques below. Don't use LabelEncoder.
- One-hot-encoder for variables with less than 10 valid values. Name your new columns "Original_name"_valid_value
- (If using sklearn) Target encoder from the following library: https://contrib.scikit-learn.org/category_encoders/index.html . Name your new column "Original_name"_trg

Installed the category_encoders:

In [28]:
pip install category_encoders

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\users\general\documents\utd\semester 3\applied machine learning\virtualmachine\ml-spring-2022\scripts\python.exe -m pip install --upgrade pip' command.


### Target Encoding

- Performing target encoding on all categorical variables.
- Renaming the encoded columns as "ColumnName_trg"
- Creating a dictionary of target encoded columns with key = ColumnName and value = (encoder, "target") so that the encoder can be saved.
- Also, dropping the original columns from the dataframe which have been encoded

In [29]:
from category_encoders import TargetEncoder
from copy import deepcopy

cat_encoders = {}

for col in X_train.columns:
    
    if X_train[col].dtype == 'object':
        print("Performing Target Encoding on column: ", col)
        encoder = TargetEncoder()
        X_train[col + '_trg'] = encoder.fit_transform(X_train[col], y_train)
        X_train.drop([col], axis=1, inplace=True)

        X_test[col + '_trg'] = encoder.transform(X_test[col])
        X_test.drop([col], axis=1, inplace=True)
        
        cat_encoders[col] = [deepcopy(encoder),"target"]
        
print("Finished performing target encoding on categorical columns")

Performing Target Encoding on column:  City
Performing Target Encoding on column:  State
Performing Target Encoding on column:  Bank
Performing Target Encoding on column:  BankState
Performing Target Encoding on column:  RevLineCr
Performing Target Encoding on column:  LowDoc
Finished performing target encoding on categorical columns


In [30]:
X_train.head()

Unnamed: 0,Zip,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,LoanDisbursedPerCity,LoanPaid,City_trg,State_trg,Bank_trg,BankState_trg,RevLineCr_trg,LowDoc_trg
409416,81657,423460,84,1,2.0,0,1,0,2,48177.0,0.0,48800.0,41480.0,11236077.0,48177.0,0.170213,0.178873,0.174162,0.158269,0.147223,0.187178
150409,95838,339950,120,44,1.0,0,44,1,1,598000.0,0.0,598000.0,350673.0,596161424.0,598000.0,0.163606,0.184869,0.01188,0.22088,0.149455,0.187178
356496,50438,321920,102,5,1.0,0,0,1,2,21000.0,0.0,21000.0,17850.0,14760689.0,21000.0,0.15493,0.115711,0.02439,0.106005,0.147223,0.187178
77256,68104,624110,84,1,1.0,1,1,1,1,90000.0,0.0,90000.0,76500.0,318819272.0,90000.0,0.136063,0.117317,0.134328,0.104845,0.149455,0.187178
370389,83709,0,84,14,1.0,0,0,1,0,107879.0,0.0,150000.0,135000.0,280459901.0,107879.0,0.116448,0.141069,0.275229,0.292642,0.147223,0.187178


In [31]:
X_test.head()

Unnamed: 0,Zip,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,LoanDisbursedPerCity,LoanPaid,City_trg,State_trg,Bank_trg,BankState_trg,RevLineCr_trg,LowDoc_trg
554048,14420,332999,240,9,1.0,3,9,1,2,157000.0,0.0,157000.0,157000.0,6884232.0,157000.0,0.191489,0.200194,0.0,0.169201,0.149455,0.187178
193174,46220,235420,240,10,1.0,0,10,1,1,255000.0,0.0,255000.0,255000.0,516649745.0,255000.0,0.172785,0.175878,0.0,0.092939,0.147223,0.187178
114504,50025,0,120,4,1.0,0,0,1,0,100000.0,0.0,100000.0,90000.0,3327707.0,100000.0,0.208333,0.115711,0.000435,0.106005,0.147223,0.187178
390737,70115,236220,37,1,1.0,5,1,0,1,25000.0,0.0,25000.0,22500.0,6289000.0,25000.0,0.307691,0.18024,0.259661,0.175917,0.147223,0.187178
731516,54302,238210,96,5,2.0,5,0,1,1,1040338.0,0.0,1040338.0,780253.0,216166944.0,1040338.0,0.111276,0.124811,0.34,0.119982,0.149455,0.187178


### Using MinMax scaler to scale numerical variables

In [32]:
print("The numerical variables excluding encoded categorical features in training dataset are:")
training_col_names = []
for col in X_train.columns:
    if X_train[col].dtype != 'object' and '_trg' not in col:
        training_col_names.append(col)
        
print(training_col_names)

print("\n")
        
print("The numerical variables excluding encoded categorical features in testing dataset are:")
test_col_names = []
for col in X_test.columns:
    if X_test[col].dtype != 'object' and '_trg' not in col:
        test_col_names.append(col)
        
print(test_col_names)

The numerical variables excluding encoded categorical features in training dataset are:
['Zip', 'NAICS', 'Term', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural', 'DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv', 'LoanDisbursedPerCity', 'LoanPaid']


The numerical variables excluding encoded categorical features in testing dataset are:
['Zip', 'NAICS', 'Term', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural', 'DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv', 'LoanDisbursedPerCity', 'LoanPaid']


In [33]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train_temp = X_train.copy()
X_test_temp = X_test.copy()

X_train_org = X_train[training_col_names]
X_test_org = X_test[test_col_names]

X_train_temp = scaler.fit_transform(X_train_org)
X_test_temp = scaler.transform(X_test_org)

In [34]:
enc_col_names = []

for val in training_col_names:
    enc_col_names.append(val + "_sc")
   
print(enc_col_names)

['Zip_sc', 'NAICS_sc', 'Term_sc', 'NoEmp_sc', 'NewExist_sc', 'CreateJob_sc', 'RetainedJob_sc', 'FranchiseCode_sc', 'UrbanRural_sc', 'DisbursementGross_sc', 'BalanceGross_sc', 'GrAppv_sc', 'SBA_Appv_sc', 'LoanDisbursedPerCity_sc', 'LoanPaid_sc']


In [35]:
X_train[enc_col_names] = X_train_temp
X_train.head()

Unnamed: 0,Zip,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,LoanDisbursedPerCity,LoanPaid,City_trg,State_trg,Bank_trg,BankState_trg,RevLineCr_trg,LowDoc_trg,Zip_sc,NAICS_sc,Term_sc,NoEmp_sc,NewExist_sc,CreateJob_sc,RetainedJob_sc,FranchiseCode_sc,UrbanRural_sc,DisbursementGross_sc,BalanceGross_sc,GrAppv_sc,SBA_Appv_sc,LoanDisbursedPerCity_sc,LoanPaid_sc
409416,81657,423460,84,1,2.0,0,1,0,2,48177.0,0.0,48800.0,41480.0,11236077.0,48177.0,0.170213,0.178873,0.174162,0.158269,0.147223,0.187178,0.816578,0.456256,0.159393,0.0001,1.0,0.0,0.000105,0.0,1.0,0.003861,0.0,0.008737,0.00749,0.0046,0.004209
150409,95838,339950,120,44,1.0,0,44,1,1,598000.0,0.0,598000.0,350673.0,596161424.0,598000.0,0.163606,0.184869,0.01188,0.22088,0.149455,0.187178,0.95839,0.366278,0.227704,0.0044,0.5,0.0,0.004632,1e-05,0.5,0.051913,0.0,0.109121,0.063999,0.244174,0.052244
356496,50438,321920,102,5,1.0,0,0,1,2,21000.0,0.0,21000.0,17850.0,14760689.0,21000.0,0.15493,0.115711,0.02439,0.106005,0.147223,0.187178,0.504385,0.346852,0.193548,0.0005,0.5,0.0,0.0,1e-05,1.0,0.001486,0.0,0.003656,0.003171,0.006044,0.001835
77256,68104,624110,84,1,1.0,1,1,1,1,90000.0,0.0,90000.0,76500.0,318819272.0,90000.0,0.136063,0.117317,0.134328,0.104845,0.149455,0.187178,0.681047,0.672445,0.159393,0.0001,0.5,0.000114,0.000105,1e-05,0.5,0.007516,0.0,0.016268,0.01389,0.13058,0.007863
370389,83709,0,84,14,1.0,0,0,1,0,107879.0,0.0,150000.0,135000.0,280459901.0,107879.0,0.116448,0.141069,0.275229,0.292642,0.147223,0.187178,0.837098,0.0,0.159393,0.0014,0.5,0.0,0.0,1e-05,0.0,0.009078,0.0,0.027235,0.024582,0.114869,0.009425


In [36]:
X_test[enc_col_names] = X_test_temp
X_test.head()

Unnamed: 0,Zip,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,LoanDisbursedPerCity,LoanPaid,City_trg,State_trg,Bank_trg,BankState_trg,RevLineCr_trg,LowDoc_trg,Zip_sc,NAICS_sc,Term_sc,NoEmp_sc,NewExist_sc,CreateJob_sc,RetainedJob_sc,FranchiseCode_sc,UrbanRural_sc,DisbursementGross_sc,BalanceGross_sc,GrAppv_sc,SBA_Appv_sc,LoanDisbursedPerCity_sc,LoanPaid_sc
554048,14420,332999,240,9,1.0,3,9,1,2,157000.0,0.0,157000.0,157000.0,6884232.0,157000.0,0.191489,0.200194,0.0,0.169201,0.149455,0.187178,0.144201,0.358789,0.455408,0.0009,0.5,0.000341,0.000947,1e-05,1.0,0.013371,0.0,0.028514,0.028603,0.002818,0.013716
193174,46220,235420,240,10,1.0,0,10,1,1,255000.0,0.0,255000.0,255000.0,516649745.0,255000.0,0.172785,0.175878,0.0,0.092939,0.147223,0.187178,0.462205,0.253653,0.455408,0.001,0.5,0.0,0.001053,1e-05,0.5,0.021936,0.0,0.046427,0.046514,0.211607,0.022278
114504,50025,0,120,4,1.0,0,0,1,0,100000.0,0.0,100000.0,90000.0,3327707.0,100000.0,0.208333,0.115711,0.000435,0.106005,0.147223,0.187178,0.500255,0.0,0.227704,0.0004,0.5,0.0,0.0,1e-05,0.0,0.00839,0.0,0.018095,0.016357,0.001361,0.008736
390737,70115,236220,37,1,1.0,5,1,0,1,25000.0,0.0,25000.0,22500.0,6289000.0,25000.0,0.307691,0.18024,0.259661,0.175917,0.147223,0.187178,0.701157,0.254515,0.070209,0.0001,0.5,0.000568,0.000105,0.0,0.5,0.001835,0.0,0.004387,0.004021,0.002574,0.002184
731516,54302,238210,96,5,2.0,5,0,1,1,1040338.0,0.0,1040338.0,780253.0,216166944.0,1040338.0,0.111276,0.124811,0.34,0.119982,0.149455,0.187178,0.543025,0.256659,0.182163,0.0005,1.0,0.000568,0.0,1e-05,0.5,0.090571,0.0,0.189972,0.142512,0.088536,0.090888


In [37]:
X_train.drop(training_col_names,axis=1,inplace=True)

In [38]:
X_test.drop(test_col_names,axis=1,inplace=True)

In [39]:
X_train.head()

Unnamed: 0,City_trg,State_trg,Bank_trg,BankState_trg,RevLineCr_trg,LowDoc_trg,Zip_sc,NAICS_sc,Term_sc,NoEmp_sc,NewExist_sc,CreateJob_sc,RetainedJob_sc,FranchiseCode_sc,UrbanRural_sc,DisbursementGross_sc,BalanceGross_sc,GrAppv_sc,SBA_Appv_sc,LoanDisbursedPerCity_sc,LoanPaid_sc
409416,0.170213,0.178873,0.174162,0.158269,0.147223,0.187178,0.816578,0.456256,0.159393,0.0001,1.0,0.0,0.000105,0.0,1.0,0.003861,0.0,0.008737,0.00749,0.0046,0.004209
150409,0.163606,0.184869,0.01188,0.22088,0.149455,0.187178,0.95839,0.366278,0.227704,0.0044,0.5,0.0,0.004632,1e-05,0.5,0.051913,0.0,0.109121,0.063999,0.244174,0.052244
356496,0.15493,0.115711,0.02439,0.106005,0.147223,0.187178,0.504385,0.346852,0.193548,0.0005,0.5,0.0,0.0,1e-05,1.0,0.001486,0.0,0.003656,0.003171,0.006044,0.001835
77256,0.136063,0.117317,0.134328,0.104845,0.149455,0.187178,0.681047,0.672445,0.159393,0.0001,0.5,0.000114,0.000105,1e-05,0.5,0.007516,0.0,0.016268,0.01389,0.13058,0.007863
370389,0.116448,0.141069,0.275229,0.292642,0.147223,0.187178,0.837098,0.0,0.159393,0.0014,0.5,0.0,0.0,1e-05,0.0,0.009078,0.0,0.027235,0.024582,0.114869,0.009425


In [40]:
X_test.head()

Unnamed: 0,City_trg,State_trg,Bank_trg,BankState_trg,RevLineCr_trg,LowDoc_trg,Zip_sc,NAICS_sc,Term_sc,NoEmp_sc,NewExist_sc,CreateJob_sc,RetainedJob_sc,FranchiseCode_sc,UrbanRural_sc,DisbursementGross_sc,BalanceGross_sc,GrAppv_sc,SBA_Appv_sc,LoanDisbursedPerCity_sc,LoanPaid_sc
554048,0.191489,0.200194,0.0,0.169201,0.149455,0.187178,0.144201,0.358789,0.455408,0.0009,0.5,0.000341,0.000947,1e-05,1.0,0.013371,0.0,0.028514,0.028603,0.002818,0.013716
193174,0.172785,0.175878,0.0,0.092939,0.147223,0.187178,0.462205,0.253653,0.455408,0.001,0.5,0.0,0.001053,1e-05,0.5,0.021936,0.0,0.046427,0.046514,0.211607,0.022278
114504,0.208333,0.115711,0.000435,0.106005,0.147223,0.187178,0.500255,0.0,0.227704,0.0004,0.5,0.0,0.0,1e-05,0.0,0.00839,0.0,0.018095,0.016357,0.001361,0.008736
390737,0.307691,0.18024,0.259661,0.175917,0.147223,0.187178,0.701157,0.254515,0.070209,0.0001,0.5,0.000568,0.000105,0.0,0.5,0.001835,0.0,0.004387,0.004021,0.002574,0.002184
731516,0.111276,0.124811,0.34,0.119982,0.149455,0.187178,0.543025,0.256659,0.182163,0.0005,1.0,0.000568,0.0,1e-05,0.5,0.090571,0.0,0.189972,0.142512,0.088536,0.090888


# Model Training

Depending on the model of your choice, you might need to use appropriate scaler for numerical variables.

Train at least two types of models from the below list.
If you use sklearn libraries:
- Logistic regression
- SVM
- Decision Tree

If you use H2O libraries:
- GLM
- SVM
- Naïve Bayes Classifier

### Logistic Regression

Performing the dataset analysis here to identify if the dataset is imbalanced

In [41]:
data['MIS_Status'].value_counts()

0    665576
1    141849
Name: MIS_Status, dtype: int64

In [42]:
MIS_Status_1 = len(data[data['MIS_Status']==1])
MIS_Status_0 = len(data[data['MIS_Status']==0])

perc_MIS_Status_1 = MIS_Status_1 / (MIS_Status_1 + MIS_Status_0)
perc_MIS_Status_0 = MIS_Status_0 / (MIS_Status_1 + MIS_Status_0)


print("Percentage of values with MIS_Status=1: ", perc_MIS_Status_1*100)
print("Percentage of values with MIS_Status=0: ", perc_MIS_Status_0*100)

Percentage of values with MIS_Status=1:  17.568071337895162
Percentage of values with MIS_Status=0:  82.43192866210484


The dataset is again a highly imbalanced dataset with 83% values of MIS_Status=0 and remaining 17% with MIS_Status=1

In [43]:
from sklearn.linear_model import LogisticRegression

# Fit the model
lreg = LogisticRegression(random_state = 0)
lreg.fit(X_train, y_train)

LogisticRegression(random_state=0)

In [44]:
from sklearn.metrics import f1_score

y_train_lreg_pred = lreg.predict(X_train)
y_test_lreg_pred = lreg.predict(X_test)


print("F1 score on train dataset: ", f1_score(y_train, y_train_lreg_pred, average='weighted'))
print("F1 score on test dataset: ", f1_score(y_test, y_test_lreg_pred, average='weighted'))

F1 score on train dataset:  0.8490267570092099
F1 score on test dataset:  0.838642407977774


#### Finding the best fit Logistic Classification Model using Grid Search CV

- I ran the below grid search to find the weights which gives the best fit with highest f1 score
- The best weights results were -> 0: 0.32326530612244897, 1: 0.676734693877551
- Commenting the below code because it takes time and I ended up choosing Decision Tree Model because of its better performance

In [45]:
from sklearn.model_selection import GridSearchCV,StratifiedKFold
import numpy as np

In [46]:
 
# lreg_bestfit = LogisticRegression()

# #Setting the range for class weights
# weights = np.linspace(0.0,0.99,50)

# #Creating a dictionary grid for grid search
# param_grid = {'class_weight': [{0:x, 1:1.0-x} for x in weights]}

# #Fitting grid search to the train data with 5 folds
# clf = GridSearchCV(estimator= lreg_bestfit, 
#                           param_grid= param_grid,
#                           cv=StratifiedKFold(), 
#                           n_jobs=-1, 
#                           scoring='f1', 
#                           verbose=2)

# best_clf_lreg = clf.fit(X_train,y_train)


In [47]:
# print("Tuned hpyerparameters :(best parameters) ",best_clf_lreg.best_params_)
# print("Accuracy :",best_clf_lreg.best_score_)

In [48]:
print("Tuned hyperparameters :(best parameters) ","{'class_weight': {0: 0.32326530612244897, 1: 0.676734693877551}")

Tuned hyperparameters :(best parameters)  {'class_weight': {0: 0.32326530612244897, 1: 0.676734693877551}


In [49]:
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.metrics import confusion_matrix,accuracy_score, f1_score, recall_score, precision_score

def evaluate_model(dt_classifier):
    print("Train Accuracy :", accuracy_score(y_train, dt_classifier.predict(X_train)))
    print("Train weighted F1 score :", f1_score(y_train, dt_classifier.predict(X_train), average='weighted'))
    print("Train AUC :", metrics.roc_auc_score(y_train, dt_classifier.predict(X_train)))
    print("Train Recall :" , recall_score(y_train, dt_classifier.predict(X_train)))
    print("Train Precision :" , precision_score(y_train, dt_classifier.predict(X_train)))
    print("Train Confusion Matrix:")
    print(confusion_matrix(y_train, dt_classifier.predict(X_train)))
    
    print("-"*50)
    
    print("Test Accuracy :", accuracy_score(y_test, dt_classifier.predict(X_test)))
    print("Test weighted F1 score :", f1_score(y_test, dt_classifier.predict(X_test), average='weighted'))
    print("Test AUC :", metrics.roc_auc_score(y_test, dt_classifier.predict(X_test)))
    print("Test Recall :" , recall_score(y_test, dt_classifier.predict(X_test)))
    print("Test Precision :" , precision_score(y_test, dt_classifier.predict(X_test)))
    print("Test Confusion Matrix:")
    print(confusion_matrix(y_test, dt_classifier.predict(X_test)))

In [50]:
from sklearn.metrics import confusion_matrix
lr_class_weights = LogisticRegression(solver='lbfgs', class_weight={0: 0.32326530612244897, 1: 0.676734693877551})
lr_class_weights.fit(X_train, y_train)


LogisticRegression(class_weight={0: 0.32326530612244897, 1: 0.676734693877551})

In [51]:
print("Performance metrics for a Logistic Classification Model:\n")
evaluate_model(lr_class_weights)

Performance metrics for a Logistic Classification Model:

Train Accuracy : 0.8603121032913273
Train weighted F1 score : 0.8604534684543658
Train AUC : 0.7599214255106174
Train Recall : 0.605127301559334
Train Precision : 0.6020193521245267
Train Confusion Matrix:
[[487022  45408]
 [ 44822  68688]]
--------------------------------------------------
Test Accuracy : 0.8496702480106512
Test weighted F1 score : 0.849259379004498
Test AUC : 0.7375530798459828
Test Recall : 0.5648046861215992
Test Precision : 0.5726859637196322
Test Confusion Matrix:
[[121203  11943]
 [ 12333  16006]]


###### Using predict_proba function to get the probablities

- I also used the predict_proba method to evaluate an optimal threshold to make the predictions
- I found the Best Threshold: 0.365 with F-Score: 0.855
- I compared the results/accuracy of this method with the above Logistic regression model to check for any improvement

In [52]:
y_test_pred_proba = lreg.predict_proba(X_test)

# Get the probabilities for positive class
y_test_pred = y_test_pred_proba[:, 1]

In [53]:
y_test_pred

array([0.00133121, 0.00101275, 0.01548202, ..., 0.09033264, 0.23383415,
       0.09972821])

In [54]:
from sklearn.metrics import f1_score
import numpy as np

# Array for finding the optimal threshold
thresholds = np.arange(0.0, 1.0, 0.005)

fscore = np.zeros(shape=(len(thresholds)))
print('Length of sequence: {}'.format(len(thresholds)))

# Fit the model
for index, elem in enumerate(thresholds):
    # Corrected probabilities
    y_pred_prob = (y_test_pred > elem).astype('int')
    # Calculate the f-score
    fscore[index] = f1_score(y_test, y_pred_prob, average='weighted')

# Find the optimal threshold
index = np.argmax(fscore)
thresholdOpt = round(thresholds[index], ndigits = 4)
fscoreOpt = round(fscore[index], ndigits = 4)
print('Best Threshold: {} with F-Score: {}'.format(thresholdOpt, fscoreOpt))


Length of sequence: 200
Best Threshold: 0.355 with F-Score: 0.854


In [55]:
from sklearn.metrics import confusion_matrix
from sklearn import metrics

In [56]:
y_pred_train_proba = (lreg.predict_proba(X_train)[:,1] >= 0.365).astype(bool)# set threshold as 0.365

print("The f1 score for the model with training data: ", f1_score(y_train, y_pred_train_proba, average='weighted'))
print(confusion_matrix(y_train, y_pred_train_proba))

auc = metrics.roc_auc_score(y_train, y_pred_train_proba)
print("Train AUC: ", auc)

The f1 score for the model with training data:  0.8644384660338794
[[495551  36879]
 [ 48788  64722]]
Train AUC:  0.7504611026603255


In [57]:
y_pred_test_proba = (lreg.predict_proba(X_test)[:,1] >= 0.365).astype(bool)# set threshold as 0.365

print("The f1 score for the model with test data: ", f1_score(y_test, y_pred_test_proba, average='weighted'))
print(confusion_matrix(y_test, y_pred_test_proba))

auc = metrics.roc_auc_score(y_test, y_pred_test_proba)
print("Test AUC: ", auc)

The f1 score for the model with test data:  0.8539257182794213
[[123449   9697]
 [ 13271  15068]]
Test AUC:  0.7294377988578806


### Decision Tree Classifier

- I ran a random Decision Tree with entropy criterion, maximum depth as 9 and random state as 0
- I observed a better F1 score of 92% here. Hence I chose this model and went ahead to perform a grid search with 5 fold stratiified grid search to get the best hyperparameters.

In [58]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.metrics import confusion_matrix,accuracy_score, f1_score, recall_score, precision_score

dtc = DecisionTreeClassifier(criterion= 'entropy', max_depth=9, random_state=0)
dtc.fit(X_train, y_train)
pred = dtc.predict(X_test)
print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))

              precision    recall  f1-score   support

           0       0.94      0.97      0.96    133146
           1       0.82      0.73      0.77     28339

    accuracy                           0.93    161485
   macro avg       0.88      0.85      0.86    161485
weighted avg       0.92      0.93      0.92    161485

[[128722   4424]
 [  7629  20710]]


In [59]:
def evaluate_model(dt_classifier):
    print("Train Accuracy :", accuracy_score(y_train, dt_classifier.predict(X_train)))
    print("Train weighted F1 score :", f1_score(y_train, dt_classifier.predict(X_train), average='weighted'))
    print("Train AUC :", metrics.roc_auc_score(y_train, dt_classifier.predict(X_train)))
    print("Train Recall :" , recall_score(y_train, dt_classifier.predict(X_train)))
    print("Train Precision :" , precision_score(y_train, dt_classifier.predict(X_train)))
    print("Train Confusion Matrix:")
    print(confusion_matrix(y_train, dt_classifier.predict(X_train)))
    
    print("-"*50)
    
    print("Test Accuracy :", accuracy_score(y_test, dt_classifier.predict(X_test)))
    print("Test weighted F1 score :", f1_score(y_test, dt_classifier.predict(X_test), average='weighted'))
    print("Test AUC :", metrics.roc_auc_score(y_test, dt_classifier.predict(X_test)))
    print("Test Recall :" , recall_score(y_test, dt_classifier.predict(X_test)))
    print("Test Precision :" , precision_score(y_test, dt_classifier.predict(X_test)))
    print("Test Confusion Matrix:")
    print(confusion_matrix(y_test, dt_classifier.predict(X_test)))

In [60]:
print("Performance metrics for a Decision Tree Classifier Model:\n")
evaluate_model(dtc)

Performance metrics for a Decision Tree Classifier Model:

Train Accuracy : 0.928957797937889
Train weighted F1 score : 0.9274036672080309
Train AUC : 0.8569868372592816
Train Recall : 0.7460135670866003
Train Precision : 0.8323258534092137
Train Confusion Matrix:
[[515371  17059]
 [ 28830  84680]]
--------------------------------------------------
Test Accuracy : 0.9253614886831594
Test weighted F1 score : 0.9235685124812528
Test AUC : 0.8487841669883955
Test Recall : 0.7307950174670949
Test Precision : 0.8239834487148882
Test Confusion Matrix:
[[128722   4424]
 [  7629  20710]]


## Model Tuning

Choose one model from the above list. You should provide reasoning on why you have picked the model over others. Perform tuning for the selected model:
- Hyper-parameter tuning. Your hyper-parameter search space should have at least 50 combinations.
- To avoid overfitting and provide you with reasonable estimate of model performance on hold-out dataset, you will need to split your dataset as following:
    - Train, will be used to train model
    - Validation, will be used to validate model each round of training
    - Testing, will be used to provide final performance metrics, used only once on the final model
- Feature engineering. You should add at least two engineered features.  For example, add feature which is combination of two features.
- If your model returns probability, calculate probability threshold to maximize F1. 

#### Feature Engineering

I have already created below two new engineered features in my dataset as can be seen in above steps in the data prepation:
- LoanDisbursedPerCity_sc = Loan Disbursed to small businesses per city
- LoanPaid_sc = DisbursementGross - BalanceGross

In [61]:
X_test.head(n=3)

Unnamed: 0,City_trg,State_trg,Bank_trg,BankState_trg,RevLineCr_trg,LowDoc_trg,Zip_sc,NAICS_sc,Term_sc,NoEmp_sc,NewExist_sc,CreateJob_sc,RetainedJob_sc,FranchiseCode_sc,UrbanRural_sc,DisbursementGross_sc,BalanceGross_sc,GrAppv_sc,SBA_Appv_sc,LoanDisbursedPerCity_sc,LoanPaid_sc
554048,0.191489,0.200194,0.0,0.169201,0.149455,0.187178,0.144201,0.358789,0.455408,0.0009,0.5,0.000341,0.000947,1e-05,1.0,0.013371,0.0,0.028514,0.028603,0.002818,0.013716
193174,0.172785,0.175878,0.0,0.092939,0.147223,0.187178,0.462205,0.253653,0.455408,0.001,0.5,0.0,0.001053,1e-05,0.5,0.021936,0.0,0.046427,0.046514,0.211607,0.022278
114504,0.208333,0.115711,0.000435,0.106005,0.147223,0.187178,0.500255,0.0,0.227704,0.0004,0.5,0.0,0.0,1e-05,0.0,0.00839,0.0,0.018095,0.016357,0.001361,0.008736


In [62]:
X_train.head(n=3)

Unnamed: 0,City_trg,State_trg,Bank_trg,BankState_trg,RevLineCr_trg,LowDoc_trg,Zip_sc,NAICS_sc,Term_sc,NoEmp_sc,NewExist_sc,CreateJob_sc,RetainedJob_sc,FranchiseCode_sc,UrbanRural_sc,DisbursementGross_sc,BalanceGross_sc,GrAppv_sc,SBA_Appv_sc,LoanDisbursedPerCity_sc,LoanPaid_sc
409416,0.170213,0.178873,0.174162,0.158269,0.147223,0.187178,0.816578,0.456256,0.159393,0.0001,1.0,0.0,0.000105,0.0,1.0,0.003861,0.0,0.008737,0.00749,0.0046,0.004209
150409,0.163606,0.184869,0.01188,0.22088,0.149455,0.187178,0.95839,0.366278,0.227704,0.0044,0.5,0.0,0.004632,1e-05,0.5,0.051913,0.0,0.109121,0.063999,0.244174,0.052244
356496,0.15493,0.115711,0.02439,0.106005,0.147223,0.187178,0.504385,0.346852,0.193548,0.0005,0.5,0.0,0.0,1e-05,1.0,0.001486,0.0,0.003656,0.003171,0.006044,0.001835


#### Choosing Decision Tree Model over others because 
- It gave very promising results on both train and test dataset. 
- There was no problem of overfitting seen either. The weighted F1 score on train dataset was 0.927 and on test dataset was 0.923.
- Good balance between Recall and Precision values.

#### Performing Hyper-parameter tuning below on the Decision Tree Classifier
***Parameters tuned were:***
- Criteria - Gini or Entropy
- Maximum Depth - [1,3,5,7,9,11]
- Minimum Sample Split - [1,3,5,7,9]
- Minimum samples leaf - [1,2,3,4,5]

Hence the search space had 1500 fits resulting from 300 candidates with 5 folds each.

In [63]:
from sklearn.model_selection import GridSearchCV

decision_tree = DecisionTreeClassifier()

# Hyper parameters range intialization for tuning 

parameters={
    "criterion":['gini','entropy'],
    "max_depth" : [1,3,5,7,9,11],
    "min_samples_split":range(1,10,2),
    "min_samples_leaf":[1,2,3,4,5]
}

dtc_grid = GridSearchCV(decision_tree,
                   param_grid = parameters,
                   cv = StratifiedKFold(n_splits=5),
                   scoring = 'f1',
                   verbose = 2,
                   n_jobs = -1)

dtc_grid.fit(X_train,y_train)

Fitting 5 folds for each of 300 candidates, totalling 1500 fits


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [1, 3, 5, 7, 9, 11],
                         'min_samples_leaf': [1, 2, 3, 4, 5],
                         'min_samples_split': range(1, 10, 2)},
             scoring='f1', verbose=2)

In [64]:
dtc_grid.best_params_

{'criterion': 'gini',
 'max_depth': 11,
 'min_samples_leaf': 1,
 'min_samples_split': 3}

In [65]:
dtc_grid.best_score_

0.8157275183980041

##### Using the hyperparameters from the above result in the below model

In [79]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.metrics import confusion_matrix,accuracy_score, f1_score, recall_score,precision_score

dtc = DecisionTreeClassifier(criterion= 'gini', max_depth=11, min_samples_leaf= 1, min_samples_split= 3)
dtc.fit(X_train, y_train)
pred = dtc.predict(X_test)
print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96    133146
           1       0.84      0.76      0.80     28339

    accuracy                           0.93    161485
   macro avg       0.89      0.86      0.88    161485
weighted avg       0.93      0.93      0.93    161485

[[128976   4170]
 [  6821  21518]]


In [80]:
def evaluate_model(dt_classifier):
    print("Train Accuracy :", accuracy_score(y_train, dt_classifier.predict(X_train)))
    print("Train weighted F1 score :", f1_score(y_train, dt_classifier.predict(X_train), average='weighted'))
    print("Train AUC :", metrics.roc_auc_score(y_train, dt_classifier.predict(X_train)))
    print("Train Recall :" , recall_score(y_train, dt_classifier.predict(X_train)))
    print("Train Precision :" , precision_score(y_train, dt_classifier.predict(X_train)))
    print("Train Confusion Matrix:")
    print(confusion_matrix(y_train, dt_classifier.predict(X_train)))
    
    print("-"*50)
    
    print("Test Accuracy :", accuracy_score(y_test, dt_classifier.predict(X_test)))
    print("Test weighted F1 score :", f1_score(y_test, dt_classifier.predict(X_test), average='weighted'))
    print("Test AUC :", metrics.roc_auc_score(y_test, dt_classifier.predict(X_test)))
    print("Test Recall :" , recall_score(y_test, dt_classifier.predict(X_test)))
    print("Test Precision :" , precision_score(y_test, dt_classifier.predict(X_test)))
    print("Test Confusion Matrix:")
    print(confusion_matrix(y_test, dt_classifier.predict(X_test)))

In [81]:
print("Performance metrics for a Decision Tree Classifier Model:\n")
evaluate_model(dtc)

Performance metrics for a Decision Tree Classifier Model:

Train Accuracy : 0.9419574573489797
Train weighted F1 score : 0.9411217717952116
Train AUC : 0.8860831237913179
Train Recall : 0.7999295216280504
Train Precision : 0.8599950749180731
Train Confusion Matrix:
[[517648  14782]
 [ 22710  90800]]
--------------------------------------------------
Test Accuracy : 0.9319379508932718
Test weighted F1 score : 0.9306035620469831
Test AUC : 0.8639939794687445
Test Recall : 0.7593069621369843
Test Precision : 0.8376673933354095
Test Confusion Matrix:
[[128976   4170]
 [  6821  21518]]


## Save all artifacts

Save all artifacts needed for scoring function:
- Trained model
- Encoders

You should restart your Kernel now to properly test scoring function

In [82]:
import pickle

#Saving the final Model
model_filename = './artifacts/finalized_model.pkl'
model_file = open(model_filename, 'wb')
pickle.dump(dtc, model_file)

#Saving the target encoder
encoder_filename = './artifacts/cat_Encoders.pkl'
encoder_file = open(encoder_filename, 'wb')
pickle.dump(cat_encoders, encoder_file)

#Saving the MinMax Scaler
scaler_filename = './artifacts/scaler_Encoders.pkl'
scaler_file = open(scaler_filename, 'wb')
pickle.dump(scaler, scaler_file)

In [83]:
# load the model from disk just to ensure the model is saved correctly
loaded_model = pickle.load(open(model_filename, 'rb'))
result = loaded_model.score(X_test, y_test)
print(result)

0.9319379508932718


In [84]:
enc_loaded = pickle.load(open(encoder_filename,'rb'))
enc_loaded

{'City': [TargetEncoder(cols=['City']), 'target'],
 'State': [TargetEncoder(cols=['State']), 'target'],
 'Bank': [TargetEncoder(cols=['Bank']), 'target'],
 'BankState': [TargetEncoder(cols=['BankState']), 'target'],
 'RevLineCr': [TargetEncoder(cols=['RevLineCr']), 'target'],
 'LowDoc': [TargetEncoder(cols=['LowDoc']), 'target']}

In [85]:
scaler_file.close()

In [86]:
scaler_loaded = pickle.load(open(scaler_filename,'rb'))
scaler_loaded

MinMaxScaler()

In [87]:
model_file.close()
encoder_file.close()