# Project 1 Starter

Project 1 is to allow students to practice Data Science concepts learned so far.

The project will include following tasks:
- Load dataset
- Clean up the data:
    - Encode replace missing values
    - Replace features values that appear incorrect
- Encode categorical variables
- Split dataset to Train/Test/Validation
- Add engineered features
- Train and tune ML model
- Provide final metrics using Validation dataset

It is up to you if you would like to modify your dataset and then split it, or split it then modify.
It is important to understand all the steps before model training, so that you can reliable replicate and test them to produce scoring function.

The Project-1 will be graded based on the completeness and performance of your final model against hold out dataset. 
The hold out dataset will not be known to the students. As part of your deliverables, you will be need to submit scoring function. The scoring function will perform following:
- Accept dataset in the same format as provided with the project, minus "MIS_Status" column
- Load trained model and any encoders that are needed to transform data
- Transform dataset into format that can be scored with the trained model
- Score the dataset and return the results, for each record
    - Record ID
    - Record label as determined by final model (0 or 1)
    - If your model returns probabilities, you need to assign label based on maximum F1 threshold


Deliverables:
- Jupyter notebook with complete code to manipulate data, train and tune final model
- Model and any potential encoders in the "pkl" format
- Scoring function that will load final model and encoders


Your notebook should include explanations about your code and be designed to be easily followed and results replicated. Once you are done with final version, you will need to test it by running all cells from top to bottom after restarting Kernel. It can be done by running `Kernel -> Restart & Run All`


**Important**: you might want to first produce working code using small subset of the dataset to speed up debuging process.

## Dataset description
The dataset for Lab-2 is sample of the SBA dataset posted on Kaggle.
The dataset is from the U.S. Small Business Administration (SBA) The U.S. SBA was founded in 1953 on the principle of promoting and assisting small enterprises in the U.S. credit market (SBA Overview and History, US Small Business Administration (2015)). Small businesses have been a primary source of job creation in the United States; therefore, fostering small business formation and growth has social benefits by creating job opportunities and reducing unemployment. There have been many success stories of start-ups receiving SBA loan guarantees such as FedEx and Apple Computer. However, there have also been stories of small businesses and/or start-ups that have defaulted on their SBA-guaranteed loans.  
More info on the original dataset: https://www.kaggle.com/mirbektoktogaraev/should-this-loan-be-approved-or-denied

**Don't use original dataset, use only dataset provided with project requirements in eLearning**

## Preparation

Use dataset provided in the eLearning

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 1500)

import warnings
warnings.filterwarnings('ignore')

#Extend cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))



Purpose: Analyze input Pandas DataFrame and return stats per column
Details: The function calculates levels for categorical variables and allows to analyze summarized information

To view wide table set following Pandas options:
pd.set_option('display.width', 1000)
pd.set_option('max_colwidth',200)
"""
import pandas as pd
def describe_more(df,normalize_ind=False, weight_column=None, skip_columns=[], dropna=True):
    var = [] ; l = [] ; t = []; unq =[]; min_l = []; max_l = [];
    assert isinstance(skip_columns, list), "Argument skip_columns should be list"
    if weight_column is not None:
        if weight_column not in list(df.columns):
            raise AssertionError('weight_column is not a valid column name in the input DataFrame')
      
    for x in df:
        if x in skip_columns:
            pass
        else:
            var.append( x )
            uniq_counts = len(pd.value_counts(df[x],dropna=dropna))
            uniq_counts = len(pd.value_counts(df[x], dropna=dropna)[pd.value_counts(df[x],dropna=dropna)>0])
            l.append(uniq_counts)
            t.append( df[ x ].dtypes )
            min_l.append(df[x].apply(str).str.len().min())
            max_l.append(df[x].apply(str).str.len().max())
            if weight_column is not None and x not in skip_columns:
                df2 = df.groupby(x).agg({weight_column: 'sum'}).sort_values(weight_column, ascending=False)
                df2['authtrans_vts_cnt']=((df2[weight_column])/df2[weight_column].sum()).round(2)
                unq.append(df2.head(n=100).to_dict()[weight_column])
            else:
                df_cat_d = df[x].value_counts(normalize=normalize_ind,dropna=dropna).round(decimals=2)
                df_cat_d = df_cat_d[df_cat_d>0]
                #unq.append(df[x].value_counts().iloc[0:100].to_dict())
                unq.append(df_cat_d.iloc[0:100].to_dict())
            
    levels = pd.DataFrame( { 'A_Variable' : var , 'Levels' : l , 'Datatype' : t ,
                             'Min Length' : min_l,
                             'Max Length': max_l,
                             'Level_Values' : unq} )
    #levels.sort_values( by = 'Levels' , inplace = True )
    return levels

### Load data

In [2]:
data = pd.read_csv('SBA_loans_project_1.csv')

In [3]:
print("Data shape:", data.shape)

Data shape: (809247, 20)


**Review dataset**

In [4]:
desc_df = data.describe()
desc_df

Unnamed: 0,Zip,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural
count,809247.0,809247.0,809247.0,809247.0,809119.0,809247.0,809247.0,809247.0,809247.0
mean,53800.937004,398573.78361,110.798776,11.414084,1.280276,8.415866,10.773366,2751.939176,0.757748
std,31186.367109,263354.979814,78.872428,74.529429,0.451692,236.288348,236.612053,12758.41181,0.646347
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,27577.0,235210.0,60.0,2.0,1.0,0.0,0.0,1.0,0.0
50%,55411.0,445310.0,84.0,4.0,1.0,0.0,1.0,1.0,1.0
75%,83704.0,561730.0,120.0,10.0,2.0,1.0,4.0,1.0,1.0
max,99999.0,928120.0,569.0,9999.0,2.0,8800.0,9500.0,99999.0,2.0


## Dataset preparation and clean-up

Modify and clean-up the dataset as following:
- Replace encode Na/Null values
- Convert the strings styled as '$XXXX.XX' to float values. Columns = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']
- Convert MIS_Status to 0/1. Make value "CHGOFF" as 1

In [5]:
# Let's check for the null values
data.isnull().values.any()

True

In [6]:
# Let's check for the na values
data.isna().values.any()

True

In [7]:
# Drop all rows which have na values in MIS_Status Column
data = data.dropna( how='all',subset=['MIS_Status'])

In [8]:
# Let's replace missing numeric values by 0 and non-numeric as Missing
values_to_fill = {}

for col in data.columns:
    if data[col].isna().any() == True:
        print("Processing column and column type:", col,data[col].dtype)
        
        if pd.api.types.is_numeric_dtype(data[col].dtype):
            values_to_fill[col] = 0
        else:
            values_to_fill[col] = "Missing"
        

data.fillna(value=values_to_fill,inplace=True)

Processing column and column type: City object
Processing column and column type: State object
Processing column and column type: Bank object
Processing column and column type: BankState object
Processing column and column type: NewExist float64
Processing column and column type: RevLineCr object
Processing column and column type: LowDoc object


In [9]:
# Let's check for the null values
data.isnull().values.any()

False

In [10]:
# Let's check for the na values
data.isna().values.any()

False

In [11]:
# Let's check for the head of data
data.head()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
0,GLEN BURNIE,MD,21060,"BUSINESS FINANCE GROUP, INC.",VA,811111,240,7,1.0,6,7,1,1,0,N,"$743,000.00",$0.00,"$743,000.00","$743,000.00",P I F
1,WEST BEND,WI,53095,JPMORGAN CHASE BANK NATL ASSOC,IL,722410,240,20,1.0,0,0,1,0,N,N,"$137,000.00",$0.00,"$137,000.00","$109,737.00",P I F
2,SAN DIEGO,CA,92128,UMPQUA BANK,OR,0,120,2,1.0,0,0,1,0,0,N,"$280,000.00",$0.00,"$280,000.00","$210,000.00",P I F
3,WEBSTER,MA,1570,HOMETOWN BANK A CO-OPERATIVE B,MA,621310,84,7,1.0,0,0,1,1,0,Y,"$144,500.00",$0.00,"$144,500.00","$122,825.00",P I F
4,JOPLIN,MO,64804,U.S. BANK NATIONAL ASSOCIATION,OH,0,60,2,2.0,0,0,1,0,N,Y,"$52,500.00",$0.00,"$52,500.00","$42,000.00",P I F


In [12]:
# Let's convert the strings styled as '$XXXX.XX' to float values
money_cols = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']

for col in money_cols:
  data[col] = [float(val[1:].replace(',', '')) for val in data[col].values]

data.head()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
0,GLEN BURNIE,MD,21060,"BUSINESS FINANCE GROUP, INC.",VA,811111,240,7,1.0,6,7,1,1,0,N,743000.0,0.0,743000.0,743000.0,P I F
1,WEST BEND,WI,53095,JPMORGAN CHASE BANK NATL ASSOC,IL,722410,240,20,1.0,0,0,1,0,N,N,137000.0,0.0,137000.0,109737.0,P I F
2,SAN DIEGO,CA,92128,UMPQUA BANK,OR,0,120,2,1.0,0,0,1,0,0,N,280000.0,0.0,280000.0,210000.0,P I F
3,WEBSTER,MA,1570,HOMETOWN BANK A CO-OPERATIVE B,MA,621310,84,7,1.0,0,0,1,1,0,Y,144500.0,0.0,144500.0,122825.0,P I F
4,JOPLIN,MO,64804,U.S. BANK NATIONAL ASSOCIATION,OH,0,60,2,2.0,0,0,1,0,N,Y,52500.0,0.0,52500.0,42000.0,P I F


In [13]:
# Let's group dataset by MIS_Status and have a count
data.groupby(by=["MIS_Status"])["MIS_Status"].count()

MIS_Status
CHGOFF    141849
P I F     665576
Name: MIS_Status, dtype: int64

In [14]:
# Converting target variable from string to binary
data['MIS_Status'] = [1 if app == 'CHGOFF' else 0 for app in data.MIS_Status.values]
data.head()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,MIS_Status
0,GLEN BURNIE,MD,21060,"BUSINESS FINANCE GROUP, INC.",VA,811111,240,7,1.0,6,7,1,1,0,N,743000.0,0.0,743000.0,743000.0,0
1,WEST BEND,WI,53095,JPMORGAN CHASE BANK NATL ASSOC,IL,722410,240,20,1.0,0,0,1,0,N,N,137000.0,0.0,137000.0,109737.0,0
2,SAN DIEGO,CA,92128,UMPQUA BANK,OR,0,120,2,1.0,0,0,1,0,0,N,280000.0,0.0,280000.0,210000.0,0
3,WEBSTER,MA,1570,HOMETOWN BANK A CO-OPERATIVE B,MA,621310,84,7,1.0,0,0,1,1,0,Y,144500.0,0.0,144500.0,122825.0,0
4,JOPLIN,MO,64804,U.S. BANK NATIONAL ASSOCIATION,OH,0,60,2,2.0,0,0,1,0,N,Y,52500.0,0.0,52500.0,42000.0,0


In [15]:
# Let's group dataset by MIS_Status and have a count
data.groupby(by=["MIS_Status"])["MIS_Status"].count()

MIS_Status
0    665576
1    141849
Name: MIS_Status, dtype: int64

In [16]:
# Let's rename the MIS_Status column to "Defaulted"
data.rename({'MIS_Status': 'Defaulted'}, axis=1, inplace=True)
data.head()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,Defaulted
0,GLEN BURNIE,MD,21060,"BUSINESS FINANCE GROUP, INC.",VA,811111,240,7,1.0,6,7,1,1,0,N,743000.0,0.0,743000.0,743000.0,0
1,WEST BEND,WI,53095,JPMORGAN CHASE BANK NATL ASSOC,IL,722410,240,20,1.0,0,0,1,0,N,N,137000.0,0.0,137000.0,109737.0,0
2,SAN DIEGO,CA,92128,UMPQUA BANK,OR,0,120,2,1.0,0,0,1,0,0,N,280000.0,0.0,280000.0,210000.0,0
3,WEBSTER,MA,1570,HOMETOWN BANK A CO-OPERATIVE B,MA,621310,84,7,1.0,0,0,1,1,0,Y,144500.0,0.0,144500.0,122825.0,0
4,JOPLIN,MO,64804,U.S. BANK NATIONAL ASSOCIATION,OH,0,60,2,2.0,0,0,1,0,N,Y,52500.0,0.0,52500.0,42000.0,0


## Categorical variables encoding

Encode categorical variables using either one of the techniques below. Don't use LabelEncoder.
- One-hot-encoder for variables with less than 10 valid values. Name your new columns "Original_name"_valid_value
- (If using sklearn) Target encoder from the following library: https://contrib.scikit-learn.org/category_encoders/index.html . Name your new column "Original_name"_trg
- (If using H2O) Use H2O target encoder


Example of use for target encoder:
```
import category_encoders as ce

encoder = ce.TargetEncoder(cols=[...])

encoder.fit(X, y)
X_cleaned = encoder.transform(X_dirty)
```

In [17]:
# Let's found the number of unique values in each column
data.nunique()

City                  31305
State                    52
Zip                   32721
Bank                   5716
BankState                56
NAICS                  1307
Term                    407
NoEmp                   580
NewExist                  3
CreateJob               234
RetainedJob             344
FranchiseCode          2683
UrbanRural                3
RevLineCr                17
LowDoc                    9
DisbursementGross    109855
BalanceGross             13
GrAppv                20697
SBA_Appv              35857
Defaulted                 2
dtype: int64

In [18]:
# Let's divide data columns in different datasets
y = data['Defaulted']
X = data.drop(columns='Defaulted')

In [19]:
# Applying the Train and Test Split from sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [20]:
# Applying the OneHot Encoding for columns less than 10 unique values
from sklearn.preprocessing import OneHotEncoder
import numpy as np

for col in X_train.columns:
    if X_train[col].dtype == 'object':
        if X_train[col].nunique() < 10:
            print("One-hot encoding of ", col)
            enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
            enc.fit(X_train[[col]])
            result = enc.transform(X_train[[col]])
            ohe_columns = [col+"_"+str(x) for x in enc.categories_[0]]
            result_train = pd.DataFrame(result, columns=ohe_columns)
            X_train = pd.concat([X_train.reset_index(drop=True), result_train.reset_index(drop=True)], axis=1)
            '''Encode Testing'''
            result = enc.transform(X_test[[col]])
            result_test = pd.DataFrame(result, columns=ohe_columns)
            X_test = pd.concat([X_test.reset_index(drop=True), result_test.reset_index(drop=True)], axis=1)

One-hot encoding of  LowDoc


In [21]:
# Let's check for the head of training data
X_train.head()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,LowDoc_0,LowDoc_A,LowDoc_C,LowDoc_Missing,LowDoc_N,LowDoc_R,LowDoc_S,LowDoc_Y
0,SYLVANIA,OH,43560,CITIZENS BANK NATL ASSOC,RI,541870,84,7,1.0,0,7,1,1,Y,N,10443.0,0.0,5000.0,2500.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,MOUNT PLEASANT,MI,48858,"PNC BANK, NATIONAL ASSOCIATION",IL,541110,83,1,1.0,0,0,1,1,Y,N,25734.0,0.0,25000.0,12500.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,Lewistown,PA,17044,MANUFACTURERS & TRADERS TR CO,NY,236220,60,2,2.0,0,2,0,1,N,N,30600.0,0.0,30600.0,15300.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,PHILADELPHIA,MS,39350,STATE BANK & TRUST COMPANY,MS,112320,180,3,1.0,0,0,1,2,N,N,280000.0,0.0,280000.0,210000.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,CORONA,CA,91720,CDC SMALL BUS. FINAN CORP,CA,327111,240,98,1.0,0,75,1,0,N,N,403000.0,0.0,403000.0,403000.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [22]:
# Applying the Target Encoder on the training and test dataset
import category_encoders as ce

encoder = ce.TargetEncoder()

encoder.fit(X_train,y_train)
X_train_trg = encoder.transform(X_train)
X_test_trg = encoder.transform(X_test)

In [23]:
# Let's check for the head of training data target encoded
X_train_trg 

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,LowDoc_0,LowDoc_A,LowDoc_C,LowDoc_Missing,LowDoc_N,LowDoc_R,LowDoc_S,LowDoc_Y
0,0.166667,0.175254,43560,0.177125,0.177988,541870,84,7,1.0,0,7,1,1,0.174651,0.176018,10443.0,0.0,5000.0,2500.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.190476,0.168804,48858,0.178601,0.175451,541110,83,1,1.0,0,0,1,1,0.174651,0.176018,25734.0,0.0,25000.0,12500.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.020925,0.179500,17044,0.178172,0.172875,236220,60,2,2.0,0,2,0,1,0.176515,0.176018,30600.0,0.0,30600.0,15300.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.196747,0.166559,39350,0.175824,0.165029,112320,180,3,1.0,0,0,1,2,0.176515,0.176018,280000.0,0.0,280000.0,210000.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.179625,0.176166,91720,0.166774,0.173457,327111,240,98,1.0,0,75,1,0,0.176515,0.176018,403000.0,0.0,403000.0,403000.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
540969,0.156923,0.176166,92807,0.166774,0.173457,493110,240,52,1.0,10,0,1,1,0.175622,0.176018,1000000.0,0.0,1000000.0,1000000.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
540970,0.217213,0.170484,12549,0.160147,0.172875,621111,84,1,2.0,4,1,1,1,0.174651,0.176018,150000.0,0.0,150000.0,75000.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
540971,0.224138,0.176166,91733,0.172687,0.173457,485310,78,1,1.0,1,1,1,0,0.175622,0.176018,5000.0,0.0,5000.0,4250.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
540972,0.199997,0.175948,83894,0.167095,0.180403,0,243,14,2.0,14,0,1,0,0.176515,0.186916,360000.0,0.0,360000.0,270000.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [24]:
# Let's check for the head of test data target encoded
X_test_trg 

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,LowDoc_0,LowDoc_A,LowDoc_C,LowDoc_Missing,LowDoc_N,LowDoc_R,LowDoc_S,LowDoc_Y
0,0.181172,0.176999,2906,0.167286,0.177988,339914,84,2,2.0,1,0,0,1,0.174651,0.176018,65341.0,0.0,40000.0,20000.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.188966,0.176166,93701,0.177700,0.173457,445120,120,1,1.0,0,1,1,1,0.175622,0.176018,100000.0,0.0,100000.0,85000.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.166667,0.168808,33781,0.175576,0.176841,337110,55,4,1.0,0,4,0,1,0.174651,0.176018,111750.0,0.0,35000.0,17500.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.197368,0.175948,83644,0.168181,0.173670,238350,107,7,1.0,2,9,0,1,0.182016,0.176018,64800.0,0.0,20000.0,10000.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.192220,0.170484,11365,0.160147,0.172875,722110,84,10,1.0,2,10,1,1,0.174651,0.176018,25000.0,0.0,25000.0,12500.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
266446,0.020925,0.170484,14559,0.183320,0.173670,0,60,3,2.0,0,0,1,0,0.176515,0.176018,25000.0,0.0,25000.0,12500.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
266447,0.166667,0.174262,75137,0.191426,0.182782,811192,294,1,2.0,0,0,1,0,0.175622,0.176018,499900.0,0.0,499900.0,374925.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
266448,0.212389,0.176166,91007,0.172687,0.173457,722110,84,11,1.0,0,11,0,1,0.176515,0.176018,300000.0,0.0,300000.0,225000.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
266449,0.160494,0.172818,54935,0.222222,0.169524,0,240,8,1.0,0,0,1,0,0.176515,0.167820,185000.0,0.0,185000.0,129500.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Model Training

Depending on the model of your choice, you might need to use appropriate scaler for numerical variables.

Train at least two types of models from the below list.
If you use sklearn libraries:
- Logistic regression
- SVM
- Decision Tree

If you use H2O libraries:
- GLM
- SVM
- Naïve Bayes Classifier

In [25]:
# Linear Regression Model
from sklearn.linear_model import LogisticRegression
X_train_trg = X_train_trg.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)
lreg = LogisticRegression()
lreg.fit(X_train_trg, y_train)
print("Mean accuracy for training:",lreg.score(X_train_trg, y_train))

Mean accuracy for training: 0.831529796256382


In [26]:
# Let,s predict the default column for the test dataset and score the accuracy
from sklearn.metrics import accuracy_score
X_test_trg = X_test_trg.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)
y_pred = lreg.predict(X_test_trg)
accuracy_score(y_test, y_pred)

0.8303853241308908

In [27]:
# Let's create the classification report for the predictions
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.97      0.90    219567
           1       0.56      0.17      0.26     46884

    accuracy                           0.83    266451
   macro avg       0.70      0.57      0.58    266451
weighted avg       0.80      0.83      0.79    266451



In [28]:
# Dataset for the Decesion Tree
X_train_dtr = X_train_trg.drop(columns=ohe_columns)
X_test_dtr = X_test_trg.drop(columns=ohe_columns)

In [29]:
# Decesion Tree
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

dtr = tree.DecisionTreeRegressor()
dtr.fit(X_train_dtr,y_train)
y_pred_dtr = dtr.predict(X_test_dtr)
accuracy_score(y_test, y_pred_dtr)

0.9217079312894303

In [30]:
# Let's create the mean squared error for the predictions from decesion tree
from sklearn import metrics
print('MSE:', metrics.mean_squared_error(y_test, y_pred_dtr))

MSE: 0.07829206871056968


In [31]:
# Let's create the classification report for the predictions
print(classification_report(y_test, y_pred_dtr))

              precision    recall  f1-score   support

           0       0.95      0.95      0.95    219567
           1       0.78      0.78      0.78     46884

    accuracy                           0.92    266451
   macro avg       0.86      0.87      0.87    266451
weighted avg       0.92      0.92      0.92    266451



## Model Tuning

Choose one model from the above list. You should provide reasoning on why you have picked the model over others. Perform tuning for the selected model:
- Hyper-parameter tuning. Your hyper-parameter search space should have at least 50 combinations.
- To avoid overfitting and provide you with reasonable estimate of model performance on hold-out dataset, you will need to split your dataset as following:
    - Train, will be used to train model
    - Validation, will be used to validate model each round of training
    - Testing, will be used to provide final performance metrics, used only once on the final model
- Feature engineering. You should add at least two engineered features.  For example, add feature which is combination of two features.
- If your model returns probability, calculate probability threshold to maximize F1. 

In [32]:
# I am choosing the Decesion Tree Regressor model since it is giving Better Accuracy than the Logistic Regression Model
# Parameters for the Decesion Tree
parameters={"max_depth" : [1,3,5,7,9,11,12],
           "min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
           "max_features":["auto","log2","sqrt",None],}

In [33]:
# Let's apply Grid Search to find the best parameters
from sklearn.model_selection import GridSearchCV

tuning_model=GridSearchCV(dtr,param_grid=parameters,scoring='neg_mean_squared_error',cv=3,verbose=3).fit(X_train_dtr,y_train)

Fitting 3 folds for each of 280 candidates, totalling 840 fits
[CV 1/3] END max_depth=1, max_features=auto, min_samples_leaf=1; total time=   0.2s
[CV 2/3] END max_depth=1, max_features=auto, min_samples_leaf=1; total time=   0.2s
[CV 3/3] END max_depth=1, max_features=auto, min_samples_leaf=1; total time=   0.2s
[CV 1/3] END max_depth=1, max_features=auto, min_samples_leaf=2; total time=   0.3s
[CV 2/3] END max_depth=1, max_features=auto, min_samples_leaf=2; total time=   0.2s
[CV 3/3] END max_depth=1, max_features=auto, min_samples_leaf=2; total time=   0.2s
[CV 1/3] END max_depth=1, max_features=auto, min_samples_leaf=3; total time=   0.2s
[CV 2/3] END max_depth=1, max_features=auto, min_samples_leaf=3; total time=   0.2s
[CV 3/3] END max_depth=1, max_features=auto, min_samples_leaf=3; total time=   0.2s
[CV 1/3] END max_depth=1, max_features=auto, min_samples_leaf=4; total time=   0.2s
[CV 2/3] END max_depth=1, max_features=auto, min_samples_leaf=4; total time=   0.2s
[CV 3/3] END 

In [34]:
# Display the best parameters
tuning_model.best_params_

{'max_depth': 12, 'max_features': 'auto', 'min_samples_leaf': 10}

In [35]:
# Display the best score
tuning_model.best_score_

-0.048274110780704

In [36]:
# Tunned HyperModel with best parameters
tuned_hyper_model= DecisionTreeRegressor(max_depth=12,max_features=None,min_samples_leaf=10)
tuned_hyper_model.fit(X_train_dtr,y_train)

DecisionTreeRegressor(max_depth=12, min_samples_leaf=10)

In [37]:
# Making predictions from the trained model
tuned_pred=tuned_hyper_model.predict(X_test_dtr)

In [38]:
# Display the Mean Squared Error for the Predictions, it has reduced from 0.0783 to 0.0475
print('MSE:', metrics.mean_squared_error(y_test, tuned_pred))

MSE: 0.04752866849133463


## Save all artifacts

Save all artifacts needed for scoring function:
- Trained model
- Encoders

You should restart your Kernel now to properly test scoring function

In [42]:
# Saving the Artifacts for Trained Model
import joblib
joblib.dump(tuned_hyper_model, 'DecisionTree.pkl') 
tuned_hyper_model = joblib.load('DecisionTree.pkl') 

In [43]:
# Saving the Artifacts for OneHotEncoder
import joblib
joblib.dump(enc,'OneHotEncoder.pkl') 
enc = joblib.load('OneHotEncoder.pkl') 

In [44]:
# Saving the Artifacts for Traget Encoder
import joblib
joblib.dump(encoder, 'TargetEncoded.pkl') 
encoder = joblib.load('TargetEncoded.pkl') 