## Individual Assignment - ANGGORO Fajar Tri
## General Workflow
### 1) Problem Statement
### 2) Data Preprocessing
### 3) Feature Selection
### 4) Model Development & Benchmarking
### 5) Conclusion

#### 1) Problem Statement

In this case, we're setting up a benchmark experiment: we're going to develop 5 different Machine Learning Models an apply them on a credit Default dataset. The goal of this bencmark experiment is to find the best performing model within a given dataset

#### Basetable Preparation

In [2]:
import pandas as pd
import numpy as np

# read in data
raw = pd.read_csv('./Data/credit_default_train.csv')
raw.head()

# There are some preprocessing steps that needs to be done here: one hot encoding and scaling the data

Unnamed: 0,cust_id,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,18895,70000.0,1.0,3.0,2.0,34.0,0.0,0.0,0.0,0.0,...,25559.0,26134.0,26715.0,1700.0,1500.0,2000.0,1000.0,1000.0,2000.0,0
1,25102,390000.0,2.0,2.0,2.0,26.0,2.0,2.0,2.0,0.0,...,140387.0,128112.0,115514.0,5000.0,3000.0,5000.0,4548.0,4100.0,3300.0,0
2,28867,60000.0,1.0,1.0,2.0,27.0,0.0,0.0,0.0,0.0,...,26038.0,28607.0,27997.0,1378.0,1406.0,3000.0,3000.0,0.0,923.0,1
3,1842,140000.0,2.0,2.0,1.0,55.0,0.0,0.0,0.0,0.0,...,72391.0,61298.0,62193.0,4200.0,2822.0,2336.0,2588.0,2250.0,2491.0,0
4,3371,50000.0,1.0,1.0,2.0,29.0,2.0,2.0,2.0,0.0,...,1047.0,0.0,0.0,3000.0,0.0,1000.0,0.0,0.0,0.0,1


In [3]:
# let's create a flag variable to indicate missing values
raw["fl_missing_edu"] = np.where(raw['EDUCATION'].isnull(), 1, 0)
raw["fl_missing_mar"] = np.where(raw['MARRIAGE'].isnull(), 1, 0)

# create flag variable to indicate missing Limit Balance
raw["fl_missing_limbal"] = np.where(raw['LIMIT_BAL'].isnull(), 1, 0)

# treat missing values by categorise them as 'others'
raw['EDUCATION'].fillna(4, inplace=True)
raw['MARRIAGE'].fillna(3, inplace=True)

# for Sex column, we first modify the values for females (2 -> 0)
raw["SEX"] = raw["SEX"].replace({2 : 0})

# create flag variable to indicate missing
raw["fl_missing_sex"] = np.where(raw['SEX'].isnull(), 1, 0)
raw["fl_missing_age"] = np.where(raw['AGE'].isnull(), 1, 0)

# impute missing values
raw['SEX'].fillna(1, inplace=True)

# Pay Status
raw['PAY_0'].fillna(0, inplace=True)
raw['PAY_2'].fillna(0, inplace=True)
raw['PAY_3'].fillna(0, inplace=True)
raw['PAY_4'].fillna(0, inplace=True)
raw['PAY_5'].fillna(0, inplace=True)
raw['PAY_6'].fillna(0, inplace=True)

In [4]:
# 
raw.head()

Unnamed: 0,cust_id,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month,fl_missing_edu,fl_missing_mar,fl_missing_limbal,fl_missing_sex,fl_missing_age
0,18895,70000.0,1.0,3.0,2.0,34.0,0.0,0.0,0.0,0.0,...,2000.0,1000.0,1000.0,2000.0,0,0,0,0,0,0
1,25102,390000.0,0.0,2.0,2.0,26.0,2.0,2.0,2.0,0.0,...,5000.0,4548.0,4100.0,3300.0,0,0,0,0,0,0
2,28867,60000.0,1.0,1.0,2.0,27.0,0.0,0.0,0.0,0.0,...,3000.0,3000.0,0.0,923.0,1,0,0,0,0,0
3,1842,140000.0,0.0,2.0,1.0,55.0,0.0,0.0,0.0,0.0,...,2336.0,2588.0,2250.0,2491.0,0,0,0,0,0,0
4,3371,50000.0,1.0,1.0,2.0,29.0,2.0,2.0,2.0,0.0,...,1000.0,0.0,0.0,0.0,1,0,0,0,0,0


In [5]:
# fix column ordering: put target variable as the last

last_column = raw.pop('default.payment.next.month')

# insert column using insert(position,column_name, value) function
raw.insert(len(raw.columns), 'default.payment.next.month', last_column)

# export Basetable
raw.to_csv('./Data/basetable.csv', index = False)

#### 2) Data Preprocessing

For the Data preprocessing, we will split the data into training and testing, preprocessing are based on the training set, and applied (transformation only) on the testing set

In [8]:
# read data
base = pd.read_csv('./Data/basetable.csv')
base.head()

Unnamed: 0,cust_id,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,fl_missing_edu,fl_missing_mar,fl_missing_limbal,fl_missing_sex,fl_missing_age,default.payment.next.month
0,18895,70000.0,1.0,3.0,2.0,34.0,0.0,0.0,0.0,0.0,...,2000.0,1000.0,1000.0,2000.0,0,0,0,0,0,0
1,25102,390000.0,0.0,2.0,2.0,26.0,2.0,2.0,2.0,0.0,...,5000.0,4548.0,4100.0,3300.0,0,0,0,0,0,0
2,28867,60000.0,1.0,1.0,2.0,27.0,0.0,0.0,0.0,0.0,...,3000.0,3000.0,0.0,923.0,0,0,0,0,0,1
3,1842,140000.0,0.0,2.0,1.0,55.0,0.0,0.0,0.0,0.0,...,2336.0,2588.0,2250.0,2491.0,0,0,0,0,0,0
4,3371,50000.0,1.0,1.0,2.0,29.0,2.0,2.0,2.0,0.0,...,1000.0,0.0,0.0,0.0,0,0,0,0,0,1


In [9]:
# let us first split the data
from sklearn.model_selection import train_test_split

# define X and y
X, y = base.iloc[:, 1:-1], base['default.payment.next.month']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [10]:
# data prep for training set
X_train['AGE'].fillna(X_train['AGE'].mean(), inplace=True)

# For continuous variables, we impute missing values with either 0 or mean, depending the case
X_train['LIMIT_BAL'].fillna(X_train['LIMIT_BAL'].mean(), inplace=True)

# Bill Amount
X_train['BILL_AMT1'].fillna(X_train['BILL_AMT1'].mean(), inplace=True)
X_train['BILL_AMT2'].fillna(X_train['BILL_AMT2'].mean(), inplace=True)
X_train['BILL_AMT3'].fillna(X_train['BILL_AMT3'].mean(), inplace=True)
X_train['BILL_AMT4'].fillna(X_train['BILL_AMT4'].mean(), inplace=True)
X_train['BILL_AMT5'].fillna(X_train['BILL_AMT5'].mean(), inplace=True)
X_train['BILL_AMT6'].fillna(X_train['BILL_AMT6'].mean(), inplace=True)

# Pay Amount 
X_train['PAY_AMT1'].fillna(X_train['PAY_AMT1'].mean(), inplace=True)
X_train['PAY_AMT2'].fillna(X_train['PAY_AMT2'].mean(), inplace=True)
X_train['PAY_AMT3'].fillna(X_train['PAY_AMT3'].mean(), inplace=True)
X_train['PAY_AMT4'].fillna(X_train['PAY_AMT4'].mean(), inplace=True)
X_train['PAY_AMT5'].fillna(X_train['PAY_AMT5'].mean(), inplace=True)
X_train['PAY_AMT6'].fillna(X_train['PAY_AMT6'].mean(), inplace=True)

In [11]:
# Lets do dummy one hot encoding first: Education & Marital Status
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

# make transformer object
transformer_OHE = make_column_transformer(
    (OneHotEncoder(), ['EDUCATION', 'MARRIAGE']),
    remainder='passthrough')

# apply fit transform
transformed = transformer_OHE.fit_transform(X_train)
X_train = pd.DataFrame(transformed, columns=transformer_OHE.get_feature_names_out())

# let's fix column names 
X_train.columns = X_train.columns.str.replace('onehotencoder__', '')
X_train.columns = X_train.columns.str.replace('remainder__', '')

X_train.head()

Unnamed: 0,EDUCATION_0.0,EDUCATION_1.0,EDUCATION_2.0,EDUCATION_3.0,EDUCATION_4.0,EDUCATION_5.0,EDUCATION_6.0,MARRIAGE_0.0,MARRIAGE_1.0,MARRIAGE_2.0,...,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,fl_missing_edu,fl_missing_mar,fl_missing_limbal,fl_missing_sex,fl_missing_age
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,2000.0,2000.0,4000.0,2300.0,3000.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1400.0,675.0,659.0,0.0,3353.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,5792.0,0.0,792.0,0.0,5857.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,2234.0,0.0,932.0,1014.0,1000.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,2000.0,1300.0,1500.0,1500.0,1576.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# lets Scale our Dataset so that it becomes less distance sensitive
# we will use robust scaler since it generally could handle outliers
from sklearn.preprocessing import RobustScaler

cols_to_scale = []

# take continuous columns
for col in X_train.columns:
    if col.startswith('BILL_A') or col.startswith('PAY_A'):
        cols_to_scale.append(col)
    elif col =="LIMIT_BAL" or col == "AGE":
        cols_to_scale.append(col)
    else:
        continue

# make transformer object
transformer_scaler = make_column_transformer(
    (RobustScaler(), cols_to_scale),
    remainder='passthrough')

# apply fit transform
transformed = transformer_scaler.fit_transform(X_train)
X_train = pd.DataFrame(transformed, columns=transformer_scaler.get_feature_names_out())



# let's fix column names 
X_train.columns = X_train.columns.str.replace('robustscaler__', '')
X_train.columns = X_train.columns.str.replace('remainder__', '')

# colnames
colnames = X_train.columns

X_train.head()

Unnamed: 0,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,...,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,fl_missing_edu,fl_missing_mar,fl_missing_limbal,fl_missing_sex,fl_missing_age
0,0.473684,-0.538462,-0.225372,-0.193211,-0.167877,-0.132136,-0.047604,-0.085837,0.097979,-0.003398,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.421053,1.153846,0.623056,0.682258,0.261578,0.284217,0.306432,-0.366637,0.171464,-0.14902,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.894737,1.153846,-0.360106,-0.354125,-0.253577,-0.366547,-0.364617,-0.358291,-0.417391,0.916935,...,-1.0,-1.0,2.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.368421,-0.307692,-0.359046,-0.351304,-0.322956,0.466642,-0.361695,-0.345267,-0.261604,0.053395,...,-1.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0
4,-0.368421,-0.307692,0.195357,0.219415,0.267106,0.327635,0.389056,0.426636,-0.514391,-0.003398,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
# now apply the approach for the testing set

X_test['AGE'].fillna(X_test['AGE'].mean(), inplace=True)

# For continuous variables, we impute missing values with either 0 or mean, depending the case
X_test['LIMIT_BAL'].fillna(X_test['LIMIT_BAL'].mean(), inplace=True)

# Bill Amount
X_test['BILL_AMT1'].fillna(X_test['BILL_AMT1'].mean(), inplace=True)
X_test['BILL_AMT2'].fillna(X_test['BILL_AMT2'].mean(), inplace=True)
X_test['BILL_AMT3'].fillna(X_test['BILL_AMT3'].mean(), inplace=True)
X_test['BILL_AMT4'].fillna(X_test['BILL_AMT4'].mean(), inplace=True)
X_test['BILL_AMT5'].fillna(X_test['BILL_AMT5'].mean(), inplace=True)
X_test['BILL_AMT6'].fillna(X_test['BILL_AMT6'].mean(), inplace=True)

# Pay Amount 
X_test['PAY_AMT1'].fillna(X_test['PAY_AMT1'].mean(), inplace=True)
X_test['PAY_AMT2'].fillna(X_test['PAY_AMT2'].mean(), inplace=True)
X_test['PAY_AMT3'].fillna(X_test['PAY_AMT3'].mean(), inplace=True)
X_test['PAY_AMT4'].fillna(X_test['PAY_AMT4'].mean(), inplace=True)
X_test['PAY_AMT5'].fillna(X_test['PAY_AMT5'].mean(), inplace=True)
X_test['PAY_AMT6'].fillna(X_test['PAY_AMT6'].mean(), inplace=True)

In [14]:
# Lets do dummy one hot encoding first: Education & Marital Status
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

# Only apply transform
transformed = transformer_OHE.transform(X_test)
X_test = pd.DataFrame(transformed, columns=transformer_OHE.get_feature_names_out())

# let's fix column names 
X_test.columns = X_test.columns.str.replace('onehotencoder__', '')
X_test.columns = X_test.columns.str.replace('remainder__', '')

X_test.head()

Unnamed: 0,EDUCATION_0.0,EDUCATION_1.0,EDUCATION_2.0,EDUCATION_3.0,EDUCATION_4.0,EDUCATION_5.0,EDUCATION_6.0,MARRIAGE_0.0,MARRIAGE_1.0,MARRIAGE_2.0,...,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,fl_missing_edu,fl_missing_mar,fl_missing_limbal,fl_missing_sex,fl_missing_age
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,4043.0,10329.0,0.0,8249.0,2572.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,2500.0,0.0,2000.0,0.0,1000.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,5000.0,4000.0,3500.0,3000.0,4000.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,4800.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# lets Scale our Dataset so that it becomes less distance sensitive
# we will use robust scaler since it generally could handle outliers
from sklearn.preprocessing import RobustScaler


# take continuous columns
for col in X_test.columns:
    if col.startswith('BILL_A') or col.startswith('PAY_A'):
        cols_to_scale.append(col)
    elif col =="LIMIT_BAL" or col == "AGE":
        cols_to_scale.append(col)
    else:
        continue


# apply transform only
transformed = transformer_scaler.transform(X_test)
X_test = pd.DataFrame(transformed)


# let's fix column names 
X_test.columns = colnames
X_test.head()

Unnamed: 0,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,...,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,fl_missing_edu,fl_missing_mar,fl_missing_limbal,fl_missing_sex,fl_missing_age
0,0.368421,0.0,0.317919,0.790203,0.820678,0.909136,-0.382002,-0.193655,6.838947,0.492446,...,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0
1,-0.473684,0.0,-0.101477,-0.089855,-0.035582,-0.030555,0.020578,-0.066132,-0.514391,0.117954,...,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0
2,0.052632,0.230769,1.0152,1.078163,1.101129,1.164476,1.189102,1.227433,0.710349,0.724713,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.315789,-0.307692,0.585703,-0.12391,-0.35588,-0.374271,-0.381146,-0.366637,0.712554,-0.488805,...,0.0,0.0,-2.0,-2.0,-2.0,0.0,0.0,0.0,0.0,0.0
4,-0.105263,-0.615385,-0.289591,-0.255663,-0.35588,-0.374271,-0.381146,-0.265478,-0.146969,-0.488805,...,0.0,0.0,-2.0,-2.0,-1.0,0.0,0.0,0.0,0.0,0.0


#### 3) Feature Selection

For the Feature Selection, We're taking the univariate selection, by comparing each variable with the target variable, calculating the pearson correlation and cross checking the p-value, we can determine which features to select

In [16]:
from scipy.stats import pearsonr

selectedFeatures = []
target = "default.payment.next.month"

temp = pd.DataFrame(y_train)
temp.reset_index(drop=True ,inplace = True)

base = pd.concat([X_train, temp], axis = 1)

# check each features and see the corresponding p-value
for column in base.columns:
    if column not in ("cust_id","default.payment.next.month"):
        (pearson,pvalue) = pearsonr(base[column], base[target])
        print(f"{column} - p-value = {pvalue} - selected : {1 if pvalue < 0.001 else 0}")
        if pvalue < 0.001:
            selectedFeatures.append(column)

# remember that we cant drop features that are the result of One hot encoding, as well as flag variables
# from this pearson method, Age is then the only variable that we're going to drop, along with it's flag variable

LIMIT_BAL - p-value = 1.0822548093360879e-69 - selected : 1
AGE - p-value = 0.9748284263439675 - selected : 0
BILL_AMT1 - p-value = 0.011504250438000092 - selected : 0
BILL_AMT2 - p-value = 0.0753486424110562 - selected : 0
BILL_AMT3 - p-value = 0.07023906687578682 - selected : 0
BILL_AMT4 - p-value = 0.17369063479630503 - selected : 0
BILL_AMT5 - p-value = 0.5025580991300875 - selected : 0
BILL_AMT6 - p-value = 0.6363867180572357 - selected : 0
PAY_AMT1 - p-value = 6.028111340295979e-17 - selected : 1
PAY_AMT2 - p-value = 6.50517282163158e-15 - selected : 1
PAY_AMT3 - p-value = 1.068796703085533e-10 - selected : 1
PAY_AMT4 - p-value = 1.9751422207018702e-11 - selected : 1
PAY_AMT5 - p-value = 1.8484924811785185e-10 - selected : 1
PAY_AMT6 - p-value = 7.313116412654968e-10 - selected : 1
EDUCATION_0.0 - p-value = 0.2339875064328704 - selected : 0
EDUCATION_1.0 - p-value = 1.3739258460595185e-09 - selected : 1
EDUCATION_2.0 - p-value = 1.1656559114552772e-05 - selected : 1
EDUCATION_3.0

#### 4) Model Development & Benchmarking

We're now going to develop our model and benchmark them. For each model, we're using gridsearchCV so that it will automatically pick the best hyperparameters

In [80]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

# Our first model is going to be SGD classifier, a linear model which is fitted through SGD (Stocastic Gradient Descent)
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier()

# scoring criteria: AUC & Accuracy
scoring = {"AUC": "roc_auc", "Accuracy": make_scorer(accuracy_score)}

param_SGD = {
    'alpha': [1e-4, 1e-3, 1e-2, 1e-1, 1e0], # learning rate
    'max_iter': [200,400,600,800,1000], 
    'loss': ['log'], # logistic regression,
    'penalty': ['l2'] # for linear model
}

# create gs object
sgd_gs = GridSearchCV(
    estimator = sgd,
    param_grid = param_SGD,
    scoring = scoring,
    refit="AUC",
    cv = 5,
    n_jobs = 5,
    verbose = 2
)

# Fitting our GridSearchCV Object
sgd_gs.fit(X_train, y_train)

# get results
results = sgd_gs.cv_results_

# get best Model
best = sgd_gs.best_estimator_

# print best results & parameter
print(sgd_gs.best_score_)
print(sgd_gs.best_params_)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
0.7167220625357812
{'alpha': 0.01, 'loss': 'log', 'max_iter': 800, 'penalty': 'l2'}


In [82]:
# Now we apply the best model to make prediction on the test set
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

auc_test = roc_auc_score(y_test, best.predict_proba(X_test)[:, 1])
acc_test = accuracy_score(y_test, np.argmax(best.predict_proba(X_test), axis = 1))
auc_train = roc_auc_score(y_train, best.predict_proba(X_train)[:, 1])
acc_train = accuracy_score(y_train, np.argmax(best.predict_proba(X_train), axis = 1))

In [83]:
# put result in a Dataframe
d = {'SGDClassifier': [sgd_gs.best_score_, auc_train, acc_train, auc_test, acc_test]}
sgd_res = pd.DataFrame(data = d, index=["Cross Validation AUC", "Train AUC", "Train Accuracy", "Test AUC", "Test Accuracy"])
sgd_res

Unnamed: 0,SGDClassifier
Cross Validation AUC,0.716722
Train AUC,0.716668
Train Accuracy,0.801071
Test AUC,0.712928
Test Accuracy,0.801


In [84]:
# Our 2nd Model is KNNclassifier
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

# scoring criteria: AUC & Accuracy
scoring = {"AUC": "roc_auc", "Accuracy": make_scorer(accuracy_score)}

param_knn = {
    'n_neighbors': [3,5,7,9,11,13, 15], # number of N
    'weights': ['uniform', 'distance'], # equal weighting or distance weighting
    'p': [1,2] # Manhattan or Euclidean
}

# create gs object
knn_gs = GridSearchCV(
    estimator = knn,
    param_grid = param_knn,
    scoring = scoring,
    refit="AUC",
    cv = 5,
    n_jobs = 5,
    verbose = 2
)

# Fitting our GridSearchCV Object
knn_gs.fit(X_train, y_train)

# get results
results = knn_gs.cv_results_

# get best Model
best = knn_gs.best_estimator_

# print best results & parameter
print(knn_gs.best_score_)
print(knn_gs.best_params_)

Fitting 5 folds for each of 28 candidates, totalling 140 fits
0.7271185397441259
{'n_neighbors': 15, 'p': 1, 'weights': 'uniform'}


In [85]:
# Now we apply the best model to make prediction on the test set
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

auc_test = roc_auc_score(y_test, best.predict_proba(X_test)[:, 1])
acc_test = accuracy_score(y_test, np.argmax(best.predict_proba(X_test), axis = 1))
auc_train = roc_auc_score(y_train, best.predict_proba(X_train)[:, 1])
acc_train = accuracy_score(y_train, np.argmax(best.predict_proba(X_train), axis = 1))

In [86]:
# put result in a Dataframe
d = {'KNNClassifier': [knn_gs.best_score_, auc_train, acc_train, auc_test, acc_test]}
knn_res = pd.DataFrame(data = d, index=["Cross Validation AUC", "Train AUC", "Train Accuracy", "Test AUC", "Test Accuracy"])
knn_res

Unnamed: 0,KNNClassifier
Cross Validation AUC,0.727119
Train AUC,0.815307
Train Accuracy,0.819143
Test AUC,0.728299
Test Accuracy,0.803833


In [87]:
# Our 3rd Model is Support Vector Classifier (SVM)
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV

svc = SVC(probability=True)


# scoring criteria: AUC & Accuracy
scoring = {"AUC": "roc_auc", "Accuracy": make_scorer(accuracy_score)}

param_svc = {
    'C':[1, 2, 3],
    'kernel':['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma':['scale'],
    'class_weight': ['balanced']
}

# create gs object
svc_gs = GridSearchCV(
    estimator = svc,
    param_grid = param_svc,
    scoring = scoring,
    refit="AUC",
    cv = 5,
    n_jobs = 5,
    verbose = 2
)

# Fitting our GridSearchCV Object
svc_gs.fit(X_train, y_train)

# get results
results = svc_gs.cv_results_

# get best Model
best = svc_gs.best_estimator_

# print best results & parameter
print(svc_gs.best_score_)
print(svc_gs.best_params_)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
0.7436385223022137
{'C': 3, 'class_weight': 'balanced', 'gamma': 'scale', 'kernel': 'rbf'}


In [88]:
# Now we apply the best model to make prediction on the test set
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

auc_test = roc_auc_score(y_test, best.predict_proba(X_test)[:, 1])
acc_test = accuracy_score(y_test, np.argmax(best.predict_proba(X_test), axis = 1))
auc_train = roc_auc_score(y_train, best.predict_proba(X_train)[:, 1])
acc_train = accuracy_score(y_train, np.argmax(best.predict_proba(X_train), axis = 1))

In [89]:
# put result in a Dataframe
d = {'SVMClassifier': [svc_gs.best_score_, auc_train, acc_train, auc_test, acc_test]}
svc_res = pd.DataFrame(data = d, index=["Cross Validation AUC", "Train AUC", "Train Accuracy", "Test AUC", "Test Accuracy"])
svc_res

Unnamed: 0,SVMClassifier
Cross Validation AUC,0.743639
Train AUC,0.786385
Train Accuracy,0.822357
Test AUC,0.747674
Test Accuracy,0.816333


In [90]:
# Our 4th Model is Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()

# scoring criteria: AUC & Accuracy
scoring = {"AUC": "roc_auc", "Accuracy": make_scorer(accuracy_score)}

param_gbc = {
    'loss':['deviance', 'exponential'], # exponential is adaboost
    'learning_rate': [1e-3, 1e-2, 1e-1, 1e0],
    'max_depth':[3, 6, 9]
}

# create gs object
gbc_gs = GridSearchCV(
    estimator = gbc,
    param_grid = param_gbc,
    scoring = scoring,
    refit="AUC",
    cv = 5,
    n_jobs = 5,
    verbose = 2
)

# Fitting our GridSearchCV Object
gbc_gs.fit(X_train, y_train)

# get results
results = gbc_gs.cv_results_

# get best Model
best = gbc_gs.best_estimator_

# print best results & parameter
print(gbc_gs.best_score_)
print(gbc_gs.best_params_)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
0.7710889923142866
{'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 3}


In [91]:
# Now we apply the best model to make prediction on the test set
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

auc_test = roc_auc_score(y_test, best.predict_proba(X_test)[:, 1])
acc_test = accuracy_score(y_test, np.argmax(best.predict_proba(X_test), axis = 1))
auc_train = roc_auc_score(y_train, best.predict_proba(X_train)[:, 1])
acc_train = accuracy_score(y_train, np.argmax(best.predict_proba(X_train), axis = 1))

In [92]:
# put result in a Dataframe
d = {'GBClassifier': [gbc_gs.best_score_, auc_train, acc_train, auc_test, acc_test]}
gbc_res = pd.DataFrame(data = d, index=["Cross Validation AUC", "Train AUC", "Train Accuracy", "Test AUC", "Test Accuracy"])
gbc_res

Unnamed: 0,GBClassifier
Cross Validation AUC,0.771089
Train AUC,0.808317
Train Accuracy,0.824
Test AUC,0.774188
Test Accuracy,0.8165


In [93]:
# our 5th model is Neural Network (MLPclassifier)
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier()

# scoring criteria: AUC & Accuracy
scoring = {"AUC": "roc_auc", "Accuracy": make_scorer(accuracy_score)}

param_mlp = {
    'hidden_layer_sizes':[(64,32,16), (32, 16), (16, 8), (32), (16), (8)], # Hidden Layer
    'activation': ['logistic', 'relu'],
    'solver':['sgd', 'adam'],
    'learning_rate':['constant', 'adaptive'],
    'max_iter': [1000]
}

# create gs object
mlp_gs = GridSearchCV(
    estimator = mlp,
    param_grid = param_mlp,
    scoring = scoring,
    refit="AUC",
    cv = 5,
    n_jobs = 5,
    verbose = 2
)

# Fitting our GridSearchCV Object
mlp_gs.fit(X_train, y_train)

# get results
results = mlp_gs.cv_results_

# get best Model
best = mlp_gs.best_estimator_

# print best results & parameter
print(mlp_gs.best_score_)
print(mlp_gs.best_params_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
0.7663520151637849
{'activation': 'logistic', 'hidden_layer_sizes': 16, 'learning_rate': 'adaptive', 'max_iter': 1000, 'solver': 'adam'}


In [94]:
# Now we apply the best model to make prediction on the test set
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

auc_test = roc_auc_score(y_test, best.predict_proba(X_test)[:, 1])
acc_test = accuracy_score(y_test, np.argmax(best.predict_proba(X_test), axis = 1))
auc_train = roc_auc_score(y_train, best.predict_proba(X_train)[:, 1])
acc_train = accuracy_score(y_train, np.argmax(best.predict_proba(X_train), axis = 1))

In [95]:
# put result in a Dataframe
d = {'MLPClassifier': [mlp_gs.best_score_, auc_train, acc_train, auc_test, acc_test]}
mlp_res = pd.DataFrame(data = d, index=["Cross Validation AUC", "Train AUC", "Train Accuracy", "Test AUC", "Test Accuracy"])
mlp_res

Unnamed: 0,MLPClassifier
Cross Validation AUC,0.766352
Train AUC,0.803335
Train Accuracy,0.826643
Test AUC,0.772058
Test Accuracy,0.814167


In [108]:
# lets try random forest
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

# scoring criteria: AUC & Accuracy
scoring = {"AUC": "roc_auc", "Accuracy": make_scorer(accuracy_score)}

param_rf = {
    'n_estimators':[100, 200, 500], # n trees
    'criterion': ['gini', 'entropy'],
    'max_depth':[3, 6]
}

# create gs object
rf_gs = GridSearchCV(
    estimator = rf,
    param_grid = param_rf,
    scoring = scoring,
    refit="AUC",
    cv = 5,
    n_jobs = 5,
    verbose = 2
)

# Fitting our GridSearchCV Object
rf_gs.fit(X_train, y_train)

# get results
results = rf_gs.cv_results_

# get best Model
best = rf_gs.best_estimator_

# print best results & parameter
print(rf_gs.best_score_)
print(rf_gs.best_params_)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
0.7715180692871063
{'criterion': 'entropy', 'max_depth': 6, 'n_estimators': 500}


In [109]:
# Now we apply the best model to make prediction on the test set
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

auc_test = roc_auc_score(y_test, best.predict_proba(X_test)[:, 1])
acc_test = accuracy_score(y_test, np.argmax(best.predict_proba(X_test), axis = 1))
auc_train = roc_auc_score(y_train, best.predict_proba(X_train)[:, 1])
acc_train = accuracy_score(y_train, np.argmax(best.predict_proba(X_train), axis = 1))

In [110]:
# put result in a Dataframe
d = {'RFClassifier': [rf_gs.best_score_, auc_train, acc_train, auc_test, acc_test]}
rf_res = pd.DataFrame(data = d, index=["Cross Validation AUC", "Train AUC", "Train Accuracy", "Test AUC", "Test Accuracy"])
rf_res

Unnamed: 0,RFClassifier
Cross Validation AUC,0.771518
Train AUC,0.795057
Train Accuracy,0.824143
Test AUC,0.77268
Test Accuracy,0.8135


In [111]:
# overview of all results
all_res = pd.concat([sgd_res, knn_res, svc_res, gbc_res, mlp_res, rf_res], axis = 1)
all_res

Unnamed: 0,SGDClassifier,KNNClassifier,SVMClassifier,GBClassifier,MLPClassifier,RFClassifier
Cross Validation AUC,0.716722,0.727119,0.743639,0.771089,0.766352,0.771518
Train AUC,0.716668,0.815307,0.786385,0.808317,0.803335,0.795057
Train Accuracy,0.801071,0.819143,0.822357,0.824,0.826643,0.824143
Test AUC,0.712928,0.728299,0.747674,0.774188,0.772058,0.77268
Test Accuracy,0.801,0.803833,0.816333,0.8165,0.814167,0.8135


#### 5) Conclusion

From the Benchmarking results we know that the best model for this particular dataset is either Gradient Boosting or Random Forest, the difference in terms of the metrics is not significant, this is followed by the Neural Network Model.

Possible recommendation to further improve the project:
- Try different Algorithms
- Try more robust hyperparameter combinations
- Try evaluating the performance of the model on an independent test set
- Hybrid Model could be considered as well