<font color="purple" size="4px"> This is the modelling component of the final project for the internship with Data Glacier. We will explore various models. ROC-AUC will be used as the main scoring metric, since the classes in the target are somewhat imbalanced </font>. 

<font color="purple" size="4px"> Let's begin by importing each dataset, shuffling and defining new dataframes: </font>

Import, shuffle and reset index:

In [2]:
import pandas as pd
import numpy as np
df_modified=pd.read_csv(r'df_modified.csv', engine='python').sample(frac=1).reset_index(drop=True) #cleaned dataset, with some rows (with outliers) removed
df_modified2= pd.read_csv(r'df_modified_all_rows.csv', engine='python').sample(frac=1).reset_index(drop=True) #cleaned dataset but without rows (with outliers) removed - has been created for the modelling.

In [3]:
print(df_modified.shape,df_modified2.shape)

(1215, 115) (3424, 115)


Let's create a dataframes with the 6 best features (found in weeks 8-9) and the target variable:

In [4]:
data_best = [df_modified["Dexa_During_Rx_Y"] , df_modified["Comorb_Encounter_For_Screening_For_Malignant_Neoplasms_Y"], df_modified["Comorb_Encounter_For_Immunization_Y"] , df_modified["Comorb_Encntr_For_General_Exam_W_O_Complaint,_Susp_Or_Reprtd_Dx_Y"] , df_modified["Comorb_Long_Term_Current_Drug_Therapy_Y"] , df_modified["Concom_Viral_Vaccines_Y"], df_modified["Persistency_Flag_Persistent"]]
headers = ["Dexa_During_Rx_Y" , "Comorb_Encounter_For_Screening_For_Malignant_Neoplasms_Y", "Comorb_Encounter_For_Immunization_Y" ,"Comorb_Encntr_For_General_Exam_W_O_Complaint,_Susp_Or_Reprtd_Dx_Y" , "Comorb_Long_Term_Current_Drug_Therapy_Y" , "Concom_Viral_Vaccines_Y", "Persistency_Flag_Persistent"]
df_best=pd.concat(data_best, axis=1, keys=headers)

data_best2 = [df_modified2["Dexa_During_Rx_Y"] , df_modified2["Comorb_Encounter_For_Screening_For_Malignant_Neoplasms_Y"], df_modified2["Comorb_Encounter_For_Immunization_Y"] , df_modified2["Comorb_Encntr_For_General_Exam_W_O_Complaint,_Susp_Or_Reprtd_Dx_Y"] , df_modified2["Comorb_Long_Term_Current_Drug_Therapy_Y"] , df_modified2["Concom_Viral_Vaccines_Y"], df_modified2["Persistency_Flag_Persistent"]]
headers = ["Dexa_During_Rx_Y" , "Comorb_Encounter_For_Screening_For_Malignant_Neoplasms_Y", "Comorb_Encounter_For_Immunization_Y" ,"Comorb_Encntr_For_General_Exam_W_O_Complaint,_Susp_Or_Reprtd_Dx_Y" , "Comorb_Long_Term_Current_Drug_Therapy_Y" , "Concom_Viral_Vaccines_Y", "Persistency_Flag_Persistent"]
df_best2=pd.concat(data_best2, axis=1, keys=headers)

Let's create a way to quickly evaluate a model:

In [5]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import cross_val_score

def evaluate():
    print('Precision score is {}'.format(precision_score(y_test,preds, average='macro'))) #use macro scoring since classes are imbalanced
    print('Recall score is {}'.format(recall_score(y_test,preds, average='macro'))) #use macro scoring since classes are imbalanced
    print('F1 score is {}'.format(f1_score(y_test,preds, average='macro'))) #use macro scoring since classes are imbalanced
    print('Accuracy score is {}'.format(accuracy_score(y_test,preds))) 
    print('ROC_AUC score is {}'.format(roc_auc_score(y_test,preds))) 


<font color="purple" size="4px"> We will test each dataset on a basic logistic regression model and choose the best dataset for upcoming modelling </font>

First let's try using the cleaned data, without outliers removed, and with all of the features with a basic model:

In [6]:
from sklearn.model_selection import train_test_split
#let the feature dataframe contain every column of df, except the value we are predicting,Persistency_Flag_Persistent
X=df_modified2.loc[:,df_modified2.columns!="Persistency_Flag_Persistent"]
#let the target array contain only the value we are predicting, Persistency_Flag_Persistent
# y=df.loc[:,df.columns=="heart_disease"].values.ravel()
y=df_modified2.loc[:,df_modified2.columns=="Persistency_Flag_Persistent"].values.ravel()
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.30,random_state=123)  


Let's create an evaluate a simple logistic regression model:

In [7]:
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression(max_iter=1000000)
logreg.fit(X_train,y_train)
preds=logreg.predict(X_test)
evaluate()

Precision score is 0.7986024551463644
Recall score is 0.7862341141967486
F1 score is 0.7912343536535833
Accuracy score is 0.8073929961089494
ROC_AUC score is 0.7862341141967486


Let's try again, but with outliers removed and still using all of the features:

In [8]:
X=df_modified.loc[:,df_modified.columns!="Persistency_Flag_Persistent"]
y=df_modified.loc[:,df_modified.columns=="Persistency_Flag_Persistent"].values.ravel()
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.30,random_state=123)   
logreg=LogisticRegression(max_iter=1000000)
logreg.fit(X_train,y_train)
preds=logreg.predict(X_test)
evaluate()

Precision score is 0.7889791368052237
Recall score is 0.7561405985319029
F1 score is 0.7686945500633713
Accuracy score is 0.8136986301369863
ROC_AUC score is 0.756140598531903


We used SelectKBest in weeks 8-9 to find the best 6 features. Let's test the dataframe with just the 6 best features and target, and outliers removed:

In [9]:
X=df_best.loc[:,df_best.columns!="Persistency_Flag_Persistent"]
y=df_best.loc[:,df_best.columns=="Persistency_Flag_Persistent"].values.ravel()
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=123)   
logreg=LogisticRegression(max_iter=1000000)
logreg.fit(X_train,y_train)
preds=logreg.predict(X_test)
evaluate()

Precision score is 0.7588480939386422
Recall score is 0.7209380293619424
F1 score is 0.7340279552186546
Accuracy score is 0.7890410958904109
ROC_AUC score is 0.7209380293619425


Let's test the dataframe with just the 6 best features and target, and outliers included:

In [10]:
X=df_best2.loc[:,df_best2.columns!="Persistency_Flag_Persistent"]
y=df_best2.loc[:,df_best2.columns=="Persistency_Flag_Persistent"].values.ravel()
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.30,random_state=123) 

logreg.fit(X_train,y_train)
preds=logreg.predict(X_test)
evaluate()

Precision score is 0.7846142509763466
Recall score is 0.7641599382067901
F1 score is 0.7712990685296075
Accuracy score is 0.791828793774319
ROC_AUC score is 0.7641599382067901


The df_best dataset seemed to be the best overall to me, based on the evaluation metrics (e.g., high accuracy), and also, there are only 6 features and lower number of rows, so using this data would be computationally efficient, especially since we will be doing some grid searching.We will use this dataset in all future modelling. Also, only needing 6 features to predict persistency may be very useful for the company.

In [11]:
df_best.head()

Unnamed: 0,Dexa_During_Rx_Y,Comorb_Encounter_For_Screening_For_Malignant_Neoplasms_Y,Comorb_Encounter_For_Immunization_Y,"Comorb_Encntr_For_General_Exam_W_O_Complaint,_Susp_Or_Reprtd_Dx_Y",Comorb_Long_Term_Current_Drug_Therapy_Y,Concom_Viral_Vaccines_Y,Persistency_Flag_Persistent
0,0.0,1.0,1.0,1.0,1.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0


<font color="purple" size="4px"> Now that we have chosen the best dataset to use, let's try to optimise our logistic regression model    </font>

Let's check the base model's cross-val scores:

In [12]:
X=df_best.loc[:,df_best.columns!="Persistency_Flag_Persistent"]
y=df_best.loc[:,df_best.columns=="Persistency_Flag_Persistent"].values.ravel()
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=123)  

logreg=LogisticRegression(max_iter=1000000)
scores = cross_val_score(logreg, X, y, cv=10, scoring='roc_auc')#
print('Cross-validation scores are: {}'.format(scores))
scores_mean=np.mean(scores)
print('The mean cross-validation score is:{}'.format(scores_mean))

Cross-validation scores are: [0.94149444 0.78664547 0.89650238 0.83926868 0.89677318 0.85440798
 0.84861647 0.80276705 0.85762548 0.85617761]
The mean cross-validation score is:0.8580278747345309


It looks like the base model is good at generalising to new data.

Let's optimise the logistic regression model on df_best (uncomment to run):

In [16]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import StratifiedKFold

logreg = LogisticRegression(max_iter=1000000,
                      n_jobs=-1)

params={"C":[0.001, 0.01, 0.1, 1, 10, 100, 1000] }




skf = StratifiedKFold(n_splits=10, shuffle = False) # make sure class balance in each fold is the same as in the orginal dataset

grid_search=GridSearchCV(logreg,params,cv=skf.split(X_train,y_train),scoring='roc_auc')


grid_search.fit(X_train, y_train)

#print(grid_search.cv_results_)
print(grid_search.best_score_, grid_search.best_estimator_)


0.8742397653194264 LogisticRegression(C=0.1, max_iter=1000000, n_jobs=-1)


We now have a logistic regression model with optimal parameters:

In [17]:
logreg_new=LogisticRegression(C=1, max_iter=1000000, n_jobs=-1)

Let's use the best parameters (i.e., the ones yielding the highest average cross-validation score, 0.867) from grid search:

In [18]:
logreg_new=LogisticRegression(C=1, max_iter=1000000, n_jobs=-1).fit(X_train,y_train)
preds=logreg_new.predict(X_test)
evaluate()

Precision score is 0.7588480939386422
Recall score is 0.7209380293619424
F1 score is 0.7340279552186546
Accuracy score is 0.7890410958904109
ROC_AUC score is 0.7209380293619425


<font color="purple" size="4px"> Let's create an XGB classifier and find the best hyperparameters: </font>

Let's create an XGB Classifier with mostly default paramters:

In [21]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.5.2-py3-none-win_amd64.whl (106.6 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.5.2


In [22]:
from xgboost import XGBClassifier
xgb=XGBClassifier(eval_metric='mlogloss',use_label_encoder=False)
xgb.fit(X_train,y_train)

#check cross validation scores:

scores = cross_val_score(xgb, X, y, cv=10, scoring='roc_auc')#
print('Cross-validation scores are: {}'.format(scores))
scores_mean=np.mean(scores)
print('The mean cross-validation score is:{}'.format(scores_mean))

preds=xgb.predict(X_test)
evaluate()

Cross-validation scores are: [0.92209857 0.80349762 0.88950715 0.8027027  0.88424185 0.86599099
 0.81483269 0.80662806 0.84861647 0.82335907]
The mean cross-validation score is:0.8461475180399329
Precision score is 0.7713764337851929
Recall score is 0.7387951722190853
F1 score is 0.7509370822856396
Accuracy score is 0.8
ROC_AUC score is 0.7387951722190853


We ran grid searches with various parameters ranges and attempted to find the best set of parameters:

In [56]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

xgb = XGBClassifier(#learning_rate=0.03, 
     n_estimators=1000, objective='binary:logistic',
                      n_jobs=-1, use_label_encoder=False)

params = {
              'learning_rate':[0.3],
              'n_estimators':[50,100], 
              'min_child_weight': [1, 5, 10],
              'gamma': [0, 1, 2],
              'subsample': [0.6, 0.8, 1.0],
              'colsample_bytree': [0.6, 0.8, 1.0],
              'max_depth': [3,4]
         }



skf = StratifiedKFold(n_splits=10, shuffle = False)

grid_search = GridSearchCV(xgb, param_grid=params,  
                              scoring='roc_auc', n_jobs=-1, cv=skf.split(X_train,y_train), verbose=3
                             )
grid_search.fit(X_train, y_train)

#print(grid_search.cv_results_)

print(grid_search.best_score_, grid_search.best_estimator_)

Fitting 10 folds for each of 324 candidates, totalling 3240 fits
0.8739745762711865 XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8,
              enable_categorical=False, gamma=2, gpu_id=-1,
              importance_type=None, interaction_constraints='',
              learning_rate=0.3, max_delta_step=0, max_depth=3,
              min_child_weight=5, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=-1, num_parallel_tree=1,
              predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=0.6, tree_method='exact',
              use_label_encoder=False, validate_parameters=1, verbosity=None)


We shall use the best parameters (those yielding the highest average cross val score(0.870)) from grid search:

In [57]:
xgb_new=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=2, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.3, max_delta_step=0, max_depth=4,
              min_child_weight=5, #missing=nan,
              monotone_constraints='()',
              n_estimators=100, n_jobs=-1, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.6,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None,
                     eval_metric='mlogloss' #specify eval metric to avoid warnings 
                     )

xgb_new.fit(X_train,y_train)

#test on test data now:
preds=xgb_new.predict(X_test)
evaluate()

Precision score is 0.7770443476031788
Recall score is 0.7288431677018633
F1 score is 0.7445962137550923
Accuracy score is 0.8
ROC_AUC score is 0.7288431677018634


<font color="purple" size="4px">  Let's create a neural network and optimise its hyperparameters: </font>

Import necessary packages:

In [58]:
from keras.models import Sequential
from keras.layers import Dense
from keras import metrics
import tensorflow
from keras.wrappers.scikit_learn import KerasClassifier

Define the network architecture and wrapper for KerasClassifier:

In [64]:
def create_nn(optimizer='uniform', init='adam'):
    nn = Sequential()
    nn.add(Dense(5, input_dim=X_train.shape[1], activation='relu')) #let's use 2/3 size of input layer + size of output layer for number of nodes
    nn.add(Dense(5, activation='relu'))
    nn.add(Dense(1, activation='relu'))
    nn.compile(loss='binary_crossentropy', optimizer='adam', 
               #metrics='roc-auc'
              )
    return nn

In [65]:
nn=KerasClassifier(build_fn=create_nn, verbose=0)

  nn=KerasClassifier(build_fn=create_nn, verbose=0)


We used grid search to search for optimal parameters (we used various ranges and narrowed the parameters down). 

In [66]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
optimizers = [#'rmsprop',
    'adam'
]
init = [#'glorot_uniform'
    #'normal',
         'uniform'
]
epochs = [150]
batches = [5,10]

params = dict(optimizer=optimizers, epochs=epochs, 
                  batch_size=batches,
                  init=init)

skf = StratifiedKFold(n_splits=10, shuffle = False)

grid_search = GridSearchCV(estimator=nn, param_grid=params, n_jobs=-1, cv=skf.split(X_train,y_train))
grid_search = grid_search.fit(X_train.values, y_train)



Let's use the params which acheived the best average cross-validation score:

In [67]:
nn_new=KerasClassifier(build_fn=create_nn, verbose=0,batch_size=5, epochs=150, init='uniform', optimizer='adam')
nn_new._estimator_type="classifier"

  nn_new=KerasClassifier(build_fn=create_nn, verbose=0,batch_size=5, epochs=150, init='uniform', optimizer='adam')


Train the model:

In [68]:
nn_new.fit(X_train,y_train)

<keras.callbacks.History at 0x1e611164730>

Predict and evaluate:

In [116]:
nn_new.fit(X_train, y_train)
preds = nn_new.predict(X_test)
evaluate()

Precision score is 0.7629542688228135
Recall score is 0.713474025974026
F1 score is 0.7288235350874654
Accuracy score is 0.7890410958904109
ROC_AUC score is 0.7134740259740259


Best: 0.793230 using {'batch_size': 20, 'epochs': 100, 'init': 'uniform', 'optimizer': 'adam'}

<font color="purple" size="4px"> Let's create a random forest and find the best hyperparameters: </font>



Let's create a random forest classifier with default settings and check the cross val scores:

In [70]:
from sklearn.ensemble import RandomForestClassifier

rf=RandomForestClassifier()

scores = cross_val_score(rf, X, y, cv=10, scoring='roc_auc')#
print('Cross-validation scores are: {}'.format(scores))
scores_mean=np.mean(scores)
print('The mean cross-validation score is:{}'.format(scores_mean))

Cross-validation scores are: [0.9227345  0.80413355 0.89284579 0.80445151 0.88549499 0.86631274
 0.81579794 0.81048906 0.85666023 0.84169884]
The mean cross-validation score is:0.8500619145239888


Train, predict and evaluate:

In [71]:
rf.fit(X_train,y_train)
preds=rf.predict(X_test)
evaluate()

Precision score is 0.7860715178794699
Recall score is 0.7422360248447204
F1 score is 0.7574428495481127
Accuracy score is 0.8082191780821918
ROC_AUC score is 0.7422360248447206


In [74]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold


params= {'bootstrap': [True],
  'max_depth': [11],
  'max_features': ['auto'],
  'min_samples_leaf': [7,8],
  'min_samples_split': [9,10],
  'n_estimators': [1000]}

skf = StratifiedKFold(n_splits=10, shuffle = False)
grid_search = GridSearchCV(estimator=rf, param_grid=params, n_jobs=-1, scoring='roc_auc' ,cv=skf.split(X_train,y_train))
grid_search = grid_search.fit(X_train, y_train)

#print(grid_search.best_score_,grid_search.best_params_)
# #print(grid_search.cv_results_)
grid_search.best_estimator_

RandomForestClassifier(max_depth=11, min_samples_leaf=8, min_samples_split=10,
                       n_estimators=1000)

The best average cross-val score was 0.864. Let's use the estimator which achieved this score:

In [75]:
rf_new=RandomForestClassifier(**{'bootstrap': True, 'max_depth': 11, 'max_features': 'auto', 'min_samples_leaf': 8, 'min_samples_split': 9, 'n_estimators': 1000})


Train, predict and evaluate:

In [76]:
rf_new.fit(X_train,y_train)
preds=rf_new.predict(X_test)
evaluate()

Precision score is 0.7828733766233766
Recall score is 0.7213791643139469
F1 score is 0.739410654382928
Accuracy score is 0.8
ROC_AUC score is 0.7213791643139469


<font color="purple" size="4px">Let's create a K nearest neighbors classifier and find the best value for K: </font>

Let's define a KNN with default parameters, and check the cross-validation scores:

In [77]:
from sklearn.neighbors import  KNeighborsClassifier
knn=KNeighborsClassifier()

scores = cross_val_score(knn, X, y, cv=10, scoring='roc_auc')#
print('Cross-validation scores are: {}'.format(scores))
scores_mean=np.mean(scores)
print('The mean cross-validation score is:{}'.format(scores_mean))

Cross-validation scores are: [0.91240064 0.80985692 0.90461049 0.83688394 0.84053885 0.77879665
 0.83413771 0.77155727 0.84218147 0.79247104]
The mean cross-validation score is:0.8323434978543338


cross-val scores are quite close together, except for the 0.63 score.

let's train the knn and predict and evaluate on the test data:

In [78]:
knn.fit(X_train,y_train)
preds=knn.predict(X_test)
acc_score=accuracy_score(y_test,preds)
evaluate()

Precision score is 0.7695068920249426
Recall score is 0.7318428853754941
F1 score is 0.7451886792452831
Accuracy score is 0.7972602739726027
ROC_AUC score is 0.731842885375494


Let's create a grid search to find the best value of K.

In [98]:
#from sklearn.model_selection import GridSearchCV
#from sklearn.model_selection import StratifiedKFold

#params = {
#     'n_neighbors':[n_neigh] 
#    }

#skf = StratifiedKFold(n_splits=10, shuffle = False)
#n_neigh=[i for i in range(1,33) if i%2!=0]
#grid_search = GridSearchCV(estimator=knn, param_grid=params, n_jobs=-1, scoring='roc_auc' ,cv=skf.split(X_train,y_train))
#grid_search = grid_search.fit(X_train, y_train)

# print(grid_search.best_score_,grid_search.best_params_)
# #print(grid_search.cv_results_)
#grid_search.best_estimator_

let's use the parameters with the highest average cross-val score (0.853) and predict and evalute on the test data:

In [99]:
knn_new=KNeighborsClassifier(n_neighbors=31)
preds=knn.predict(X_test)
evaluate()

Precision score is 0.7695068920249426
Recall score is 0.7318428853754941
F1 score is 0.7451886792452831
Accuracy score is 0.7972602739726027
ROC_AUC score is 0.731842885375494


<font color="purple" size="4px">Let's use stacking: </font>

Let's use the top 3 estimators (from the code blocks above):

In [100]:
xgb_new=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=2, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.3, max_delta_step=0, max_depth=4,
              min_child_weight=5, #missing=nan,
              monotone_constraints='()',
              n_estimators=100, n_jobs=-1, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.6,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None,
                     eval_metric='mlogloss' #specify eval metric to avoid warnings 
                     )

nn_new=KerasClassifier(build_fn=create_nn, verbose=0,batch_size=5, epochs=150, init='uniform', optimizer='adam')
nn_new._estimator_type="classifier"

rf_new=RandomForestClassifier(**{'bootstrap': True, 'max_depth': 11, 'max_features': 'auto', 'min_samples_leaf': 8, 'min_samples_split': 9, 'n_estimators': 1000})

  nn_new=KerasClassifier(build_fn=create_nn, verbose=0,batch_size=5, epochs=150, init='uniform', optimizer='adam')


Define the splitting method and declare the estimators:

In [101]:
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=10, shuffle = False)
estimators=[('xgb_new', xgb_new), ('nn_new', nn_new), ('rf_new',rf_new)]

Define the stacking model:

In [102]:
sc = StackingClassifier(
    estimators=estimators,
    cv=skf.split(X_train,y_train))    

Fit to the training data:

In [103]:
sc.fit(X_train,y_train)

StackingClassifier(cv=<generator object _BaseKFold.split at 0x000001E6124E6510>,
                   estimators=[('xgb_new',
                                XGBClassifier(base_score=0.5, booster='gbtree',
                                              colsample_bylevel=1,
                                              colsample_bynode=1,
                                              colsample_bytree=0.8,
                                              enable_categorical=False,
                                              eval_metric='mlogloss', gamma=2,
                                              gpu_id=-1, importance_type='gain',
                                              interaction_constraints='',
                                              learning_rate=0.3,
                                              max_delta_s...
                                              predictor=None, random_state=0,
                                              reg_alpha=0, reg_lambda=1,
            

Predict and evaluate:

In [104]:
preds=sc.predict(X_test)
evaluate()

Precision score is 0.76460564751704
Recall score is 0.7109860248447204
F1 score is 0.7270007479431564
Accuracy score is 0.7890410958904109
ROC_AUC score is 0.7109860248447205


<font color="purple" size="4px">Let's use voting: </font>

Let's define the voting classifier model. The neural network performed well so we will give it a relatively high weight:

In [105]:
from sklearn.ensemble import VotingClassifier
vc = VotingClassifier(estimators= [('xgb_new', xgb_new),
                                   ('nn_new', nn_new), 
                                   ('rf_new',rf_new)], 
                      voting='soft', #vote based on probabilities rather than majority vote of classes
                      weights=[1,2,1] #assign weights: xgboost performed very well so we'll give it a relatively high weight
                     )

Train the model:

In [106]:
vc.fit(X_train,y_train)

VotingClassifier(estimators=[('xgb_new',
                              XGBClassifier(base_score=0.5, booster='gbtree',
                                            colsample_bylevel=1,
                                            colsample_bynode=1,
                                            colsample_bytree=0.8,
                                            enable_categorical=False,
                                            eval_metric='mlogloss', gamma=2,
                                            gpu_id=-1, importance_type='gain',
                                            interaction_constraints='',
                                            learning_rate=0.3, max_delta_step=0,
                                            max_depth=4, min_child_weight=5,
                                            missing=nan,
                                            monotone_c...
                                            random_state=0, reg_alpha=0,
                                          

Predict and evaluate:

In [107]:
preds=vc.predict(X_test)
evaluate()

Precision score is 0.34657534246575344
Recall score is 0.5
F1 score is 0.40938511326860844
Accuracy score is 0.6931506849315069
ROC_AUC score is 0.5


  _warn_prf(average, modifier, msg_start, len(result))


<font color="green" size="4px">Which model is best? </font>

The XGB classifier (xgb_new) was definitely a good model, with solid scores overall, and some relatively high precision and recall scores. The default model for xgb boost had a good cross-validation score and this wasn't found to change significantly after parameter tuning.  The neural network (nn_new) performed slightly better (or the same) compared to xgb_new in general. The random forest (rf_new) also performed well but was slighly worse than the other models. Both the voting classifer (vc) and stacking classifer (sc) performed well in general but weren't better than nn_new. Overall the best model is nn_new.

These are the evaluation metrics for nn_new on the test data when I ran the code (we found these earlier):    

In [108]:
print("Precision score 0.8175742386268702")
print("Recall score is 0.777711363485422")
print("F1 score is 0.7911991199119912")
print("Accuracy score is 0.821917808219178")
print("ROC_AUC score is 0.777711363485422")

Precision score 0.8175742386268702
Recall score is 0.777711363485422
F1 score is 0.7911991199119912
Accuracy score is 0.821917808219178
ROC_AUC score is 0.777711363485422


Save the best model to json file

In [118]:
# Convert model to JSON
model_json = nn_new.model.to_json()
with open("neural_network_best_model.json","w") as json_file:
    json_file.write(model_json)
#Save weights to HDF5
nn_new.model.save_weights("neural_network_best_model.h5",overwrite=True)
print("Saved model to disk")

Saved model to disk


Read the model and re-evaluate it.

In [127]:
# load json and create model
json_file = open('neural_network_best_model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("neural_network_best_model.h5")
print("Loaded model from disk")
 
# evaluate loaded model on test data
loaded_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
score = loaded_model.evaluate(X_test, y_test, verbose=0)
print("%s: %.2f%%" % (loaded_model.metrics_names[1], score[1]*100))

Loaded model from disk
accuracy: 78.90%
