<a href="https://colab.research.google.com/github/adichat08/Support-Vector-Classifier-for-Predicting-Survival-Likelihood-of-Hepatitis-Patients/blob/main/Parameter_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



---



# Parameter Tuning

In this section, each model's parameters(a few selected ones) will be tuned to provide optimal performance on this particular dataset.



---


Parameter Tuning for Linear Model

In [None]:
# setting up preprocessing and the linear model in a pipeline
log_pipe_cv = Pipeline([('preprocessing',ct),('log',LogisticRegression(random_state=42))])
# setting up the parameter grid for the grid search
log_param_grid = {'log__C':[0.0001,0.001,0.01,0.1,1,10,100],
                  'log__solver':['lbfgs','liblinear','newton-cg']}
# evaluating the results of the grid search on the different metrics
metric_lst = ['accuracy','roc_auc','f1']
for i in metric_lst:
  log = LogisticRegression(random_state=42,max_iter=1000)
  log_grid = GridSearchCV(log_pipe_cv,log_param_grid,scoring = i,cv=kfold,refit=True)
  log_grid.fit(X_train,y_train)
  print("Best parameters({}):\n{}".format(i,log_grid.best_params_))
  print("Best score({}):\n{}".format(i,log_grid.best_score_))
  print('')

Best parameters(accuracy):
{'log__C': 1, 'log__solver': 'lbfgs'}
Best score(accuracy):
0.8888888888888888

Best parameters(roc_auc):
{'log__C': 0.0001, 'log__solver': 'lbfgs'}
Best score(roc_auc):
0.869815668202765

Best parameters(f1):
{'log__C': 1, 'log__solver': 'lbfgs'}
Best score(f1):
0.6495726495726495



In [None]:
# testing the performance of the linear model using the parameters that returned the greatest f1 average(using cross validation on the training set)
log_pipe_cv = Pipeline([('preprocessing',ct),('log',LogisticRegression(random_state=42,C=1,solver='lbfgs'))])
kfold = KFold(n_splits=3,shuffle=True,random_state=42)
print("Cross-validation scores:\n{}".format(
      cross_val_score(log_pipe_cv,X_train,y_train,cv=kfold)))
print('Average score:\n{}'.format(
    cross_val_score(log_pipe_cv,X_train,y_train,cv=kfold).mean()))
print("Cross-validation AUC:\n{}".format(
      cross_val_score(log_pipe_cv,X_train,y_train,cv=kfold,scoring='roc_auc')))
print('Average AUC:\n{}'.format(
    cross_val_score(log_pipe_cv,X_train,y_train,cv=kfold,scoring='roc_auc').mean()))
print('F1 average:\n{}'.format(
    cross_val_score(log_pipe_cv,X_train,y_train,cv=kfold,scoring = 'f1').mean()))

Cross-validation scores:
[0.8974359 0.8974359 0.8974359]
Average score:
0.8974358974358975
Cross-validation AUC:
[0.81696429 0.89919355 0.82142857]
Average AUC:
0.8458621351766512
F1 average:
0.6666666666666666


Interestingly, the best parameters for the linear model are the default ones.



---

Parameter Tuning for Gradient Boosting Model

In [None]:
# Repeating grid search process for gradient boosting classifier
xgb_pipe_cv = Pipeline([('preprocessing',ct),('xgb',XGBClassifier(random_state=42,n_jobs=-1,))])
xgb_param_grid = {'xgb__max_depth':[2,3,4,5],
                  'xgb__learning_rate':[0.0001,0.001,0.01,0.1,1,10,100],
                  'xgb__n_estimators':[50,100,200]}
metric_lst = ['accuracy','roc_auc','f1']
for i in metric_lst:
  xgb = XGBClassifier(random_state=42,n_jobs=-1)
  xgb_grid = GridSearchCV(xgb_pipe_cv,xgb_param_grid,scoring = i,cv=kfold,refit=True)
  xgb_grid.fit(X_train,y_train)
  print("Best parameters({}):\n{}".format(i,xgb_grid.best_params_))
  print("Best score({}):\n{}".format(i,xgb_grid.best_score_))
  print('')

Best parameters(accuracy):
{'xgb__learning_rate': 1, 'xgb__max_depth': 2, 'xgb__n_estimators': 100}
Best score(accuracy):
0.8632478632478633

Best parameters(roc_auc):
{'xgb__learning_rate': 0.1, 'xgb__max_depth': 3, 'xgb__n_estimators': 50}
Best score(roc_auc):
0.8784562211981567

Best parameters(f1):
{'xgb__learning_rate': 1, 'xgb__max_depth': 2, 'xgb__n_estimators': 100}
Best score(f1):
0.6071428571428571



The gradient boosting model displayed comparatively poor performance on all three metrics.



---

Parameter Tuning for Neural Network

Note: for the purpose of minimizing computational expense, the grid search for the neural network was run with f1 scoring only. This is because neural networks are very computationally heavy to train.

In [None]:
# Repeating grid search process for multilayer preceptron
mlp_param_grid = {'mlp__solver':['adam','lbfgs'],
                  'mlp__hidden_layer_sizes':[(50),(50,50),(100,),(100,100),(100,100,50,25)],
                  'mlp__activation':['relu','logistic'],
                  'mlp__alpha':[0.00001,0.0001,0.001,0.01,0.1,1,10],
                  'mlp__batch_size':[10,20,30,40]
                  }
mlp_pipe_cv = Pipeline([('preprocessing',ct),('mlp',MLPClassifier(max_iter=3000,random_state=42))])
mlp = MLPClassifier(random_state=42,max_iter=10000)
mlp_grid = GridSearchCV(mlp_pipe_cv,mlp_param_grid,scoring = 'f1',cv=kfold,refit=True)
mlp_grid.fit(X_train,y_train)
print("Best parameters(f1):\n{}".format(mlp_grid.best_params_))
print("Best score(f1):\n{}".format(mlp_grid.best_score_))
print('')

Best parameters(f1):
{'mlp__activation': 'relu', 'mlp__alpha': 1e-05, 'mlp__batch_size': 30, 'mlp__hidden_layer_sizes': (100,), 'mlp__solver': 'adam'}
Best score(f1):
0.7039627039627039



In [None]:
# testing the performance of the neural network using the parameters that returned the greatest f1 average(using cross validation on the training set)
mlp_pipe_cv = Pipeline([('preprocessing',ct),('mlp',MLPClassifier(max_iter=10000,random_state=42,activation='relu', alpha= 0.00001, batch_size= 35, hidden_layer_sizes= (100,), solver= 'adam'))])
kfold = KFold(n_splits=3,shuffle=True,random_state=42)
mlp = MLPClassifier(max_iter=10000,random_state=42,activation='relu', alpha= 0.00001, batch_size= 30, hidden_layer_sizes= (100,), solver= 'adam')
print("Cross-validation scores:\n{}".format(
      cross_val_score(mlp_pipe_cv,X_train,y_train,cv=kfold)))
print('Average score:\n{}'.format(
    cross_val_score(mlp_pipe_cv,X_train,y_train,cv=kfold).mean()))
print("Cross-validation AUC:\n{}".format(
      cross_val_score(mlp_pipe_cv,X_train,y_train,cv=kfold,scoring='roc_auc')))
print('Average AUC:\n{}'.format(
    cross_val_score(mlp_pipe_cv,X_train,y_train,cv=kfold,scoring='roc_auc').mean()))
print('F1 average:\n{}'.format(
    cross_val_score(mlp_pipe_cv,X_train,y_train,cv=kfold,scoring = 'f1').mean()))


Cross-validation scores:
[0.92307692 0.87179487 0.92307692]
Average score:
0.905982905982906
Cross-validation AUC:
[0.75       0.81854839 0.84821429]
Average AUC:
0.8055875576036865
F1 average:
0.7039627039627039


The best parameters for the neural network don't appear to perform differently from the default at all. Many of the parameter values that the grid search returned are the same as the default values, however.



---

Parameter Tuning for SVM

In [None]:
# Repeating grid search process for support vector machine
svc_param_grid = [{'svm__C':[0.001,0.01,0.1,1,10,100],
                   'svm__kernel':['rbf','poly','sigmoid'],
                   'svm__gamma':[0.001,0.01,0.1,1,10,100,'auto','scale']},
                  {'svm__C':[0.001,0.01,0.1,1,10,100],
                   'svm__kernel':['linear']
                  }]
svc_pipe_cv = Pipeline([('preprocessing',ct),('svm',SVC(random_state=42))])
metric_lst = ['accuracy','roc_auc','f1']
for i in metric_lst:
  svc = SVC(random_state=42)
  svc_grid = GridSearchCV(svc_pipe_cv,svc_param_grid,scoring = i,cv=kfold,refit=True)
  svc_grid.fit(X_train,y_train)
  print("Best parameters({}):\n{}".format(i,svc_grid.best_params_))
  print("Best score({}):\n{}".format(i,svc_grid.best_score_))
  print('')

Best parameters(accuracy):
{'svm__C': 1, 'svm__gamma': 0.1, 'svm__kernel': 'rbf'}
Best score(accuracy):
0.8974358974358975

Best parameters(roc_auc):
{'svm__C': 1, 'svm__gamma': 1, 'svm__kernel': 'rbf'}
Best score(roc_auc):
0.8590629800307218

Best parameters(f1):
{'svm__C': 100, 'svm__gamma': 0.01, 'svm__kernel': 'sigmoid'}
Best score(f1):
0.6868686868686869



In [None]:
# testing the performance of the support vector machine using the parameters that returned the greatest f1 average(using cross validation on the training set)
svc_pipe_cv = Pipeline([('preprocessing',ct),('svm',SVC(C=100,gamma=0.01,kernel = 'sigmoid',random_state=42))])
kfold = KFold(n_splits=3,shuffle=True,random_state=42)
svc = SVC(C=100,gamma=0.01,random_state=42,kernel='rbf')
print("Cross-validation scores:\n{}".format(
      cross_val_score(svc_pipe_cv,X_train,y_train,cv=kfold)))
print('Average score:\n{}'.format(
    cross_val_score(svc_pipe_cv,X_train,y_train,cv=kfold).mean()))
print("Cross-validation AUC:\n{}".format(
      cross_val_score(svc_pipe_cv,X_train,y_train,cv=kfold,scoring='roc_auc')))
print('Average AUC:\n{}'.format(
    cross_val_score(svc_pipe_cv,X_train,y_train,cv=kfold,scoring='roc_auc').mean()))
print('F1 average:\n{}'.format(
    cross_val_score(svc_pipe_cv,X_train,y_train,cv=kfold,scoring = 'f1').mean()))

Cross-validation scores:
[0.8974359  0.87179487 0.92307692]
Average score:
0.8974358974358975
Cross-validation AUC:
[0.80803571 0.8266129  0.86607143]
Average AUC:
0.8335733486943164
F1 average:
0.6868686868686869


The SVC displayed significant improvement with the selection of optimal parameters, especially when it came to the f1-score, which is the most important metric for this task.