# MS RCC Biomarker Hyperparameter Tuning

Author: Olatomiwa Bifarin<br>
Department of Biochemistry and Molecular Biology<br>
University of Georgia<br>
Edison Lab<br>

Last edited: 03NOV2020 

**Notes**: Hyperparameter tuning for the top 10 discriminating metabolites (MS features) in the study.


<a id="0"></a>

## Notebook Content

1.  [Grid Search: Random Forest](#1)
2.  [Grid Search: SVM-RBF](#2)
3.  [Grid Search: Lin-SVM](#3)
4.  [Grid Search: kNN](#4)

In [7]:
# Global seed
import random  
random.seed(42)

#import os
#os.environ['PYTHONHASHSEED']=str(42)

import pandas as pd
import numpy as np
np.random.seed(42)


#To ignore warning
import warnings
warnings.filterwarnings('ignore')

# More sharp and legible graphics
%config InlineBackend.figure_format = 'retina'

# Sklearn module
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

In [8]:
#import model cohort
modelcohort = pd.read_excel('data/modelcohort.xlsx', index_col=0)

NMRMS = modelcohort.drop(['Sample ID', 'Patient ID', 'Collection', 'Gender',
                         'Race', 'BMI', 'Smoker', 'Age'], axis=1)


NMRMS.rename(columns={720:'2-Phenylacetamide', 1481:'Lys-Ile',
                          2102:'Dibutylamine', 3141:'m/z 343.11',
                         3675:'m/z 87.06', 3804:'Tromethamine phosphate',
                         3872:'m/z 973.6', 4080:'m/z 406.05',
                         6261:'m/z 314.12', 6262:'2-Hydroxyhippuric acid/mannitol'}, 
                 inplace=True)

In [9]:
final_features = {720, 1481, 2102, 3141, 3675, 3804, 3872, 4080, 6261, 6262}
final_features_ID = {'2-Phenylacetamide', 'Lys-Ile','Dibutylamine', 'm/z 343.11',
                     'm/z 87.06','Tromethamine phosphate', 'm/z 973.6','m/z 406.05',
                     'm/z 314.12','2-Hydroxyhippuric acid/mannitol'}

In [10]:
final_features

{720, 1481, 2102, 3141, 3675, 3804, 3872, 4080, 6261, 6262}

In [11]:
# Import MS_labels
MS_labels = pd.read_excel('data/MS_labels.xlsx', index_col=0)

In [12]:
MS_labels[MS_labels.ID.isin(final_features)]

Unnamed: 0,ID,Mode,RT [min],Name,Formula
719,720,positive,2.562,2-Aminoacetophenone;O-Acetylaniline,C8 H9 N O
1480,1481,positive,6.29,1481,
2101,2102,positive,3.449,"N,N-Diisopropylethylamine (DIPEA)",C8 H19 N
3140,3141,positive,1.133,3141,C7 H18 N8 O6 S
3674,3675,positive,1.184,3675,
3803,3804,positive,2.595,3804,C4 H12 N O6 P
3871,3872,positive,4.049,3872,
4079,4080,positive,0.821,4080,C10 H21 N3 O8 P2 S
6260,6261,negative,2.591,6261,C9 H18 N9 O2 P
6261,6262,negative,2.667,6262,C10 H20 N9 O5 P


In [13]:
MLfeatures = NMRMS[list(final_features_ID)]
MLfeatures =(MLfeatures - MLfeatures.mean(axis=0))/MLfeatures.std(axis=0) #autoscaling

Define features and labels.

In [14]:
dfgrp = NMRMS.filter(['Groups'], axis=1)
#convert strings (RCC, Control) to integers
dfgroup = dfgrp['Groups'].map({'Control': 0, 'RCC': 1}) 
X = MLfeatures.values
y = dfgroup.values

### Grid Search: Random Forest
<a id="1"></a>

[Method Reference: towardsdatascience.com](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) <br>
[GridSearchCV sklearn Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [15]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [10, 20, 30],
    'max_features': ['auto', 'sqrt', 'log2'],
    'min_samples_leaf': [1, 2, 3, 4, 5],
    'min_samples_split': [2, 4, 6, 8],
    'n_estimators': [50, 100, 150, 200]
}
# Create a based model
rf = RandomForestClassifier(random_state=42)

# Create a custom CV so we can seed with random state
rsk = model_selection.StratifiedKFold(n_splits=5, random_state=42)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = rsk, scoring = 'accuracy', n_jobs = 4, verbose = 2)

In [16]:
# Fit the grid search to the data
grid_search.fit(X, y)
grid_search.best_params_

Fitting 5 folds for each of 720 candidates, totalling 3600 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    2.3s
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed:    7.2s
[Parallel(n_jobs=4)]: Done 357 tasks      | elapsed:   14.9s
[Parallel(n_jobs=4)]: Done 640 tasks      | elapsed:   26.3s
[Parallel(n_jobs=4)]: Done 1005 tasks      | elapsed:   41.3s
[Parallel(n_jobs=4)]: Done 1450 tasks      | elapsed:  1.0min
[Parallel(n_jobs=4)]: Done 1977 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 2584 tasks      | elapsed:  1.7min
[Parallel(n_jobs=4)]: Done 3273 tasks      | elapsed:  2.2min
[Parallel(n_jobs=4)]: Done 3600 out of 3600 | elapsed:  2.4min finished


{'bootstrap': True,
 'max_depth': 10,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 100}

In [17]:
grid_search.best_score_

0.9371794871794872

In [18]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bootstrap,param_max_depth,param_max_features,param_min_samples_leaf,param_min_samples_split,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.054526,0.000675,0.004745,0.001335,True,10,auto,1,2,50,"{'bootstrap': True, 'max_depth': 10, 'max_feat...",1.0,0.692308,1.0,0.916667,1.000000,0.921795,0.119196,208
1,0.112083,0.006238,0.008264,0.000863,True,10,auto,1,2,100,"{'bootstrap': True, 'max_depth': 10, 'max_feat...",1.0,0.769231,1.0,0.916667,1.000000,0.937179,0.089963,1
2,0.196593,0.010366,0.010946,0.000375,True,10,auto,1,2,150,"{'bootstrap': True, 'max_depth': 10, 'max_feat...",1.0,0.769231,1.0,0.916667,1.000000,0.937179,0.089963,1
3,0.221959,0.005444,0.014038,0.000187,True,10,auto,1,2,200,"{'bootstrap': True, 'max_depth': 10, 'max_feat...",1.0,0.769231,1.0,0.916667,1.000000,0.937179,0.089963,1
4,0.055478,0.002615,0.003804,0.000029,True,10,auto,1,4,50,"{'bootstrap': True, 'max_depth': 10, 'max_feat...",1.0,0.769231,1.0,0.916667,1.000000,0.937179,0.089963,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
715,0.258317,0.012334,0.017238,0.002927,True,30,log2,5,6,200,"{'bootstrap': True, 'max_depth': 30, 'max_feat...",1.0,0.769231,1.0,0.916667,0.916667,0.920513,0.084324,217
716,0.059146,0.001557,0.003912,0.000055,True,30,log2,5,8,50,"{'bootstrap': True, 'max_depth': 30, 'max_feat...",1.0,0.769231,1.0,0.916667,1.000000,0.937179,0.089963,1
717,0.136424,0.008772,0.009492,0.001663,True,30,log2,5,8,100,"{'bootstrap': True, 'max_depth': 30, 'max_feat...",1.0,0.769231,1.0,0.916667,1.000000,0.937179,0.089963,1
718,0.179081,0.010589,0.011601,0.001403,True,30,log2,5,8,150,"{'bootstrap': True, 'max_depth': 30, 'max_feat...",1.0,0.769231,1.0,0.916667,0.916667,0.920513,0.084324,217


### Grid Search: SVM-RBF
<a id="2"></a>

In [19]:
from sklearn.model_selection import GridSearchCV

param_grid = {'kernel': ['rbf'], 'C': [0.1, 1, 10, 100],
         'gamma': [0.01, 0.03, 0.1, 0.3, 1.0]}

svm_cls = svm.SVC(random_state=42)

# Create a custom CV so we can seed with random state
rsk = model_selection.StratifiedKFold(n_splits=5, random_state=42)

grid_search = GridSearchCV(svm_cls, param_grid, cv=rsk, scoring='accuracy', verbose=2, n_jobs=4)

In [20]:
# Fit the grid search to the data
grid_search.fit(X, y)
grid_search.best_params_

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=4)]: Done  93 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished


{'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}

In [21]:
grid_search.best_score_

0.9512820512820512

In [22]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_gamma,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000889,0.000115,0.000322,1.9e-05,0.1,0.01,rbf,"{'C': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}",0.461538,0.461538,0.916667,0.916667,0.75,0.701282,0.204992,19
1,0.000781,4.5e-05,0.000327,9.5e-05,0.1,0.03,rbf,"{'C': 0.1, 'gamma': 0.03, 'kernel': 'rbf'}",0.923077,0.692308,0.916667,0.833333,0.833333,0.839744,0.083284,17
2,0.000852,0.000113,0.000337,5.3e-05,0.1,0.1,rbf,"{'C': 0.1, 'gamma': 0.1, 'kernel': 'rbf'}",0.846154,0.769231,1.0,0.916667,1.0,0.90641,0.089524,6
3,0.000874,0.000149,0.000355,7.6e-05,0.1,0.3,rbf,"{'C': 0.1, 'gamma': 0.3, 'kernel': 'rbf'}",0.461538,0.692308,0.833333,0.916667,0.833333,0.747436,0.160108,18
4,0.001137,0.00047,0.000332,3.7e-05,0.1,1.0,rbf,"{'C': 0.1, 'gamma': 1.0, 'kernel': 'rbf'}",0.461538,0.461538,0.833333,0.833333,0.666667,0.651282,0.16645,20
5,0.000789,8.9e-05,0.000291,1.1e-05,1.0,0.01,rbf,"{'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}",0.923077,0.692308,1.0,0.833333,0.916667,0.873077,0.104658,13
6,0.000939,0.000131,0.000395,6.2e-05,1.0,0.03,rbf,"{'C': 1, 'gamma': 0.03, 'kernel': 'rbf'}",0.923077,0.769231,1.0,0.916667,0.916667,0.905128,0.07491,7
7,0.001086,0.000735,0.000394,0.000127,1.0,0.1,rbf,"{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}",0.923077,0.769231,0.916667,0.916667,1.0,0.905128,0.07491,7
8,0.000858,8.4e-05,0.000354,7.2e-05,1.0,0.3,rbf,"{'C': 1, 'gamma': 0.3, 'kernel': 'rbf'}",0.923077,0.923077,0.833333,0.916667,1.0,0.919231,0.052798,5
9,0.000812,0.000107,0.000341,6e-05,1.0,1.0,rbf,"{'C': 1, 'gamma': 1.0, 'kernel': 'rbf'}",0.923077,0.769231,0.833333,0.916667,0.833333,0.855128,0.057849,15


### Grid Search: Lin-SVM
<a id="3"></a>

In [23]:
from sklearn.model_selection import GridSearchCV

param_grid = {'kernel': ['linear'], 'C': [0.001, 0.01, 0.1, 1, 5, 10.]}

svm_cls = svm.SVC(random_state=42)

# Create a custom CV so we can seed with random state
rsk = model_selection.StratifiedKFold(n_splits=5, random_state=42)

grid_search = GridSearchCV(svm_cls, param_grid, cv=rsk, scoring='accuracy', verbose=2, n_jobs=4)

In [24]:
# Fit the grid search to the data
grid_search.fit(X, y)
grid_search.best_params_

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  23 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  30 out of  30 | elapsed:    0.0s finished


{'C': 0.1, 'kernel': 'linear'}

In [25]:
grid_search.best_score_

0.8897435897435898

In [26]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.00086,7.3e-05,0.000341,6.5e-05,0.001,linear,"{'C': 0.001, 'kernel': 'linear'}",0.461538,0.461538,0.833333,0.833333,0.75,0.667949,0.171258,6
1,0.00212,0.002437,0.000449,0.000148,0.01,linear,"{'C': 0.01, 'kernel': 'linear'}",0.923077,0.692308,0.916667,0.833333,0.916667,0.85641,0.088508,4
2,0.000975,0.000253,0.000613,0.000487,0.1,linear,"{'C': 0.1, 'kernel': 'linear'}",0.846154,0.769231,0.916667,0.916667,1.0,0.889744,0.077498,1
3,0.000838,0.000135,0.000309,1.8e-05,1.0,linear,"{'C': 1, 'kernel': 'linear'}",0.923077,0.846154,0.833333,0.916667,0.833333,0.870513,0.040623,2
4,0.00105,0.000156,0.000379,7.2e-05,5.0,linear,"{'C': 5, 'kernel': 'linear'}",0.923077,0.846154,0.833333,0.916667,0.833333,0.870513,0.040623,2
5,0.000971,8.9e-05,0.000409,0.000114,10.0,linear,"{'C': 10.0, 'kernel': 'linear'}",0.769231,0.846154,0.916667,0.916667,0.833333,0.85641,0.055677,4


### Grid Search: kNN
<a id="4"></a>

In [27]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_neighbors': list(range(3,30)), 'p': [1,2]}

knn_cls = KNeighborsClassifier()

# Create a custom CV so we can seed with random state
rsk = model_selection.StratifiedKFold(n_splits=5, random_state=42)

grid_search = GridSearchCV(knn_cls, param_grid, cv=rsk, scoring='accuracy', verbose=2, n_jobs=4)

In [28]:
# Fit the grid search to the data
grid_search.fit(X, y)
grid_search.best_params_

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 54 candidates, totalling 270 fits


[Parallel(n_jobs=4)]: Done 136 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 270 out of 270 | elapsed:    0.2s finished


{'n_neighbors': 4, 'p': 1}

In [29]:
grid_search.best_score_

0.9512820512820512

In [30]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,param_p,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001139,0.000875,0.001343,0.000145,3,1,"{'n_neighbors': 3, 'p': 1}",1.0,0.846154,0.916667,0.916667,0.916667,0.919231,0.048752,5
1,0.0006,0.000116,0.001132,7.3e-05,3,2,"{'n_neighbors': 3, 'p': 2}",0.923077,0.846154,1.0,0.916667,0.916667,0.920513,0.048752,2
2,0.000498,6e-05,0.001144,0.000113,4,1,"{'n_neighbors': 4, 'p': 1}",1.0,0.923077,0.916667,0.916667,1.0,0.951282,0.039847,1
3,0.00065,0.000201,0.001819,0.001036,4,2,"{'n_neighbors': 4, 'p': 2}",0.923077,0.923077,0.833333,0.916667,1.0,0.919231,0.052798,4
4,0.000602,0.00014,0.001103,0.000103,5,1,"{'n_neighbors': 5, 'p': 1}",1.0,0.692308,1.0,0.916667,0.916667,0.905128,0.112748,6
5,0.000455,2.1e-05,0.001202,0.000131,5,2,"{'n_neighbors': 5, 'p': 2}",1.0,0.769231,0.916667,0.916667,1.0,0.920513,0.084324,2
6,0.000528,8.8e-05,0.001132,6.2e-05,6,1,"{'n_neighbors': 6, 'p': 1}",0.923077,0.692308,0.916667,0.916667,0.916667,0.873077,0.090419,8
7,0.000629,0.000169,0.001579,0.000772,6,2,"{'n_neighbors': 6, 'p': 2}",0.846154,0.769231,0.833333,0.916667,1.0,0.873077,0.078864,8
8,0.00064,0.000332,0.002764,0.00177,7,1,"{'n_neighbors': 7, 'p': 1}",1.0,0.692308,0.916667,0.916667,0.833333,0.871795,0.104075,13
9,0.002075,0.003229,0.001297,0.000192,7,2,"{'n_neighbors': 7, 'p': 2}",0.923077,0.692308,0.916667,0.916667,1.0,0.889744,0.103632,7
