In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option("display.max_columns", 85)

# Nested CV and Logistic Regression

This notebook implements three different approaches to hyperparameter tuning and model scoring. Optimization and scoring are both performed using the **ROC-AUC score.**

**A. GridSearch without Cross-Validation**

1. Split data set into training, validation, and testing data. ($ 64 \% : 16 \% : 20 \% $)
2. Perform a grid-search, training on the training set and scoring on the validation set. 
3. Score the best model on the testing set. 

**B. GridSearch with Cross-Validation**

1. Split the data into training and testing data.
2. Perform a 3-fold gridsearch on the hyperparameters.
3. Score the best model on the testing set.

**C. Nested CV**

1. Spit the data into training and testing. 
2. Define an outer fold (k=10). Within each of these, perform a k=3 gridsearch to find a best model. Score the best model on the outer folds to obtain an estimate of generalization error. 
3. Obtain an average score on the outer folds. 

# Results
- Method 1 (Standard Split Set Validation): $ROCAUC=0.659199$
- Method 2 (GridSearch CV): $ROCAUC=0.659229$
- Method 3a (Nested CV): $ROCAUC=0.670565 \ (std=0.018)$
- Method 3b (Nested CV): $ROCAUC=0.667420 \ (std=0.019)$

Does the high uncertainty of the nested CV mean anything? **Try: Increasing k of inner CV**

## Overfitting During Model Selection

> *On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, Cawley et al*.

The paper discusses two separate but equally important goals: model selection and model evaluation. 

In the model selection phase, appropriate care must be taken as to not overfit the model to the validation set. 

***
### Load the data

In [2]:
# Read Data from CSV (NEW DATA, NOT SCALED)
df = pd.read_csv("../data/abnormal_writeout_noscale.data.csv", index_col=0)

# trascurare da ACC a UVM
start_drop = df.columns.get_loc("ACC")
end_drop = df.columns.get_loc("UVM")
cols = np.arange(start_drop, end_drop + 1)
df.drop(df.columns[cols], axis=1, inplace=True)

# trascurare alcune colonne
df.drop("TTT_freq", axis=1, inplace=True)
df.drop("oldest_phylostratum_factor", axis=1, inplace=True)

# Drop NaNs
df.dropna(inplace=True)

# Sort features
resp = df["response"]
occ = df["occ_total_sum"]
age = df["oldest_phylostratum"]
conf = df.drop(labels=["response", "occ_total_sum", "oldest_phylostratum"], axis=1)

# Collect Features and Labels
features_df = pd.DataFrame()
features_df["occ_total_sum"] = occ
features_df["oldest_phylostratum"] = age
features_df = pd.concat([features_df, conf], axis=1)

X = features_df.to_numpy()
y = df["response"].to_numpy()

features_df.head(10)

Unnamed: 0,occ_total_sum,oldest_phylostratum,cds_length,gc_cds,dnase_gene,dnase_cds,H3k4me1_gene,H3k4me3_gene,H3k27ac_gene,H3k4me1_cds,H3k4me3_cds,H3k27ac_cds,lamin_gene,repli_gene,nsome_gene,nsome_cds,transcription_gene,repeat_gene,repeat_cds,recomb_gene,AAA_freq,AAC_freq,AAG_freq,AAT_freq,ACA_freq,ACC_freq,ACG_freq,ACT_freq,AGA_freq,AGC_freq,AGG_freq,AGT_freq,ATA_freq,ATC_freq,ATG_freq,ATT_freq,CAA_freq,CAC_freq,CAG_freq,CAT_freq,CCA_freq,CCC_freq,CCG_freq,CCT_freq,CGA_freq,CGC_freq,CGG_freq,CGT_freq,CTA_freq,CTC_freq,CTG_freq,CTT_freq,GAA_freq,GAC_freq,GAG_freq,GAT_freq,GCA_freq,GCC_freq,GCG_freq,GCT_freq,GGA_freq,GGC_freq,GGG_freq,GGT_freq,GTA_freq,GTC_freq,GTG_freq,GTT_freq,TAA_freq,TAC_freq,TAG_freq,TAT_freq,TCA_freq,TCC_freq,TCG_freq,TCT_freq,TGA_freq,TGC_freq,TGG_freq,TGT_freq,TTA_freq,TTC_freq,TTG_freq
1,33,12.0,1488,0.657258,0.61223,0.758065,0.561429,1.0,0.216855,0.66129,1.0,0.198925,0.0,0.041809,0.809254,0.706453,6.798234,0.040516,0.0,0.0,0.004755,0.008152,0.007473,0.002717,0.011549,0.026495,0.01087,0.008152,0.01019,0.028533,0.019701,0.009511,0.000679,0.006114,0.01087,0.002038,0.009511,0.019022,0.028533,0.007473,0.027174,0.03125,0.025136,0.029891,0.015625,0.027174,0.019701,0.009511,0.007473,0.017663,0.044837,0.013587,0.008832,0.021739,0.03125,0.008152,0.016984,0.033967,0.027853,0.034647,0.023777,0.030571,0.029212,0.013587,0.000679,0.012908,0.027174,0.003397,0.0,0.008152,0.0,0.001359,0.008832,0.021739,0.009511,0.01019,0.02038,0.027174,0.029212,0.01087,0.000679,0.013587,0.005435
10,28,1.0,873,0.42268,0.086769,0.195876,0.657839,0.0,0.0,0.0,0.0,0.0,1.0,-0.007148,0.828752,1.097018,0.061963,0.002809,0.0,2.04335,0.025258,0.019518,0.021814,0.02411,0.025258,0.01837,0.003444,0.012629,0.035591,0.009185,0.016073,0.006889,0.016073,0.017222,0.010333,0.033295,0.019518,0.011481,0.020666,0.022962,0.017222,0.008037,0.002296,0.021814,0.003444,0.001148,0.004592,0.002296,0.008037,0.019518,0.022962,0.019518,0.033295,0.013777,0.019518,0.011481,0.014925,0.006889,0.0,0.012629,0.01837,0.011481,0.017222,0.01837,0.005741,0.008037,0.012629,0.012629,0.012629,0.014925,0.006889,0.017222,0.017222,0.016073,0.005741,0.022962,0.020666,0.012629,0.027555,0.011481,0.021814,0.017222,0.026406
100,36,1.0,1092,0.572344,0.479295,0.611722,0.851369,0.354628,0.618954,0.754579,0.03022,0.086996,0.0,0.040463,1.2496,1.354306,6.08162,0.028404,0.0,0.868383,0.018727,0.012172,0.023408,0.003745,0.01779,0.024345,0.007491,0.014981,0.024345,0.020599,0.025281,0.011236,0.003745,0.013109,0.019663,0.004682,0.01779,0.016854,0.029963,0.01779,0.034644,0.022472,0.0103,0.02809,0.005618,0.0103,0.014045,0.003745,0.015918,0.015918,0.033708,0.011236,0.014981,0.022472,0.026217,0.009363,0.015918,0.031835,0.007491,0.025281,0.02809,0.029026,0.021536,0.013109,0.008427,0.0103,0.016854,0.003745,0.006554,0.012172,0.005618,0.008427,0.014981,0.016854,0.009363,0.008427,0.014981,0.019663,0.029026,0.0103,0.004682,0.0103,0.004682
1000,126,1.0,2800,0.46,0.171524,0.280357,0.554023,0.05242,0.278492,0.270357,0.021429,0.151429,0.0,-0.022495,0.92142,1.382249,2.254471,0.01452,0.0,1.14306,0.022054,0.014823,0.022415,0.024946,0.022054,0.0141,0.006146,0.015546,0.024946,0.016992,0.012292,0.015907,0.013377,0.02133,0.026392,0.017715,0.026392,0.011931,0.027477,0.017354,0.023861,0.016992,0.006508,0.019161,0.005785,0.003977,0.007954,0.003977,0.006146,0.010846,0.025307,0.015907,0.022415,0.022777,0.016269,0.0188,0.015184,0.016992,0.0047,0.014461,0.017354,0.010484,0.010123,0.011931,0.0094,0.007231,0.020607,0.011931,0.013738,0.008315,0.006146,0.016631,0.022054,0.018077,0.0047,0.009038,0.031092,0.019523,0.019523,0.016992,0.016269,0.0141,0.015907
10000,55,1.0,1484,0.401617,0.143843,0.030997,0.400789,0.106455,0.457949,0.708221,0.030997,0.659704,0.0,-0.000387,0.960747,1.196871,1.080241,0.009545,0.0,4.217,0.039835,0.015797,0.03022,0.025412,0.024038,0.012363,0.002747,0.019918,0.048077,0.006868,0.015797,0.009615,0.020604,0.009615,0.03228,0.023352,0.019918,0.012363,0.021978,0.015797,0.01511,0.003434,0.004121,0.013049,0.005495,0.001374,0.00206,0.00206,0.013736,0.014423,0.014423,0.013736,0.034341,0.017857,0.024725,0.024725,0.016484,0.006868,0.002747,0.006181,0.022665,0.013049,0.010302,0.008242,0.009615,0.004808,0.013736,0.011676,0.018544,0.012363,0.008242,0.019231,0.01511,0.012363,0.00206,0.015797,0.024038,0.010989,0.026099,0.018544,0.014423,0.015797,0.019231
100009676,0,15.0,267,0.670412,0.648707,0.891386,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.029824,0.549569,0.148308,18.763257,0.0,0.0,0.878711,0.0,0.003774,0.003774,0.003774,0.007547,0.015094,0.015094,0.0,0.003774,0.018868,0.022642,0.007547,0.007547,0.0,0.011321,0.007547,0.0,0.015094,0.030189,0.007547,0.015094,0.022642,0.05283,0.015094,0.007547,0.064151,0.041509,0.015094,0.0,0.030189,0.011321,0.018868,0.003774,0.007547,0.011321,0.011321,0.026415,0.045283,0.030189,0.041509,0.011321,0.026415,0.007547,0.037736,0.015094,0.018868,0.026415,0.003774,0.007547,0.011321,0.007547,0.0,0.003774,0.022642,0.030189,0.003774,0.015094,0.033962,0.011321,0.003774,0.003774,0.011321,0.015094
10001,14,2.0,1153,0.416305,0.120222,0.306158,0.17318,0.231617,1.0,0.019081,0.157849,1.0,1.0,0.038151,0.848328,1.028834,12.872106,0.006818,0.0,0.877603,0.03543,0.016829,0.03543,0.008857,0.016829,0.011515,0.001771,0.021258,0.033658,0.018601,0.011515,0.022143,0.013286,0.024801,0.019486,0.018601,0.023029,0.010629,0.031001,0.019486,0.015058,0.010629,0.000886,0.015943,0.004429,0.0,0.004429,0.000886,0.017715,0.005314,0.018601,0.023915,0.025686,0.009743,0.017715,0.020372,0.019486,0.005314,0.002657,0.015943,0.015943,0.010629,0.0062,0.014172,0.014172,0.0124,0.018601,0.005314,0.014172,0.013286,0.007086,0.025686,0.032772,0.015058,0.004429,0.009743,0.021258,0.014172,0.024801,0.013286,0.015058,0.018601,0.015943
10003,99,1.0,2706,0.396896,0.071271,0.209904,0.052603,0.038094,0.005366,0.105322,0.071693,0.030303,1.0,-0.016515,0.664585,0.914735,0.328743,0.033453,0.0,0.43413,0.036439,0.015778,0.023666,0.028174,0.016905,0.01127,0.002254,0.019159,0.029677,0.013899,0.01127,0.01127,0.022164,0.020661,0.025545,0.026672,0.020285,0.009391,0.021037,0.016905,0.019159,0.010894,0.00263,0.01127,0.003005,0.001127,0.00263,0.001503,0.012772,0.011645,0.022164,0.019534,0.034185,0.010518,0.018783,0.018032,0.012021,0.009391,0.001127,0.017656,0.025169,0.006386,0.009016,0.009391,0.012021,0.00263,0.009391,0.013524,0.012772,0.013148,0.007137,0.030804,0.019534,0.012397,0.003005,0.017656,0.023666,0.016529,0.027047,0.015026,0.016905,0.01728,0.02592
100037417,3,1.0,405,0.614815,0.330596,0.298765,0.351564,0.056264,0.028482,0.298765,0.0,0.0,0.0,0.049459,0.785799,0.368035,57.925186,0.011008,0.0,1.32662,0.005013,0.012531,0.012531,0.007519,0.007519,0.02005,0.015038,0.005013,0.015038,0.025063,0.015038,0.010025,0.007519,0.010025,0.017544,0.007519,0.010025,0.017544,0.017544,0.02005,0.037594,0.027569,0.030075,0.025063,0.012531,0.030075,0.02005,0.010025,0.005013,0.017544,0.035088,0.010025,0.017544,0.015038,0.027569,0.005013,0.010025,0.050125,0.025063,0.02005,0.022556,0.025063,0.017544,0.012531,0.007519,0.010025,0.012531,0.010025,0.005013,0.005013,0.007519,0.007519,0.010025,0.022556,0.002506,0.017544,0.017544,0.025063,0.025063,0.005013,0.005013,0.015038,0.007519
10004,49,1.0,2250,0.607556,0.482872,0.707556,1.0,0.391545,0.147303,1.0,0.454667,0.211111,0.0,0.048559,1.1061,1.052286,2.206688,0.002915,0.0,0.915263,0.003617,0.014014,0.01085,0.003165,0.017179,0.023056,0.011754,0.014014,0.014014,0.026221,0.018083,0.008137,0.001808,0.012658,0.011302,0.005425,0.015371,0.018083,0.03481,0.011754,0.025769,0.030289,0.014467,0.037071,0.006781,0.007685,0.014467,0.009946,0.012658,0.023508,0.039783,0.016727,0.012206,0.023056,0.020344,0.008137,0.022604,0.027577,0.006781,0.027125,0.027577,0.026673,0.035262,0.011302,0.004973,0.009494,0.019892,0.004973,0.000904,0.01085,0.002712,0.006329,0.014467,0.024864,0.006329,0.014014,0.015823,0.022152,0.034358,0.009946,0.000904,0.014014,0.012206


***
### Custom PCA

In [3]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.decomposition import PCA

# Columns of confounder variables (highly colinear)
conf_index = 2
conf_cols = np.arange(conf_index, X.shape[1])  

class ConfounderPCA(BaseEstimator, TransformerMixin):
    """ 
    Custom PCA transformer for this dataset.
    Applies PCA only to the many collinear confounder 
    variables.
    """

    def __init__(self, n_components=0.95, apply_PCA=True):
        self.n_components = n_components
        self.apply_PCA = apply_PCA
        if self.apply_PCA:
            self.pca = PCA(n_components=self.n_components)

    def fit(self, X, y=None):
        if self.apply_PCA:
            self.pca.fit(X[:, conf_cols])
        return self

    def transform(self, X, y=None):
        if self.apply_PCA:
            X_conf_pca = self.pca.transform(X[:, conf_cols])
            return np.c_[X[:, :2], X_conf_pca]
        else:
            return X
        
print(X.shape[1], "total features.")
print("Confounder columns start from index", conf_index, "of feature matrix.")
print("Non-counfounders:", features_df.iloc[:, 0:conf_index].columns.tolist())
features_df

83 total features.
Confounder columns start from index 2 of feature matrix.
Non-counfounders: ['occ_total_sum', 'oldest_phylostratum']


Unnamed: 0,occ_total_sum,oldest_phylostratum,cds_length,gc_cds,dnase_gene,dnase_cds,H3k4me1_gene,H3k4me3_gene,H3k27ac_gene,H3k4me1_cds,H3k4me3_cds,H3k27ac_cds,lamin_gene,repli_gene,nsome_gene,nsome_cds,transcription_gene,repeat_gene,repeat_cds,recomb_gene,AAA_freq,AAC_freq,AAG_freq,AAT_freq,ACA_freq,ACC_freq,ACG_freq,ACT_freq,AGA_freq,AGC_freq,AGG_freq,AGT_freq,ATA_freq,ATC_freq,ATG_freq,ATT_freq,CAA_freq,CAC_freq,CAG_freq,CAT_freq,CCA_freq,CCC_freq,CCG_freq,CCT_freq,CGA_freq,CGC_freq,CGG_freq,CGT_freq,CTA_freq,CTC_freq,CTG_freq,CTT_freq,GAA_freq,GAC_freq,GAG_freq,GAT_freq,GCA_freq,GCC_freq,GCG_freq,GCT_freq,GGA_freq,GGC_freq,GGG_freq,GGT_freq,GTA_freq,GTC_freq,GTG_freq,GTT_freq,TAA_freq,TAC_freq,TAG_freq,TAT_freq,TCA_freq,TCC_freq,TCG_freq,TCT_freq,TGA_freq,TGC_freq,TGG_freq,TGT_freq,TTA_freq,TTC_freq,TTG_freq
1,33,12.0,1488,0.657258,0.612230,0.758065,0.561429,1.000000,0.216855,0.661290,1.000000,0.198925,0.0,0.041809,0.809254,0.706453,6.798234,0.040516,0.0,0.000000,0.004755,0.008152,0.007473,0.002717,0.011549,0.026495,0.010870,0.008152,0.010190,0.028533,0.019701,0.009511,0.000679,0.006114,0.010870,0.002038,0.009511,0.019022,0.028533,0.007473,0.027174,0.031250,0.025136,0.029891,0.015625,0.027174,0.019701,0.009511,0.007473,0.017663,0.044837,0.013587,0.008832,0.021739,0.031250,0.008152,0.016984,0.033967,0.027853,0.034647,0.023777,0.030571,0.029212,0.013587,0.000679,0.012908,0.027174,0.003397,0.000000,0.008152,0.000000,0.001359,0.008832,0.021739,0.009511,0.010190,0.020380,0.027174,0.029212,0.010870,0.000679,0.013587,0.005435
10,28,1.0,873,0.422680,0.086769,0.195876,0.657839,0.000000,0.000000,0.000000,0.000000,0.000000,1.0,-0.007148,0.828752,1.097018,0.061963,0.002809,0.0,2.043350,0.025258,0.019518,0.021814,0.024110,0.025258,0.018370,0.003444,0.012629,0.035591,0.009185,0.016073,0.006889,0.016073,0.017222,0.010333,0.033295,0.019518,0.011481,0.020666,0.022962,0.017222,0.008037,0.002296,0.021814,0.003444,0.001148,0.004592,0.002296,0.008037,0.019518,0.022962,0.019518,0.033295,0.013777,0.019518,0.011481,0.014925,0.006889,0.000000,0.012629,0.018370,0.011481,0.017222,0.018370,0.005741,0.008037,0.012629,0.012629,0.012629,0.014925,0.006889,0.017222,0.017222,0.016073,0.005741,0.022962,0.020666,0.012629,0.027555,0.011481,0.021814,0.017222,0.026406
100,36,1.0,1092,0.572344,0.479295,0.611722,0.851369,0.354628,0.618954,0.754579,0.030220,0.086996,0.0,0.040463,1.249600,1.354306,6.081620,0.028404,0.0,0.868383,0.018727,0.012172,0.023408,0.003745,0.017790,0.024345,0.007491,0.014981,0.024345,0.020599,0.025281,0.011236,0.003745,0.013109,0.019663,0.004682,0.017790,0.016854,0.029963,0.017790,0.034644,0.022472,0.010300,0.028090,0.005618,0.010300,0.014045,0.003745,0.015918,0.015918,0.033708,0.011236,0.014981,0.022472,0.026217,0.009363,0.015918,0.031835,0.007491,0.025281,0.028090,0.029026,0.021536,0.013109,0.008427,0.010300,0.016854,0.003745,0.006554,0.012172,0.005618,0.008427,0.014981,0.016854,0.009363,0.008427,0.014981,0.019663,0.029026,0.010300,0.004682,0.010300,0.004682
1000,126,1.0,2800,0.460000,0.171524,0.280357,0.554023,0.052420,0.278492,0.270357,0.021429,0.151429,0.0,-0.022495,0.921420,1.382249,2.254471,0.014520,0.0,1.143060,0.022054,0.014823,0.022415,0.024946,0.022054,0.014100,0.006146,0.015546,0.024946,0.016992,0.012292,0.015907,0.013377,0.021330,0.026392,0.017715,0.026392,0.011931,0.027477,0.017354,0.023861,0.016992,0.006508,0.019161,0.005785,0.003977,0.007954,0.003977,0.006146,0.010846,0.025307,0.015907,0.022415,0.022777,0.016269,0.018800,0.015184,0.016992,0.004700,0.014461,0.017354,0.010484,0.010123,0.011931,0.009400,0.007231,0.020607,0.011931,0.013738,0.008315,0.006146,0.016631,0.022054,0.018077,0.004700,0.009038,0.031092,0.019523,0.019523,0.016992,0.016269,0.014100,0.015907
10000,55,1.0,1484,0.401617,0.143843,0.030997,0.400789,0.106455,0.457949,0.708221,0.030997,0.659704,0.0,-0.000387,0.960747,1.196871,1.080241,0.009545,0.0,4.217000,0.039835,0.015797,0.030220,0.025412,0.024038,0.012363,0.002747,0.019918,0.048077,0.006868,0.015797,0.009615,0.020604,0.009615,0.032280,0.023352,0.019918,0.012363,0.021978,0.015797,0.015110,0.003434,0.004121,0.013049,0.005495,0.001374,0.002060,0.002060,0.013736,0.014423,0.014423,0.013736,0.034341,0.017857,0.024725,0.024725,0.016484,0.006868,0.002747,0.006181,0.022665,0.013049,0.010302,0.008242,0.009615,0.004808,0.013736,0.011676,0.018544,0.012363,0.008242,0.019231,0.015110,0.012363,0.002060,0.015797,0.024038,0.010989,0.026099,0.018544,0.014423,0.015797,0.019231
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999,208,1.0,2649,0.514911,0.313496,0.427709,0.721323,0.380132,0.560000,0.371461,0.147603,0.375613,0.0,0.051321,1.156640,1.677763,12.624956,0.019776,0.0,0.549588,0.016813,0.015285,0.020634,0.015667,0.029041,0.018342,0.012228,0.012992,0.022163,0.018724,0.019488,0.009171,0.007642,0.014903,0.015667,0.015667,0.020634,0.025984,0.026366,0.012992,0.026748,0.023691,0.009935,0.021016,0.009171,0.007642,0.008789,0.007642,0.009935,0.017577,0.030187,0.012610,0.024838,0.019870,0.022545,0.016431,0.010699,0.021781,0.005732,0.018724,0.022927,0.014520,0.011846,0.012992,0.004968,0.011846,0.017577,0.007642,0.006878,0.011846,0.003057,0.007642,0.019488,0.017195,0.005732,0.017577,0.028659,0.015667,0.021781,0.011464,0.006496,0.015667,0.014138
9990,88,1.0,4035,0.486245,0.159518,0.305328,0.618466,1.000000,0.379258,0.538290,1.000000,0.578686,0.0,0.032907,0.952004,1.596068,4.338614,0.013269,0.0,2.271970,0.019613,0.013327,0.021624,0.012824,0.021624,0.018607,0.004275,0.016093,0.018607,0.016847,0.015841,0.013830,0.007292,0.016595,0.025396,0.016847,0.019361,0.019864,0.020619,0.023133,0.024642,0.014835,0.009052,0.018104,0.009303,0.004275,0.007040,0.003017,0.012069,0.015590,0.021121,0.022630,0.020116,0.014332,0.019361,0.016595,0.016595,0.015590,0.004275,0.018356,0.019613,0.015841,0.016093,0.014835,0.008046,0.011064,0.019361,0.010561,0.008801,0.011567,0.006035,0.012321,0.019864,0.018104,0.006035,0.018858,0.022630,0.017098,0.027910,0.016093,0.011315,0.019613,0.018858
9991,37,2.0,2043,0.443465,0.164623,0.025453,0.748995,0.710461,0.872609,0.785120,0.786099,1.000000,0.0,0.045040,0.865913,1.245576,7.591840,0.014049,0.0,2.458350,0.024463,0.007988,0.020469,0.024463,0.015477,0.016975,0.002996,0.011982,0.017474,0.017474,0.008987,0.014978,0.012481,0.019970,0.024463,0.019471,0.017474,0.015976,0.024963,0.018472,0.021468,0.012481,0.003994,0.025462,0.004493,0.004493,0.003994,0.003495,0.014978,0.020969,0.024963,0.031952,0.021468,0.008987,0.010484,0.014978,0.013979,0.013979,0.001498,0.021468,0.014978,0.011982,0.009486,0.007489,0.004993,0.010484,0.012981,0.012981,0.013979,0.014478,0.005991,0.016475,0.025961,0.019970,0.006490,0.033450,0.018972,0.015976,0.021468,0.014978,0.018472,0.034448,0.011483
9992,14,12.0,372,0.462366,0.166620,0.572581,0.857123,0.861899,1.000000,1.000000,1.000000,1.000000,0.0,0.017871,1.277585,1.767925,0.136402,0.020090,0.0,2.001840,0.018919,0.013514,0.027027,0.024324,0.027027,0.013514,0.008108,0.018919,0.032432,0.013514,0.013514,0.008108,0.002703,0.027027,0.027027,0.018919,0.029730,0.018919,0.013514,0.024324,0.035135,0.016216,0.002703,0.016216,0.005405,0.005405,0.005405,0.005405,0.013514,0.010811,0.024324,0.016216,0.035135,0.016216,0.021622,0.010811,0.008108,0.016216,0.005405,0.010811,0.018919,0.016216,0.005405,0.010811,0.010811,0.010811,0.013514,0.008108,0.000000,0.018919,0.005405,0.013514,0.016216,0.024324,0.005405,0.018919,0.029730,0.005405,0.027027,0.018919,0.010811,0.016216,0.016216


### Custom Scoring

In [4]:
from sklearn.metrics import auc, make_scorer, precision_recall_curve


def auprc(y_true, y_scores, **kwargs):
    """ Remember to use make_scorer(auprc, needs_proba=True,) """
    precisions, recalls, thresholds = precision_recall_curve(y_true, y_scores)
    # result is sum of the areas under each curve
    return auc(thresholds, precisions[:-1]) + auc(thresholds, recalls[:-1])

### The Model and its Parameter Space

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Define a parameter space to search
param_grid = {
    "lr__C": np.logspace(-3, 4, 7),
    "lr__class_weight": [None, "balanced"],
    "pca__apply_PCA": [True, False],
    "pca__n_components": [0.95, None],
}

# Define the model to be tuned
lr_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", ConfounderPCA()),
    ("lr", LogisticRegression(max_iter=2000,)),
])

### Train-Test Split

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Training set shape: (14536, 83) (14536,)
Testing set shape: (3634, 83) (3634,)


# Grid-search without Cross-validation (~k=1)

In this section: GridSearch using a standard validation set to determine optimal hyperparameters.

Using `ShuffleSplit(n_splits=1, test_size=0.2)` I create the train-validation split to be performed on `X_train`. 

Then I run `GridSearchCV`, which will perform the grid search on hyperparameters as if $k=1$. 

The best model is then fit on `X_train` and tested on `X_test`.

**NB:** $C = 1 / \lambda$, meaning a higher $C$ results in less regularization.

In [7]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV, ShuffleSplit

split = ShuffleSplit(n_splits=1, test_size=0.2, random_state=2) # Train / Validation split
search = GridSearchCV(estimator=lr_clf, param_grid=param_grid, scoring="roc_auc", cv=split, n_jobs=-1, verbose=1)
gs_result = search.fit(X_train, y_train)

model = gs_result.best_estimator_
print("\nBest Params:")
print(gs_result.best_params_)

model.fit(X_train, y_train)
pred_proba = model.predict_proba(X_test)[:, 1]
print(f"\nROC-AUC generalization score w/o CV: {roc_auc_score(y_test, pred_proba):.6f}")

Fitting 1 folds for each of 56 candidates, totalling 56 fits

Best Params:
{'lr__C': 681.2920690579622, 'lr__class_weight': 'balanced', 'pca__apply_PCA': False, 'pca__explained_variance': 0.95}

ROC-AUC generalization score w/o CV: 0.659199


***
# GridSearchCV for hyperparameter tuning

In this section: GridSearch with 3-fold cross-validation to determine optimal model parameters.

Here, $k=3$, for the usual `GridSearchCV` on the training set. The best estimator is then trained on `X_train` and evaluated on `X_test`.

In [8]:
# Define the grid search object
gscv = GridSearchCV(
    estimator=lr_clf,
    param_grid=param_grid,
    cv=3,
    n_jobs=-1,
    verbose=1,
    scoring="roc_auc", 
)

# Search
gscv_result = gscv.fit(X_train, y_train)

Fitting 3 folds for each of 56 candidates, totalling 168 fits


In [9]:
model = gscv_result.best_estimator_
print("Best Params:")
print(gscv_result.best_params_)
model.fit(X_train, y_train)
pred_proba = model.predict_proba(X_test)[:, 1]
print(f"\nROC-AUC generalization score with 3-fold CV: {roc_auc_score(y_test, pred_proba):.6f}")

Best Params:
{'lr__C': 3.1622776601683795, 'lr__class_weight': 'balanced', 'pca__apply_PCA': False, 'pca__explained_variance': 0.95}

ROC-AUC generalization score with 3-fold CV: 0.659229


*** 
## Nested CV on Logistic Regression, Returning Information on the Best Parameter Configurations

In this section: Nested CV (outer-fold=10, inner-fold=3). 

The outer cross-validation is defined by `KFold(n_splits=10)`, the inner by `KFold(n_splits=3)`. 

The nested cross-validation is performed as follows. Within each outer fold:
1. The data is split into training and testing. (9 folds for training, 1 for testing.)
2. A hyperparameter grid search is performed on the training set using 3-fold CV. The score for the best model, along with its parameters, are printed. 
3. This best model is scored on the reserved testing set. This outer-cv score is printed. 

In [19]:
from sklearn.exceptions import ConvergenceWarning, FitFailedWarning
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score, RepeatedKFold
from sklearn.utils._testing import ignore_warnings

# configure the cross-validation procedure
k_outer = 10
k_inner = 3
cv_outer = KFold(n_splits=k_outer, shuffle=True, random_state=1)
cv_inner = KFold(n_splits=k_inner, shuffle=True, random_state=3)

# To store results
roc_results = list()
# auprc_results = list()
# prec_results = list()
# rec_results = list()
# f1_results = list()
found_params = list()

print(f"Performing nested-cv with {k_outer} outer-folds and {k_inner} inner-folds.\n")
print("OUTER CV | BEST OF INNER CV | CHOSEN PARAMS")

for train_ix, test_ix in cv_outer.split(X_train):

    # split data
    X_tr, X_te = X_train[train_ix, :], X_train[test_ix, :]
    y_tr, y_te = y_train[train_ix], y_train[test_ix]

    # with ignore_warnings(category=[ConvergenceWarning, FitFailedWarning]):
    # define search
    search = GridSearchCV(estimator=lr_clf, param_grid=param_grid, scoring="roc_auc", cv=cv_inner, n_jobs=-1)
    # execute search
    result = search.fit(X_tr, y_tr)
        
    # get the best performing model fit on the whole training set
    best_model = result.best_estimator_

    # evaluate model on the hold out dataset
    # yhat = best_model.predict(X_te)
    yhat = best_model.predict_log_proba(X_te)[:,1]

    # evaluate the model
    roc_auc = roc_auc_score(y_te, yhat)
    
    # store the result
    roc_results.append(roc_auc)
    # auprc_results.append(auprc(y_te, yhat))
    # prec_results.append(precision_score(y_te, yhat))
    # rec_results.append(recall_score(y_te, yhat))
    # f1_results.append(f1_score(y_te, yhat))
    found_params.append(result.best_params_)

    # report progress
    print("roc-auc=%.3f, est=%.3f, params=%s" % (roc_auc, result.best_score_, result.best_params_))

# summarize the estimated performance of the model
print("ROC-AUC: %.3f (std = %.3f)" % (np.mean(roc_results), np.std(roc_results)))
print("(other scores stored in ncv_df)")

Performing nested-cv with 15 outer-folds and 3 inner-folds.

OUTER CV | BEST OF INNER CV | CHOSEN PARAMS
roc-auc=0.679, est=0.663, params={'lr__C': 0.21544346900318845, 'lr__class_weight': 'balanced', 'pca__apply_PCA': False, 'pca__explained_variance': 0.95}
roc-auc=0.684, est=0.664, params={'lr__C': 3.1622776601683795, 'lr__class_weight': 'balanced', 'pca__apply_PCA': False, 'pca__explained_variance': 0.95}
roc-auc=0.682, est=0.662, params={'lr__C': 0.21544346900318845, 'lr__class_weight': 'balanced', 'pca__apply_PCA': False, 'pca__explained_variance': 0.95}
roc-auc=0.669, est=0.668, params={'lr__C': 3.1622776601683795, 'lr__class_weight': 'balanced', 'pca__apply_PCA': False, 'pca__explained_variance': 0.95}
roc-auc=0.659, est=0.670, params={'lr__C': 3.1622776601683795, 'lr__class_weight': 'balanced', 'pca__apply_PCA': False, 'pca__explained_variance': 0.95}
roc-auc=0.612, est=0.670, params={'lr__C': 0.21544346900318845, 'lr__class_weight': 'balanced', 'pca__apply_PCA': False, 'pca__e

In [11]:
ncv_df = pd.DataFrame()
ncv_df["ROC-AUC"] = roc_results
# ncv_df["AUPRC"] = auprc_results
# ncv_df["Precision"] = prec_results
# ncv_df["Recall"] = rec_results
# ncv_df["f1-score"] = f1_results
ncv_df = pd.concat([ncv_df, pd.DataFrame(found_params)], axis=1)
ncv_df

Unnamed: 0,ROC-AUC,lr__C,lr__class_weight,pca__apply_PCA,pca__explained_variance
0,0.67681,0.215443,balanced,False,0.95
1,0.684902,3.162278,balanced,False,0.95
2,0.663116,46.415888,balanced,False,0.95
3,0.633348,3.162278,balanced,False,0.95
4,0.650987,3.162278,balanced,False,0.95
5,0.660139,0.215443,balanced,False,0.95
6,0.676674,3.162278,balanced,False,0.95
7,0.6968,3.162278,balanced,False,0.95
8,0.68701,3.162278,balanced,False,0.95
9,0.675871,3.162278,balanced,False,0.95


In [12]:
ncv_df["ROC-AUC"].mean()

0.6705656035796703

***
## Scoring the classifier with Nested CV

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, KFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Choose cross-validation techniques for the inner and outer loops
outer_cv = KFold(n_splits=k_outer, shuffle=True, random_state=1)
inner_cv = KFold(n_splits=k_inner, shuffle=True, random_state=2)

# Define GSCV, then perform Nested CV within cross_validate
clf = GridSearchCV(estimator=lr_clf, param_grid=param_grid, cv=inner_cv, verbose=0)  # Define a classifier

nested_score = cross_validate(
    clf, X=X_train, y=y_train, cv=outer_cv, scoring="roc_auc", n_jobs=-1, error_score="raise", return_estimator=True
)  # Nested CV

In [14]:
print("cross_validate only remembers the gridsearch method, not the optimal configuration.")
nested_score["estimator"][0]

cross_validate only remembers the gridsearch method, not the optimal configuration.


GridSearchCV(cv=KFold(n_splits=3, random_state=2, shuffle=True),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('pca', ConfounderPCA()),
                                       ('lr',
                                        LogisticRegression(max_iter=2000))]),
             param_grid={'lr__C': array([1.00000000e-03, 1.46779927e-02, 2.15443469e-01, 3.16227766e+00,
       4.64158883e+01, 6.81292069e+02, 1.00000000e+04]),
                         'lr__class_weight': [None, 'balanced'],
                         'pca__apply_PCA': [True, False],
                         'pca__explained_variance': [0.95, 1]})

In [15]:
pd.DataFrame(nested_score)

Unnamed: 0,fit_time,score_time,estimator,test_score
0,75.714898,0.006126,"GridSearchCV(cv=KFold(n_splits=3, random_state...",0.66944
1,86.273319,0.003089,"GridSearchCV(cv=KFold(n_splits=3, random_state...",0.684435
2,86.812161,0.006231,"GridSearchCV(cv=KFold(n_splits=3, random_state...",0.656482
3,89.430038,0.003019,"GridSearchCV(cv=KFold(n_splits=3, random_state...",0.632464
4,82.242674,0.002966,"GridSearchCV(cv=KFold(n_splits=3, random_state...",0.642849
5,78.770849,0.003235,"GridSearchCV(cv=KFold(n_splits=3, random_state...",0.655038
6,79.912749,0.003374,"GridSearchCV(cv=KFold(n_splits=3, random_state...",0.680601
7,82.88646,0.003237,"GridSearchCV(cv=KFold(n_splits=3, random_state...",0.692401
8,52.986907,0.001928,"GridSearchCV(cv=KFold(n_splits=3, random_state...",0.684556
9,50.394512,0.00172,"GridSearchCV(cv=KFold(n_splits=3, random_state...",0.675933


In [16]:
np.mean(nested_score["test_score"])

0.6674198311326174

In [17]:
np.std(nested_score["test_score"])

0.018857615697145648