# ELN tuning for different preprocessing methods - HEGs

## Outline

The **MLAging - preprocessing** workflow consists of sections:

`00 preprocessing.R` Data preprocessing and preparation in Seurat.

`011 Preprocessing HEG ELN Tuning` ELN model tunning using highly expressed genes (HEGs) and hyperparameter selection using `GridSearchCV` -- **this notebook**:

#### [HEG ELN Model Tuning](#1.-HEG)
- [HEG-lognorm](#2.-heg_lognorm)
- [HEG-std](#3.-heg_std)
- [HEG-integration](#4.-heg_integrated)
- [HEG-binarization](#5.-heg_bin)
    
`012 Preprocessing HVG ELN Tuning` ELN model tunning using highly variable genes (HVGs) and hyperparameter selection using `GridSearchCV`.

`02 Preprocessing ELN Result 10x` Run the best ELN model over 10 random seeds.

`03 Preprocessing ELN Result Viz` Result visulization.

In [8]:
import warnings
warnings.filterwarnings('ignore')

from src.data_processing import *
from src.grid_search import *
import os
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
import pickle

data_type = 'float32'

In [2]:
def assign_target(input_df):
    input_df = pd.read_csv(input_df, index_col=0)
    input_df['animal'] = input_df.index.str[-1]
    input_df['target'] = ((input_df['animal'] =='3')|(input_df['animal']=='4')).astype(int)
    return input_df

In [3]:
def train_test_split(input_train, input_test, binarization=False):
    
    df_test = assign_target(input_test)
    test_X = df_test.iloc[:,:-2]
    test_y = df_test.target
    test_X, test_y = shuffle(test_X, test_y, random_state=42)

    df_train = assign_target(input_train)
    train = df_train.reset_index()
    custom_cv = customized_cv_index(train)
    
    train_X = train.iloc[:,1:-2]
    train_y = train.target
    
    if binarization==True:
        test_X = binarize_data(test_X)
        train_X = binarize_data(train_X)
    
    return train_X, train_y, test_X, test_y, custom_cv

In [4]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

## ELN model tuning for the top2k HEGs <a name="1.-HEG"></a>

### 1) HEG - log-normalized <a name="2.-heg_lognorm"></a>

In [5]:
input_train = '../data/train_heg2k_lognorm_intersect.csv'
input_test = '../data/test_heg2k_lognorm_intersect.csv'

In [6]:
train_X, train_y, test_X, test_y, custom_cv = train_test_split(input_train, input_test, binarization=False)

In [7]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
# the other hps tested
# param_grid = {'logisticregression__C': np.logspace(-4, 4, 10),
#              'logisticregression__l1_ratio': np.logspace(-3, 0, 5)}
# [0.05, 0.1, 0.2, 0.35]

param_grid = {'logisticregression__C': np.logspace(-2, 2, 10),
             'logisticregression__l1_ratio': np.logspace(-3, 0, 6)}
models_eln = []
for i in tqdm(range(3)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 33%|███▎      | 1/3 [2:42:06<5:24:13, 9726.80s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.690676182858998
test score: 0.6712988646865272


 67%|██████▋   | 2/3 [6:39:27<3:26:21, 12381.89s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.6906209515897587
test score: 0.6712998613287319


100%|██████████| 3/3 [10:42:11<00:00, 12843.90s/it] 

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.6906515079431652
test score: 0.6712992375756468





### 2) HEG - log-normalized + scaled<a name="3.-heg_std"></a>

In [9]:
input_train = '../data/train_heg2k_std_intersect.csv'
input_test = '../data/test_heg2k_std_intersect.csv'

In [10]:
train_X, train_y, test_X, test_y, custom_cv = train_test_split(input_train, input_test, binarization=False)

In [11]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
# param_grid = {'logisticregression__C': np.logspace(-4, 4, 10),
#              'logisticregression__l1_ratio': np.logspace(-3, 0, 5)}
param_grid = {'logisticregression__C': np.logspace(-2, 2, 10),
             'logisticregression__l1_ratio': np.logspace(-3, 0, 6)}
# [0.05, 0.1, 0.2, 0.35]
models_eln = []
for i in tqdm(range(3)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 33%|███▎      | 1/3 [1:15:48<2:31:37, 4548.98s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.6703216336692734
test score: 0.6728107688645167


 67%|██████▋   | 2/3 [2:31:42<1:15:51, 4551.44s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.6703241709145165
test score: 0.6728116909753354


100%|██████████| 3/3 [3:47:33<00:00, 4551.13s/it]  

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.6703240555084935
test score: 0.6728204947208725





### 3) HEG - log-normalized + scaled + integrated <a name="4.-heg_integrated"></a>

In [12]:
input_train = '../data/train_heg2k_std_integrated.csv'
input_test = '../data/test_heg2k_std_integrated.csv'

In [13]:
train_X, train_y, test_X, test_y, custom_cv = train_test_split(input_train, input_test, binarization=False)

In [14]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
# param_grid = {'logisticregression__C': np.logspace(-4, 4, 10),
#              'logisticregression__l1_ratio': np.logspace(-3, 0, 5)}
param_grid = {'logisticregression__C': np.logspace(-2, 2, 10),
             'logisticregression__l1_ratio': np.logspace(-3, 0, 6)}
# [0.05, 0.1, 0.2, 0.35]
models_eln = []
for i in tqdm(range(3)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 33%|███▎      | 1/3 [05:36<11:12, 336.11s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.5265876990684155
test score: 0.6888078006294425


 67%|██████▋   | 2/3 [11:13<05:36, 336.76s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.5265869236657875
test score: 0.6888062319276776


100%|██████████| 3/3 [16:43<00:00, 334.45s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.5265870582412444
test score: 0.6888062161805402





### 4) HEG - log-normalized + scaled + integrated + binarized <a name="5.-heg_bin"></a>

In [16]:
train_X, train_y, test_X, test_y, custom_cv = train_test_split(input_train, input_test, binarization=True)

In [17]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
# param_grid = {'logisticregression__C': np.logspace(-4, 4, 10),
#              'logisticregression__l1_ratio': np.logspace(-3, 0, 5)}
param_grid = {'logisticregression__C': np.logspace(-2, 2, 10),
             'logisticregression__l1_ratio': np.logspace(-3, 0, 6)}
# [0.05, 0.1, 0.2, 0.35]
models_eln = []
for i in tqdm(range(3)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 33%|███▎      | 1/3 [12:58<25:56, 778.04s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.5202947064158397
test score: 0.607951176367699


 67%|██████▋   | 2/3 [26:01<13:01, 781.15s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.5202909914637522
test score: 0.6079509180657812


100%|██████████| 3/3 [39:03<00:00, 781.31s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.520293493011841
test score: 0.6079511119121208



