# ELN tuning for different preprocessing methods - HVGs

## Outline

The **MLAging - preprocessing** workflow consists of sections:

`00 preprocessing.R` Data preprocessing and preparation in Seurat.

`011 Preprocessing HEG ELN Tuning` ELN model tunning using highly expressed genes (HEGs) and hyperparameter selection using `GridSearchCV`.

`012 Preprocessing HVG ELN Tuning` ELN model tunning using highly variable genes (HVGs) and hyperparameter selection using `GridSearchCV` -- **this notebook**:

#### [HVG ELN Model Tunning](#6.-HVG)
 - [HVG-lognorm](#7.-hvg_lognorm)
 - [HVG-std](#8.-hvg_std)
 - [HVG-integration](#9-hvg_integrated)
 - [HVG-binarization](#10.-hvg_bin)
 
 
`02 Preprocessing ELN Result 10x` Run the best ELN model over 10 random seeds.

`03 Preprocessing ELN Result Viz` Result visulization.

In [1]:
import warnings
warnings.filterwarnings('ignore')

from src.data_processing import *
from src.grid_search import *
import os
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
import pickle

data_type = 'float32'

In [2]:
def assign_target(input_df):
    input_df = pd.read_csv(input_df, index_col=0)
    input_df['animal'] = input_df.index.str[-1]
    input_df['target'] = ((input_df['animal'] =='3')|(input_df['animal']=='4')).astype(int)
    return input_df

In [3]:
def train_test_split(input_train, input_test, binarization=False):
    
    df_test = assign_target(input_test)
    test_X = df_test.iloc[:,:-2]
    test_y = df_test.target
    test_X, test_y = shuffle(test_X, test_y, random_state=42)

    df_train = assign_target(input_train)
    train = df_train.reset_index()
    custom_cv = customized_cv_index(train)
    
    train_X = train.iloc[:,1:-2]
    train_y = train.target
    
    if binarization==True:
        test_X = binarize_data(test_X)
        train_X = binarize_data(train_X)
    
    return train_X, train_y, test_X, test_y, custom_cv

In [4]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

In [5]:
from sklearn.linear_model import LogisticRegression

## ELN model tuning for the top2k highly variable genes <a name="6.-HVG"></a>

### 1) HVG - log-normalized  <a name="7.-hvg_lognorm"></a>

In [6]:
input_train = '../data/train_hvg2k_lognorm_intersect.csv'
input_test = '../data/test_hvg2k_lognorm_intersect.csv'

In [7]:
train_X, train_y, test_X, test_y, custom_cv = train_test_split(input_train, input_test, binarization=False)

In [14]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-2, 2, 10),
             'logisticregression__l1_ratio': np.logspace(-3, 0, 6)}
models_eln = []
for i in tqdm(range(3)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 33%|███▎      | 1/3 [1:18:58<2:37:57, 4738.85s/it]

{'logisticregression__C': 100.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.570585246467588
test score: 0.7207469897663866


 67%|██████▋   | 2/3 [2:38:01<1:19:01, 4741.26s/it]

{'logisticregression__C': 100.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.5705849468730947
test score: 0.7207472743691319


100%|██████████| 3/3 [3:56:55<00:00, 4738.40s/it]  

{'logisticregression__C': 100.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.5705849196511922
test score: 0.7207473414535435





### 2) HVG - log-normalized  + scaled <a name="8.-hvg_std"></a>

In [15]:
input_train = '../data/train_hvg2k_std_intersect.csv'
input_test = '../data/test_hvg2k_std_intersect.csv'

In [16]:
train_X, train_y, test_X, test_y, custom_cv = train_test_split(input_train, input_test, binarization=False)

In [17]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)

param_grid = {'logisticregression__C': np.logspace(-2, 2, 10),
             'logisticregression__l1_ratio': np.logspace(-3, 0, 6)}
models_eln = []
for i in tqdm(range(3)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 33%|███▎      | 1/3 [41:18<1:22:37, 2478.86s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0630957344480193}
best CV score: 0.5719506435814674
test score: 0.7285211656167588


 67%|██████▋   | 2/3 [1:22:37<41:18, 2478.65s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0630957344480193}
best CV score: 0.5719522738697693
test score: 0.7285214181381353


100%|██████████| 3/3 [2:03:58<00:00, 2479.53s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0630957344480193}
best CV score: 0.5719510537064791
test score: 0.7285205797900898





### 3) HVG - log-normalized  + scaled + integrated <a name="9-hvg_integrated"></a>

In [18]:
input_train = '../../../MLAging/data/train_hvg2k_std_integrated.csv'
input_test = '../../../MLAging/data/test_hvg2k_std_integrated.csv'

In [19]:
train_X, train_y, test_X, test_y, custom_cv = train_test_split(input_train, input_test, binarization=False)

In [20]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)

param_grid = {'logisticregression__C': np.logspace(-2, 2, 10),
             'logisticregression__l1_ratio': np.logspace(-3, 0, 6)}
models_eln = []
for i in tqdm(range(3)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 33%|███▎      | 1/3 [14:43<29:26, 883.38s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.5277449441700237
test score: 0.6977447142194114


 67%|██████▋   | 2/3 [29:28<14:44, 884.20s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.5277448927049289
test score: 0.6977438552916828


100%|██████████| 3/3 [44:17<00:00, 885.76s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.5277443321006587
test score: 0.6977452605024097





### 4) HVG - log-normalized  + scaled + integrated + binarized<a name="10.-hvg_bin"></a>

In [23]:
train_X, train_y, test_X, test_y, custom_cv = train_test_split(input_train, input_test, binarization=True)

In [24]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)

param_grid = {'logisticregression__C': np.logspace(-2, 2, 10),
             'logisticregression__l1_ratio': np.logspace(-3, 0, 6)}
models_eln = []
for i in tqdm(range(3)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 33%|███▎      | 1/3 [1:05:03<2:10:07, 3903.51s/it]

{'logisticregression__C': 0.027825594022071243, 'logisticregression__l1_ratio': 0.015848931924611134}
best CV score: 0.9755697526833953
test score: 0.9674268126301244


 67%|██████▋   | 2/3 [2:10:13<1:05:07, 3907.32s/it]

{'logisticregression__C': 0.027825594022071243, 'logisticregression__l1_ratio': 0.015848931924611134}
best CV score: 0.9755700544530364
test score: 0.9674311401834192


100%|██████████| 3/3 [3:17:15<00:00, 3945.08s/it]  

{'logisticregression__C': 0.027825594022071243, 'logisticregression__l1_ratio': 0.015848931924611134}
best CV score: 0.9755704713033182
test score: 0.9674272931931669



