# ELN Tuning After Binarization

## Outline

The **MLAging - SVZ all-cell** workflow consists of sections:

`30 SVZpreprocessing.R` Data preprocessing and preparation in Seurat.

`311 SVZ All-cell ELN Tuning - Before Binarization` ML model tunning using *non-binarized* HVGs and hyperparameter selection using `GridSearchCV`.

`312 SVZ All-cell ELN Tuning - After Binarization` ML model tunning using *binarized* HVGs and hyperparameter selection using `GridSearchCV` -- **this notebook:** 

`321 SVZ All-cell ELN 10x` Run the best ELN model for both binarized and nonbinarized HVGs over 10 random seeds.

`322 SVZ All-cell MLP 10x - Before Binarization` Run the best MLP model for *non-binarized* HVGs over 10 random seeds.

`323 SVZ All-cell MLP 10x - After Binarization` Run the best MLP model for *binarized* HVGs over 10 random seeds.
 
`33 SVZ All-cell Model Result Viz` Result visulization.

`34 SVZ All-cell Stat` Stat test on whether exercise rejuvenates cells.

In [1]:
import warnings
warnings.filterwarnings('ignore')

from src.data_processing import *
from src.grid_search import *
import os
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
import pickle

data_type = 'float32'

In [2]:
input_train = '../data/svz_processed/svz_ctl_train_cell_sep3integ_batch1.csv'
input_test = '../data/svz_processed/svz_ctl_test_cell_sep3integ_batch2.csv'
cell_type = 'All'

In [3]:
train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for All


In [4]:
train_X

Unnamed: 0,Acta2,Tagln,Lhfpl3,Tnr,Cd74,Tpm2,Myh11,Vtn,Lyz2,Myl9,...,Brip1os,Has2os,Dennd4a,Rasgrp2,Gabrg1,Etnppl,Galnt6,Fgfbp1,P2ry13,Man2b1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
27376,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
27377,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
27378,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [5]:
test_X

Unnamed: 0,Acta2,Tagln,Lhfpl3,Tnr,Cd74,Tpm2,Myh11,Vtn,Lyz2,Myl9,...,Brip1os,Has2os,Dennd4a,Rasgrp2,Gabrg1,Etnppl,Galnt6,Fgfbp1,P2ry13,Man2b1
8769,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
7881,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
563,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
8247,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
5192,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5391,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
861,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

In [7]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-4, 1, 10),
             'logisticregression__l1_ratio': np.append(np.logspace(-3, 0, 5), 0)}
# [0.05, 0.1, 0.2, 0.35]
models_eln = []
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 10%|█         | 1/10 [1:44:49<15:43:21, 6289.08s/it]

{'logisticregression__C': 0.016681005372000592, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8803214726312977
test score: 0.8770979766781516


 20%|██        | 2/10 [3:26:23<13:43:18, 6174.86s/it]

{'logisticregression__C': 0.016681005372000592, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.880319287394568
test score: 0.8770974467661885


 30%|███       | 3/10 [5:13:05<12:12:27, 6278.22s/it]

{'logisticregression__C': 0.016681005372000592, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8803239160209193
test score: 0.8770988271804703


 40%|████      | 4/10 [7:00:54<10:35:21, 6353.58s/it]

{'logisticregression__C': 0.016681005372000592, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8803233228548004
test score: 0.8770986835769321


 50%|█████     | 5/10 [8:44:25<8:45:11, 6302.26s/it] 

{'logisticregression__C': 0.016681005372000592, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8803179766223558
test score: 0.8770983144824592


 60%|██████    | 6/10 [10:11:34<6:35:48, 5937.24s/it]

{'logisticregression__C': 0.016681005372000592, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8803206282127058
test score: 0.8770975234424655


 60%|██████    | 6/10 [10:38:52<7:05:55, 6388.81s/it]


KeyboardInterrupt: 