# Model Tuning Before Binarization

## Outline

The **MLAging - SVZ all-cell** workflow consists of sections:

`30 SVZpreprocessing.R` Data preprocessing and preparation in Seurat.

`311 SVZ All-cell ELN Tuning - Before Binarization` ML model tunning using *non-binarized* HVGs and hyperparameter selection using `GridSearchCV` -- **this notebook:** 

`312 SVZ All-cell ELN Tuning - After Binarization` ML model tunning using *binarized* HVGs and hyperparameter selection using `GridSearchCV`.

`321 SVZ All-cell ELN 10x` Run the best ELN model for both binarized and nonbinarized HVGs over 10 random seeds.

`322 SVZ All-cell MLP 10x - Before Binarization` Run the best MLP model for *non-binarized* HVGs over 10 random seeds.

`323 SVZ All-cell MLP 10x - After Binarization` Run the best MLP model for *binarized* HVGs over 10 random seeds.
 
`33 SVZ All-cell Model Result Viz` Result visulization.

`34 SVZ All-cell Stat` Stat test on whether exercise rejuvenates cells.

In [1]:
import warnings
warnings.filterwarnings('ignore')

from src.data_processing import *
from src.grid_search import *
import os
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
import pickle

data_type = 'float32'

In [2]:
input_train = '../data/svz_processed/svz_ctl_train_cell_sep3integ_batch1.csv'
input_test = '../data/svz_processed/svz_ctl_test_cell_sep3integ_batch2.csv'
cell_type = 'All'

In [3]:
train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=False)

Finished data prepration for All


In [4]:
train_X

Unnamed: 0,Acta2,Tagln,Lhfpl3,Tnr,Cd74,Tpm2,Myh11,Vtn,Lyz2,Myl9,...,Brip1os,Has2os,Dennd4a,Rasgrp2,Gabrg1,Etnppl,Galnt6,Fgfbp1,P2ry13,Man2b1
0,-0.139283,-0.144205,-0.160735,-0.180525,-0.163505,-0.204255,-0.146405,-0.289789,-0.325618,-0.218009,...,-0.778954,-0.087277,-1.141469,0.281625,-0.402966,-0.209533,-0.461319,-0.187218,-0.700793,-0.607787
1,-0.161499,-0.152532,-0.264351,-0.131902,-0.436693,-0.144216,-0.215989,-0.192377,0.680627,-0.242194,...,-0.671902,-0.087277,2.104323,-0.328176,-0.400630,-0.182378,-0.392737,-0.038585,1.391814,1.987571
2,-0.170037,-0.139211,-0.194593,-0.208957,-0.163505,-0.212109,-0.146405,-0.230138,0.542996,-0.217335,...,-0.669474,-0.087277,-0.070949,-0.423777,-0.315105,-0.182378,2.008814,-0.038585,-0.496443,-0.554762
3,-0.162868,-0.152532,-0.212335,-0.208957,-0.163505,-0.214269,-0.146405,-0.233218,2.482844,-0.255055,...,-0.584547,-0.087277,0.799594,-0.328176,-0.400630,-0.182378,-0.463811,-0.038585,1.239413,2.136096
4,-0.054546,-0.152532,-0.293807,-0.075603,-0.086520,-0.214269,-0.146405,-0.233218,-0.412121,-0.275808,...,2.358951,-0.087277,-0.593119,-0.271644,2.837719,-2.300888,-0.463811,-0.038585,-0.531021,-0.499732
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27375,-0.153835,-0.151189,-0.265675,-0.138482,-0.155428,-0.222874,-0.269623,-0.097962,-0.476642,-0.056789,...,-0.137673,-0.087277,-0.873983,1.653014,-0.405477,-0.172411,-0.480619,-0.365729,-0.548349,-0.741556
27376,-0.127015,-0.112593,-0.310051,-0.071968,-0.098989,-0.189908,-0.144402,-0.380448,-0.419913,-0.225337,...,0.013533,-0.061346,-1.005832,-0.147108,-0.412994,2.059286,-0.476231,-0.038585,-0.481077,-0.828175
27377,-0.230289,-0.149019,-0.286573,-0.225392,-0.160687,-0.214269,-0.126135,-0.233218,-0.387934,-0.237072,...,-0.596195,-0.087277,-0.813082,-0.328176,-0.389171,-0.182378,1.940034,-0.038585,-0.516300,-0.517082
27378,-0.162868,-0.152532,-0.212335,-0.208957,-0.163505,-0.214269,-0.146404,-0.233218,-0.426164,-0.255055,...,-0.584547,-0.087277,-0.370983,-0.328176,-0.400630,-0.182378,2.289674,-0.038585,-0.534859,-0.667789


In [5]:
test_X

Unnamed: 0,Acta2,Tagln,Lhfpl3,Tnr,Cd74,Tpm2,Myh11,Vtn,Lyz2,Myl9,...,Brip1os,Has2os,Dennd4a,Rasgrp2,Gabrg1,Etnppl,Galnt6,Fgfbp1,P2ry13,Man2b1
8769,-0.135944,-0.140658,-0.221113,-0.215306,-0.169317,-0.207980,-0.153488,-0.241173,-0.336650,-0.436925,...,-0.999266,-0.099914,0.021423,-0.659265,-0.416445,-0.191825,1.294599,0.178888,-0.388868,0.070537
7881,5.649084,7.028369,-0.303604,-0.164583,-0.162139,5.968231,5.268714,3.503660,-0.423255,4.628437,...,-0.846380,-0.130980,-0.855237,3.043438,-0.406753,-0.175493,-0.469741,-0.053084,-0.503853,-0.536721
563,-0.156415,-0.141813,-0.163067,-0.295101,-0.295051,-0.206570,-0.152707,-0.224450,-0.481059,-0.245889,...,-0.807466,-0.099441,-0.189739,-0.309177,-0.496836,-0.180507,-0.156066,-0.053084,-0.533541,-0.600322
4711,-0.127834,-0.143007,-0.036520,-0.229437,-0.134526,-0.363637,-0.140756,-0.230797,-0.428729,0.641183,...,0.899364,-0.482796,-0.323725,-0.269985,1.282087,4.720711,-0.439624,-0.053084,-0.475611,-0.644790
8247,-0.160208,-0.154213,-0.256902,-0.197361,0.032763,-0.166417,-0.153995,-0.226586,0.265912,-0.231923,...,1.715125,-0.101158,1.829308,-0.337822,-0.337326,-0.180146,-0.466813,-0.053084,1.675312,1.726863
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5735,-0.175128,-0.130475,-0.245290,-0.228210,-0.089603,-0.218539,-0.154105,-0.238089,-0.396905,-0.256435,...,0.261115,-0.100597,-0.291516,-0.343673,-0.390277,-0.175493,1.889772,-0.053084,-0.507538,-0.651051
5192,-0.208008,-0.189768,-0.176190,-0.258886,-0.129250,-0.134111,-0.154105,-0.202624,-0.408454,-0.273241,...,1.481983,-0.100597,-0.209636,-0.550286,-0.318888,0.050814,-0.474867,-0.053084,-0.528556,-0.692622
5391,-0.195319,-0.130083,-0.159926,-0.132716,-0.122992,-0.221522,-0.154024,-0.210064,-0.358639,-0.230875,...,-0.361645,-0.072208,-0.691857,-0.349666,-0.358995,-0.168079,0.586303,-0.053084,-0.519017,-0.586517
861,-0.536029,0.641512,-0.202156,-0.167233,-0.156645,3.692186,2.545490,4.755052,-0.444426,3.079664,...,-0.677007,-0.100597,-0.827437,1.999966,-0.390277,-0.175493,-0.397458,-0.053084,-0.450433,-0.634520


In [6]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

In [7]:
from sklearn.linear_model import LogisticRegression

In [8]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-4, 1, 10),
             'logisticregression__l1_ratio': np.append(np.logspace(-3, 0, 5), 0)}
# [0.05, 0.1, 0.2, 0.35]
models_eln = []
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 10%|█         | 1/10 [1:07:25<10:06:47, 4045.23s/it]

{'logisticregression__C': 0.0001, 'logisticregression__l1_ratio': 0.1778279410038923}
best CV score: 0.7331900283196932
test score: 0.3984236017816682


 20%|██        | 2/10 [2:18:15<9:15:28, 4166.12s/it] 

{'logisticregression__C': 0.0001, 'logisticregression__l1_ratio': 0.1778279410038923}
best CV score: 0.7331900283196932
test score: 0.3984241455905371


 30%|███       | 3/10 [3:26:45<8:03:02, 4140.37s/it]

{'logisticregression__C': 0.0001, 'logisticregression__l1_ratio': 0.1778279410038923}
best CV score: 0.7331900283196932
test score: 0.3984245922566181


 40%|████      | 4/10 [4:35:58<6:54:30, 4145.09s/it]

{'logisticregression__C': 0.0001, 'logisticregression__l1_ratio': 0.1778279410038923}
best CV score: 0.7331900283196932
test score: 0.3984241184474084


 50%|█████     | 5/10 [5:42:40<5:41:08, 4093.68s/it]

{'logisticregression__C': 0.0001, 'logisticregression__l1_ratio': 0.1778279410038923}
best CV score: 0.7331900283196932
test score: 0.39842561335758575


 60%|██████    | 6/10 [6:51:05<4:33:10, 4097.65s/it]

{'logisticregression__C': 0.0001, 'logisticregression__l1_ratio': 0.1778279410038923}
best CV score: 0.7331900283196932
test score: 0.39842418774703486


 70%|███████   | 7/10 [7:58:01<3:23:31, 4070.66s/it]

{'logisticregression__C': 0.0001, 'logisticregression__l1_ratio': 0.1778279410038923}
best CV score: 0.7331900283196932
test score: 0.3984239610909226


 80%|████████  | 8/10 [9:04:46<2:15:00, 4050.01s/it]

{'logisticregression__C': 0.0001, 'logisticregression__l1_ratio': 0.1778279410038923}
best CV score: 0.7331900283196932
test score: 0.3984241417004951


 80%|████████  | 8/10 [9:51:18<2:27:49, 4434.80s/it]


KeyboardInterrupt: 