# The SVZ ElasticNet tuning (cell type-specific models)

## Outline

The **MLAging - SVZ cell type** workflow consists of sections:

`30 SVZpreprocessing.R` Data preprocessing and preparation in Seurat.

`41 SVZ Cell Type ELN Tuning` ELN model tunning using *non-binarized* and *binarized* HVGs and hyperparameter selection using `GridSearchCV`     -- **this notebook:** 

1. [Data Preparation](#1.-prep)
2. [Cell Types](#2.-celltypes)
    - [Microglia](#3.-Microglia)
    - [Astrocyte_qNSC](#4.-Astrocyte_qNSC)
    - [Endothelial](#5.-Endothelial)
    - [Neuroblast](#6.-Neuroblast)
    - [Oligodendro](#7.-Oligodendro)
    - [aNSC_NPC](#8.-aNSC_NPC)
    - [Mural](#9.-Mural)

`42 SVZ Cell Type ELN 10x` Run the best ELN model for both binarized and nonbinarized HVGs over 10 random seeds.

`43 SVZ Cell Type ELN Result Viz` Result visulization.

`44 SVZ Cell Type Stat` Stat test on whether exercise rejuvenates cells.

In [2]:
import warnings
warnings.filterwarnings('ignore')

from src.data_processing import *
from src.grid_search import *
import os
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
import pickle

data_type = 'float32'

## 1. Data Preparation <a name="1.-prep"></a>
### Load training, testing batch

In [3]:
input_train = '../data/svz_processed/svz_ctl_train_cell_sep3integ_batch1.csv'
input_test = '../data/svz_processed/svz_ctl_test_cell_sep3integ_batch2.csv'

In [4]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

## 2. Cell type-specific ELN models<a name="2.-celltypes"></a>

### a) Microglia<a name="3.-Microglia"></a>

In [5]:
cell_type = 'Microglia'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for Microglia


In [6]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-4, 2, 10),
             'logisticregression__l1_ratio': np.append(np.logspace(-3, 0, 10), 0)}
# [0.05, 0.1, 0.2, 0.35]
models_eln = []
for i in tqdm(range(5)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 20%|██        | 1/5 [33:14<2:12:57, 1994.41s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9136604283844914
test score: 0.8761702811825283


 40%|████      | 2/5 [1:05:45<1:38:26, 1968.81s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9136828418906111
test score: 0.8762039489579435


 60%|██████    | 3/5 [1:38:16<1:05:21, 1960.89s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.913638420719382
test score: 0.876197433095826


 80%|████████  | 4/5 [2:10:59<32:41, 1961.42s/it]  

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9136896879264818
test score: 0.8761890701427045


100%|██████████| 5/5 [2:44:54<00:00, 1978.94s/it]

{'logisticregression__C': 21.54434690031882, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9137353668481561
test score: 0.871684917535416





### b) Astrocyte_qNSC <a name="4.-Astrocyte_qNSC"></a>

In [7]:
cell_type = 'Astrocyte_qNSC'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for Astrocyte_qNSC


In [8]:
# eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
# param_grid = {'logisticregression__C': np.logspace(-2, 1, 10),
#              'logisticregression__l1_ratio': [0.001, 0.01, 0.05, 0.5, 0.65, 0.8]}
# [0.05, 0.1, 0.2, 0.35]
models_eln = []
for i in tqdm(range(5)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 20%|██        | 1/5 [26:22<1:45:29, 1582.32s/it]

{'logisticregression__C': 0.21544346900318823, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9730987490908909
test score: 0.9555509053798459


 40%|████      | 2/5 [52:50<1:19:17, 1585.85s/it]

{'logisticregression__C': 0.21544346900318823, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9730770540805702
test score: 0.9554610909794564


 60%|██████    | 3/5 [1:18:56<52:33, 1576.54s/it]

{'logisticregression__C': 0.21544346900318823, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9730659328483255
test score: 0.9555073667940187


 80%|████████  | 4/5 [1:45:28<26:22, 1582.87s/it]

{'logisticregression__C': 0.21544346900318823, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9730641371344253
test score: 0.9555061297542334


100%|██████████| 5/5 [2:11:48<00:00, 1581.68s/it]

{'logisticregression__C': 0.21544346900318823, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9730624332746889
test score: 0.9554862466211005





### c) Endothelial <a name="5.-Endothelial"></a>

In [9]:
cell_type = 'Endothelial'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for Endothelial


In [10]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
# param_grid = {'logisticregression__C': np.logspace(-4, 2, 10),
#              'logisticregression__l1_ratio': np.append(np.logspace(-3, 0, 10), 0)}
# [0.05, 0.1, 0.2, 0.35]
models_eln = []
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 10%|█         | 1/10 [14:37<2:11:34, 877.13s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.7634654181271505
test score: 0.8321033115797022


 20%|██        | 2/10 [29:14<1:56:56, 877.07s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.7634523411228401
test score: 0.8321253752298888


 30%|███       | 3/10 [43:56<1:42:36, 879.51s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.7635075743475641
test score: 0.8321856951095539


 40%|████      | 4/10 [58:28<1:27:39, 876.52s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.763483929833543
test score: 0.8321603131222713


 50%|█████     | 5/10 [1:13:04<1:13:02, 876.49s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.763481336943802
test score: 0.8321436584251849


 60%|██████    | 6/10 [1:27:52<58:41, 880.30s/it]  

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.7634657994120635
test score: 0.8321183534558554


 70%|███████   | 7/10 [1:42:33<44:01, 880.59s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.7634456294597916
test score: 0.8320413407966536


 80%|████████  | 8/10 [1:57:22<29:26, 883.03s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.7635330828285647
test score: 0.8321118775007887


 90%|█████████ | 9/10 [2:12:12<14:45, 885.32s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.7634941654924507
test score: 0.832176131554705


100%|██████████| 10/10 [2:26:55<00:00, 881.58s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 1.0}
best CV score: 0.7634598280890901
test score: 0.8321117632669636





### d) Neuroblast <a name="6.-Neuroblast"></a>

In [11]:
cell_type = 'Neuroblast'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for Neuroblast


In [12]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
# param_grid = {'logisticregression__C': np.logspace(-4, 2, 10),
#              'logisticregression__l1_ratio': np.append(np.logspace(-3, 0, 5), 0)}
models_eln = []
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 10%|█         | 1/10 [13:14<1:59:11, 794.56s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9227376938412275
test score: 0.9144239340255389


 20%|██        | 2/10 [26:18<1:45:07, 788.43s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9227367793579466
test score: 0.9144522806180575


 30%|███       | 3/10 [39:20<1:31:36, 785.25s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9227448858291674
test score: 0.9144618443460053


 40%|████      | 4/10 [52:21<1:18:21, 783.56s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9227307311857995
test score: 0.9144539417269276


 50%|█████     | 5/10 [1:05:43<1:05:51, 790.23s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9227921143399843
test score: 0.9144443779989799


 60%|██████    | 6/10 [1:19:16<53:12, 798.17s/it]  

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9227855644412404
test score: 0.9144446097095901


 70%|███████   | 7/10 [1:32:34<39:54, 798.17s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9227515833476907
test score: 0.9144306256333055


 80%|████████  | 8/10 [1:45:55<26:37, 798.84s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9227783343903382
test score: 0.9144306256333055


 90%|█████████ | 9/10 [1:59:00<13:14, 794.45s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9227504999305479
test score: 0.9144306256333055


100%|██████████| 10/10 [2:12:16<00:00, 793.62s/it]

{'logisticregression__C': 0.01, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9228082078919067
test score: 0.9145818381863459





### e) Oligodendro <a name="7.-Oligodendro"></a>

In [13]:
cell_type = 'Oligodendro'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for Oligodendro


In [14]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
# param_grid = {'logisticregression__C': np.logspace(-4, 1, 10),
#              'logisticregression__l1_ratio': np.append(np.logspace(-3, 0, 5), 0)}
models_eln = []
for i in tqdm(range(5)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 20%|██        | 1/5 [24:10<1:36:42, 1450.56s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9366364304408744
test score: 0.9166648003913993


 40%|████      | 2/5 [48:20<1:12:30, 1450.33s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9366274717605518
test score: 0.9166412593323694


 60%|██████    | 3/5 [1:12:44<48:32, 1456.35s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9366496649404834
test score: 0.9166587349274489


 80%|████████  | 4/5 [1:36:59<24:16, 1456.11s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9366457229513531
test score: 0.9166438979962127


100%|██████████| 5/5 [2:01:10<00:00, 1454.09s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9366433226859627
test score: 0.9166569997239429





### f) aNSC_NPC <a name="8.-aNSC_NPC"></a>

In [15]:
cell_type = 'aNSC_NPC'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for aNSC_NPC


In [16]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
# param_grid = {'logisticregression__C': np.logspace(-4, 1, 10),
#              'logisticregression__l1_ratio': np.append(np.logspace(-3, 0, 5), 0)}
models_eln = []
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 10%|█         | 1/10 [09:26<1:24:57, 566.39s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8442464008956284
test score: 0.8517405377794788


 20%|██        | 2/10 [18:54<1:15:39, 567.50s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8442031335776572
test score: 0.8517412625736818


 30%|███       | 3/10 [28:18<1:05:59, 565.66s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8442695830660417
test score: 0.8519005626618357


 40%|████      | 4/10 [37:43<56:33, 565.63s/it]  

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8441089718362459
test score: 0.8519005626618357


 50%|█████     | 5/10 [47:20<47:28, 569.76s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.001}
best CV score: 0.8440976628062196
test score: 0.8532791511662958


 60%|██████    | 6/10 [56:45<37:52, 568.11s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8443697572249688
test score: 0.8519710218501044


 70%|███████   | 7/10 [1:06:10<28:21, 567.04s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8444015616366116
test score: 0.8513204148376787


 80%|████████  | 8/10 [1:15:34<18:52, 566.10s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8443557002145363
test score: 0.8519005626618357


 90%|█████████ | 9/10 [1:24:57<09:25, 565.13s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8444229568009414
test score: 0.8519710218501044


100%|██████████| 10/10 [1:34:22<00:00, 566.29s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8444979440815714
test score: 0.8519005626618357





### g) Mural <a name="9.-Mural"></a>

In [17]:
cell_type = 'Mural'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for Mural


In [18]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
# param_grid = {'logisticregression__C': np.logspace(-4, 1, 10),
#              'logisticregression__l1_ratio': np.append(np.logspace(-3, 0, 5), 0)}
models_eln = []
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 10%|█         | 1/10 [05:58<53:48, 358.72s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8971336026856878
test score: 0.9639287219372886


 20%|██        | 2/10 [11:56<47:47, 358.41s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8970579342621527
test score: 0.9639287219372886


 30%|███       | 3/10 [17:56<41:51, 358.74s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8970968136913157
test score: 0.9639390962441781


 40%|████      | 4/10 [23:54<35:52, 358.71s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8971001752853326
test score: 0.9639358094616334


 50%|█████     | 5/10 [29:51<29:50, 358.13s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8971371350776924
test score: 0.9639358094616334


 60%|██████    | 6/10 [35:51<23:54, 358.70s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8970680951643372
test score: 0.9639390962441781


 70%|███████   | 7/10 [41:51<17:57, 359.18s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8971046136630412
test score: 0.9639357643420605


 80%|████████  | 8/10 [47:49<11:57, 358.87s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.897117879308527
test score: 0.9639358094616334


 90%|█████████ | 9/10 [53:47<05:58, 358.51s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.897128830407301
test score: 0.9639358094616334


100%|██████████| 10/10 [59:42<00:00, 358.27s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.8970885195215789
test score: 0.9639287219372886



