# The ElasticNet tuning (specific cell types)

## Outline

The **MLAging Cell-type** workflow consists of four sections:

I. Data Preprocessing in Seurat ```preprocessing.R```

II. ElasticNet Tuning (hyperparameter selection for  with ```GridSearchCV```) -- **this notebook**:

1. [Data Preparation](#1.-prep)
2. [Cell Types](#2.-celltypes)
    - [Neuron](#3.-neuron)
    - [Oligodendrycte](#4.-Oligo)
    - [Astrocyte](#5.-astro)
    - [Microglia](#6.-micro)

III. Final Models Over 10 Random States

IV. Results and Intepretations

In [1]:
import warnings
warnings.filterwarnings('ignore')

from src.data_processing import *
from src.grid_search import *
import os
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
import pickle

data_type = 'float32'

## 1. Data Preparation <a name="1.-prep"></a>
### Load training, testing batch

In [2]:
input_test = '../data/test_final_group_info.csv'
input_train = '../data/train_final_group_info.csv'

Finished data prepration


In [19]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

## 2. Cell-type specific models<a name="2.-celltypes"></a>

### a) Neuron <a name="2.-celltypes"></a><a name="3.-neuron"></a>

In [None]:
cell_type = 'Neuron'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

In [None]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-2, 1, 10),
             'logisticregression__l1_ratio': [0.001, 0.01, 0.05, 0.5, 0.65, 0.8]}
# [0.05, 0.1, 0.2, 0.35]
models_eln = []
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 10%|█         | 1/10 [46:59<7:02:58, 2819.83s/it]

{'logisticregression__C': 0.21544346900318834, 'logisticregression__l1_ratio': 0.05}
best CV score: 0.9738372302135103
test score: 0.9667233392013964


 20%|██        | 2/10 [2:00:36<8:21:12, 3759.04s/it]

{'logisticregression__C': 0.21544346900318834, 'logisticregression__l1_ratio': 0.05}
best CV score: 0.9738373990074987
test score: 0.9667239967503731


 30%|███       | 3/10 [3:12:55<7:49:29, 4024.16s/it]

{'logisticregression__C': 0.21544346900318834, 'logisticregression__l1_ratio': 0.05}
best CV score: 0.973837767258314
test score: 0.9667220366700146


 40%|████      | 4/10 [3:59:51<5:54:42, 3547.13s/it]

{'logisticregression__C': 0.21544346900318834, 'logisticregression__l1_ratio': 0.05}
best CV score: 0.9738381720626543
test score: 0.9667234148777877


 50%|█████     | 5/10 [4:46:40<4:33:24, 3280.87s/it]

{'logisticregression__C': 0.21544346900318834, 'logisticregression__l1_ratio': 0.05}
best CV score: 0.9738379600306608
test score: 0.9667218265108759


 60%|██████    | 6/10 [5:32:53<3:27:12, 3108.13s/it]

{'logisticregression__C': 0.21544346900318834, 'logisticregression__l1_ratio': 0.05}
best CV score: 0.9738382557212457
test score: 0.9667211608360007


 70%|███████   | 7/10 [6:19:44<2:30:33, 3011.15s/it]

{'logisticregression__C': 0.21544346900318834, 'logisticregression__l1_ratio': 0.05}
best CV score: 0.9738382341542344
test score: 0.9667211132379622


 80%|████████  | 8/10 [7:06:35<1:38:14, 2947.31s/it]

{'logisticregression__C': 0.21544346900318834, 'logisticregression__l1_ratio': 0.05}
best CV score: 0.9738376129036088
test score: 0.9667226250496714


 90%|█████████ | 9/10 [7:53:54<48:33, 2913.34s/it]  

{'logisticregression__C': 0.21544346900318834, 'logisticregression__l1_ratio': 0.05}
best CV score: 0.9738378012807429
test score: 0.9667228278444705


In [10]:
file = open('../results/cell_type/eln_' + str(cell_type) + '_models_10_finer.save', 'wb')
pickle.dump(models_eln, file)
file.close()

### b) Oligodendrycte <a name="2.-celltypes"></a><a name="5.-Oligo"></a>

In [None]:
cell_type = 'Oligodendrocyte'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

In [None]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-2, 1, 10),
             'logisticregression__l1_ratio': [0.001, 0.01, 0.05, 0.5, 0.65, 0.8]}
# [0.05, 0.1, 0.2, 0.35]
models_eln = []
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

In [None]:
file = open('../results/cell_type/eln_' + str(cell_type) + '_models_10_finer.save', 'wb')
pickle.dump(models_eln, file)
file.close()

### c) Atrocyte <a name="2.-celltypes"></a><a name="5.-astro"></a>

In [3]:
cell_type = 'Astrocyte'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration


In [14]:
l1_ratio_list = np.append(np.logspace(-3, 0, 10), 0)

In [23]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-4, 2, 10),
             'logisticregression__l1_ratio': l1_ratio_list}
# [0.05, 0.1, 0.2, 0.35]
models_eln = []
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

 10%|█         | 1/10 [07:27<1:07:05, 447.23s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9791592799911145
test score: 0.9593097922545175


 20%|██        | 2/10 [14:03<55:36, 417.00s/it]  

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9791672217468737
test score: 0.9591400595622753


 30%|███       | 3/10 [20:35<47:19, 405.68s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9791217819393039
test score: 0.9594909571849615


 40%|████      | 4/10 [27:15<40:20, 403.34s/it]

{'logisticregression__C': 4.641588833612772, 'logisticregression__l1_ratio': 0.0021544346900318843}
best CV score: 0.9795257539453458
test score: 0.9583835743128594


 50%|█████     | 5/10 [34:08<33:55, 407.01s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9792295727283665
test score: 0.9596210756806711


 60%|██████    | 6/10 [41:14<27:33, 413.41s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9792167910541458
test score: 0.9591682545246464


 70%|███████   | 7/10 [48:22<20:54, 418.08s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9792903724815281
test score: 0.9595950267082924


 80%|████████  | 8/10 [55:22<13:57, 418.94s/it]

{'logisticregression__C': 1.0, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9792650694843053
test score: 0.9594818418492603


 90%|█████████ | 9/10 [1:02:44<07:06, 426.15s/it]

{'logisticregression__C': 100.0, 'logisticregression__l1_ratio': 0.0021544346900318843}
best CV score: 0.9800145355964683
test score: 0.958463252544377


100%|██████████| 10/10 [1:10:01<00:00, 420.19s/it]

{'logisticregression__C': 21.54434690031882, 'logisticregression__l1_ratio': 0.0}
best CV score: 0.9794331256095796
test score: 0.958488059330731





In [24]:
file = open('../results/cell_type/eln_' + str(cell_type) + '_models_10_finer.save', 'wb')
pickle.dump(models_eln, file)
file.close()

### d) Microglia <a name="2.-celltypes"></a><a name="6.-micro"></a>

In [26]:
cell_type = 'Microglia'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration


In [None]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-4, 1, 10),
             'logisticregression__l1_ratio': np.append(np.logspace(-3, 0, 5), 0)}
models_eln = []
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)
    models_eln.append(grid)

  0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
file = open('../results/cell_type/eln_' + str(cell_type) + '_models_10_finer.save', 'wb')
pickle.dump(models_eln, file)
file.close()