# The ElasticNet tuning (cell type-specific models)

## Outline

The **MLAging - cell type** workflow consists of sections:

`00 preprocessing.R` Data preprocessing and preparation in Seurat.

`21 Cell Type ELN Tuning` ELN model tunning using *binarized* HVGs and hyperparameter selection using `GridSearchCV` - **this notebook**:
1. [Data Preparation](#1.-prep)
2. [Cell Types](#2.-celltypes)
    - [Neuron](#3.-neuron)
    - [Oligodendrycte](#4.-oligo)
    - [Astrocyte](#5.-astro)
    - [OPC](#6.-opc)
    - [Microglia](#7.-micro)

`22 Cell Type ELN Result 10x` Run the best models for *binarized* HVGs over 10 random seeds.

`23 Cell Type Result Viz` Result visulization.

`24 Cell Type Interpretations` Result interpretation. 

In [None]:
import warnings
warnings.filterwarnings('ignore')

from src.data_processing import *
from src.grid_search import *
import os
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
import pickle

data_type = 'float32'

## 1. Data Preparation <a name="1.-prep"></a>
### Load training, testing batch

In [None]:
input_test = '../data/test_final_group_info.csv'
input_train = '../data/train_final_group_info.csv'

In [None]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

## 2. Cell type-specific models<a name="2.-celltypes"></a>

### a) Neuron <a name="2.-celltypes"></a><a name="3.-neuron"></a>

In [None]:
cell_type = 'Neuron'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

In [None]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-2, 1, 10),
             'logisticregression__l1_ratio': [0.001, 0.01, 0.05, 0.5, 0.65, 0.8]}
# [0.05, 0.1, 0.2, 0.35]
for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

### b) Oligodendrycte <a name="4.-oligo"></a>

In [None]:
cell_type = 'Oligodendrocyte'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

In [None]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-2, 1, 10),
             'logisticregression__l1_ratio': [0.001, 0.01, 0.05, 0.5, 0.65, 0.8]}
# [0.05, 0.1, 0.2, 0.35]
for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

### c) Atrocyte <a name="2.-celltypes"></a><a name="5.-astro"></a>

In [None]:
cell_type = 'Astrocyte'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

In [None]:
l1_ratio_list = np.append(np.logspace(-3, 0, 10), 0)

In [None]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-4, 2, 10),
             'logisticregression__l1_ratio': l1_ratio_list}
# [0.05, 0.1, 0.2, 0.35]
for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

### d) OPC <a name="2.-celltypes"></a><a name="6.-opc"></a>

In [None]:
cell_type = 'OPC'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

In [None]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-4, 2, 10),
             'logisticregression__l1_ratio': np.append(np.logspace(-3, 0, 5), 0)}
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

### e) Microglia <a name="7.-micro"></a>

In [None]:
cell_type = 'Microglia'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

In [None]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-4, 1, 10),
             'logisticregression__l1_ratio': np.append(np.logspace(-3, 0, 5), 0)}
for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)