# Final ELN models for different preprocessing methods

## Outline

The **MLAging - preprocessing** workflow consists of sections:

`00 preprocessing.R` Data preprocessing and preparation in Seurat.

`011 Preprocessing HEG ELN Tuning` ELN model tunning using highly expressed genes (HEGs) and hyperparameter selection using `GridSearchCV`.

`012 Preprocessing HVG ELN Tuning` ELN model tunning using highly variable genes (HVGs) and hyperparameter selection using `GridSearchCV`.
 
`02 Preprocessing ELN Result 10x` Run the best ELN model over 10 random seeds -- **this notebook**:
1. [HEG ELN Final Model](#1.-HEG)
    - [HEG-lognorm](#2.-heg_lognorm)
    - [HEG-std](#3.-heg_std)
    - [HEG-integration](#4.-heg_integrated)
    - [HEG-binarization](#5.-heg_bin)
    
    
2. [HVG ELN Final Model](#6.-HVG)
    - [HVG-lognorm](#7.-hvg_lognorm)
    - [HVG-std](#8.-hvg_std)
    - [HVG-integration](#9-hvg_integrated)
    - [HVG-binarization](#10.-hvg_bin)

`03 Preprocessing ELN Result Viz` Result visulization.

In [1]:
import warnings
warnings.filterwarnings('ignore')

from src.preprocessing_eln import *
import os
import numpy as np
from sklearn.metrics import make_scorer

data_type = 'float32'

In [2]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

## 1. ELN models for the top2k highly expressed genes <a name="1.-HEG"></a>

### 1.1) HEG - log-normalized <a name="2.-heg_lognorm"></a>

In [3]:
input_train = '../data/train_heg2k_lognorm_intersect.csv'
input_test = '../data/test_heg2k_lognorm_intersect.csv'

In [4]:
train_X, train_y, test_X, test_y = train_test_split(input_train, input_test, binarization=False)

In [5]:
runs_10(train_X, train_y, test_X, test_y, 0.01, 1, 'heg_lognorm')

100%|██████████| 10/10 [46:14<00:00, 277.47s/it]


auprc: 0.6712991948323103 ± 1.2969912151835466e-06


### 1.2) HEG - log-normalized + scaled<a name="3.-heg_std"></a>

In [9]:
input_train = '../data/train_heg2k_std_intersect.csv'
input_test = '../data/test_heg2k_std_intersect.csv'

In [11]:
train_X, train_y, test_X, test_y = train_test_split(input_train, input_test, binarization=False)

In [12]:
runs_10(train_X, train_y, test_X, test_y, 0.01, 1, 'heg_lognorm_std')

100%|██████████| 10/10 [01:05<00:00,  6.54s/it]


auprc: 0.6728152675629776 ± 2.61064408511419e-06


### 1.3) HEG - log-normalized + scaled + integrated <a name="4.-heg_integrated"></a>

In [13]:
input_train = '../data/train_heg2k_std_integrated.csv'
input_test = '../data/test_heg2k_std_integrated.csv'

In [14]:
train_X, train_y, test_X, test_y = train_test_split(input_train, input_test, binarization=False)

In [15]:
runs_10(train_X, train_y, test_X, test_y, 0.01, 1, 'heg_lognorm_std_int')

100%|██████████| 10/10 [01:13<00:00,  7.34s/it]


auprc: 0.6888074655089317 ± 9.284846311665915e-07


### 1.4) HEG - log-normalized + scaled + integrated + binarized <a name="5.-heg_bin"></a>

In [17]:
train_X, train_y, test_X, test_y = train_test_split(input_train, input_test, binarization=True)

In [18]:
runs_10(train_X, train_y, test_X, test_y, 0.01, 1, 'heg_lognorm_std_int_bin')

100%|██████████| 10/10 [01:48<00:00, 10.83s/it]


auprc: 0.6079509873818921 ± 1.2191563448131428e-07


## 2. ELN model tuning for the top2k highly variable genes <a name="6.-HVG"></a>

### 2.1) HVG - log-normalized  <a name="7.-hvg_lognorm"></a>

In [19]:
input_train = '../data/train_hvg2k_lognorm_intersect.csv'
input_test = '../data/test_hvg2k_lognorm_intersect.csv'

In [20]:
train_X, train_y, test_X, test_y = train_test_split(input_train, input_test, binarization=False)

In [21]:
runs_10(train_X, train_y, test_X, test_y, 100, 1, 'hvg_lognorm')

100%|██████████| 10/10 [25:18<00:00, 151.86s/it]


auprc: 0.7207472300849239 ± 1.9599708976034896e-07


### 2.2) HVG - log-normalized  + scaled <a name="8.-hvg_std"></a>

In [22]:
input_train = '../data/train_hvg2k_std_intersect.csv'
input_test = '../data/test_hvg2k_std_intersect.csv'

In [23]:
train_X, train_y, test_X, test_y = train_test_split(input_train, input_test, binarization=False)

In [25]:
runs_10(train_X, train_y, test_X, test_y, 0.01, 0.0630957344480193, 'hvg_lognorm_std')

100%|██████████| 10/10 [05:22<00:00, 32.25s/it]


auprc: 0.7285212402127392 ± 2.329283217320429e-07


### 2.3) HVG - log-normalized  + scaled + integrated <a name="9-hvg_integrated"></a>

In [26]:
input_train = '../data/train_hvg2k_std_integrated.csv'
input_test = '../data/test_hvg2k_std_integrated.csv'

In [27]:
train_X, train_y, test_X, test_y = train_test_split(input_train, input_test, binarization=False)

In [28]:
runs_10(train_X, train_y, test_X, test_y, 0.01, 1, 'hvg_lognorm_std_int')

100%|██████████| 10/10 [02:28<00:00, 14.82s/it]


auprc: 0.6977453556133097 ± 1.0085031102552479e-06


### 2.4) HVG - log-normalized  + scaled + integrated + binarized<a name="10.-hvg_bin"></a>

In [29]:
train_X, train_y, test_X, test_y = train_test_split(input_train, input_test, binarization=True)

In [30]:
runs_10(train_X, train_y, test_X, test_y, 0.027825594022071243, 0.015848931924611134, 'hvg_lognorm_std_int_bin')

100%|██████████| 10/10 [23:16<00:00, 139.69s/it]


auprc: 0.9674282498451758 ± 1.5072149155038359e-06
