# Final ELN model for switched training and test sets

## Outline

The **MLAging - all-cell** workflow consists of sections:

`00 preprocessing.R` Data preprocessing and preparation in Seurat.

`111 All-cell Model Tuning - Before Binarization` ML model tunning using *non-binarized* HVGs and hyperparameter selection using `GridSearchCV`.

`112 All-cell Model Tuning - After Binarization` ML model tunning using *binarized* HVGs.

`121 All-cell Model 10x - Before Binarization` Run the best models for non-binarized* HVGs over 10 random seeds.

`122 All-cell Model 10x - After Binarization` Run the best models for *binarized* HVGs over 10 random seeds.
 
`123 All-cell Model 10x Swapped Train-Test` Run the best models for *binarized* HVGs over 10 random seeds. But switched the training and test sets to make sure that the sequencing throughput did not affect model performance -- **this notebook**:

- [1. ELN before binarization](#1.-before)
- [2. ELN after binarization](#2.-after)

`13 All-cell Model Result Viz` Result visulization.

`14 All-cell ELN Interpretation` Result interpretation. 

In [1]:
import warnings
warnings.filterwarnings('ignore')

from src.preprocessing_eln import *
from src.data_processing import *
from src.grid_search import *
import os
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
import tqdm
from tqdm import tqdm
from statistics import mean, stdev

data_type = 'float32'

In [2]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

In [3]:
input_test = '../data/test_final_group_info.csv'
input_train = '../data/train_final_group_info.csv'

cell_type = 'All'

# swapped train and test sets
test_X, test_y, train_X, train_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=False)

Finished data prepration for All


### 1. ELN before <a name="1.-before"></a>

In [4]:
from sklearn.linear_model import LogisticRegression

In [5]:
scores = []
final_test = []
final_models = []
for i in tqdm(range(10)):
    random_state = 42*i    
    X_test, y_test = shuffle(test_X, test_y, random_state=random_state)
    X_train, y_train = shuffle(train_X, train_y, random_state=random_state)
    
    eln = LogisticRegression(penalty='elasticnet', C=0.046415888336127774, l1_ratio=0.01, 
                             solver='saga', max_iter=10000000)
        
    eln.fit(X_train, y_train)
    
    y_pred = eln.predict_proba(X_test)[:, 1]
    auprc = pr_auc_score(y_test, y_pred)
    
    final_models.append(eln)
    scores.append(auprc)   
print(f'auprc: {mean(scores)} ± {stdev(scores)}' )

100%|██████████| 10/10 [06:14<00:00, 37.41s/it]

auprc: 0.5659642836172514 ± 1.9810013813039692e-07





In [6]:
file = open('../results/results_swapped/eln_n_model_test_scores.save', 'wb')
pickle.dump(scores, file)
file.close()

file = open('../results/results_swapped/eln_n_model_test_sets.save', 'wb')
pickle.dump(final_test, file)
file.close()

file = open('../results/results_swapped/eln_n_model_test_models.save', 'wb')
pickle.dump(final_models, file)
file.close()

### 2. ELN after <a name="2.-after"></a>

In [7]:
test_X, test_y, train_X, train_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for All


In [8]:
scores = []
final_test = []
final_models = []
for i in tqdm(range(10)):
    random_state = 42*i    
    X_test, y_test = shuffle(test_X, test_y, random_state=random_state)
    X_train, y_train = shuffle(train_X, train_y, random_state=random_state)
    
    eln = LogisticRegression(penalty='elasticnet', C=0.046415888336127774, l1_ratio=0.01, 
                             solver='saga', max_iter=10000000)
        
    eln.fit(X_train, y_train)
    
    y_pred = eln.predict_proba(X_test)[:, 1]
    auprc = pr_auc_score(y_test, y_pred)
    
    final_models.append(eln)
    scores.append(auprc)   
print(f'auprc: {mean(scores)} ± {stdev(scores)}' )

100%|██████████| 10/10 [16:43<00:00, 100.40s/it]

auprc: 0.9550785748878134 ± 3.679179010928998e-06





In [9]:
file = open('../results/results_swapped/eln_model_test_scores.save', 'wb')
pickle.dump(scores, file)
file.close()

file = open('../results/results_swapped/eln_model_test_sets.save', 'wb')
pickle.dump(final_test, file)
file.close()

file = open('../results/results_swapped/eln_model_test_models.save', 'wb')
pickle.dump(final_models, file)
file.close()