# AUROC Results

## Outline

The **MLAging - batch integration and misc ** workflow consists of sections:

`60 preprocessing_batch.R` Data preprocessing and preparation in Seurat.

`61 Batch Integration Scheme ELN Tuning` Scheme: batch effects within training or test sets. ELN model tunning using highly variable genes (HVGs) and hyperparameter selection using `GridSearchCV`.

`62 Batch Integration Scheme ELN Result 10x` Run the best ELN model over 10 random seeds.

`63 HVG and Cell Type` Clustering and heatmap showing that HVGs are cell type-specific.

`64 AUROC Results` ELN 10x results shown in auroc metric -- **this notebook**:

`65 age_genes.R` Aging database queries.

In [1]:
from src.preprocessing_eln import *
from src.data_processing import *
from src.grid_search import *
from src.packages import *

from sklearn.metrics import roc_auc_score

In [2]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

In [3]:
input_test = '../data/test_final_group_info.csv'
input_train = '../data/train_final_group_info.csv'

cell_type = 'All'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=False)

Finished data prepration for All


In [4]:
scores = []
final_test = []
final_models = []
for i in tqdm(range(10)):
    random_state = 42*i    
    X_test, y_test = shuffle(test_X, test_y, random_state=random_state)
    X_train, y_train = shuffle(train_X, train_y, random_state=random_state)
    
    eln = LogisticRegression(penalty='elasticnet', C=0.001, l1_ratio=0.35, 
                             solver='saga', max_iter=10000000)
        
    eln.fit(X_train, y_train)
    
    y_pred = eln.predict_proba(X_test)[:, 1]
    auroc = roc_auc_score(y_test, y_pred)
    
    final_test.append((X_test, y_test))
    final_models.append(eln)
    scores.append(auroc)   
print(f'auroc: {mean(scores)} ± {stdev(scores)}' )

100%|██████████| 10/10 [01:02<00:00,  6.25s/it]

auroc: 0.5225134538061271 ± 1.9655594411774403e-06





In [5]:
file = open('../results/revision/eln_model_test_scores_before.save', 'wb')
pickle.dump(scores, file)
file.close()

file = open('../results/revision/eln_model_test_sets_before.save', 'wb')
pickle.dump(final_test, file)
file.close()

file = open('../results/revision/eln_model_test_models_before.save', 'wb')
pickle.dump(final_models, file)
file.close()

In [6]:
train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for All


In [7]:
scores = []
final_test = []
final_models = []
for i in tqdm(range(10)):
    random_state = 42*i    
    X_test, y_test = shuffle(test_X, test_y, random_state=random_state)
    X_train, y_train = shuffle(train_X, train_y, random_state=random_state)
    
    eln = LogisticRegression(penalty='elasticnet', C=0.046415888336127774, l1_ratio=0.01, 
                             solver='saga', max_iter=10000000)
        
    eln.fit(X_train, y_train)
    
    y_pred = eln.predict_proba(X_test)[:, 1]
    auroc = roc_auc_score(y_test, y_pred)
    
    final_test.append((X_test, y_test))
    final_models.append(eln)
    scores.append(auroc)   
print(f'auroc: {mean(scores)} ± {stdev(scores)}' )

100%|██████████| 10/10 [20:16<00:00, 121.61s/it]

auroc: 0.9312506448003369 ± 2.663004245953596e-06





In [8]:
file = open('../results/revision/eln_model_test_scores.save', 'wb')
pickle.dump(scores, file)
file.close()

file = open('../results/revision/eln_model_test_sets.save', 'wb')
pickle.dump(final_test, file)
file.close()

file = open('../results/revision/eln_model_test_models.save', 'wb')
pickle.dump(final_models, file)
file.close()