# The ElasticNet results with the best hyperparameters (specific cell types)

## Outline

The **MLAging Cell-type** workflow consists of four sections:

I. Data Preprocessing in Seurat ```preprocessing.R```

II. ElasticNet tuning (hyperparameter selection for  with ```GridSearchCV```):

III. Final Models Over 10 Random States -- **this notebook**:
1. [Data Preparation](#1.-prep)
2. [Cell Types](#2.-celltypes)
    - [Neuron](#3.-neuron)
    - [Oligodendrycte](#4.-Oligo)
    - [Astrocyte](#5.-astro)
    - [Microglia](#6.-micro)

IV. Results and Intepretations

In [1]:
import warnings
warnings.filterwarnings('ignore')

from src.data_processing import *
from src.grid_search import *
import os
import numpy as np
from sklearn.metrics import make_scorer
from tqdm import tqdm
import pickle
from sklearn.linear_model import LogisticRegression
from statistics import mean, stdev

data_type = 'float32'

## 1. Data Preparation <a name="1.-prep"></a>
### Load training, testing batch

In [2]:
input_test = '../data/test_final_group_info.csv'
input_train = '../data/train_final_group_info.csv'

## 2. Cell-type specific models<a name="2.-celltypes"></a>

### a) Neuron <a name="2.-celltypes"></a><a name="3.-neuron"></a>

In [3]:
cell_type = 'Neuron'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for Neuron


In [4]:
scores = []
final_test = []
final_models = []
for i in tqdm(range(10)):
    random_state = 42*i    
    X_test, y_test = shuffle(test_X, test_y, random_state=random_state)
    X_train, y_train = shuffle(train_X, train_y, random_state=random_state)
    
    eln = LogisticRegression(penalty='elasticnet', C=0.21544346900318834, l1_ratio=0.05, 
                             solver='saga', max_iter=10000000)
        
    eln.fit(X_train, y_train)
    
    y_pred = eln.predict_proba(X_test)[:, 1]
    auprc = pr_auc_score(y_test, y_pred)
    
    final_test.append((X_test, y_test))
    final_models.append(eln)
    scores.append(auprc)   
print(f'auprc: {mean(scores)} ± {stdev(scores)}' )

100%|██████████| 10/10 [35:51<00:00, 215.10s/it]

auprc: 0.9667214532669696 ± 1.9092280532864465e-06





In [5]:
file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_scores.save', 'wb')
pickle.dump(scores, file)
file.close()

file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_sets.save', 'wb')
pickle.dump(final_test, file)
file.close()

file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_models.save', 'wb')
pickle.dump(final_models, file)
file.close()

### b) Oligodendrycte <a name="2.-celltypes"></a><a name="4.-Oligo"></a> 

In [6]:
cell_type = 'Oligodendrocyte'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for Oligodendrocyte


In [7]:
scores = []
final_test = []
final_models = []
for i in tqdm(range(10)):
    random_state = 42*i    
    X_test, y_test = shuffle(test_X, test_y, random_state=random_state)
    X_train, y_train = shuffle(train_X, train_y, random_state=random_state)
    
    eln = LogisticRegression(penalty='elasticnet', C=4.6415888336127775, l1_ratio=0.05, 
                             solver='saga', max_iter=10000000)
        
    eln.fit(X_train, y_train)
    
    y_pred = eln.predict_proba(X_test)[:, 1]
    auprc = pr_auc_score(y_test, y_pred)
    
    final_test.append((X_test, y_test))
    final_models.append(eln)
    scores.append(auprc)   
print(f'auprc: {mean(scores)} ± {stdev(scores)}' )

100%|██████████| 10/10 [02:20<00:00, 14.09s/it]

auprc: 0.9852303669579788 ± 0.00022924765090377315





In [9]:
file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_scores.save', 'wb')
pickle.dump(scores, file)
file.close()

file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_sets.save', 'wb')
pickle.dump(final_test, file)
file.close()

file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_models.save', 'wb')
pickle.dump(final_models, file)
file.close()

### c) Atrocyte <a name="2.-celltypes"></a><a name="5.-astro"></a>

In [10]:
cell_type = 'Astrocyte'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for Astrocyte


In [11]:
scores = []
final_test = []
final_models = []
for i in tqdm(range(10)):
    random_state = 42*i    
    X_test, y_test = shuffle(test_X, test_y, random_state=random_state)
    X_train, y_train = shuffle(train_X, train_y, random_state=random_state)
    
    eln = LogisticRegression(penalty='elasticnet', C=1, l1_ratio=0, 
                             solver='saga', max_iter=10000000)
        
    eln.fit(X_train, y_train)
    
    y_pred = eln.predict_proba(X_test)[:, 1]
    auprc = pr_auc_score(y_test, y_pred)
    
    final_test.append((X_test, y_test))
    final_models.append(eln)
    scores.append(auprc)   
print(f'auprc: {mean(scores)} ± {stdev(scores)}' )

100%|██████████| 10/10 [00:56<00:00,  5.67s/it]

auprc: 0.9594332552035916 ± 0.00012088829201947426





In [12]:
file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_scores.save', 'wb')
pickle.dump(scores, file)
file.close()

file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_sets.save', 'wb')
pickle.dump(final_test, file)
file.close()

file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_models.save', 'wb')
pickle.dump(final_models, file)
file.close()

### d) Microglia <a name="2.-celltypes"></a><a name="6.-micro"></a>

In [13]:
cell_type = 'Microglia'

train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for Microglia


In [14]:
scores = []
final_test = []
final_models = []
for i in tqdm(range(10)):
    random_state = 42*i    
    X_test, y_test = shuffle(test_X, test_y, random_state=random_state)
    X_train, y_train = shuffle(train_X, train_y, random_state=random_state)
    
    eln = LogisticRegression(penalty='elasticnet', C=0.004641588833612782, l1_ratio=0, 
                             solver='saga', max_iter=10000000)
        
    eln.fit(X_train, y_train)
    
    y_pred = eln.predict_proba(X_test)[:, 1]
    auprc = pr_auc_score(y_test, y_pred)
    
    final_test.append((X_test, y_test))
    final_models.append(eln)
    scores.append(auprc)   
print(f'auprc: {mean(scores)} ± {stdev(scores)}' )

100%|██████████| 10/10 [00:13<00:00,  1.35s/it]

auprc: 0.9968936862401703 ± 2.709511039178658e-06





In [15]:
file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_scores.save', 'wb')
pickle.dump(scores, file)
file.close()

file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_sets.save', 'wb')
pickle.dump(final_test, file)
file.close()

file = open('../results/cell_type_best/' + str(cell_type) + '_eln_model_test_models.save', 'wb')
pickle.dump(final_models, file)
file.close()