# Batch Integration Scheme ELN Result 10x

## Outline

The **MLAging - batch integration and misc ** workflow consists of sections:

`60 preprocessing_batch.R` Data preprocessing and preparation in Seurat.

`61 Batch Integration Scheme ELN Tuning` Scheme: batch effects within training or test sets. ELN model tunning using highly variable genes (HVGs) and hyperparameter selection using `GridSearchCV`.

`62 Batch Integration Scheme ELN Result 10x` Run the best ELN model over 10 random seeds -- **this notebook**:

`63 HVG and Cell Type` Clustering and heatmap showing that HVGs are cell type-specific.

`64 AUROC Results` ELN 10x results shown in AUPRC metric.

`65 age_genes.R` Aging database queries.

In [1]:
from src.data_processing import *
from src.grid_search import *
from src.packages import *

In [2]:
def assign_target(input_df):
    input_df = pd.read_csv(input_df, index_col=0)
    input_df['animal'] = input_df.index.str[-1]
    input_df['target'] = ((input_df['animal'] =='3')|(input_df['animal']=='4')).astype(int)
    return input_df

In [3]:
def train_test_split(input_train, input_test, binarization=False):  
    
    df_test = assign_target(input_test)
    test_X = df_test.iloc[:,:-2]
    test_y = df_test.target
    test_X, test_y = shuffle(test_X, test_y, random_state=42)

    df_train = assign_target(input_train)
    train = df_train.reset_index()
    custom_cv = customized_cv_index(train)
    
    train_X = train.iloc[:,1:-2]
    train_y = train.target
    
    if binarization==True:
        test_X = binarize_data(test_X)
        train_X = binarize_data(train_X)
    
    return train_X, train_y, test_X, test_y, custom_cv

In [4]:
input_train = '../../MLAging/data/batch3_train_hvg2k_std_integrated.csv'
input_test = '../../MLAging/data/batch3_test_hvg2k_std_integrated.csv'

In [5]:
train_X, train_y, test_X, test_y, custom_cv = train_test_split(input_train, input_test, binarization=False)

### 1. L1 <a name="1.-l1"></a>

In [7]:
from sklearn.linear_model import LogisticRegression

In [10]:
scores = []
final_test = []
final_models = []
for i in tqdm(range(10)):
    random_state = 42*i    
    X_test, y_test = shuffle(test_X, test_y, random_state=random_state)
    X_train, y_train = shuffle(train_X, train_y, random_state=random_state)
    
    eln = LogisticRegression(penalty='elasticnet', C=0.01, l1_ratio=0.25118864315095796, 
                             solver='saga', max_iter=10000000)
        
    eln.fit(X_train, y_train)
    
    y_pred = eln.predict_proba(X_test)[:, 1]
    auprc = pr_auc_score(y_test, y_pred)
    
    final_test.append((X_test, y_test))
    final_models.append(eln)
    scores.append(auprc)   
print(f'auprc: {mean(scores)} ± {stdev(scores)}')

100%|██████████| 10/10 [04:28<00:00, 26.86s/it]

auprc: 0.7449221718789035 ± 2.003299713783117e-07





In [11]:
file = open('../results/results_batch/before_eln_model_test_scores.save', 'wb')
pickle.dump(scores, file)
file.close()

file = open('../results/results_batch/before_eln_model_test_models.save', 'wb')
pickle.dump(final_models, file)
file.close()

file = open('../results/results_batch/before_eln_model_test_sets.save', 'wb')
pickle.dump(final_test, file)
file.close()

In [12]:
train_X, train_y, test_X, test_y, custom_cv = train_test_split(input_train, input_test, binarization=True)

In [13]:
scores = []
final_test = []
final_models = []
for i in tqdm(range(10)):
    random_state = 42*i    
    X_test, y_test = shuffle(test_X, test_y, random_state=random_state)
    X_train, y_train = shuffle(train_X, train_y, random_state=random_state)
    
    eln = LogisticRegression(penalty='elasticnet', C=4.6415888336127775, l1_ratio=1.0, 
                             solver='saga', max_iter=10000000)
        
    eln.fit(X_train, y_train)
    
    y_pred = eln.predict_proba(X_test)[:, 1]
    auprc = pr_auc_score(y_test, y_pred)
    
    final_test.append((X_test, y_test))
    final_models.append(eln)
    scores.append(auprc)   
print(f'auprc: {mean(scores)} ± {stdev(scores)}')

100%|██████████| 10/10 [2:03:09<00:00, 738.96s/it] 

auprc: 0.9694553113194376 ± 1.2216231460646335e-05





In [14]:
file = open('../results/results_batch/after_eln_model_test_scores.save', 'wb')
pickle.dump(scores, file)
file.close()

file = open('../results/results_batch/after_eln_model_test_models.save', 'wb')
pickle.dump(final_models, file)
file.close()

file = open('../results/results_batch/after_eln_model_test_sets.save', 'wb')
pickle.dump(final_test, file)
file.close()