# All-cell Model Tuning -- before count binarization

## Outline

The **MLAging - all-cell** workflow consists of sections:

`00 preprocessing.R` Data preprocessing and preparation in Seurat.

`111 All-cell Model Tuning - Before Binarization` ML model tunning using *non-binarized* HVGs and hyperparameter selection using `GridSearchCV` -- **this notebook:** 
1. [Data Preparation](#1.-prep)
2. [Model Tunning](#2.-tunning)
    - [Lasso](#3.-l1)
    - [Ridge](#4.-l2)
    - [ElasticNet](#5.-eln)
    
    - [Random Forest](#6.-rfc)
    - [XGBoost](#7.-xgbc)
    
    - [Support Vector Machine with rbf kernel](#8.-svc)

`112 All-cell Model Tuning - After Binarization` ML model tunning using *binarized* HVGs.

`121 All-cell Model 10x - Before Binarization` Run the best models for non-binarized* HVGs over 10 random seeds.

`122 All-cell Model 10x - After Binarization` Run the best models for *binarized* HVGs over 10 random seeds.
 
`123 All-cell Model 10x Swapped Train-Test` Run the best models for *binarized* HVGs over 10 random seeds. But switched the training and test sets to make sure that the sequencing throughput did not affect model performance.

`13 All-cell Model Result Viz` Result visulization.

`14 All-cell ELN Interpretation` Result interpretation. 

In [1]:
import warnings
warnings.filterwarnings('ignore')

from src.data_processing import *
from src.grid_search import *
import os
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
import pickle

data_type = 'float32'

## 1. Data Preparation <a name="1.-prep"></a>
### Load training, testing batch

In [2]:
input_test = '../data/test_final_group_info.csv'
input_train = '../data/train_final_group_info.csv'

cell_type = 'All'

In [3]:
train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=False)

Finished data prepration for All


In [4]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

## 2. Model tunning<a name="2.-tunning"></a>

### 1) Logistic regression -- l1<a name="3.-l1"></a>

In [5]:
from sklearn.linear_model import LogisticRegression

In [6]:
l1 = LogisticRegression(penalty='l1', solver='saga', max_iter=10000000)
# 0.01, 0.05, 0.1, 0.5, 1, 5, 8, 10, 20, 50, 100 
# 12.5, 15, 17.5, 20, 25, 30, 35, 40
param_grid = {'logisticregression__C': np.logspace(-3, 2, 10)}

for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                l1, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

 25%|██▌       | 1/4 [05:15<15:47, 315.79s/it]

{'logisticregression__C': 0.001}
best CV score: 0.7563551215370854
test score: 0.6485430254049704


 50%|█████     | 2/4 [10:38<10:39, 319.62s/it]

{'logisticregression__C': 0.001}
best CV score: 0.7563551215370854
test score: 0.6485430254049704


 75%|███████▌  | 3/4 [16:10<05:25, 325.41s/it]

{'logisticregression__C': 0.001}
best CV score: 0.7563551215370854
test score: 0.6485430254049704


100%|██████████| 4/4 [21:52<00:00, 328.21s/it]

{'logisticregression__C': 0.001}
best CV score: 0.7563551215370854
test score: 0.6485430254049704





### 2) Logistic regression -- l2<a name="4.-l2"></a>

In [10]:
l2 = LogisticRegression(penalty='l2', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-3, 2, 10)}

for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                l2, param_grid, i, custom_cv, pr_auc_scorer) 
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

 25%|██▌       | 1/4 [03:35<10:46, 215.62s/it]

{'logisticregression__C': 0.001}
best CV score: 0.5226013565858811
test score: 0.6969598755704333


 50%|█████     | 2/4 [07:09<07:09, 214.72s/it]

{'logisticregression__C': 0.001}
best CV score: 0.5226015946461939
test score: 0.6969601798254196


 75%|███████▌  | 3/4 [10:34<03:30, 210.19s/it]

{'logisticregression__C': 0.001}
best CV score: 0.5226008220123401
test score: 0.6969601259286972


100%|██████████| 4/4 [14:16<00:00, 214.09s/it]

{'logisticregression__C': 0.001}
best CV score: 0.5226018591962115
test score: 0.6969597984390901





### 3) Logistic regression -- ElasticNet<a name="5.-eln"></a>

In [11]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C':  np.logspace(-3, 2, 10),
             'logisticregression__l1_ratio': [0.05, 0.1, 0.2, 0.35]}

for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

 25%|██▌       | 1/4 [21:26<1:04:20, 1286.91s/it]

{'logisticregression__C': 0.001, 'logisticregression__l1_ratio': 0.35}
best CV score: 0.5363368197271268
test score: 0.6668600137364138


 50%|█████     | 2/4 [40:21<39:54, 1197.45s/it]  

{'logisticregression__C': 0.001, 'logisticregression__l1_ratio': 0.35}
best CV score: 0.5363361898255219
test score: 0.6668619514605549


 75%|███████▌  | 3/4 [52:29<16:22, 982.94s/it] 

{'logisticregression__C': 0.001, 'logisticregression__l1_ratio': 0.35}
best CV score: 0.5363366807473459
test score: 0.6668614142927457


100%|██████████| 4/4 [1:11:19<00:00, 1069.84s/it]

{'logisticregression__C': 0.001, 'logisticregression__l1_ratio': 0.35}
best CV score: 0.5363370134458996
test score: 0.6668626931111988





### 4) Random Forest Classifier<a name="6.-rfc"></a>

In [12]:
from sklearn.ensemble import RandomForestClassifier

In [13]:
rfc = RandomForestClassifier()
param_grid = {'randomforestclassifier__max_features': [10, 15, 20, 25, 50, None],
              'randomforestclassifier__max_depth': [10, 20, 30, 50, 100, None],
              'randomforestclassifier__min_samples_split': [2, 5, 10, 20]}

for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y,
                                                rfc, param_grid, i, custom_cv, pr_auc_scorer)

    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

 25%|██▌       | 1/4 [3:34:18<10:42:54, 12858.05s/it]

{'randomforestclassifier__max_depth': 20, 'randomforestclassifier__max_features': 20, 'randomforestclassifier__min_samples_split': 10}
best CV score: 0.9501298298018654
test score: 0.5535345169903746


 50%|█████     | 2/4 [7:01:48<7:00:36, 12618.33s/it] 

{'randomforestclassifier__max_depth': 30, 'randomforestclassifier__max_features': 50, 'randomforestclassifier__min_samples_split': 10}
best CV score: 0.951880274511343
test score: 0.5326231548073438


 75%|███████▌  | 3/4 [9:09:41<2:52:40, 10360.01s/it]

{'randomforestclassifier__max_depth': 10, 'randomforestclassifier__max_features': 50, 'randomforestclassifier__min_samples_split': 5}
best CV score: 0.9543131058740284
test score: 0.5737193567438823


100%|██████████| 4/4 [11:04:46<00:00, 9971.58s/it]  

{'randomforestclassifier__max_depth': 10, 'randomforestclassifier__max_features': 50, 'randomforestclassifier__min_samples_split': 5}
best CV score: 0.9521794368837239
test score: 0.5510854190687913





### 5) XGBoost Classifier<a name="#7.-xgbc"></a>

In [14]:
import xgboost
from xgboost import XGBClassifier

xgbc = XGBClassifier(use_label_encoder=False)
param_grid = {'xgbclassifier__max_depth': [1, 3, 5, 10, 20, 30, 100],
              "xgbclassifier__learning_rate": [0.03],
              #'xgbclassifier__min_child_weight': [1, 3, 5, 7],
              #'xgbclassifier__gamma': [0, 0.1, 0.2 , 0.3, 0.4],
              'xgbclassifier__colsample_bytree': [0.9],
              'xgbclassifier__subsample': [0.66],
              'xgbclassifier__eval_metric': ['logloss']}

for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                xgbc, param_grid, i, custom_cv, pr_auc_scorer,
                                                xgbc=True)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

 10%|█         | 1/10 [21:35<3:14:17, 1295.31s/it]

{'xgbclassifier__colsample_bytree': 0.9, 'xgbclassifier__eval_metric': 'logloss', 'xgbclassifier__learning_rate': 0.03, 'xgbclassifier__max_depth': 3, 'xgbclassifier__subsample': 0.66}
best CV score: 0.8766080288522291
test score: 0.5933629849465569


 20%|██        | 2/10 [42:45<2:50:46, 1280.77s/it]

{'xgbclassifier__colsample_bytree': 0.9, 'xgbclassifier__eval_metric': 'logloss', 'xgbclassifier__learning_rate': 0.03, 'xgbclassifier__max_depth': 3, 'xgbclassifier__subsample': 0.66}
best CV score: 0.8734098616096132
test score: 0.598334735493387


 30%|███       | 3/10 [1:03:57<2:28:57, 1276.78s/it]

{'xgbclassifier__colsample_bytree': 0.9, 'xgbclassifier__eval_metric': 'logloss', 'xgbclassifier__learning_rate': 0.03, 'xgbclassifier__max_depth': 3, 'xgbclassifier__subsample': 0.66}
best CV score: 0.8700772352690267
test score: 0.5703041037310352


 40%|████      | 4/10 [1:25:11<2:07:34, 1275.69s/it]

{'xgbclassifier__colsample_bytree': 0.9, 'xgbclassifier__eval_metric': 'logloss', 'xgbclassifier__learning_rate': 0.03, 'xgbclassifier__max_depth': 3, 'xgbclassifier__subsample': 0.66}
best CV score: 0.8748046045553552
test score: 0.581108803198559


 50%|█████     | 5/10 [1:46:29<1:46:21, 1276.31s/it]

{'xgbclassifier__colsample_bytree': 0.9, 'xgbclassifier__eval_metric': 'logloss', 'xgbclassifier__learning_rate': 0.03, 'xgbclassifier__max_depth': 3, 'xgbclassifier__subsample': 0.66}
best CV score: 0.87086885541217
test score: 0.5932668866557252


 60%|██████    | 6/10 [2:07:44<1:25:04, 1276.05s/it]

{'xgbclassifier__colsample_bytree': 0.9, 'xgbclassifier__eval_metric': 'logloss', 'xgbclassifier__learning_rate': 0.03, 'xgbclassifier__max_depth': 3, 'xgbclassifier__subsample': 0.66}
best CV score: 0.8736294847835803
test score: 0.590594976399925


 70%|███████   | 7/10 [2:29:07<1:03:54, 1278.21s/it]

{'xgbclassifier__colsample_bytree': 0.9, 'xgbclassifier__eval_metric': 'logloss', 'xgbclassifier__learning_rate': 0.03, 'xgbclassifier__max_depth': 3, 'xgbclassifier__subsample': 0.66}
best CV score: 0.8773555574359218
test score: 0.6128357583793621


 80%|████████  | 8/10 [2:50:20<42:33, 1276.69s/it]  

{'xgbclassifier__colsample_bytree': 0.9, 'xgbclassifier__eval_metric': 'logloss', 'xgbclassifier__learning_rate': 0.03, 'xgbclassifier__max_depth': 3, 'xgbclassifier__subsample': 0.66}
best CV score: 0.8641742232549473
test score: 0.5633546566925081


 90%|█████████ | 9/10 [3:11:34<21:15, 1275.74s/it]

{'xgbclassifier__colsample_bytree': 0.9, 'xgbclassifier__eval_metric': 'logloss', 'xgbclassifier__learning_rate': 0.03, 'xgbclassifier__max_depth': 3, 'xgbclassifier__subsample': 0.66}
best CV score: 0.8778473545131605
test score: 0.578121363070095


100%|██████████| 10/10 [3:32:39<00:00, 1275.96s/it]

{'xgbclassifier__colsample_bytree': 0.9, 'xgbclassifier__eval_metric': 'logloss', 'xgbclassifier__learning_rate': 0.03, 'xgbclassifier__max_depth': 3, 'xgbclassifier__subsample': 0.66}
best CV score: 0.8781023280107005
test score: 0.6010914382366007





### 6) SVC<a name="8.-svc"></a>

In [None]:
from sklearn.svm import SVC
svc = SVC(probability=True)
# 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1
param_grid = {'svc__gamma': np.logspace(-3, 2, 6),
              'svc__C': np.logspace(-3, 2, 6)}

for i in tqdm(range(10)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                svc, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

 10%|█         | 1/10 [5:23:50<48:34:36, 19430.74s/it]

{'svc__C': 10.0, 'svc__gamma': 0.1}
best CV score: 0.7563617383099294
test score: 0.8286780851998243


 20%|██        | 2/10 [9:26:52<36:50:27, 16578.48s/it]

{'svc__C': 0.001, 'svc__gamma': 0.1}
best CV score: 0.7563551215370854
test score: 0.8286780851998243


 30%|███       | 3/10 [14:52:08<34:50:37, 17919.64s/it]

{'svc__C': 10.0, 'svc__gamma': 0.1}
best CV score: 0.7563676090148831
test score: 0.8286780851998243
