# All-cell Model Tuning -- after count binarization

## Outline

The **MLAging - all-cell** workflow consists of sections:

`00 preprocessing.R` Data preprocessing and preparation in Seurat.

`111 All-cell Model Tuning - Before Binarization` ML model tunning using *non-binarized* HVGs and hyperparameter selection using `GridSearchCV`.

`112 All-cell Model Tuning - After Binarization` ML model tunning using *binarized* HVGs -- **this notebook**:

1. [Data Preparation](#1.-prep)
2. [Model Tunning](#2.-tunning)
    - [Lasso](#3.-l1)
    - [Ridge](#4.-l2)
    - [ElasticNet](#5.-eln)
    
    - [Random Forest](#6.-rfc)
    - [XGBoost](#7.-xgbc)
    
    - [Support Vector Machine with rbf kernel](#8.-svc)

`121 All-cell Model 10x - Before Binarization` Run the best models for non-binarized* HVGs over 10 random seeds.

`122 All-cell Model 10x - After Binarization` Run the best models for *binarized* HVGs over 10 random seeds.
 
`123 All-cell Model 10x Swapped Train-Test` Run the best models for *binarized* HVGs over 10 random seeds. But switched the training and test sets to make sure that the sequencing throughput did not affect model performance.

`13 All-cell Model Result Viz` Result visulization.

`14 All-cell ELN Interpretation` Result interpretation. 

## Outline

The **MLAging - all-cell** workflow consists of four sections:

`00 preprocessing.R` Data preprocessing and preparation in Seurat.

`111 All-cell Model Tuning - Before Binarization` ML model tunning using *non-binarized* HVGs and hyperparameter selection using `GridSearchCV`.

`112 All-cell Model Tuning - After Binarization` ML model tunning using *binarized* HVGs.

- [1. Lasso - L1](#1.-l1)
- [2. Ridge - L2](#2.-l2)
- [3. ElasticNet](#3.-eln)
    
    
- [4. Random Forest](#4.-rfc)
- [5. XGBoost](#5.-xgbc)
    
    
- [6. Support Vector Machine with rbf kernel](#6.-svc)

`121 All-cell Model 10x - Before Binarization` Run the best models for non-binarized* HVGs over 10 random seeds -- **this notebook**:

`122 All-cell Model 10x - After Binarization` Run the best models for *binarized* HVGs over 10 random seeds.
 
`123 All-cell Model 10x Swapped Train-Test` Run the best models for *binarized* HVGs over 10 random seeds. But switched the training and test sets to make sure that the sequencing throughput did not affect model performance.

`13 All-cell Model Result Viz` Result visulization.

`14 All-cell ELN Interpretation` Result interpretation. 

In [1]:
import warnings
warnings.filterwarnings('ignore')

from src.data_processing import *
from src.grid_search import *
import os
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
import pickle

data_type = 'float32'

## 1. Data Preparation <a name="1.-prep"></a>
### Load training, testing batch

In [2]:
input_test = '../data/test_final_group_info.csv'
input_train = '../data/train_final_group_info.csv'

cell_type = 'All'

In [3]:
train_X, train_y, test_X, test_y, custom_cv = data_prep(input_test, input_train,
                                                        cell_type, binarization=True)

Finished data prepration for All


In [4]:
pr_auc_scorer = make_scorer(pr_auc_score, greater_is_better=True,
                            needs_proba=True)

## 2. Model tunning<a name="2.-tunning"></a>

### 1) Logistic regression -- l1<a name="3.-l1"></a>

In [5]:
from sklearn.linear_model import LogisticRegression

In [6]:
l1 = LogisticRegression(penalty='l1', solver='saga', max_iter=10000000)
# 0.01, 0.05, 0.1, 0.5, 1, 5, 8, 10, 20, 50, 100 
# 12.5, 15, 17.5, 20, 25, 30, 35, 40
param_grid = {'logisticregression__C': [18, 19, 20, 21, 22, 23]}

for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                l1, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

 25%|██▌       | 1/4 [32:20<1:37:01, 1940.58s/it]

{'logisticregression__C': 21}
best CV score: 0.9645329809120251
test score: 0.9607151436507586


 50%|█████     | 2/4 [1:05:04<1:05:08, 1954.33s/it]

{'logisticregression__C': 19}
best CV score: 0.9647309276427187
test score: 0.9607902888704182


 75%|███████▌  | 3/4 [1:37:07<32:19, 1940.00s/it]  

{'logisticregression__C': 20}
best CV score: 0.964419295822672
test score: 0.9607495728651902


100%|██████████| 4/4 [2:09:30<00:00, 1942.73s/it]

{'logisticregression__C': 18}
best CV score: 0.9645240269585771
test score: 0.9608373502661964





### 2) Logistic regression -- l2<a name="4.-l2"></a>

In [7]:
l2 = LogisticRegression(penalty='l2', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-2, 2, 10)}

for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                               l2, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

 25%|██▌       | 1/4 [10:02<30:07, 602.39s/it]

{'logisticregression__C': 0.0774263682681127}
best CV score: 0.9752339606733502
test score: 0.9669253582815398


 50%|█████     | 2/4 [19:56<19:54, 597.36s/it]

{'logisticregression__C': 0.0774263682681127}
best CV score: 0.9752318134393316
test score: 0.9669285311511667


 75%|███████▌  | 3/4 [30:45<10:20, 620.98s/it]

{'logisticregression__C': 0.0774263682681127}
best CV score: 0.9752322822811742
test score: 0.9669277905646708


100%|██████████| 4/4 [41:40<00:00, 625.21s/it]

{'logisticregression__C': 0.0774263682681127}
best CV score: 0.9752321610156784
test score: 0.9669268928317597





### 3) Logistic regression -- ElasticNet<a name="5.-eln"></a>

In [8]:
eln = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=10000000)
param_grid = {'logisticregression__C': np.logspace(-2, 1, 10),
             'logisticregression__l1_ratio': [0.001, 0.01, 0.05, 0.5, 0.65, 0.8]}

for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                eln, param_grid, i, custom_cv, pr_auc_scorer)
    
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

 25%|██▌       | 1/4 [1:14:26<3:43:20, 4466.70s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.01}
best CV score: 0.9755470672779081
test score: 0.9671617154646567


 50%|█████     | 2/4 [2:28:16<2:28:10, 4445.22s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.01}
best CV score: 0.975547351794952
test score: 0.9671610221966493


 75%|███████▌  | 3/4 [3:43:55<1:14:48, 4488.05s/it]

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.01}
best CV score: 0.9755480501504383
test score: 0.967161551325769


100%|██████████| 4/4 [4:59:07<00:00, 4486.85s/it]  

{'logisticregression__C': 0.046415888336127774, 'logisticregression__l1_ratio': 0.01}
best CV score: 0.9755460582716937
test score: 0.9671608627239665





### 4) Random Forest Classifier<a name="6.-rfc"></a>

In [9]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier()
param_grid = {'randomforestclassifier__max_features': [10, 15, 20, 25, 50, None],
              'randomforestclassifier__max_depth': [10, 20, 30, 50, 100, None],
              'randomforestclassifier__min_samples_split': [2, 5, 10, 20]}

for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                rfc, param_grid, i, custom_cv, pr_auc_scorer)

    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

  0%|          | 0/4 [00:00<?, ?it/s]

### 5) XGBoost Classifier<a name="#7.-xgbc"></a>

In [None]:
import xgboost
from xgboost import XGBClassifier

xgbc = XGBClassifier(use_label_encoder=False)
param_grid = {'xgbclassifier__max_depth': [1, 3, 5, 10, 20, 30, 100],
              "xgbclassifier__learning_rate": [0.03],
              #'xgbclassifier__min_child_weight': [1, 3, 5, 7],
              #'xgbclassifier__gamma': [0, 0.1, 0.2 , 0.3, 0.4],
              'xgbclassifier__colsample_bytree': [0.9],
              'xgbclassifier__subsample': [0.66],
              'xgbclassifier__eval_metric': ['logloss']}

for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                xgbc, param_grid, i, custom_cv, pr_auc_scorer, xgbc=True)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)

### 6) SVC<a name="8.-svc"></a>

In [None]:
from sklearn.svm import SVC
svc = SVC(probability=True)
# 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 
param_grid = {'svc__gamma': [1e-3, 1e-2, 1e-1],
              'svc__C': np.logspace(-3, 2, 5)}

for i in tqdm(range(4)):
    grid, test_score = ML_pipeline_GridSearchCV(train_X, train_y, test_X, test_y, 
                                                svc, param_grid, i, custom_cv, pr_auc_scorer)
    print(grid.best_params_)
    print('best CV score:', grid.best_score_)
    print('test score:',test_score)