## AD ML binary classification

"Supervised Machine Learning (ML) algorithms were used to identify the discriminative power of the top ten differentially expressed genes in distinguishing between the two categories. Support vector machines using both linear and radial basis function kernels were selected for the binary classification in addition to random forest and quadratic Bayes algorithms [36]. The data was divided into a training set and a test set. The training set consisted of approximately 70% of the samples (114 samples) and the remaining 30% were used for testing to test and approximate how the classifiers generalize to unknown data. There was no overlap in the subjects and samples between the training and testing data to avoid any correlation between samples in both sets. The genes were used as features for each sample with all different combinations of N genes (2 ≤ N ≤ 10) out of 10 genes. For every combination of N genes, the combination with the maximum accuracy on the training data was chosen and the corresponding accuracy on the test data was reported on those N genes to avoid data snooping. Precision and recall scores in addition to the F1 score were reported for the testing accuracy that corresponded to the maximum training accuracy over all combinations."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7717689/



This version keeps **ALL** patients and samples and does **NOT** exclude patients with TBI

In [278]:
import os
import pickle as pkl
import pandas as pd
import numpy as np
import random
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import average_precision_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE


## Data Prep

In [7]:
samples = pd.read_csv('../data/raw/gene_expression_matrix_2016-03-03/columns-samples.csv')
donor_info = pd.read_csv('../data/raw/DonorInformation.csv')


In [8]:
# Process donor info to segregate control group
control_group_df = donor_info[donor_info['act_demented'] == 'No Dementia']
dementia_group_df = donor_info[donor_info['act_demented'] != 'No Dementia']

# Get donor ids
control_ids = control_group_df['donor_id']
dementia_ids = dementia_group_df['donor_id']

# Assign condition to sample data
samples['Condition'] = samples['donor_id'].apply(lambda x: 'control' if x in control_ids.values else 'dementia')
samples

Unnamed: 0,rnaseq_profile_id,donor_id,donor_name,specimen_id,specimen_name,rna_well_id,polygon_id,structure_id,structure_acronym,structure_color,structure_name,hemisphere,Condition
0,488395315,309335467,H14.09.030,309357843,H14.09.030.TCx.01,395325172,320817998,10235,TCx,#ebbfd0,temporal neocortex,left,control
1,496100277,309335441,H14.09.004,309357624,H14.09.004.PCx.01,320630866,310967169,10557,FWM,#f2f1f0,white matter of forebrain,right,control
2,496100278,309335438,H14.09.001,309357596,H14.09.001.PCx.01,320630834,310790571,10557,FWM,#f2f1f0,white matter of forebrain,left,control
3,496100279,309335438,H14.09.001,309357599,H14.09.001.TCx.01,320630838,310790522,10235,TCx,#ebbfd0,temporal neocortex,left,control
4,496100281,309335439,H14.09.002,309357603,H14.09.002.HIP.01,320630842,310790372,10294,HIP,#bfb5d5,hippocampus (hippocampal formation),right,dementia
...,...,...,...,...,...,...,...,...,...,...,...,...,...
372,496100667,467056391,H15.09.103,467179071,H15.09.103.TCx.01,482655826,480366830,10235,TCx,#ebbfd0,temporal neocortex,right,control
373,496100669,467056391,H15.09.103,467179068,H15.09.103.PCx.01,482655822,480363830,10557,FWM,#f2f1f0,white matter of forebrain,right,control
374,496100670,467056406,H15.09.107,467179104,H15.09.107.TCx.01,482655780,480363840,10235,TCx,#ebbfd0,temporal neocortex,right,dementia
375,496100671,467056391,H15.09.103,467179065,H15.09.103.HIP.01,482655820,480366825,10294,HIP,#bfb5d5,hippocampus (hippocampal formation),right,control


### Grab unique donor ids for proper data split

In [53]:
donor_ids = samples.donor_id.unique()

### 70-15-15 Train-Validate-Test donor_id split


In [90]:
# 70, 15, 15 Train, Validate, Test split
np.random.seed(42)
train_ids = np.random.choice(donor_ids, int(np.ceil(len(donor_ids) * 0.7)))
test_ids = np.setdiff1d(donor_ids, train_ids)
validate_ids = np.random.choice(test_ids, int(np.ceil(len(test_ids) * 0.5)))
test_ids = np.setdiff1d(test_ids, validate_ids)
len(train_ids), len(test_ids), len(validate_ids)

(75, 30, 27)

### Check condition distribution

In [95]:
samples[samples['donor_id'].isin(train_ids)]['Condition'].value_counts()

Condition
control     107
dementia     83
Name: count, dtype: int64

In [96]:
samples[samples['donor_id'].isin(validate_ids)]['Condition'].value_counts()

Condition
dementia    48
control     32
Name: count, dtype: int64

In [97]:
samples[samples['donor_id'].isin(test_ids)]['Condition'].value_counts()

Condition
control     58
dementia    49
Name: count, dtype: int64

### Create ML dataframe

In [98]:
counts = pd.read_pickle('../data/interim/PyDeseq2/ct_matrix.pkl')
top_genes = pd.read_pickle('../data/interim/PyDeseq2/top_bottom_ten_sigs.pkl')

In [165]:
# Filter by top genes (top and bottom 10)
ml_df = counts[counts.index.isin(top_genes.gene_id)].copy()
ml_df

rnaseq_profile_id,488395315,496100277,496100278,496100279,496100281,496100283,496100284,496100285,496100287,496100288,...,496100661,496100663,496100664,496100665,496100666,496100667,496100669,496100670,496100671,496100672
gene_id_mapped,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
499309627,21,85,83,44,40,34,39,32,26,27,...,43,45,24,20,17,53,52,26,56,40
499315537,13,22,14,49,28,27,48,44,14,48,...,70,43,51,9,24,103,103,11,124,41
499315843,18,13,6,12,5,25,18,15,20,19,...,26,12,11,39,20,23,29,35,11,10
499317231,31,15,30,42,55,38,35,62,37,39,...,64,89,47,23,22,73,42,48,46,41
499317260,4,3,6,10,4,4,10,10,6,4,...,14,13,5,5,6,8,4,0,11,7
499319747,7,5,2,13,3,9,11,16,5,6,...,39,24,8,5,3,39,26,6,21,10
499324505,11,7,3,108,43,68,59,119,45,117,...,88,111,101,44,17,114,86,39,245,27
499326780,64,25,412,250,87,75,73,86,46,48,...,99,123,110,62,75,117,71,59,114,185
499329195,90,155,168,114,165,115,120,54,146,186,...,50,62,262,233,333,83,61,177,115,129
499334295,168,174,111,136,200,190,163,133,224,285,...,181,154,228,316,348,178,174,258,166,208


In [166]:
# Transform and apply conditions (y_label)
ml_df = ml_df.T.reset_index().rename_axis(None, axis = 1)
ml_df['Condition'] = samples['Condition']
ml_df


Unnamed: 0,rnaseq_profile_id,499309627,499315537,499315843,499317231,499317260,499319747,499324505,499326780,499329195,...,499335939,499336206,499336992,499343767,499343769,499347240,499348654,499350441,499352783,Condition
0,488395315,21,13,18,31,4,7,11,64,90,...,4,11,548,14,30,23,29,60,30,control
1,496100277,85,22,13,15,3,5,7,25,155,...,0,3,374,2,3,24,34,100,12,control
2,496100278,83,14,6,30,6,2,3,412,168,...,1,1,274,1,5,7,32,54,13,control
3,496100279,44,49,12,42,10,13,108,250,114,...,1,27,457,8,6,5,16,31,70,control
4,496100281,40,28,5,55,4,3,43,87,165,...,1,12,651,9,11,13,30,37,16,dementia
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
372,496100667,53,103,23,73,8,39,114,117,83,...,6,34,302,5,7,6,39,32,87,control
373,496100669,52,103,29,42,4,26,86,71,61,...,11,26,333,4,6,9,24,42,82,control
374,496100670,26,11,35,48,0,6,39,59,177,...,2,8,538,19,9,16,55,54,28,dementia
375,496100671,56,124,11,46,11,21,245,114,115,...,4,17,308,2,6,10,26,53,23,control


In [177]:
#samples (rna_profile_ids) by donor data splt
train_samples = samples[samples['donor_id'].isin(train_ids)]['rnaseq_profile_id']
val_samples = samples[samples['donor_id'].isin(validate_ids)]['rnaseq_profile_id']
test_samples = samples[samples['donor_id'].isin(test_ids)]['rnaseq_profile_id']

len(train_samples), len(val_samples), len(test_samples)

(190, 80, 107)

In [182]:
# Now can filter by train, val, test, split
train_df = ml_df[ml_df['rnaseq_profile_id'].isin(train_samples)].drop(columns = 'rnaseq_profile_id')
val_df = ml_df[ml_df['rnaseq_profile_id'].isin(val_samples)].drop(columns = 'rnaseq_profile_id')
test_df = ml_df[ml_df['rnaseq_profile_id'].isin(test_samples)].drop(columns = 'rnaseq_profile_id')


In [194]:
# quick check
train_df

Unnamed: 0,499309627,499315537,499315843,499317231,499317260,499319747,499324505,499326780,499329195,499334295,...,499335939,499336206,499336992,499343767,499343769,499347240,499348654,499350441,499352783,Condition
1,85,22,13,15,3,5,7,25,155,174,...,0,3,374,2,3,24,34,100,12,control
2,83,14,6,30,6,2,3,412,168,111,...,1,1,274,1,5,7,32,54,13,control
3,44,49,12,42,10,13,108,250,114,136,...,1,27,457,8,6,5,16,31,70,control
4,40,28,5,55,4,3,43,87,165,200,...,1,12,651,9,11,13,30,37,16,dementia
5,34,27,25,38,4,9,68,75,115,190,...,2,15,541,10,21,17,44,49,55,dementia
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,24,51,11,47,5,8,101,110,262,228,...,3,14,508,4,10,12,62,72,41,dementia
372,53,103,23,73,8,39,114,117,83,178,...,6,34,302,5,7,6,39,32,87,control
373,52,103,29,42,4,26,86,71,61,174,...,11,26,333,4,6,9,24,42,82,control
375,56,124,11,46,11,21,245,114,115,166,...,4,17,308,2,6,10,26,53,23,control


## Base Models

In [259]:
X_train = train_df.drop(columns='Condition')
y_train = train_df['Condition']

X_val = val_df.drop(columns='Condition')
y_val = val_df['Condition']

# Scale data and transform data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_= scaler.transform(X_train)
X_val_= scaler.transform(X_val)

y_train = y_train.apply(lambda x: 1 if x == 'dementia' else 0)
y_val = y_val.apply(lambda x: 1 if  x=='dementia' else 0)

In [233]:
dummy = DummyClassifier(strategy= 'most_frequent', random_state=42)
dummy.fit(X_train_, y_train)
preds = dummy.predict(X_val_)
score = classification_report(y_val, preds)
print(score)

              precision    recall  f1-score   support

     control       0.40      1.00      0.57        32
    dementia       0.00      0.00      0.00        48

    accuracy                           0.40        80
   macro avg       0.20      0.50      0.29        80
weighted avg       0.16      0.40      0.23        80



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [227]:
reg = LogisticRegression(random_state=42)
reg.fit(X_train_, y_train)
preds = reg.predict(X_val_)
score = classification_report(y_val, preds)
print(score)

              precision    recall  f1-score   support

     control       0.63      0.84      0.72        32
    dementia       0.86      0.67      0.75        48

    accuracy                           0.74        80
   macro avg       0.75      0.76      0.74        80
weighted avg       0.77      0.74      0.74        80



In [216]:
tree = DecisionTreeClassifier(random_state = 42)
tree.fit(X_train_, y_train)
preds = tree.predict(X_val_)
score = classification_report(y_val, preds)
print(score)

              precision    recall  f1-score   support

     control       0.61      0.62      0.62        32
    dementia       0.74      0.73      0.74        48

    accuracy                           0.69        80
   macro avg       0.68      0.68      0.68        80
weighted avg       0.69      0.69      0.69        80



In [217]:
forest = RandomForestClassifier(random_state= 42)
forest.fit(X_train_, y_train)
preds = forest.predict(X_val_)
score = classification_report(y_val, preds)
print(score)

              precision    recall  f1-score   support

     control       0.65      0.88      0.75        32
    dementia       0.89      0.69      0.78        48

    accuracy                           0.76        80
   macro avg       0.77      0.78      0.76        80
weighted avg       0.80      0.76      0.76        80



In [218]:
linear_svm = SVC(kernel='linear', random_state=42)
linear_svm.fit(X_train_, y_train)
preds = linear_svm.predict(X_val_)
score = classification_report(y_val, preds)
print(score)

              precision    recall  f1-score   support

     control       0.60      0.81      0.69        32
    dementia       0.84      0.65      0.73        48

    accuracy                           0.71        80
   macro avg       0.72      0.73      0.71        80
weighted avg       0.74      0.71      0.71        80



In [219]:
rbf_svm = SVC(kernel='rbf', random_state=42)
rbf_svm.fit(X_train_, y_train)
preds = rbf_svm.predict(X_val_)
score = classification_report(y_val, preds)
print(score)

              precision    recall  f1-score   support

     control       0.68      0.88      0.77        32
    dementia       0.90      0.73      0.80        48

    accuracy                           0.79        80
   macro avg       0.79      0.80      0.79        80
weighted avg       0.81      0.79      0.79        80



In [222]:
boost = GradientBoostingClassifier(random_state=42)
boost.fit(X_train_, y_train)
preds = boost.predict(X_val_)
score = classification_report(y_val, preds)
print(score)

              precision    recall  f1-score   support

     control       0.53      0.78      0.63        32
    dementia       0.79      0.54      0.64        48

    accuracy                           0.64        80
   macro avg       0.66      0.66      0.64        80
weighted avg       0.69      0.64      0.64        80



In [235]:

gauss = GaussianNB()
gauss.fit(X_train_, y_train)
preds = gauss.predict(X_val_)
score = classification_report(y_val, preds)
print(score)


              precision    recall  f1-score   support

     control       0.77      0.75      0.76        32
    dementia       0.84      0.85      0.85        48

    accuracy                           0.81        80
   macro avg       0.81      0.80      0.80        80
weighted avg       0.81      0.81      0.81        80



In [261]:
rbf_svm = SVC(kernel='rbf', random_state=42, probability= True)
rbf_svm.fit(X_train_, y_train)
preds = rbf_svm.predict_proba(X_val_)
score = average_precision_score(y_val, preds[:,1])
print(score)


0.8579185462776736


In [264]:
forest = RandomForestClassifier(random_state= 42)
forest.fit(X_train_, y_train)
preds = forest.predict_proba(X_val_)
score = average_precision_score(y_val, preds[:,1])
print(score)

0.8217226270723974


In [290]:
# Use AVERAGE Precision score
models = [
    LogisticRegression(random_state=42),
    DecisionTreeClassifier(random_state=42),
    RandomForestClassifier(random_state=42),
    SVC(kernel="linear", random_state=42, probability=True),
    SVC(kernel="rbf", random_state=42, probability=True),
    GaussianNB(),
    GradientBoostingClassifier(random_state=42),
    DummyClassifier(strategy= 'most_frequent', random_state=42),
]
model_names = ["Log_Reg", "DT", "RF", 'SVM_linear', "SVM_radial", "GaussianNB", 'Gradient_boosted', 'Dummy_most_freq']

scores = {}
saved_models = {}
for name, model in zip(model_names, models):
    model.fit(X_train_, y_train)
    saved_models[name] = model
    preds = model.predict_proba(X_val_)
    score = average_precision_score(y_val, preds[:, 1])
    scores[name] = score

pd.Series(scores).sort_values(ascending=False)

GaussianNB          0.957345
SVM_radial          0.857919
Log_Reg             0.852079
RF                  0.821723
SVM_linear          0.821285
Gradient_boosted    0.793011
DT                  0.705496
Dummy_most_freq     0.600000
dtype: float64

In [283]:
# Quick look at DT feature performance
saved_models['DT'].feature_importances_

array([0.0222854 , 0.01727976, 0.01604549, 0.        , 0.        ,
       0.        , 0.05521354, 0.14013153, 0.02062992, 0.02108836,
       0.06222948, 0.04558957, 0.03085671, 0.102719  , 0.00875209,
       0.07852476, 0.04819838, 0.04931979, 0.22134175, 0.05979448])

In [285]:
# RFE with decision tree
estimator = models[1]
selector = RFE(estimator)
selector = selector.fit(X_train_, y_train)
print(selector.support_)
print(selector.ranking_)

[False False False False False False False  True False  True  True  True
 False  True False  True  True  True  True  True]
[ 9  5  7 11 10  8  6  1  3  1  1  1  2  1  4  1  1  1  1  1]
