# Phenotyping Demo
<hr>

<div style=" float: right;">
<img align="right" src="https://ndownloader.figshare.com/files/34052981" width="25%"/>
<img align="right" src="https://www.cs.cmu.edu/~chiragn/auton_logo.png" width="25%"/>
</div>

# Contents


### 1. [Introduction](#introduction) 

### 2. [SUPPORT Dataset](#syndata) 
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.1 [SUPPORT Dataset Description.](#suppdatadesc)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.2 [Loading and Visualizing the Dataset.](#vissyndata)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.3 [Process Dataset Features.](#vissyndata)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.4 [Split Dataset into Train and Test.](#splitdata)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.5 [Process Train and Test Data.](#processdata)


### 3. [Intersectional Phenotyper](#syndata) 

       
### 4. [Clustering Phenotyper](#phenotyping)

####   &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;   4.1 [Dimensionality Reduction](#phenocmhe)

####   &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;   4.2 [Clustering](#clustering)



### 5. [Deep Cox Mixtures](#regression)

#### &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;   5.1 [Fit DCM model for phenotypes](#regcmhe)


####   &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;   5.2 [Evaluate DCM model for phenotypes](#deepcph)

<hr>


*For examples of counterfactual phenotyping with Deep Cox Mixtures with Heterogeneous Effects (CMHE), please refer to the following paper*:

[Nagpal, C., Yadlowsky, S., Rostamzadeh, N., and Heller, K. (2021c). Deep cox mixtures for survival regression. In
Machine Learning for Healthcare Conference, pages 674–708. PMLR.](https://arxiv.org/abs/2101.06536)

*For full details on Deep Cox Mixtures (DCM), please refer to the following paper*:

[Nagpal, C., Yadlowsky, S., Rostamzadeh, N., and Heller, K. (2021c). Deep cox mixtures for survival regression. In
Machine Learning for Healthcare Conference, pages 674–708. PMLR.](https://arxiv.org/abs/2101.06536)

<a id="supportdata"></a>

## 2. SUPPORT Dataset

In [1]:
import pandas as pd
import torch
from tqdm import tqdm 
import sys
sys.path.append('../')

from auton_survival.datasets import load_dataset

<a id="gensyndata"></a>
### 2.1. SUPPORT Dataset Description

*For the original datasource, please refer to the following [website](https://biostat.app.vumc.org/wiki/Main/SupportDesc).*

Data features $x$ are stored in a pandas dataframe with rows corresponding to individual samples and columns as covariates. Data outcome consists of 'time', $t$, and 'event', $e$, that correspond to the time to event and the censoring indicator, respectively. 



<a id="gensyndata"></a>
### 2.2. Loading and Visualizing the Dataset

In [2]:
# Load the synthetic dataset
outcomes, features = load_dataset(dataset='SUPPORT')

# Let's take a look at take the dataset
display(features.head(5))
display(outcomes.head(5))

Unnamed: 0,sex,dzgroup,dzclass,income,race,ca,age,num.co,meanbp,wblc,...,alb,bili,crea,sod,ph,glucose,bun,urine,adlp,adls
0,male,Lung Cancer,Cancer,$11-$25k,other,metastatic,62.84998,0,97.0,6.0,...,1.799805,0.199982,1.199951,141.0,7.459961,,,,7.0,7.0
1,female,Cirrhosis,COPD/CHF/Cirrhosis,$11-$25k,white,no,60.33899,2,43.0,17.097656,...,,,5.5,132.0,7.25,,,,,1.0
2,female,Cirrhosis,COPD/CHF/Cirrhosis,under $11k,white,no,52.74698,2,70.0,8.5,...,,2.199707,2.0,134.0,7.459961,,,,1.0,0.0
3,female,Lung Cancer,Cancer,under $11k,white,metastatic,42.38498,2,75.0,9.099609,...,,,0.799927,139.0,,,,,0.0,0.0
4,female,ARF/MOSF w/Sepsis,ARF/MOSF,,white,no,79.88495,1,59.0,13.5,...,,,0.799927,143.0,7.509766,,,,,2.0


Unnamed: 0,event,time
0,0,2029
1,1,4
2,1,47
3,1,133
4,0,2029


<a id="gensyndata"></a>
### 2.3. Process Dataset Features

In [3]:
from auton_survival.preprocessing import Preprocessor
cat_feats = ['sex', 'dzgroup', 'dzclass', 'income', 'race', 'ca']
num_feats = ['age', 'num.co', 'meanbp', 'wblc', 'hrt', 'resp', 
	     'temp', 'pafi', 'alb', 'bili', 'crea', 'sod', 'ph', 
             'glucose', 'bun', 'urine', 'adlp', 'adls']

features = Preprocessor().fit_transform(features, cat_feats=cat_feats, num_feats=num_feats)

# Let's take a look at take the processed dataset
display(features.head(5))

Unnamed: 0,age,num.co,meanbp,wblc,hrt,resp,temp,pafi,alb,bili,...,dzclass_Coma,income_$25-$50k,income_>$50k,income_under $11k,race_black,race_hispanic,race_other,race_white,ca_no,ca_yes
0,0.012772,-1.390013,0.449837,-0.693182,-0.892283,-0.138967,-0.881504,1.569019,-1.655686,-0.5238337,...,0,0,0,0,0,0,1,0,0,0
1,-0.148262,0.097711,-1.500702,0.51871,0.470382,1.114591,-2.005013,-1.495658,-6.389701e-16,9.880260000000001e-17,...,0,0,0,0,0,0,0,1,1,0
2,-0.635153,0.097711,-0.525432,-0.420176,-0.290175,0.487812,0.235766,-0.0831988,-6.389701e-16,-0.0789274,...,0,0,0,1,0,0,0,1,1,0
3,-1.299688,0.097711,-0.344827,-0.354697,-0.290175,0.905665,-1.680444,-3.003564e-16,-6.389701e-16,9.880260000000001e-17,...,0,0,0,1,0,0,0,1,0,0
4,1.105258,-0.646151,-0.922764,0.125837,0.470382,-0.347893,0.635237,-0.699767,-6.389701e-16,9.880260000000001e-17,...,0,0,0,0,0,0,0,1,1,0


<a id="gensyndata"></a>
### 2.3. Split Dataset into Train and Test

In [4]:
# Hyper-parameters
random_seed = 0
test_size = 0.25

# Split the synthetic data into training and testing data
import numpy as np

np.random.seed(random_seed)
n = features.shape[0] 

test_idx = np.zeros(n).astype('bool')
test_idx[np.random.randint(n, size=int(n*test_size))] = True 

features_tr = features.iloc[~test_idx] 
outcomes_tr = outcomes.iloc[~test_idx]
print(f'Number of training data points: {len(features_tr)}')

features_te = features.iloc[test_idx] 
outcomes_te = outcomes.iloc[test_idx]
print(f'Number of test data points: {len(features_te)}')

x_tr = features_tr.values.astype('float32')
t_tr = outcomes_tr['time'].values.astype('float32')
e_tr = outcomes_tr['event'].values.astype('float32')

x_te = features_te.values.astype('float32')
t_te = outcomes_te['time'].values.astype('float32')
e_te = outcomes_te['event'].values.astype('float32')

print('Training Data Statistics:')
print(f'Shape of covariates: {x_tr.shape} | times: {t_tr.shape} | events: {e_tr.shape}')

Number of training data points: 7094
Number of test data points: 2011
Training Data Statistics:
Shape of covariates: (7094, 38) | times: (7094,) | events: (7094,)


<a id="clustering"></a>
## 3. Intersectional phenotyper

In [6]:
from auton_survival.phenotyping import IntersectionalPhenotyper

phenotyper = IntersectionalPhenotyper(cat_vars=['ca_yes', 'ca_no'], num_vars=['age'],
                                                   num_vars_quantiles=(0, .5, 1.0), random_seed=0)
intersectonal_phenotypes = phenotyper.fit_phenotype(features_tr)

intersectonal_phenotypes 

array(['ca_yes:0 & ca_no:1 & age:(-2.8569999999999998, 0.149]',
       'ca_yes:0 & ca_no:1 & age:(-2.8569999999999998, 0.149]',
       'ca_yes:0 & ca_no:0 & age:(-2.8569999999999998, 0.149]', ...,
       'ca_yes:0 & ca_no:1 & age:(-2.8569999999999998, 0.149]',
       'ca_yes:0 & ca_no:1 & age:(0.149, 2.514]',
       'ca_yes:0 & ca_no:1 & age:(0.149, 2.514]'], dtype='<U53')

In [None]:
from auton_survival.phenotyping import IntersectionalPhenotyper

phenotyper = IntersectionalPhenotyper(cat_vars=['ca_yes', 'ca_no'], num_vars=['age'],
                                                   num_vars_quantiles=(0, .5, 1.0), random_seed=0)
intersectonal_phenotypes = phenotyper.fit_phenotype(features_tr)

intersectonal_phenotypes 

In [7]:
features_tr['age']

1      -0.148262
2      -0.635153
3      -1.299688
4       1.105258
5       1.947377
          ...   
9098    0.498443
9099    0.382556
9101   -0.480786
9102    0.495813
9104    1.211332
Name: age, Length: 7094, dtype: float64

<a id="clustering"></a>
## 4. Clustering phenotyper

We first perform dimensionality reduction of the input covariates, $\mathbf{x}$, followed by clustering. 

In [19]:
from auton_survival.phenotyping import ClusteringPhenotyper

clustering_method = 'gmm'
dim_red_method = 'pca' # We would not perform dimensionality reduction for the synthetic dataset
n_components = 3 
n_clusters = 2 # Number of underlying phenotypes

# Running the phenotyper
clustering_phenotypes = ClusteringPhenotyper(clustering_method=clustering_method, 
                                  dim_red_method=dim_red_method, 
                                  n_components=n_components, 
                                  n_clusters=n_clusters).fit_phenotype(features)

clustering_phenotypes

Fitting the following Dimensionality Reduction Model:
 PCA(n_components=3)
Fitting the following Clustering Model:
 GaussianMixture(covariance_type='diag', n_components=3)


array([[8.33133928e-01, 2.26219408e-08, 1.66866049e-01],
       [1.53273542e-04, 3.53382686e-01, 6.46464040e-01],
       [4.04053316e-01, 5.67402068e-01, 2.85446161e-02],
       ...,
       [2.78876574e-01, 6.75832413e-01, 4.52910134e-02],
       [1.90432597e-02, 8.71261819e-01, 1.09694922e-01],
       [9.04663995e-01, 7.04396569e-02, 2.48963481e-02]])

<a id="DCM"></a>
## 5. Deep Cox Mixtures (DCM)