# Phenotyping Demo
<hr>

<div style=" float: right;">
<img align="right" src="https://ndownloader.figshare.com/files/34052981" width="25%"/>
<img align="right" src="https://www.cs.cmu.edu/~chiragn/auton_logo.png" width="25%"/>
</div>

# Contents


### 1. [Introduction](#introduction) 

### 2. [SUPPORT Dataset](#syndata) 
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.1 [SUPPORT Dataset Description.](#suppdatadesc)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.2 [Loading and Visualizing the Dataset.](#vissyndata)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.3 [Split Dataset into Train and Test.](#splitdata)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.3 [Process Train and Test Data.](#processdata)


### 3. [Intersectional Phenotyper](#syndata) 

       
### 4. [Clustering Phenotyper](#phenotyping)

####   &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;   4.1 [Dimensionality Reduction](#phenocmhe)

####   &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;   4.2 [Clustering](#clustering)



### 5. [Deep Cox Mixtures](#regression)

#### &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;   5.1 [Fit DCM model for phenotypes](#regcmhe)


####   &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;   5.2 [Evaluate DCM model for phenotypes](#deepcph)

<hr>


*For examples of counterfactual phenotyping with Deep Cox Mixtures with Heterogeneous Effects (CMHE), please refer to the following paper*:

[Nagpal, C., Yadlowsky, S., Rostamzadeh, N., and Heller, K. (2021c). Deep cox mixtures for survival regression. In
Machine Learning for Healthcare Conference, pages 674–708. PMLR.](https://arxiv.org/abs/2101.06536)

*For full details on Deep Cox Mixtures (DCM), please refer to the following paper*:

[Nagpal, C., Yadlowsky, S., Rostamzadeh, N., and Heller, K. (2021c). Deep cox mixtures for survival regression. In
Machine Learning for Healthcare Conference, pages 674–708. PMLR.](https://arxiv.org/abs/2101.06536)

<a id="supportdata"></a>

## 2. SUPPORT Dataset

In [1]:
import pandas as pd
import torch
from tqdm import tqdm 
import sys
sys.path.append('../')

from auton_survival.datasets import load_dataset

<a id="gensyndata"></a>
### 2.1. SUPPORT Dataset Description

*For the original datasource, please refer to the following [website](https://biostat.app.vumc.org/wiki/Main/SupportDesc).*

Data features $x$ are stored in a pandas dataframe with rows corresponding to individual samples and columns as covariates.

Sample 'time' and 'event'. 



<a id="gensyndata"></a>
### 2.2. Loading and Visualizing the Dataset

In [2]:
# Load the synthetic dataset
outcomes, features = load_dataset(dataset='SUPPORT')

# Let's take a look at take the dataset
features.head(5)

Unnamed: 0,sex,dzgroup,dzclass,income,race,ca,age,num.co,meanbp,wblc,...,alb,bili,crea,sod,ph,glucose,bun,urine,adlp,adls
0,male,Lung Cancer,Cancer,$11-$25k,other,metastatic,62.84998,0,97.0,6.0,...,1.799805,0.199982,1.199951,141.0,7.459961,,,,7.0,7.0
1,female,Cirrhosis,COPD/CHF/Cirrhosis,$11-$25k,white,no,60.33899,2,43.0,17.097656,...,,,5.5,132.0,7.25,,,,,1.0
2,female,Cirrhosis,COPD/CHF/Cirrhosis,under $11k,white,no,52.74698,2,70.0,8.5,...,,2.199707,2.0,134.0,7.459961,,,,1.0,0.0
3,female,Lung Cancer,Cancer,under $11k,white,metastatic,42.38498,2,75.0,9.099609,...,,,0.799927,139.0,,,,,0.0,0.0
4,female,ARF/MOSF w/Sepsis,ARF/MOSF,,white,no,79.88495,1,59.0,13.5,...,,,0.799927,143.0,7.509766,,,,,2.0


In [None]:
<a id="gensyndata"></a>
### 2.3. Split Dataset into Train and Test

In [10]:
# Hyper-parameters
random_seed = 0
test_size = 0.25
cat_var = ['sex', ]

# Split the synthetic data into training and testing data
import numpy as np

np.random.seed(random_seed)
n = features.shape[0] 

test_idx = np.zeros(n).astype('bool')
test_idx[np.random.randint(n, size=int(n*test_size))] = True 

features_tr = features.iloc[~test_idx] 
outcomes_tr = outcomes.iloc[~test_idx]
print(f'Number of training data points: {len(features_tr)}')

features_te = features.iloc[test_idx] 
outcomes_te = outcomes.iloc[test_idx]
print(f'Number of test data points: {len(features_te)}')

x_tr = features_tr.values.astype('float32')
t_tr = outcomes_tr['time'].values.astype('float32')
e_tr = outcomes_tr['event'].values.astype('float32')

x_te = features_te.values.astype('float32')
t_te = outcomes_te['time'].values.astype('float32')
e_te = outcomes_te['event'].values.astype('float32')

print('Training Data Statistics:')
print(f'Shape of covariates: {x_tr.shape} | times: {t_tr.shape} | events: {e_tr.shape} | interventions: {a_tr.shape}')

Number of training data points: 7094
Number of test data points: 2011


ValueError: could not convert string to float: 'female'

<a id="clustering"></a>
## 3. Intersectional phenotyper

In [4]:
from auton_survival.phenotyping import IntersectionalPhenotyper

intersectonal_phenotypes = IntersectionalPhenotyper(cat_vars=['race'], 
                                                    num_vars=['age']).fit_phenotype(features)

intersectonal_phenotypes 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features[num_var][features[num_var]>=var_max] = var_max
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features[num_var][features[num_var]<=var_min] = var_min


array(['race:other & age:(18.041, 64.857]',
       'race:white & age:(18.041, 64.857]',
       'race:white & age:(18.041, 64.857]', ...,
       'race:white & age:(64.857, 101.848]',
       'race:white & age:(18.041, 64.857]',
       'race:white & age:(64.857, 101.848]'], dtype='<U37')

<a id="clustering"></a>
## 4. Clustering phenotyper

We first perform dimensionality reduction of the input covariates, $\mathbf{x}$, followed by clustering. 

In [8]:
from auton_survival.phenotyping import ClusteringPhenotyper
from sklearn.metrics import auc

clustering_method = 'gmm'
dim_red_method = None # We would not perform dimensionality reduction for the synthetic dataset
n_components = None 
n_clusters = 2 # Number of underlying treatment effect phenotypes

# Running the phenotyper
clustering_phenotypes = ClusteringPhenotyper(clustering_method=clustering_method, 
                                  dim_red_method=dim_red_method, 
                                  n_components=n_components, 
                                  n_clusters=n_clusters).fit_phenotype(features)

clustering_phenotypes

No Dimensionaity reduction specified...
 Proceeding to learn clusters with the raw features...
Fitting the following Clustering Model:
 GaussianMixture(covariance_type='diag', n_components=3)


ValueError: could not convert string to float: 'male'

In [None]:
from auton_survival import reporting
reporting.plot_kaplanmeier(outcomes, clustering_phenotypes)

<a id="DCM"></a>
## 5. Deep Cox Mixtures (DCM)