# CS598 Deep Learning for HealthCare
## Final Project sample notebook
Adam Walsh and Donghyun Lee

### Libraries

To run this Jupyter notebook, you will need:
* pandas
* tables
    
    --> tables is used within pandas for HDF format. This is not explicitly imported but is required.

In [1]:
import pandas as pd
from IPython.display import display

In [2]:
def display_features(df, name):
    indexs = list(df.index.names)
    cols = [ "_".join(elem) if type(elem) is not str else elem for elem in df.columns ]
    print(f'For table {name},\nindex: {", ".join(indexs)}\ncolumns: {", ".join(cols)}\n')

In [3]:
import inspect

def retrieve_name_in_fn(var):
    callers_local_vars = inspect.currentframe().f_back.f_back.f_locals.items()
    out = [var_name for var_name, var_val in callers_local_vars if var_val is var]
    return out[0]

BLINDED = True
def blind_display(*dfs):
    for df in dfs:
        print(f"{retrieve_name_in_fn(df)}.shape: ", df.shape)
        if BLINDED:
            display(df.head(0))
        else:
            display(df.head())

### Data

We have implemented an ablation to original codebase, which is to remove unit conversion process.
This notebook will introduce statistical analysis on two types of outputs.
Both output sets utilized 1000 data input.

In [4]:

path_1000_unit_converted = './output/population-1000/all_hourly_data_1000.h5'
path_1000_unit_not_converted = './output/population-1000-no-unit-conversion/all_hourly_data_1000.h5'


In [5]:
with pd.HDFStore(path_1000_unit_converted) as hdf:
    table_keys = hdf.keys()
    display(f'List of tables: {", ".join(table_keys)}')

'List of tables: /codes, /interventions, /patients, /vitals_labs, /vitals_labs_mean, /patients/meta/values_block_6/meta, /patients/meta/values_block_5/meta, /patients/meta/values_block_4/meta, /patients/meta/values_block_0/meta'

Given dataset include following tables.
We will use:
* vitals_labs
* vitals_labs_mean
* interventions
* patients

In [6]:
keys = ['vitals_labs', 'vitals_labs_mean', 'interventions', 'patients']

In [7]:
with pd.HDFStore(path_1000_unit_converted) as hdf:
    X_converted = pd.read_hdf(hdf, keys[0])
    X_mean_converted = pd.read_hdf(hdf, keys[1])
    Y_converted = pd.read_hdf(hdf, keys[2])
    S_converted = pd.read_hdf(hdf, keys[3])

In [8]:
with pd.HDFStore(path_1000_unit_not_converted) as hdf:
    X_not_converted = pd.read_hdf(hdf, keys[0])
    X_mean_not_converted = pd.read_hdf(hdf, keys[1])
    Y_not_converted = pd.read_hdf(hdf, keys[2])
    S_not_converted = pd.read_hdf(hdf, keys[3])

Features of each table are:

In [9]:
for table, name in zip([X_converted, X_mean_converted, Y_converted, S_converted], keys):
    display_features(table, name)

For table vitals_labs,
index: subject_id, hadm_id, icustay_id, hours_in
columns: alanine aminotransferase_count, alanine aminotransferase_mean, alanine aminotransferase_std, albumin_count, albumin_mean, albumin_std, albumin ascites_count, albumin ascites_mean, albumin ascites_std, albumin pleural_count, albumin pleural_mean, albumin pleural_std, albumin urine_count, albumin urine_mean, albumin urine_std, alkaline phosphate_count, alkaline phosphate_mean, alkaline phosphate_std, anion gap_count, anion gap_mean, anion gap_std, asparate aminotransferase_count, asparate aminotransferase_mean, asparate aminotransferase_std, basophils_count, basophils_mean, basophils_std, bicarbonate_count, bicarbonate_mean, bicarbonate_std, bilirubin_count, bilirubin_mean, bilirubin_std, blood urea nitrogen_count, blood urea nitrogen_mean, blood urea nitrogen_std, co2_count, co2_mean, co2_std, co2 (etco2, pco2, etc.)_count, co2 (etco2, pco2, etc.)_mean, co2 (etco2, pco2, etc.)_std, calcium_count, calcium_me

### Demographics

In [10]:
def categorize_age(age):
    if age > 10 and age <= 30: 
        cat = '<31'
    elif age > 30 and age <= 50:
        cat = '31-50'
    elif age > 50 and age <= 70:
        cat = '51-70'
    else: 
        cat = '>70'
    return cat

def categorize_ethnicity(ethnicity):
    if 'ASIAN' in ethnicity:
        ethnicity = 'ASIAN'
    elif 'WHITE' in ethnicity:
        ethnicity = 'WHITE'
    elif 'HISPANIC' in ethnicity:
        ethnicity = 'HISPANIC/LATINO'
    elif 'BLACK' in ethnicity:
        ethnicity = 'BLACK'
    else: 
        ethnicity = 'OTHER'
    return ethnicity

In [11]:
S_converted['age_bucket'] = S_converted['age'].apply(categorize_age)
S_converted['ethnicity'] = S_converted['ethnicity'].apply(categorize_ethnicity)

In [12]:
def get_patient_stat(S_level):
        S_level['age_bucket'] = S_level['age'].apply(categorize_age)
        S_level['ethnicity'] = S_level['ethnicity'].apply(categorize_ethnicity)

        by_ethnicity = S_level.reset_index().pivot_table(index='ethnicity',
                                                        columns='gender',
                                                        values=['icustay_id','mort_icu','mort_hosp','max_hours'],
                                                        aggfunc={'icustay_id': 'count',
                                                                'mort_icu': 'mean',
                                                                'mort_hosp':'mean',
                                                                'max_hours':'mean'},
                                                        margins=True)
        by_ethnicity = by_ethnicity.sort_values(by=('icustay_id','All'))
        by_ethnicity = pd.concat([by_ethnicity], keys=['ethnicity'], names=['item'])

        by_insurance = S_level.reset_index().pivot_table(index='insurance',
                                                        columns='gender',
                                                        values=['icustay_id','mort_icu','mort_hosp','max_hours'],
                                                        aggfunc={'icustay_id': 'count',
                                                                'mort_icu': 'mean',
                                                                'mort_hosp':'mean',
                                                                'max_hours':'mean'},
                                                        margins=True)
        by_insurance = by_insurance.sort_values(by=('icustay_id','All'))
        by_insurance = pd.concat([by_insurance], keys=['insurance'], names=['item'])

        by_age = S_level.reset_index().pivot_table(index='age_bucket',
                                                columns='gender',
                                                values=['icustay_id','mort_icu','mort_hosp','max_hours'],
                                                aggfunc={'icustay_id': 'count',
                                                        'mort_icu': 'mean',
                                                        'mort_hosp':'mean',
                                                                'max_hours':'mean'},
                                                margins=True)
        by_age = by_age.sort_values(by=('icustay_id','All'))
        by_age = pd.concat([by_age], keys=['age'], names=['item'])

        by_admission = S_level.reset_index().pivot_table(index='admission_type',
                                                        columns='gender',
                                                        values=['icustay_id','mort_icu','mort_hosp','max_hours'],
                                                        aggfunc={'icustay_id': 'count',
                                                                'mort_icu': 'mean',
                                                                'mort_hosp':'mean',
                                                                'max_hours':'mean'},
                                                        margins=True)
        by_admission = by_admission.sort_values(by=('icustay_id','All'))
        by_admission = pd.concat([by_admission], keys=['admission_type'], names=['item'])

        by_unit = S_level.reset_index().pivot_table(index='first_careunit',
                                                columns='gender',
                                                values=['icustay_id','mort_icu','mort_hosp','max_hours'],
                                                aggfunc={'icustay_id': 'count',
                                                        'mort_icu': 'mean',
                                                        'mort_hosp':'mean',
                                                                'max_hours':'mean'},
                                                margins=True)
        by_unit = by_unit.sort_values(by=('icustay_id','All'))
        by_unit = pd.concat([by_unit], keys=['first_careunit'], names=['item'])

        demographics = pd.concat([by_ethnicity, by_age, by_insurance, by_admission, by_unit],axis=0)
        demographics.index.names = ['item','values']
        return demographics

In [13]:
print("Patient information for output with unit conversion")
demographics_converted = get_patient_stat(S_converted)
display(demographics_converted)

print()
print("Patient information for output without unit conversion")
demographics_not_converted = get_patient_stat(S_not_converted)
display(demographics_not_converted)

Patient information for output with unit conversion


Unnamed: 0_level_0,Unnamed: 1_level_0,icustay_id,icustay_id,icustay_id,max_hours,max_hours,max_hours,mort_hosp,mort_hosp,mort_hosp,mort_icu,mort_icu,mort_icu
Unnamed: 0_level_1,gender,F,M,All,F,M,All,F,M,All,F,M,All
item,values,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
ethnicity,ASIAN,7,11,18,61.857143,73.363636,68.888889,0.0,0.181818,0.111111,0.0,0.090909,0.055556
ethnicity,HISPANIC/LATINO,11,21,32,63.454545,56.238095,58.71875,0.272727,0.0,0.09375,0.181818,0.0,0.0625
ethnicity,BLACK,41,32,73,80.317073,76.6875,78.726027,0.073171,0.125,0.09589,0.04878,0.03125,0.041096
ethnicity,OTHER,70,133,203,68.057143,70.902256,69.921182,0.085714,0.142857,0.123153,0.057143,0.105263,0.08867
ethnicity,WHITE,281,393,674,61.441281,63.600509,62.700297,0.096085,0.076336,0.08457,0.049822,0.035623,0.041543
ethnicity,All,410,590,1000,64.519512,65.876271,65.32,0.095122,0.09322,0.094,0.053659,0.050847,0.052
age,<31,14,30,44,59.214286,60.833333,60.318182,0.071429,0.066667,0.068182,0.071429,0.066667,0.068182
age,31-50,73,114,187,61.547945,54.087719,57.0,0.054795,0.035088,0.042781,0.027397,0.017544,0.02139
age,51-70,131,213,344,60.473282,67.342723,64.726744,0.083969,0.056338,0.06686,0.045802,0.042254,0.043605
age,>70,192,233,425,68.796875,70.95279,69.978824,0.119792,0.158798,0.141176,0.067708,0.072961,0.070588



Patient information for output without unit conversion


Unnamed: 0_level_0,Unnamed: 1_level_0,icustay_id,icustay_id,icustay_id,max_hours,max_hours,max_hours,mort_hosp,mort_hosp,mort_hosp,mort_icu,mort_icu,mort_icu
Unnamed: 0_level_1,gender,F,M,All,F,M,All,F,M,All,F,M,All
item,values,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
ethnicity,ASIAN,7,11,18,61.857143,73.363636,68.888889,0.0,0.181818,0.111111,0.0,0.090909,0.055556
ethnicity,HISPANIC/LATINO,11,21,32,63.454545,56.238095,58.71875,0.272727,0.0,0.09375,0.181818,0.0,0.0625
ethnicity,BLACK,41,32,73,80.317073,76.6875,78.726027,0.073171,0.125,0.09589,0.04878,0.03125,0.041096
ethnicity,OTHER,70,133,203,68.057143,70.902256,69.921182,0.085714,0.142857,0.123153,0.057143,0.105263,0.08867
ethnicity,WHITE,281,393,674,61.441281,63.600509,62.700297,0.096085,0.076336,0.08457,0.049822,0.035623,0.041543
ethnicity,All,410,590,1000,64.519512,65.876271,65.32,0.095122,0.09322,0.094,0.053659,0.050847,0.052
age,<31,14,30,44,59.214286,60.833333,60.318182,0.071429,0.066667,0.068182,0.071429,0.066667,0.068182
age,31-50,73,114,187,61.547945,54.087719,57.0,0.054795,0.035088,0.042781,0.027397,0.017544,0.02139
age,51-70,131,213,344,60.473282,67.342723,64.726744,0.083969,0.056338,0.06686,0.045802,0.042254,0.043605
age,>70,192,233,425,68.796875,70.95279,69.978824,0.119792,0.158798,0.141176,0.067708,0.072961,0.070588


As shown, unit conversion applies only lab-diagnosed values, and thus, patient information is constant throughout ablation.

### Vitals and Labs

In [14]:
def get_vitals_stat(X_mean):
    vitals_mean = pd.DataFrame(X_mean.mean(),columns=['mean'])
    vitals_std = pd.DataFrame(X_mean.std(),columns=['stdev'])
    vitals_missing = pd.DataFrame(X_mean.isnull().sum()/X_mean.shape[0]*100,columns=['missing percent'])

    vitals_summary = pd.concat([vitals_mean,vitals_std,vitals_missing],axis=1)
    vitals_summary.index = vitals_summary.index.droplevel(1)
    vitals_summary.sort_values(by='missing percent', ascending=True,inplace=True)
    return vitals_summary

In [15]:
print("Vitals and Labs information for output with unit conversion")
vitals_summary_converted = get_vitals_stat(X_mean_converted)
display(vitals_summary_converted)

print()
print("Vitals and Labs information for output without unit conversion")
vitals_summary_not_converted = get_vitals_stat(X_mean_not_converted)
display(vitals_summary_not_converted)

print()
print("Two datasets are compared")
display(vitals_summary_converted.compare(vitals_summary_not_converted))

Vitals and Labs information for output with unit conversion


Unnamed: 0_level_0,mean,stdev,missing percent
LEVEL2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
heart rate,84.357351,16.919556,13.692702
systolic blood pressure,122.032772,22.191260,15.928830
diastolic blood pressure,59.293845,13.078821,15.946924
respiratory rate,18.992608,5.753864,16.070567
mean blood pressure,80.259101,15.046637,16.767189
...,...,...,...
albumin ascites,2.675000,1.824600,99.993969
lymphocytes atypical csl,1.500000,1.000000,99.993969
creatinine ascites,20.900000,28.425693,99.996984
height,172.860000,14.255273,99.996984



Vitals and Labs information for output without unit conversion


Unnamed: 0_level_0,mean,stdev,missing percent
LEVEL2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
heart rate,84.357351,16.919556,13.692702
systolic blood pressure,122.032772,22.191260,15.928830
diastolic blood pressure,59.293845,13.078821,15.946924
respiratory rate,18.992608,5.753864,16.070567
mean blood pressure,80.259101,15.046637,16.767189
...,...,...,...
albumin ascites,2.675000,1.824600,99.993969
lymphocytes atypical csl,1.500000,1.000000,99.993969
creatinine ascites,20.900000,28.425693,99.996984
height,120.500000,9.899495,99.996984



Two datasets are compared


Unnamed: 0_level_0,mean,mean,stdev,stdev,missing percent,missing percent
Unnamed: 0_level_1,self,other,self,other,self,other
LEVEL2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
temperature,37.046891,37.045315,0.790544,0.792633,68.692702,68.88269
weight,84.785557,85.706772,23.117308,23.592627,,
fraction inspired oxygen,0.567505,0.567331,0.199905,0.183687,99.538601,99.613993
height,172.86,120.5,14.255273,9.899495,,


Most values are same or similar for both datasets.

However, there were few changes, such as temperature, weight, fraction inspired oxygen, and height.

For the table above, self refers to original output with unit conversion. whereas, other refers to output without unit conversion(ablation).

### Interventions


In [16]:
def get_mean_duration(Y_table):
    mean_duration = pd.DataFrame(Y_table.reset_index().groupby('icustay_id').agg(sum).mean()[3:],columns=['hours'])
    return mean_duration

In [17]:
print("Interventions information for output with unit conversion")
mean_duration_converted = get_mean_duration(Y_converted)
display(mean_duration_converted)

print()
print("Interventions information for output without unit conversion")
mean_duration_not_converted = get_mean_duration(Y_not_converted)
display(mean_duration_not_converted)

Interventions information for output with unit conversion


Unnamed: 0,hours
vent,10.038
vaso,8.356
adenosine,0.0
dobutamine,0.877
dopamine,0.959
epinephrine,0.51
isuprel,0.0
milrinone,0.973
norepinephrine,2.424
phenylephrine,4.123



Interventions information for output without unit conversion


Unnamed: 0,hours
vent,10.038
vaso,8.356
adenosine,0.0
dobutamine,0.877
dopamine,0.959
epinephrine,0.51
isuprel,0.0
milrinone,0.973
norepinephrine,2.424
phenylephrine,4.123


### Overall

In a nutshell, we were not able to spot major difference between the output with unit conversion and without unit conversion.

The values were exactly same for patients and interventions information.

However, few elements from vitals and labs table showed changes, which were temperature, weight, fraction inspired oxygen, and height.

This could be caused by the small population size, given that the comparison was only on a data set population size of 1,000.

The data and extraction technique seem to be robust.

In conclusion, this indicates that unit conversion is indifferent to patient information and interventions information, while vitals and labs data were impacted by unit conversion.