<h1><center>CS598 Deep Learning for Healthcare Spring 2023<br>Paper Reproduction Project</center></h1>

<h3><center>Gilberto Ramirez and Jay Kakwani<br><span style="font-family:monospace;">{ger6, kakwani2}@illinois.edu</span><br><font color="lightgrey">Group ID: 27 | Paper ID: 181</font></center></h3>

In this project, we aim to reproduce the paper [*Learning Task for Multitask Learning: Heterogeneous Patient Populations in the ICU* by (Suresh et al, 2018)](https://arxiv.org/abs/1806.02878). In this paper, the authors propose a novel two-step pipeline to predict in-hospital mortality across patient populations with different characteristics. The first step of the pipeline divides patients into relevant non-overlapping cohorts in an unsupervised way using a long short-term memory (LSTM) autoencoder followed by a Gaussian Mixture Model (GMM). The second step of the pipeline predicts in-hospital mortality for each patient cohort identified in the previous step using an LSTM based multi-task learning model where every cohort is considered a different task.
The paper claims that by applying this pipeline there will be better predictive results when compared to a model applied to the aggregate population using a single task learning model. According to the authors, the better performance given by this pipeline is due to the combination of a multi-task learning model leveraging shared knowledge across distinct patient groups and the way how those groups were created, i.e., identification using a data-driven method rather than relying on domain knowledge or auxiliary labels.

## 1. Data

This paper uses the publicly available [MIMIC-III database](https://www.nature.com/articles/sdata201635) which contains clinical data in a critical care setting. After reviewing the paper in detail, we decided to use [MIMIC-Extract](https://arxiv.org/abs/1907.08322), an open source pipeline by (Wang et al., 2020) for transforming the raw EHR data into usable [Pandas](https://pandas.pydata.org/) dataframes containing hourly time series of vitals and laboratory measurements after performing unit conversion, outlier handling, and aggregation of semantically similar features.

Unfortunately, the MIMIC-Extract pipeline misses two features the [paper code](https://github.com/mit-caml/multitask-patients) makes use of:
* `timecmo_chart` which indicates the timestamp of a patient after being declared in CMO (Comfort Measures Only) state. This feature comes from a MIMIC-III concept table called `code_status`.
* `sapsii` which contains the SAPS (Simplified Acute Physiology Score) II for the patient. This feature comes from another MIMIC-III concept table called `sapsii`.

As a result, there are three data files needed to run this notebook:
* `all_hourly_data.h5`, an HDF file resulting from running the MIMIC-Extract pipeline which is publicly available in GCP using [this link](https://console.cloud.google.com/storage/browser/mimic_extract) and referenced in the [MIMIC-Extract github repo](https://github.com/MLforHealth/MIMIC_Extract).
* `code_status.csv`, a CSV file holding the MIMIC concept table `CODE_STATUS` that can be generated following the instructions in [this link within the MIT-LCP github repo](https://github.com/MIT-LCP/mimic-code/tree/main/mimic-iii/concepts#generating-the-concepts-in-postgresql).
* `sapsii.csv`, a CSV file holding the MIMIC concept table `SAPSII` that can be generated following the instructions in [this link within the MIT-LCP github repo](https://github.com/MIT-LCP/mimic-code/tree/main/mimic-iii/concepts#generating-the-concepts-in-postgresql).

The functions used in this notebook assume the three files listed above are in the folder `../data/` by default. However, location can be changed using arguments to the functions that process the data.

All code needed to replicate the paper is in [our github repo](https://github.com/ger6-illini/dl4h-sp23-team27-project) inside a Python module called `mtl_patients`.

The first function from that module we will start using is `get_summaries()`. This function returns three summaries as dataframes:
1. A summary providing some statistics of all patients broken by careunit.
2. A summary providing some statistics of all patients broken by SAPS-II score quartile.
3. A summary providing some statistics of the 29 distinct physiological measurements used in the paper.

These summaries need two arguments to be created:
* `cutoff_hours` (default 24) which is the minimum number of hours a patient needs to stay in the ICU to be considered part of a cohort.
* `gap_hours` (default 12) which is the number of hours between the end of `cutoff_hours` and the moment a model can start making a mortality prediction.

The importance of these two arguments is his impact in the exception criteria used in the paper. In particular, the paper:
1. Excludes all patients that met the in-hospital mortality criteria before `cutoff period` + `gap period`.
2. Excludes patients that were discharged before `cutoff period` + `gap period`.

The in-hospital mortality criteria used in the paper is an extended one and not just considers patients who died but also patients with a CMO (Comfort Measures Only) note. That is considered in the summaries creation as well.

### 1.1. Summaries, 24 hours (cutoff period) + 12 hours (gap period)

Now let's run the `get_summaries()` function with `cutoff_hours` = 24 and `gap_hours` = 12:

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

import sys
pathname = "../code/"
if pathname not in sys.path:
    sys.path.append("../code/")
pd.options.display.max_rows = 999

from mtl_patients import get_summaries

2023-04-08 20:07:19.524863: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
%%time
pat_summ_24_by_cu_df, pat_summ_24_by_sapsiiq_df, vitals_labs_summ_24_df = get_summaries(cutoff_hours=24, gap_hours=12)

Let's now display the resulting summaries one at a time.

#### 1.1.1. Data summary by patients in each intensive care unit (ICU)

In [None]:
pat_summ_24_by_cu_df

In the previous summary, patients were broken in groups where each group is one of five careunits where patients were first admitted:
* CCU: Coronary Care Unit
* CSRU: Cardiac Surgery Recovery Unit
* MICU: Medical Intensive Care Unit
* SICU: Surgical Intensive Care Unit
* TSICU: Trauma Surgical Intensive Care Unit

In addition, an overall group was also added. The statistics provided by the summary are:
* `N`: The number of samples (patients) in the group.
* `n`: The number of samples (patients) meeting the in-hospital mortality criteria defined in the paper: patient died or had a note of "Do Not Resuscitate" (DNR) or had a note of "Comfort Measures Only" (CMO).
* `Class Imbalance`: Ratio of patients meeting the in-hospital mortality criteria defined in the paper, i.e., $\dfrac{\text{n}}{\text{N}}$.
* `Age (Mean)`: Mean age of patients for each group in years.
* `Gender (Male)`: Ratio of patients that are males.

This summary was prepared to match Table 1 in the original paper. There are differences between both that can be attributed to the way how data was preprocessed by MIMIC-Extract when compared to the preprocessing done by the authors back in 2018, before MIMIC-Extract became available, and that was not made available by the authors in [their code](https://github.com/mit-caml/multitask-patients).

#### 1.1.2. Data summary by patients in each SAPS-II score quartile

In [None]:
pat_summ_24_by_sapsiiq_df

In the previous summary, patients were broken based on the quartile of the SAPS-II score assigned to them. As it can be seen, the four quartiles have the ranges $[0, 22], [23, 32], [33, 41], [42, 118] $. This was included in the authors code but not in the paper. It seems the class imbalance might have been the primary reason. As it is evident from the summary, most of the patients are in quartile $3$ since they are in an ICU and is expected their values are on the high side.

#### 1.1.3. Data summary for physiological measurements

In [None]:
vitals_labs_summ_24_df

In the previous summary, all vitals and lab measurements selected in the paper (29 in total) are listed with relevant statistics associated to it:
* `min` representing the minimum of the measurement observed in the vitals/labs.
* `avg` representing the average of the measurement observed in the vitals/labs.
* `max` representing the maximum of the measurement observed in the vitals/labs.
* `std` representing the standard deviation of the measurement observed in the vitals/labs.
* `N` representing the number of non `NaN` samples for the specific vital/lab measurement.
* `pres.` representing the portion of all possible hours across all patients, admissions, and ICU stays where at least one of the 104 vitals/labs measurements in the original MIMIC-Extract pipeline was taken.

All these measurements are based on the `vitals_labs_mean` dataframe in the MIMIC-Extract pipeline which provides average of vitals/labs on a per hour basis for each patient after going into an ICU.

### 1.2. Summaries, 48 hours (cutoff period) + 24 hours (gap period)

Now let's run the `get_summaries()` function with `cutoff_hours` = 48 and `gap_hours` = 24:

In [None]:
%%time
pat_summ_48_by_cu_df, pat_summ_48_by_sapsiiq_df, vitals_labs_summ_48_df = get_summaries(cutoff_hours=48, gap_hours=24)

#### 1.2.1. Data summary by patients in each intensive care unit (ICU)

In [None]:
pat_summ_48_by_cu_df

#### 1.2.2. Data summary by patients in each SAPS-II score quartile

In [None]:
pat_summ_48_by_sapsiiq_df

#### 1.2.3. Data summary for physiological measurements

In [None]:
vitals_labs_summ_48_df

## 2. Discovering Patient Cohorts

The paper uses a two-step pipeline to: 1) identify relevant patient cohorts, and 2) use those relevant cohorts as separate tasks in a multi-lask learning framework to predict in-hospital mortality. In this section, we will focus on the first step of the pipeline, i.e., patient cohort discovery.

In order to identify meaningful patient cohorts, the paper proposes to process the raw patient data in such a way that the result is a 3D matrix of shape $(P \times T \times F)$ where $P$ represents the number of patients, $T$ the number of timesteps, and $F$ the number of features as depicted in the figure below (in blue) which is partially based on Figure 2 of the original paper. All numbers shown in the figure below correspond to a specific experiment published in the paper in which the observation window is limited to the first $24$ hours (cutoff period) after a patient goes into a careunit and there is a gap of $12$ hours (gap period) between the end of the observation window and the beginning of the prediction window where the prediction task is in-hospital mortality.

Preparation of the data to get the 3D (blue) matrix is performed by a function called `prepare_data()` inside the `mtl_patients.py` module. This preparation consists of the following transformations taken from the paper and the author's code reference implementation:
1. Calculation of the mortality flag (prediction label) and mortality time for every patient in the dataset using an *extended* definition of mortality: death, a note of "Do Not Resuscitate" (DNR), or a note of "Comfort Measures Only" (CMO). In case any of these conditions are met for a patient, the corresponding mortality label is set to *True* and the corresponding mortality time is considered as the earliest time of any of the three conditions. After reviewing in detail the author's code implementation it seems mortality is based on deathtime and a CMO note but not DNR. However, the calculation of the time of death is based on the earliest time of the three conditions.
2. Data used for the prediction is only limited to the first certain amount of hours after a patient goes into the ICU. This amount of hours is called inside the code "a cutoff period" (observation window) and defines the period of data used to train all models. In addition, there is another number of hours called inside the code "the gap period" which represents the time between the end of the observation window and the beginning of the prediction window to prevent label leakage. All patients that died under the *extended* definition before the cutoff period plus the gap period or stayed less than the cutoff period are excluded from the experiment as part of this step. Also, all patients under the age of 15 are excluded (this is already part of the exclusion criteria of the MIMIC-Extract pipeline).
3. There are 29 vitals/labs timeseries selected by the paper. Only data within the cutoff period for vitals/labs is kept and rest is removed. This will be used for the rest of the machine learning pipeline.
4. All vitals/labs values are converted to z-scores so they all have zero mean and unit standard deviation. Those z-scores are rounded to the closest integer and clipped between $-4$ and $4$ or set to $9$ in case of `NaN`. This allows to map every vital/lab measurement (a float) to one of ten possible values $[-4, -3, -2, -1, 0, 1, 2, 3, 4, 9]$, so they can be converted to dummy values. After dummifying the vitals/labs, column for the $9$ values (`NaN`) is removed, and the resulting matrix is sparse and containing either $0$s or $1$s.
5. Every patient is padded with rows of zeroes for those hours that are missed. For example, if a patient only has vitals/labs for the first ten hours and the cutoff period is 24, code adds fourteen hours (rows) with zeroes for that patient. In the end, the matrix will have a size of $P \times T \times F$ as expected by the subsequent models.
6. Finally, static data (gender, age, and ethnicity) is converted to integers representing categories and dummified. In case of age, there are four buckets established; $(10, 30), (30, 50), (50, 70), (70, \infty)$; while ethnicity is broken into five buckets (asian, white, hispanic, black, other).
7. Cohort assignments based on first careunit or Simplified Acute Physiology Score (SAPS) II score quartile is calculated for each patient and returned as well.

![Figure 1](../img/paper-181-fig-1.png)

The `discover_cohorts()` function inside the `mtl_patients.py` module is the one implementing the pipeline shown in the figure above and then calling the `prepare_data()` function detailed previously as the first step. Once data has been processed, the function will break the data in training, validation, and test data sets in a $70\%/10\%/20\%$ proportion.

The training data is used to train an LSTM autoencoder. The main purpose of the LSTM autoencoder is to generate a fixed-length dense representation (embedding) of the sparse inputs trying to retain the most important parts of the inputs. The paper selected embeddings of size $50$ as the optimal dimension (hyperparameter). The purple box in the middle of the diagram above (a 2D matrix) represents the embeddings after the LSTM autoencoder learned the representation of the original 3D matrix of shape $(32537 \times 24 \times 232)$ where every row corresponds to a patient.

Once the embeddings are calculated, a Gaussian Mixture Model is applied using $3$ clusters (the value the authors considered optimal). The result are the three green boxes representing three cohorts discovered in an unsupervised way and grouping similar patients based on the three static and the 29 time-varying vitals/labs selected from the MIMIC-III database.

### 2.1. Cohort statistics at 24 hours and 48 hours

The paper runs two experiments. The first experiment uses a cutoff period of 24 hours, a gap period of 12 hours, and three clusters. Let's run this first experiment using the `discover_cohorts()` function and determine the corresponding cohort assignment for every patient that does not meet the exception criteria:

In [None]:
from mtl_patients import discover_cohorts

In [None]:
%%time
cohort_unsupervised_24 = discover_cohorts(cutoff_hours=24, gap_hours=12, cohort_unsupervised_filename='../data/unsupervised_clusters_24.npy')

The second experiment uses a cutoff period of 48 hours, a gap period of 24 hours, and two instead of three clusters. The reduction in the number of clusters is proposed by the authors (Table 3 of the paper). Let's run now this second experiment using the `discover_cohorts()` function and determine the corresponding cohort assignment for every patient that does not meet the exception criteria:

In [None]:
%%time
cohort_unsupervised_48 = discover_cohorts(cutoff_hours=48, gap_hours=24, num_clusters=2, cohort_unsupervised_filename='../data/unsupervised_clusters_48.npy')

Let's summarize the results of the 24 hour and 48 hour experiments similar to what Table 3 of the paper shows.

In [None]:
#----------------------------------------------------
# Let's create the summary for the 24 hour experiment

cohort_unsupervised_24 = np.load('../data/unsupervised_clusters_24.npy')

from mtl_patients import prepare_data
_, Y, _, _, subject_ids = prepare_data(cutoff_hours=24, gap_hours=12)

In [None]:
subject_ids = np.array(subject_ids.tolist())
cohort_unsupervised_24_df = pd.DataFrame({'subject_id': subject_ids, 'Y': Y, 'Group': cohort_unsupervised_24}, dtype=int)

# calculate summaries per cohort (24 hours)
table3_a_df = cohort_unsupervised_24_df.groupby('Group').agg(
    N=('Y', 'size'),
    n=('Y', 'sum'),
)
table3_a_df.loc[:, 'Experiment'] = '24 hours'
table3_a_df.loc[:, 'Cohort Type'] = 'Unsupervised'

# calculate overall summary (24 hours)
table3_a_overall_df = table3_a_df.groupby(['*'] * len(table3_a_df)).agg(
    N=('N', 'sum'),
    n=('n', 'sum'),
)
table3_a_overall_df.index.name = 'Group'
table3_a_overall_df.loc[:, 'Experiment'] = '24 hours'
table3_a_overall_df.loc[:, 'Cohort Type'] = 'Global'

# merge 24 hour tables and make cosmetic changes
table3_a_df = pd.concat([table3_a_df, table3_a_overall_df], axis=0)
table3_a_df.reset_index(inplace=True)
table3_a_df.set_index(['Experiment', 'Cohort Type', 'Group'], inplace=True)

In [None]:
#----------------------------------------------------
# Let's create the summary for the 48 hour experiment

cohort_unsupervised_48 = np.load('../data/unsupervised_clusters_48.npy')

_, Y, _, _, subject_ids = prepare_data(cutoff_hours=48, gap_hours=24)

In [None]:
subject_ids = np.array(subject_ids.tolist())
cohort_unsupervised_48_df = pd.DataFrame({'subject_id': subject_ids, 'Y': Y, 'Group': cohort_unsupervised_48}, dtype=int)

# calculate summaries per cohort (48 hours)
table3_b_df = cohort_unsupervised_48_df.groupby('Group').agg(
    N=('Y', 'size'),
    n=('Y', 'sum'),
)
table3_b_df.loc[:, 'Experiment'] = '48 hours'
table3_b_df.loc[:, 'Cohort Type'] = 'Unsupervised'

# calculate overall summary (48 hours)
table3_b_overall_df = table3_b_df.groupby(['*'] * len(table3_b_df)).agg(
    N=('N', 'sum'),
    n=('n', 'sum'),
)
table3_b_overall_df.index.name = 'Group'
table3_b_overall_df.loc[:, 'Experiment'] = '48 hours'
table3_b_overall_df.loc[:, 'Cohort Type'] = 'Global'

# merge 48 hour tables and make cosmetic changes
table3_b_df = pd.concat([table3_b_df, table3_b_overall_df], axis=0)
table3_b_df.reset_index(inplace=True)
table3_b_df.set_index(['Experiment', 'Cohort Type', 'Group'], inplace=True)

In [None]:
#--------------------------------------------------------------------
# Now let's merge results for 24 and 48 hour experiments in one table

# merge 24 hour and 48 hour tables
table3_df = pd.concat([table3_a_df, table3_b_df], axis=0)

# calculate class imbalance
table3_df.loc[:, 'Class Imbalance'] = table3_df.loc[:, 'n'] / table3_df.loc[:, 'N']
table3_df.loc[:, 'Class Imbalance'] = table3_df.loc[:, 'Class Imbalance'].round(4)

In [None]:
table3_df

Table above is the equivalent to Table 3 in the paper. We can see the results are different. Data from MIMIC-Extract might be different from the data used by the authors.

For the 24 hours experiment, size of the clusters are different and mortality as well. For instance, clusters 0 and 1 are smaller compared to cluster 2 but the mortality rate is significantly higher (9.27% and 10.15% versus 4.49%). For the 48 hours experiment, size and mortality of the two resulting clusters are closer when compared to the results from the 24 hour experiment.

### 2.2. Visualization of selected lab test and vital signs features

In this section, we will try to reproduce the results from Figure 4 (section 6.1.1) in the paper. In Figure 4, data from experiment 1 (24 hours) is used to create heatmap plots to determine if patients from different cohorts are physiologically distinct. To do that, we added the function `get_heatmap_data()` to get the mean of all z-scores by patient, by hour in the ICU, by cohort.

#### 2.2.1. Heatmap plots for experiment 1 (24 hours)

Let's run the `get_heatmap_data()` function using the cohorts discovered in experiment 1 (24 hour cutoff period and 12 hour gap period):

In [None]:
%%time

from mtl_patients import get_heatmap_data
import seaborn as sns
import matplotlib.pyplot as plt

labs_df, vitals_df = get_heatmap_data(cutoff_hours=24, gap_hours=12,
                                      cohort_unsupervised_filename='../data/unsupervised_clusters_24.npy')

Let's plot the heatmaps for the selected lab tests and vitals:

In [None]:
num_cohorts = len(np.unique(labs_df.index.get_level_values(0)))
fig, axs = plt.subplots(nrows=2, ncols=num_cohorts)
fig.set_size_inches(24, 12)
for i in np.arange(num_cohorts):
    plot_df = labs_df.unstack(1).stack(1).query(f'cohort == {i}').droplevel(0)
    sns.heatmap(plot_df, ax=axs[0, i], yticklabels=True if i==0 else False).set(title=f'Lab Tests – Cohort {i}')
for i in np.arange(num_cohorts):
    plot_df = vitals_df.unstack(1).stack(1).query(f'cohort == {i}').droplevel(0)
    sns.heatmap(plot_df, ax=axs[1, i], yticklabels=True if i==0 else False).set(title=f'Vitals – Cohort {i}')
plt.savefig('../img/heatmap_24')

From the heatmap plots there are some trends in the physiological data that seems to show a distinction between cohorts. In case of labs glucose and blood area nitrogen seems to be different between the three cohorts. However, some of the vitals definitely show different trends in some cases. Blood pressure across the different cohorts is different. When cohort 0 patients have a diastolic blood pressure that tends to drop the longer the patient stays, patients from cohort 2 have a diastolic blood pressure that tends to increase. 

#### 2.2.2. Heatmap plots for experiment 2 (48 hours)

Let's run the `get_heatmap_data()` function using the cohorts discovered in experiment 2 (48 hour cutoff period and 24 hour gap period):

In [None]:
%%time
labs_df, vitals_df = get_heatmap_data(cutoff_hours=48, gap_hours=24,
                                      cohort_unsupervised_filename='../data/unsupervised_clusters_48.npy')

Let's plot the heatmaps for the selected lab tests and vitals:

In [None]:
num_cohorts = len(np.unique(labs_df.index.get_level_values(0)))
fig, axs = plt.subplots(nrows=2, ncols=num_cohorts)
fig.set_size_inches(24, 12)
for i in np.arange(num_cohorts):
    plot_df = labs_df.unstack(1).stack(1).query(f'cohort == {i}').droplevel(0)
    sns.heatmap(plot_df, ax=axs[0, i], yticklabels=True if i==0 else False).set(title=f'Lab Tests – Cohort {i}')
for i in np.arange(num_cohorts):
    plot_df = vitals_df.unstack(1).stack(1).query(f'cohort == {i}').droplevel(0)
    sns.heatmap(plot_df, ax=axs[1, i], yticklabels=True if i==0 else False).set(title=f'Vitals – Cohort {i}')
plt.savefig('../img/heatmap_48')

In the 48 hour experiment, labs do not show significant differences between the two cohorts. However, the vitals show changes in the blood pressure where cohort 1 trend is more on the high side compared to cohort 0.

## 3. Predicting In-Hospital Mortality

As mentioned in the previous section, the paper uses a two-step pipeline to: 1) identify relevant patient cohorts, and 2) use those relevant cohorts as separate tasks in a multi-lask learning framework to predict in-hospital mortality. In this section, we will focus on the second step of the pipeline, i.e., use multi-task learning to make in-hospital mortality predictions for different patient cohorts.

The second step uses as input the result from the first step which is a series of 3D matrices, one per discovered cohort, of shape $(P \times T \times F)$ where $P$ represents the number of patients, $T$ the number of timesteps, and $F$ the number of features. As an example, the 24 hour experiment described by the authors in the paper and reproduced in the previous section resulted in three cohorts (clusters) called group 0, group 1, and group 2 where the shapes of the corresponding 3D matrices are:
* $14120 \times 24 \times 232$ for group 0,
* $10841 \times 24 \times 232$ for group 1, and
* $7752 \times 24 \times 232$ for group 2.

To convert these matrices into predictions, the paper proposes an LSTM for all model configurations including the baseline. In particular, the paper shows results from two specific models: a baseline model that is called *global* and using single-task learning and the multi-task learning model the authors claim as superior to the baseline.

A diagram of the baseline (*global*) model proposed by the authors is shown below. As it can be seen, this model consists of an LSTM layer of 16 cells using a RELU activation function followed by a *single* dense layer with a sigmoid activation function. The result of the dense (fully-connected) layer is an estimate of the probability of in-hospital mortality for a given patient. This baseline model is trained with all patient samples regardless the cohort, hence the name *global*, and used for per cohort predictions.

![Figure 2](../img/paper-181-fig-2.png)

Moving to the second model and the one the authors claim it provides benefits against the baseline is the so called *multi-task learning model*. This model consists of an LSTM layer with same number of cells (16) as the baseline model, to ensure the comparison is fair, connected to as many dense layers as population subgroups (cohorts). Each of these cohorts is considered a *task* and authors propose training these models on multiple tasks simultaneously in contrast to the baseline model with just one dense layer. The benefit of this approach according to the authors is the ability to share knowledge learned from one task (cohort) to rest of tasks under the assumption that the subpopulations used are distinct enough with relation to the outcome learned (mortality) that such shared knowledge truly exists. A representation of the multi-task learning model is shown below:

![Figure 3](../img/paper-181-fig-3.png)

For benchmarking purposes of the entire pipeline, the authors compared the results from running the pipeline using unsupervised cohort discovery (step one) against cohorts created using the first careunit the patient went into which can be considered an engineered feature. We will show those results in the next subsections.

The overall performance of this model is measured using both macro and micro metrics (section 4.3 in the paper) where:
* In *micro* metrics all predicted probabilities for all patients are treated as if they come from a single classifier: $\text{Metric}_\text{Micro} = \text{Metric}([\hat{y}_0, ..., \hat{y}_k], [y_0, ..., y_K])$.
* In *macro* metrics probabilities are evaluated on a *per cohort* basis, and then averaged: $\text{Metric}_\text{Macro} = \dfrac{1}{K} \displaystyle\sum_{k=0}^K \text{Metric}(\hat{y}_K, y_K)$.

Paper suggests that, although micro metrics are the ones typically chosen in the literature, evaluating performance on different subpopulations will benefit from macro metrics instead of micro metrics specially when there is class imbalance in every cohort. All results show macro and micro versions of the metrics for the aggregate performance of the models.

All results being used for comparison between models by the paper will use three metrics:
* AUC (Area Under the ROC Curve) for every cohort and, for the aggregate performance, macro and micro.
* PPV (Positive Predictive Value which is same as Precision) for every cohort and, for the aggregate performance, macro and micro. This PPV is calculated at a sensitivity of 80%, a value selected by the paper authors.
* Specificity for every cohort and, for the aggregate performance, macro and micro. This specificity is calculated at a sensitivity of 80%, a value selected by the paper authors.

All in-hospital mortality prediction tasks are implemented using the function `run_mortality_prediction_task()`. This function will call other functions to prepare the data, split the data in training/validation/test data sets, train the corresponding model, predict using the resulting model, and calculate the metrics of the model.

### 3.1. Predictions without Bootstrapping

In this section all in-hospital mortality predictions across the two models, global and multi-task learning, and across the two experiments, 24 hours and 48 hours, are calculated for the three metrics; AUC, PPV (precision) @80% sensitivity, and Specificity @80% sensitivity; using the test set (20% of the original dataset). No bootstrapping is done.

#### 3.1.1. In-Hospital Mortality Prediction – Baseline (*Global*) Model at 24 Hours

Let's first run the mortality prediction task using the *global* model (baseline) in the 24 hour experiment setting. In this experiment, the cutoff period is 24 hours and the gap period is 12 hours, meaning model can only feed from patient data collected during the first 24 hours of the ICU stay, and predict mortality 36 hours after patient goes into the ICU to avoid label leakage. In terms of the cohort type, let's go with careunits first:

In [None]:
%%time

from mtl_patients import run_mortality_prediction_task

metrics_global_24_careunits_df = run_mortality_prediction_task(model_type='global', cutoff_hours=24, gap_hours=12, cohort_criteria_to_select='careunits')

In [None]:
metrics_global_24_careunits_df.round(3)

Now, let's repeat the prediction task but this time using the groups fetched in an unsupervised way from step 1 proposed by the authors in the paper:

In [None]:
%%time
metrics_global_24_unsupervised_df = run_mortality_prediction_task(model_type='global', cutoff_hours=24, gap_hours=12,
                                                                  cohort_criteria_to_select='unsupervised', cohort_unsupervised_filename='../data/unsupervised_clusters_24.npy')

In [None]:
metrics_global_24_unsupervised_df.round(3)

#### 3.1.2. In-Hospital Mortality Prediction – Baseline (*Global*) Model at 48 Hours

Let's now run the mortality prediction task using the *global* model (baseline) in the 48 hour experiment setting. In this experiment, the cutoff period is 48 hours and the gap period is 24 hours, meaning model can only feed from patient data collected during the first 48 hours of the ICU stay, and predict mortality 72 hours after patient goes into the ICU to avoid label leakage. In terms of the cohort type, let's go with careunits first:

In [None]:
%%time
metrics_global_48_careunits_df = run_mortality_prediction_task(model_type='global', cutoff_hours=48, gap_hours=24, cohort_criteria_to_select='careunits')

In [None]:
metrics_global_48_careunits_df.round(3)

Now, let's repeat the prediction task but this time using the groups fetched in an unsupervised way from step 1 proposed by the authors in the paper:

In [None]:
%%time
metrics_global_48_unsupervised_df = run_mortality_prediction_task(model_type='global', cutoff_hours=48, gap_hours=24,
                                                                  cohort_criteria_to_select='unsupervised', cohort_unsupervised_filename='../data/unsupervised_clusters_48.npy')

In [None]:
metrics_global_48_unsupervised_df.round(3)

#### 3.1.3. In-Hospital Mortality Prediction – Multi-Task Learning Model at 24 Hours

Let's now run the mortality prediction task using the multi-task learning model in the 24 hour experiment setting. In this experiment, the cutoff period is 24 hours and the gap period is 12 hours, meaning model can only feed from patient data collected during the first 24 hours of the ICU stay, and predict mortality 36 hours after patient goes into the ICU to avoid label leakage. In terms of the cohort type, let's go with careunits first:

In [None]:
%%time
metrics_mtl_24_careunits_df = run_mortality_prediction_task(model_type='multitask', cutoff_hours=24, gap_hours=12, cohort_criteria_to_select='careunits')

In [None]:
metrics_mtl_24_careunits_df.round(3)

Now, let's repeat the prediction task but this time using the groups fetched in an unsupervised way from step 1 proposed by the authors in the paper:

In [None]:
%%time
metrics_mtl_24_unsupervised_df = run_mortality_prediction_task(model_type='multitask', cutoff_hours=24, gap_hours=12,
                                                               cohort_criteria_to_select='unsupervised', cohort_unsupervised_filename='../data/unsupervised_clusters_24.npy')

In [None]:
metrics_mtl_24_unsupervised_df.round(3)

#### 3.1.4. In-Hospital Mortality Prediction – Multi-Task Learning Model at 48 Hours

Let's now run the mortality prediction task using the multi-task learning model in the 48 hour experiment setting. In this experiment, the cutoff period is 48 hours and the gap period is 24 hours, meaning model can only feed from patient data collected during the first 48 hours of the ICU stay, and predict mortality 72 hours after patient goes into the ICU to avoid label leakage. In terms of the cohort type, let's go with careunits first:

In [None]:
%%time
metrics_mtl_48_careunits_df = run_mortality_prediction_task(model_type='multitask', cutoff_hours=48, gap_hours=24, cohort_criteria_to_select='careunits')

In [None]:
metrics_mtl_48_careunits_df.round(3)

Now, let's repeat the prediction task but this time using the groups fetched in an unsupervised way from step 1 proposed by the authors in the paper:

In [None]:
%%time
metrics_mtl_48_unsupervised_df = run_mortality_prediction_task(model_type='multitask', cutoff_hours=48, gap_hours=24,
                                                               cohort_criteria_to_select='unsupervised', cohort_unsupervised_filename='../data/unsupervised_clusters_48.npy')

In [None]:
metrics_mtl_48_unsupervised_df.round(3)

#### 3.1.5. Summary

Similar to Table 4 in paper, the dataframe below summarizes all results.

In [None]:
summary_a_df = metrics_global_24_unsupervised_df.reset_index()
summary_a_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_a_df['Cohort type'] = 'Unsupervised'
summary_a_df['Model'] = 'Global'
summary_a_df['Experiment'] = '24 hours'

summary_b_df = metrics_global_24_careunits_df.reset_index()
summary_b_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_b_df['Cohort type'] = 'Careunits'
summary_b_df['Model'] = 'Global'
summary_b_df['Experiment'] = '24 hours'

summary_c_df = metrics_mtl_24_unsupervised_df.reset_index()
summary_c_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_c_df['Cohort type'] = 'Unsupervised'
summary_c_df['Model'] = 'Multi-task'
summary_c_df['Experiment'] = '24 hours'

summary_d_df = metrics_mtl_24_careunits_df.reset_index()
summary_d_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_d_df['Cohort type'] = 'Careunits'
summary_d_df['Model'] = 'Multi-task'
summary_d_df['Experiment'] = '24 hours'

summary_24_df = pd.concat([summary_a_df, summary_b_df, summary_c_df, summary_d_df])

summary_e_df = metrics_global_48_unsupervised_df.reset_index()
summary_e_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_e_df['Cohort type'] = 'Unsupervised'
summary_e_df['Model'] = 'Global'
summary_e_df['Experiment'] = '48 hours'

summary_f_df = metrics_global_48_careunits_df.reset_index()
summary_f_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_f_df['Cohort type'] = 'Careunits'
summary_f_df['Model'] = 'Global'
summary_f_df['Experiment'] = '48 hours'

summary_g_df = metrics_mtl_48_unsupervised_df.reset_index()
summary_g_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_g_df['Cohort type'] = 'Unsupervised'
summary_g_df['Model'] = 'Multi-task'
summary_g_df['Experiment'] = '48 hours'

summary_h_df = metrics_mtl_48_careunits_df.reset_index()
summary_h_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_h_df['Cohort type'] = 'Careunits'
summary_h_df['Model'] = 'Multi-task'
summary_h_df['Experiment'] = '48 hours'

summary_48_df = pd.concat([summary_e_df, summary_f_df, summary_g_df, summary_h_df])

summary_df = pd.concat([summary_24_df, summary_48_df])

from pandas.api.types import CategoricalDtype
cohort_type = CategoricalDtype(['CCU', 'CSRU', 'MICU', 'SICU', 'TSICU', 'Macro', 'Micro'], ordered=True)
summary_df['Cohort type'] = summary_df['Cohort type'].astype(cohort_type)

summary_df = pd.melt(summary_df, id_vars=['Cohort', 'Cohort type', 'Model', 'Experiment'], var_name='Metric')
summary_df = summary_df.set_index(['Experiment', 'Cohort type', 'Cohort'])
summary_df = summary_df.pivot(columns=['Metric', 'Model'], values='value')
summary_df = summary_df.round(3)
summary_df

### 3.2. Predictions with Bootstrapping

In this section all in-hospital mortality predictions across the two models, global and multi-task learning, and across the two experiments, 24 hours and 48 hours, are calculated for the three metrics; AUC, PPV (precision) @80% sensitivity, and Specificity @80% sensitivity; using 100 bootstrapped samples of the test set (20% of the original dataset). The results will be metrics (AUC, PPV, and Specificity) for each bootstrapped sample. This will allow the comparison between the global model and the multi-task learning model using the Wilcoxon signed-rank test as indicated in the paper.

#### 3.2.1. In-Hospital Mortality Prediction – Baseline (*Global*) Model at 24 Hours

Let's first run the mortality prediction task using the *global* model (baseline) in the 24 hour experiment setting. In this experiment, the cutoff period is 24 hours and the gap period is 12 hours, meaning model can only feed from patient data collected during the first 24 hours of the ICU stay, and predict mortality 36 hours after patient goes into the ICU to avoid label leakage. Since we are enabling bootstrapping, we will repeat same experiment with 100 bootstrapped samples from the test dataset. In terms of the cohort type, let's go with careunits first:

In [None]:
%%time

from mtl_patients import run_mortality_prediction_task

metrics_global_24_careunits_btstrp_df = run_mortality_prediction_task(model_type='global', cutoff_hours=24, gap_hours=12, cohort_criteria_to_select='careunits', bootstrap=True)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Preparing the data
--------------------------------------------------------------------------------
    Loading data from MIMIC-Extract pipeline...
    Adding SAPS II score to static dataset...
    Adding mortality columns to static dataset...
    Discretizing X...
        X.shape: (2200954, 33), X.subject_id.nunique(): 34472
        X_discrete.shape: (2200954, 225), X_discrete.subject_id.nunique(): 34472
    Keep only X_discrete[X_discrete.hours_in < 24]...
        New X_discrete.shape: (808539, 223), new X_discrete.subject_id.nunique(): 34472
    Padding patients with less than 24 hours of data...
    Merging dataframes to create X_full...
    Mortality per careunit...
        MICU: 1138 out of 11403
        SICU: 409 out of 5187
        CCU: 344 out of 4907
        CSRU: 139 out of 6971
        TSICU: 291 out of 4245
    Final shape of X: (32713, 24, 232)
    Number of positive samples: 2321
    Done!
+

2023-04-08 20:08:05.528535: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
INFO:tensorflow:Assets written to: ../data/models/model_global_24+12_careunits/assets
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Predicting using 'global' model...
    Bootstrap prediction for task "CCU"...


  0%|          | 0/100 [00:00<?, ?it/s]

    Bootstrap prediction for task "CSRU"...


  0%|          | 0/100 [00:00<?, ?it/s]

    Bootstrap prediction for task "MICU"...


  0%|          | 0/100 [00:00<?, ?it/s]

    Bootstrap prediction for task "SICU"...


  0%|          | 0/100 [00:00<?, ?it/s]

    Bootstrap prediction for task "TSICU"...


  0%|          | 0/100 [00:00<?, ?it/s]

    Bootstrap prediction for task "all"...


  0%|          | 0/100 [00:00<?, ?it/s]

In [None]:
pd.options.display.max_rows = 20
metrics_global_24_careunits_btstrp_df.round(3)

Now, let's repeat the prediction task but this time using the groups fetched in an unsupervised way from step 1 proposed by the authors in the paper:

In [None]:
%%time
metrics_global_24_unsupervised_btstrp_df = run_mortality_prediction_task(model_type='global', cutoff_hours=24, gap_hours=12, bootstrap=True,
                                                                         cohort_criteria_to_select='unsupervised', cohort_unsupervised_filename='../data/unsupervised_clusters_24.npy')

In [None]:
metrics_global_24_unsupervised_btstrp_df.round(3)

#### 3.2.2. In-Hospital Mortality Prediction – Baseline (*Global*) Model at 48 Hours

Let's first run the mortality prediction task using the *global* model (baseline) in the 48 hour experiment setting. In this experiment, the cutoff period is 48 hours and the gap period is 24 hours, meaning model can only feed from patient data collected during the first 48 hours of the ICU stay, and predict mortality 72 hours after patient goes into the ICU to avoid label leakage. Since we are enabling bootstrapping, we will repeat same experiment with 100 bootstrapped samples from the test dataset. In terms of the cohort type, let's go with careunits first:

In [None]:
%%time

from mtl_patients import run_mortality_prediction_task

metrics_global_48_careunits_btstrp_df = run_mortality_prediction_task(model_type='global', cutoff_hours=48, gap_hours=24, cohort_criteria_to_select='careunits', bootstrap=True)

In [None]:
metrics_global_48_careunits_btstrp_df.round(3)

Now, let's repeat the prediction task but this time using the groups fetched in an unsupervised way from step 1 proposed by the authors in the paper:

In [None]:
%%time
metrics_global_48_unsupervised_btstrp_df = run_mortality_prediction_task(model_type='global', cutoff_hours=48, gap_hours=24, bootstrap=True,
                                                                         cohort_criteria_to_select='unsupervised', cohort_unsupervised_filename='../data/unsupervised_clusters_48.npy')

In [None]:
metrics_global_48_unsupervised_btstrp_df.round(3)

#### 3.2.3. In-Hospital Mortality Prediction – Multi-Task Learning Model at 24 Hours

Let's now run the mortality prediction task using the multi-task learning model in the 24 hour experiment setting. In this experiment, the cutoff period is 24 hours and the gap period is 12 hours, meaning model can only feed from patient data collected during the first 24 hours of the ICU stay, and predict mortality 36 hours after patient goes into the ICU to avoid label leakage. Since we are enabling bootstrapping, we will repeat same experiment with 100 bootstrapped samples from the test dataset. In terms of the cohort type, let's go with careunits first:

In [None]:
%%time

from mtl_patients import run_mortality_prediction_task

metrics_mtl_24_careunits_btstrp_df = run_mortality_prediction_task(model_type='multitask', cutoff_hours=24, gap_hours=12, cohort_criteria_to_select='careunits', bootstrap=True)

In [None]:
metrics_mtl_24_careunits_btstrp_df.round(3)

Now, let's repeat the prediction task but this time using the groups fetched in an unsupervised way from step 1 proposed by the authors in the paper:

In [None]:
%%time
metrics_mtl_24_unsupervised_btstrp_df = run_mortality_prediction_task(model_type='multitask', cutoff_hours=24, gap_hours=12, bootstrap=True,
                                                                      cohort_criteria_to_select='unsupervised', cohort_unsupervised_filename='../data/unsupervised_clusters_24.npy')

In [None]:
metrics_mtl_24_unsupervised_btstrp_df.round(3)

#### 3.2.4. In-Hospital Mortality Prediction – Multi-Task Learning Model at 48 Hours

Let's now run the mortality prediction task using the multi-task learning model in the 48 hour experiment setting. In this experiment, the cutoff period is 48 hours and the gap period is 24 hours, meaning model can only feed from patient data collected during the first 48 hours of the ICU stay, and predict mortality 72 hours after patient goes into the ICU to avoid label leakage. Since we are enabling bootstrapping, we will repeat same experiment with 100 bootstrapped samples from the test dataset. In terms of the cohort type, let's go with careunits first:

In [None]:
%%time

from mtl_patients import run_mortality_prediction_task

metrics_mtl_48_careunits_btstrp_df = run_mortality_prediction_task(model_type='multitask', cutoff_hours=48, gap_hours=24, cohort_criteria_to_select='careunits', bootstrap=True)

In [None]:
metrics_mtl_48_careunits_btstrp_df.round(3)

Now, let's repeat the prediction task but this time using the groups fetched in an unsupervised way from step 1 proposed by the authors in the paper:

In [None]:
%%time
metrics_mtl_48_unsupervised_btstrp_df = run_mortality_prediction_task(model_type='multitask', cutoff_hours=48, gap_hours=24, bootstrap=True,
                                                                      cohort_criteria_to_select='unsupervised', cohort_unsupervised_filename='../data/unsupervised_clusters_48.npy')

In [None]:
metrics_mtl_48_unsupervised_btstrp_df.round(3)

#### 3.2.5. Summary

Similar to Table 4 in paper, the dataframe below summarizes all results. Due to bootstrapping we will get 100 metric (AUC, PPV, or Specificity) values for every combination of experiment (24 hours or 48 hours), cohort type (careunits or unsupervised), and model type (global or multi-task. We will reduce that table in a next step.

In [None]:
summary_a_btstrp_df = metrics_global_24_unsupervised_btstrp_df.reset_index()
summary_a_btstrp_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_a_btstrp_df['Cohort type'] = 'Unsupervised'
summary_a_btstrp_df['Model'] = 'Global'
summary_a_btstrp_df['Experiment'] = '24 hours'

summary_b_btstrp_df = metrics_global_24_careunits_btstrp_df.reset_index()
summary_b_btstrp_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_b_btstrp_df['Cohort type'] = 'Careunits'
summary_b_btstrp_df['Model'] = 'Global'
summary_b_btstrp_df['Experiment'] = '24 hours'

summary_c_btstrp_df = metrics_mtl_24_unsupervised_btstrp_df.reset_index()
summary_c_btstrp_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_c_btstrp_df['Cohort type'] = 'Unsupervised'
summary_c_btstrp_df['Model'] = 'Multi-task'
summary_c_btstrp_df['Experiment'] = '24 hours'

summary_d_btstrp_df = metrics_mtl_24_careunits_btstrp_df.reset_index()
summary_d_btstrp_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_d_btstrp_df['Cohort type'] = 'Careunits'
summary_d_btstrp_df['Model'] = 'Multi-task'
summary_d_btstrp_df['Experiment'] = '24 hours'

summary_24_btstrp_df = pd.concat([summary_a_btstrp_df, summary_b_btstrp_df, summary_c_btstrp_df, summary_d_btstrp_df])

summary_e_btstrp_df = metrics_global_48_unsupervised_btstrp_df.reset_index()
summary_e_btstrp_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_e_btstrp_df['Cohort type'] = 'Unsupervised'
summary_e_btstrp_df['Model'] = 'Global'
summary_e_btstrp_df['Experiment'] = '48 hours'

summary_f_btstrp_df = metrics_global_48_careunits_btstrp_df.reset_index()
summary_f_btstrp_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_f_btstrp_df['Cohort type'] = 'Careunits'
summary_f_btstrp_df['Model'] = 'Global'
summary_f_btstrp_df['Experiment'] = '48 hours'

summary_g_btstrp_df = metrics_mtl_48_unsupervised_btstrp_df.reset_index()
summary_g_btstrp_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_g_btstrp_df['Cohort type'] = 'Unsupervised'
summary_g_btstrp_df['Model'] = 'Multi-task'
summary_g_btstrp_df['Experiment'] = '48 hours'

summary_h_btstrp_df = metrics_mtl_48_careunits_btstrp_df.reset_index()
summary_h_btstrp_df.rename(columns={'index': 'Cohort'}, inplace=True)
summary_h_btstrp_df['Cohort type'] = 'Careunits'
summary_h_btstrp_df['Model'] = 'Multi-task'
summary_h_btstrp_df['Experiment'] = '48 hours'

summary_48_btstrp_df = pd.concat([summary_e_btstrp_df, summary_f_btstrp_df, summary_g_btstrp_df, summary_h_btstrp_df])

summary_btstrp_df = pd.concat([summary_24_btstrp_df, summary_48_btstrp_df])

# This is a trick using a categorical data type to have Macro and Micro after Cohort names while displaying
from pandas.api.types import CategoricalDtype
cohort = CategoricalDtype(['0', '1', '2', 'CCU', 'CSRU', 'MICU', 'SICU', 'TSICU', 'Macro', 'Micro'], ordered=True)
summary_btstrp_df['Cohort'] = summary_btstrp_df['Cohort'].astype(cohort)
summary_btstrp_df = summary_btstrp_df.dropna()

summary_btstrp_df = pd.melt(summary_btstrp_df, id_vars=['Cohort', 'Sample', 'Cohort type', 'Model', 'Experiment'], var_name='Metric')
summary_btstrp_df = summary_btstrp_df.set_index(['Experiment', 'Cohort type', 'Cohort', 'Sample'])
summary_btstrp_df = summary_btstrp_df.pivot(columns=['Metric', 'Model'], values='value')
summary_btstrp_df = summary_btstrp_df.round(3)
# Now summary_btstrp_df has all bootstrapped samples with right multi-indices for rows and columns!

##### 3.2.5.1. Mean values of metrics from bootstrapped samples

Let's get the mean values of the 100 bootstrapped samples:

In [None]:
pd.options.display.max_rows = 999
summary_df = summary_btstrp_df.groupby(['Experiment', 'Cohort type', 'Cohort']).mean().round(3).dropna()
summary_df

##### 3.2.5.2 Wilcoxon Signed-Rank Test

Now it is time to apply the Wilcoxon Signed-Rank Test. [This video](https://www.youtube.com/watch?v=v4ZHlTbTOK8) has a very good detailed explanation of the Wilcoxon Signed-Rank Test which is a non-parametric version of the paired t-test used when there are not many samples (which is our case).

In [None]:
from scipy.stats import wilcoxon

def calc_wilcoxon(grp_df, what):
    if (what == 'auc'):
        # calculate p-value for AUC using Wilcoxon Signed Rank Test
        x = grp_df[('AUC', 'Global')]
        y = grp_df[('AUC', 'Multi-task')]
        _, pvalue = wilcoxon(x, y)

    if (what == 'ppv'):
        # calculate p-value for PPV using Wilcoxon Signed Rank Test
        x = grp_df[('PPV', 'Global')]
        y = grp_df[('PPV', 'Multi-task')]
        _, pvalue = wilcoxon(x, y)
    
    if (what == 'specificity'):
        # calculate p-value for AUC using Wilcoxon Signed=Rank Test
        x = grp_df[('Specificity', 'Global')]
        y = grp_df[('Specificity', 'Multi-task')]
        _, pvalue = wilcoxon(x, y)

    return pvalue

In [None]:
summary_df.loc[:, ('AUC', 'p-value')] = summary_btstrp_df.groupby(['Experiment', 'Cohort type', 'Cohort']).apply(calc_wilcoxon, what='auc')
summary_df.loc[:, ('PPV', 'p-value')] = summary_btstrp_df.groupby(['Experiment', 'Cohort type', 'Cohort']).apply(calc_wilcoxon, what='ppv')
summary_df.loc[:, ('Specificity', 'p-value')] = summary_btstrp_df.groupby(['Experiment', 'Cohort type', 'Cohort']).apply(calc_wilcoxon, what='specificity')
cols = [('AUC', 'Global'), ('AUC', 'Multi-task'), ('AUC', 'p-value'),
        ('PPV', 'Global'), ('PPV', 'Multi-task'), ('PPV', 'p-value'),
        ('Specificity', 'Global'), ('Specificity', 'Multi-task'), ('Specificity', 'p-value')]
summary_df = summary_df[cols]
summary_df