# DL4H Project

**Topic:** Combining structured and Unstructured data for Predictive models: A Deep Learning Approach

**Team 105:**
*  Mariano Ovalle (lo22@illinois.edu)
*  Mochammad Dikra Prasetya (mdp9@illinois.edu)
*  Zhang Jiang (zhangj3@illinois.edu)



# Introduction

## Objective

Reproducing the "Clinical Fusion" research paper:

> Zhang, D., Yin, C., Zeng, J. et al. Combining structured and unstructured data for predictive models: a deep learning approach. BMC Medical Informatics and Decision Making, 280 (2020). https://doi.org/10.1186/s12911-020-01297-6



## Background

The widespread use of electronic medical records (EHRs) has opened up new avenues for health care research, enabling the resolution of various medical issues. Machine learning and deep learning methods have gained popularity in the field of medical informatics. However, while many research studies focus on structured data for predictive modeling, they often overlook the valuable information contained in unstructured clinical notes. This oversight may limit the accuracy of prediction models. To address this issue, integrating diverse data types from EHRs using deep learning techniques may enhance the performance of predictive models.


## Paper's Proposal

The research introduces two general-purpose multi-modal neural network architectures to combine sequential unstructured notes with structured data for enhanced representation learning and prediction in patient-level medical data. The models use Doc2Vec [2] document embeddings for clinical note documents, and either convolutional neural networks (Fusion-CNN) or long short-term memory networks (Fusion-LSTM) for sequential clinical notes and temporal signals. The final patient representation is obtained by concatenating these representations and used for making predictions.

The proposed Fusion-CNN and Fusion-LSTM are experimented towards in-hospital mortality prediction, long length of stay prediction, and 30-day readmission prediction on MIMIC-III, of which have shown positive results that the new approaches improve the effectiveness of the tasks.

![fusion-cnn.png](https://camo.githubusercontent.com/5e8da03117089410e20fc85e46138a9c92428243184f5e2c1fc13b5e11d4899d/68747470733a2f2f696d6775722e636f6d2f6e4b68414f724d2e706e67)

<br>
<br>

![fusion-ltsm.png](https://camo.githubusercontent.com/84168f9e94bcfe582069c3fef295bfea99ab157bdb2fe643503c23e197a1bd68/68747470733a2f2f696d6775722e636f6d2f416772496b6c362e706e67)

# Scope of Reproducibility:

## Hypotheses

There are 2 hypotheses of this paper to be tested:
1. By combining the unstructured data (clinical notes), the overall model performance will be improved and outperform than traditional methods such as Random Forest and Logistic Regression.

2. (Based on the statistics in this paper, this seems not to be the case but still plan to verify) Given the inherent advantage of RNN in processing sequential or temporal data, model Fusion-LSTM shall outperform the model Fusion-CNN.


## Ablations Planned

This paper has ablation studies already. Thus we’re planning to replicate that: with U, T, S denoting the unstructured clinical notes, temporal signals, and static information, we’ll perform the evaluations on two proposed models in this paper but 1. Excluding each of the components (U, T, S) individually to assess their individual contributions. 2. Including only one component at a time (i.e., using models with exclusively U, exclusively T, or exclusively S). This methodology allows us to dissect the relative importance and added value of each type of data in enhancing the model's accuracy and reliability.


# Methodology

The full code can be accessed in: https://github.com/dkrprasetya/clinical-fusion/tree/dev, which is forked from the paper author's [repository](https://github.com/onlyzdd/clinical-fusion).

Below are the step-by-step explanations with some excerpts of the code. It is a pre-requisite to follow the "Data" section to prepare the processed data for running `baseline.py` and `main.py` at the later stage.

For the explanation convenience, we assume that the codes are available with the below Google drive mounting scheme:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

!ls "/content/drive/MyDrive/CS598-project"

##  Data

The raw data used in this project is from the Medical Information Mart for Intensive Care III (MIMIC-III v1.4), available at https://physionet.org/content/mimiciii/1.4/. The tables utilized include ADMISSIONS, CHARTEVENTS, DIAGNOSES_ICD, ICUSTAYS, LABEVENTS, NOTEEVENTS, and PATIENTS. The overall unzipped database size is approximately 6.16 GB. After unzipping, the required tables alone exceed 40 GB in size. Due to this and security issue, only some processed data will be uploaded to Google Drive and mounted to this Colab.

For data processing:

1. Data loading is first performed using PostgreSQL, following the steps outlined at https://github.com/MIT-LCP/mimic-code/tree/main/mimic-iii/buildmimic/postgres. Due to the CHARTEVENTS table being larger than 30 GB, loading it can take between 4-6 hours. We utilize SQL queries (found at 'https://github.com/dkrprasetya/clinical-fusion/tree/dev/sql_query') adm_details.sql, pivoted_lab.sql, and pivoted_vital.sql, derived from the original paper to generate corresponding processed tables. These are then stored as CSV files in `<code_root>/data/mimic` (found at 'https://github.com/dkrprasetya/clinical-fusion/tree/dev/data/mimic'). The adm_details.csv file results from joining the ADMISSIONS and PATIENTS tables with required features. pivoted_lab.csv and pivoted_vital.csv are generated from ADMISSIONS, ICUSTAYS, LABEVENTS, and CHARTEVENTS tables, providing tables of lab test results and vital sign measurements, respectively, based on ICU and hospital stay intervals.

2. We then use data preprocessing code from the original repository to generate processed data (available in 'https://github.com/dkrprasetya/clinical-fusion/tree/dev/data/processed'). It's noted that the original preprocessing code contains minor bugs. For instance, in the latest MIMIC-III release, most timestamps in the ADMISSIONS, ICUSTAYS, and PATIENTS tables are synthetic (post-2100), yet some patient DOBs appear to be real (1800s). When calculating patients' ages at admission (admittime - dob), ages greater than 300 years are outliers and can cause RuntimeError. To mitigate such outliers, we've imposed a limit on the DOB, setting the minimal acceptable DOB to 2100-01-01. Additionally, various minor issues with the original paper were identified and corrected in our updated version. To replicate our results, ensure you have the previously generated adm_details.csv, pivoted_lab.csv, and pivoted_vital.csv, along with the unzipped tables of MIMIC-III (ADMISSIONS.csv, CHARTEVENTS.csv, DIAGNOSES_ICD.csv, ICUSTAYS.csv, LABEVENTS.csv, NOTEEVENTS.csv, and PATIENTS.csv) under the folder 'https://github.com/dkrprasetya/clinical-fusion/tree/dev/data/mimic', then run the following code:

```
$ python 00_define_cohort.py
$ python 01_get_signals.py
$ python 02_extract_notes.py --firstday
$ python 03_merge_ids.py
$ python 04_statistics.py
$ python 05_preprocess.py
```

We've also uploaded the processed data under `/content/drive/MyDrive/CS598-project/data/processed`.

In [None]:
import pandas as pd
import numpy as np
import json
from utils import clean_text, text2words
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

processed_data_dir = '/content/drive/MyDrive/CS598-project/data/processed'

def load_processed_data(raw_data_dir):
  df = pd.read_csv('/content/drive/MyDrive/CS598-project/data/processed/earlynotes.csv')
  df['text'] = df['text'].astype(str).apply(text2words)
  data = json.load(open('./data/processed/files/splits.json'))
  train_ids = np.hstack([data[t] for t in splits[:7]])
  train_ids = list(map(lambda x: int(float(x[-12:-4])), train_ids))
  #print("train_ids:", train_ids)
  print("train_ids len:", len(train_ids))
  train = df[df['hadm_id'].isin(train_ids)]['text'].tolist()
  #print("train:", train)
  print("train len:", len(train))

  train_tagged = []
  for idx, text in enumerate(train):
      train_tagged.append(TaggedDocument(text, tags=[str(idx)]))

  print("train_tagged size:", len(train_tagged))

##   Model

In this project there're multiple models, including the Doc2Vec model from `gensim.models.doc2vec` and two models implemented with PyTorch (from the original repository): LSTM and CNN.

LSTM ('https://github.com/dkrprasetya/clinical-fusion/blob/dev/lstm.py'): This model leverages document embeddings, a BiLSTM layer, and a max-pooling layer to process sequential clinical notes. Two LSTM layers (imported from `torch.nn.LSTM`) are employed for processing data, with the first layer dedicated to unstructured text data and the second layer to structured clinical event sequences. The model utilizes adaptive max pooling to compact the sequence representation into a fixed-size vector. Furthermore, several linear and nonlinear mappings (with ReLU and Dropout) transform the embeddings and LSTM outputs into feature spaces suitable for final prediction data structure.

CNN ('https://github.com/dkrprasetya/clinical-fusion/blob/dev/cnn.py'): The CNN model also uses document embeddings and employs a 2-layer CNN architecture followed by max-pooling to process sequential clinical notes. The convolutional layers are designed to capture patterns in the data, with the max-pooling layer reducing the dimensionality of the output for subsequent layers. The CNN architecture is further enhanced with residual blocks to facilitate deeper learning without the vanishing gradient problem. Also, each block of this CNN model includes convolutional layers with ReLU activation and Batch Normalization(`nn.BatchNorm1d`).

The loss function is also defined by a separate file 'https://github.com/dkrprasetya/clinical-fusion/blob/dev/myloss.py'. In short,
the Loss class implements binary cross-entropy loss (nn.BCELoss) after applying a sigmoid activation (nn.Sigmoid) to the model's outputs.

As for optimizer: Adam optimizer (imported from `torch.optim.Adam`) is used for adjusting the weights of the network, with the learning rate potentially modified dynamically across training epochs to fine-tune the training process.


The models are not pretrained.

In [None]:
import lstm, cnn
import parse

# Pending update to accommendate in this notebook
args = parse.args

#model = cnn.CNN(args)
model = lstm.LSTM(args)

loss_func = None
optimizer = None

# def train_model_one_iter(model, loss_func, optimizer):
#   pass

# num_epoch = 10
# # model training loop: it is better to print the training/validation losses during the training
# for i in range(num_epoch):
#   train_model_one_iter(model, loss_func, optimizer)
#   train_loss, valid_loss = None, None
#   print("Train Loss: %.2f, Validation Loss: %.2f" % (train_loss, valid_loss))


## Training and Evaluation

We use the function `train_eval` in `main.py` for training and evaluation by setting the argument `phase` to `train` or `test` respectively.

This function also needs the following arguments:


`data_loader:` data for training or evaluation.

`net:` The model type,  can be either an instance of LSTM or CNN, as described above.

`loss:` Loss function instance, (nn.BCELoss)

`epoch:` Number of full cycles through the training data
optimizer: In this case, an instance of an Adam optimizer.


The following code block from `main()` method prepares the instance arguments for function `train_eval`:

In [None]:
def main():
    data_splits = json.load(open(os.path.join(args.files_dir, 'splits.json'), 'r'))
    train_files = [f for idx in [0, 1, 2, 3, 4, 5, 6] for f in data_splits[idx]]
    valid_files = [f for idx in [7] for f in data_splits[idx]]
    test_files = [f for idx in [8, 9] for f in data_splits[idx]]
    if args.phase == 'test':
        train_phase, valid_phase, test_phase, train_shuffle = 'test', 'test', 'test', False
    else:
        train_phase, valid_phase, test_phase, train_shuffle = 'train', 'valid', 'test', True
    train_dataset = data_loader.DataBowl(args, train_files, phase=train_phase)
    valid_dataset = data_loader.DataBowl(args, valid_files, phase=valid_phase)
    test_dataset = data_loader.DataBowl(args, test_files, phase=test_phase)
    train_loader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=train_shuffle, num_workers=args.workers, pin_memory=True)
    valid_loader = DataLoader(valid_dataset, batch_size=args.batch_size, shuffle=False, num_workers=args.workers, pin_memory=True)
    test_loader = DataLoader(test_dataset, batch_size=args.batch_size, shuffle=False, num_workers=args.workers, pin_memory=True)

    # net = icnn.CNN(args)

    if args.model == 'cnn':
        net = cnn.CNN(args)
    else:
        net = lstm.LSTM(args)
    # net = torch.nn.DataParallel(net)
    # loss = myloss.Loss(0)
    loss = myloss.MultiClassLoss(0)

    net = _cuda(net, 0)
    loss = _cuda(loss, 0)

    best_metric= [0,0]
    start_epoch = 0

    if args.resume:
        p_dict = {'model': net}
        # function.load_model(p_dict, args.resume)
        net.load_state_dict(torch.load('./models/{}.model'.format(args.model)))

    parameters_all = []
    for p in net.parameters():
        parameters_all.append(p)

    optimizer = torch.optim.Adam(parameters_all, args.lr)

In case of training the loop calls the `train_eval` function repeatedly according to the `epoch` argument:

In [None]:
    if args.phase == 'train':
        for epoch in range(start_epoch, args.epochs):
            print('start epoch :', epoch)
            t0 = time.time()
            train_eval(train_loader, net, loss, epoch, optimizer, best_metric)
            t1 = time.time()
            print('Running time:', t1 - t0)
            best_metric = train_eval(valid_loader, net, loss, epoch, optimizer, best_metric, phase='valid')
        print('best metric', best_metric)

In [None]:
def train_eval(data_loader, net, loss, epoch, optimizer, best_metric, phase='train'):
    print(phase)
    lr = get_lr(epoch)
    if phase == 'train':
        net.train()
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr
    else:
        net.eval()

    loss_list, pred_list, label_list, = [], [], []
    for b, data_list in enumerate(tqdm(data_loader)):
        data, dtime, demo, content, label, files = data_list
        if args.value_embedding == 'no':
            data = Variable(_cuda(data))
        else:
            data = index_value(data)


        dtime = Variable(_cuda(dtime))
        demo = Variable(_cuda(demo))
        content = Variable(_cuda(content))
        label = Variable(_cuda(label))
        output = net(data, dtime, demo, content) # [bs, 1]
        # output = net(data, dtime, demo) # [bs, 1]



        loss_output = loss(output, label)
        pred_list.append(output.data.cpu().numpy())
        loss_list.append(loss_output[0].data.cpu().numpy())
        label_list.append(label.data.cpu().numpy())

        if phase == 'train':
            optimizer.zero_grad()
            loss_output[0].backward()
            optimizer.step()

Otherwise, model evaluation is executed:

In [None]:
    elif args.phase == 'test':
        train_eval(test_loader, net, loss, 0, optimizer, best_metric, 'test')

In order to train the models we will use the following command:

In [None]:
$ python main.py --model [model] --task [task] --inputs [input]

Where:

`[model]` Options:  `cnn` or `lstm`

`[task]` Options: `mortality`, `readmit`, or `llos`

`[inputs]` Options: `3`: For Temporal Signals + Structure Data, `4`: For Unstructured Data, For `7`: All three combined

For testing the model we will use the same command but setting the flag `phase` to test:

In [None]:
$ python main.py --model [model] --task [task] --inputs [input] --phase test

## Model comparison


For model performance, the paper code uses `sklearn.metrics` module:

[`metrics.average_precision_score`]( https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score)

[`metrics.f1_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)

[`metrics.roc_auc_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score)

In [None]:
def cal_metric(y_true, probs):
    fpr, tpr, thresholds = metrics.roc_curve(y_true, probs)
    optimal_idx = np.argmax(np.sqrt(tpr * (1-fpr)))
    optimal_threshold = thresholds[optimal_idx]
    preds = (probs > optimal_threshold).astype(int)
    auc = metrics.roc_auc_score(y_true, probs)
    auprc = metrics.average_precision_score(y_true, probs)
    f1 = metrics.f1_score(y_true, preds)
    return f1, auc, auprc

The model performance will be evaluated againts two baseline models: [Logistic Regresion](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) & [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html),  code uses `sklearn.linear_model` module implementation:

In [None]:
def train_test_base(X_train, X_test, y_train, y_test, name):
    mtl = 1 if y_test.shape[1] > 1 else 0 # multi-label
    if name == 'lr':
        print('Start training Logistic Regression:')
        model = LogisticRegression()
        param_grid = {
            'penalty': ['l1', 'l2']
        }
    else:
        print('Start training Random Forest:')
        model = RandomForestClassifier()
        param_grid = {
            'n_estimators': [x for x in range(20, 40, 5)],
            'max_depth': [None, 20, 40, 60, 80, 100]
        }
    if mtl:
        model = OneVsRestClassifier(model)
    else:
        y_train, y_test = y_train[:, 0], y_test[:, 0]
    t0 = time.time()
    gridsearch = GridSearchCV(model, param_grid, scoring='roc_auc', cv=5)
    gridsearch.fit(X_train, y_train)
    model = gridsearch.best_estimator_
    t1 = time.time()
    print('Running time:', t1 - t0)
    probs = model.predict_proba(X_test)
    metrics = []
    if mtl:
        for idx in range(y_test.shape[1]):
            metric = cal_metric(y_test[:, idx], probs[:, idx])
            print(idx + 1, metric)
            metrics.append(metric)
        print('Avg', np.mean(metrics, axis=0).tolist())
    else:
        metric = cal_metric(y_test, probs[:, 1])
        print(metric)

Baselines are generated using the following command (arguments are identical with `main.py`):

In [None]:
$ python baselines.py --model [model] --task [task] --inputs [inputs]

# Results

At this stage, we are only able to experiment with a limited set of parameters due to some discovered issues in the original source code.

We managed to run the baseline with Linear Regression and compare it with our trained Fusion-LSTM model with task is set for *Mortality* data set and with training features of unstructured notes (U) and temporal data (T).

<p>

|                   | F1    | AUC   | AUPRC |
|-------------------|-------|-------|-------|
| Linear Regression | 0.474 | 0.896 | 0.518 |
| Fusion-LTSM       | 0.485 | 0.899 | 0.573 |

<p>

Disclaimer: we could not claim that the results are accurate just yet. Iterations and deeper analysis are required to confirm that everything is implemented correctly.

Screenshot of local run:

![training-screenshot.png](https://github.com/dkrprasetya/clinical-fusion/blob/dev/imgs/training_screenshot.png?raw=true)



# Discussion

## Discoveries

1. The original source code is not runnable at the get go. Some fixes were necessary to successfully run the entire flow (data processing -> train -> test).

2. Some parameters were missing in the code (e.g., how many layers, how many iterations), so investigating which parameters were used in the paper was necessary.

3. The training step is currently running at 10 mins **per epoch**. We might need to improve / find solution for faster run.

4. In general, we feel the paper is likely reproducible. Although, we will need more iteration to confirm whether currently things are running correctly.


## Next step

1. Fix remaining issues in the soure code to ensure all functionality is working.

2. Start making training tests and decide whether to use all or a subset of the samples; most likely, we will need to pre-train and save the model to Google Drive.

3. Generate baseline results with the same data used to train the model, either the full set or the subset.

4. Compare performace of model using different sets of inputs (Unstructured Data, Temporal Signals and Structured Data)

5. [Optional] Clean-up code and contribute to the parent repo by creating a pull-request for the improvements.

6. [Optional] Consider introducing the fusion models into https://github.com/sunlabuiuc/PyHealth.


# References

1.   Zhang, D., Yin, C., Zeng, J. et al. Combining structured and unstructured data for predictive models: a deep learning approach. BMC Med Inform Decis Mak 20, 280 (2020). https://doi.org/10.1186/s12911-020-01297-6

2. Le Q, Mikolov T. Distributed representations of sentences and documents. In: International conference on machine learning (2014). p. 1188–1196.

