# After patient data has been queried, we are moving to model practise

## Reference

* https://github.com/YerevaNN/mimic3-benchmarks

## Citations

> Johnson, Alistair EW, David J. Stone, Leo A. Celi, and Tom J. Pollard. "The MIMIC Code Repository: enabling reproducibility in critical care research." Journal of the American Medical Informatics Association (2017): ocx084.
* Github: https://github.com/MIT-LCP/mimic-code
* Zenodo: https://doi.org/10.5281/zenodo.821872
* Structure tested: multitask RNN architecture: https://arxiv.org/abs/1703.07771 

* Hrayr Harutyunyan, Hrant Khachatrian, David C. Kale, and Aram Galstyan. Multitask Learning and Benchmarking with Clinical Time Series Data. arXiv:1703.07771: https://arxiv.org/abs/1703.07771
<br>

* Mimic3 benchmarks for machine learning: https://github.com/YerevaNN/mimic3-benchmarks
    1. early triage and risk assessment, i.e., mortality prediction
    2. prediction of physiologic decompensation
    3. identification of high cost patients, i.e. length of stay forecasting
    4. characterization of complex, multi-system diseases, i.e., acute care phenotyping

## Tools used

In [8]:
import numpy as np
import pandas as pd
import sklearn                              # generic machine learning tool kit
import keras                                # for LSTM model to handle time series

  '{0}.{1}.{2}'.format(*version.hdf5_built_version_tuple)
Using TensorFlow backend.


In [23]:
from packages import directory_tree         # note this is a package I wrote to display directory trees
import os
CURRENT_DIR = os.path.abspath('')
CURRENT_DIR

'C:\\Users\\ericx\\Jupyter Projects\\MGH_medical_AI'

## Getting data prepared for ML

Note: The whole process might take a few hours...

1\. Clone the repo. Following code will create ~/data folder to contain ML train/test/val

`git clone https://github.com/YerevaNN/mimic3-benchmarks/
cd mimic3-benchmarks/`

2\. The following command takes MIMIC-III CSVs, the steps takes around ~3 hour.
- generates one directory per `SUBJECT_ID`, totally 33798 folders unnder ~/data/root/
- writes ICU stay information to `data/{SUBJECT_ID}/stays.csv`
- diagnoses to `data/{SUBJECT_ID}/diagnoses.csv`
- and events to `data/{SUBJECT_ID}/events.csv`. 

`python -m mimic3benchmark.scripts.extract_subjects {PATH TO MIMIC-III CSVs} data/root/`

3\. The following command attempts to fix some issues (ICU stay ID is missing) and removes the events that have missing information. About 80% of events remain after removing all suspicious rows.

`python -m mimic3benchmark.scripts.validate_events data/root/`

4\. The next command breaks up per-subject data into separate episodes (pertaining to ICU stays). 
- Time series of events are stored in `{SUBJECT_ID}/episode{#}_timeseries.csv` (where # counts distinct episodes) 
- episode-level information (patient age, gender, ethnicity, height, weight) and outcomes (mortality, length of stay, diagnoses) are stores in `{SUBJECT_ID}/episode{#}.csv`. 
- This script requires two files, one that maps event `ITEMIDs` to clinical variables and another that defines valid ranges for clinical variables (for detecting outliers, etc.). 
- Outlier detection is disabled in the current version.

`python -m mimic3benchmark.scripts.extract_episodes_from_subjects data/root/`

5\. The next command splits the whole dataset into training and testing sets. Note that the train/test split is the same of all tasks.

`python -m mimic3benchmark.scripts.split_train_and_test data/root/`

6\. The following commands will generate task-specific datasets, which can later be used in models. These commands are independent, if you are going to work only on one benchmark task, you can run only the corresponding command.


1. early triage and risk assessment, i.e., mortality prediction <br>
`python -m mimic3benchmark.scripts.create_in_hospital_mortality data/root/ data/in-hospital-mortality/`
<br>
    
2. prediction of physiologic decompensation<br>
`python -m mimic3benchmark.scripts.create_decompensation data/root/ data/decompensation/`
<br>

3. identification of high cost patients, i.e. length of stay forecasting<br>
`python -m mimic3benchmark.scripts.create_length_of_stay data/root/ data/length-of-stay/`
<br>

4. characterization of complex, multi-system diseases, i.e., acute care phenotyping<br>
`python -m mimic3benchmark.scripts.create_phenotyping data/root/ data/phenotyping/`<br>
`python -m mimic3benchmark.scripts.create_multitask data/root/ data/multitask/`





7\. After the above commands are done, there will be a directory data/{task} for each created benchmark task. These directories have two sub-directories: `train` and `test`. Each of them contains bunch of ICU stays and one file with name `listfile.csv`, which lists all samples in that particular set. 

Each row of listfile.csv has the following form: `icu_stay`, `period_length`, `label(s)`. A row specifies a sample for which the input is the collection of ICU event of `icu_stay` that occurred in the first `period_length` hours of the stay and the target is/are label(s). 

In in-hospital mortality prediction task `period_length` is always 48 hours, so it is not listed in corresponding listfiles.

8\. Seems some record missing for mortality data<br>
Note remaining tasks also have data missing, won't copy all details.

`(PY36) $ python -m mimic3benchmark.scripts.create_in_hospital_mortality data/root/ data/in-hospital-mortality/
processed 5000 / 5070 patients
 3236
(length of stay is missing) 10128 episode1_timeseries.csv
processed 100 / 28728 patients
(length of stay is missing) 10168 episode1_timeseries.csv
processed 2400 / 28728 patients
(no events in ICU)  14219 episode1_timeseries.csv
processed 2600 / 28728 patients
(no events in ICU)  14469 episode1_timeseries.csv
processed 4700 / 28728 patients
(no events in ICU)  18350 episode1_timeseries.csv
processed 5200 / 28728 patients
(no events in ICU)  19097 episode1_timeseries.csv
processed 5600 / 28728 patients
(no events in ICU)  19872 episode1_timeseries.csv
processed 16100 / 28728 patients
(length of stay is missing) 499 episode1_timeseries.csv
processed 28700 / 28728 patients
 17903`


In [24]:
dirlist = directory_tree.get_dir_list('./')
print(dirlist)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## Baseline models