# Build cohort: MIMIC BASIC

In [1]:
import warnings
import os
warnings.filterwarnings('ignore')
os.getcwd()
os.chdir('/share/pi/boussard/eroosli_work/benchmarking')

## Step 1: Extract subjects

In [3]:
%run benchmarks/mimic/scripts/extract_subjects.py ../mimic-iii-clinical-database-1.4 data/mimic/basic

new in extract_subjects.py
1. NUMBER OF UNIQUE SAMPLES:                (1) ICUSTAY_ID: 61532  (2) HADM_ID: 57786  (3) SUBJECT_ID: 46476
2. AFTER REMOVING ICU TRANSFERS:            (1) ICUSTAY_ID: 55830  (2) HADM_ID: 52834  (3) SUBJECT_ID: 43277
3. AFTER REMOVING MULTIPLE STAYS PER ADMIT: (1) ICUSTAY_ID: 50186  (2) HADM_ID: 50186  (3) SUBJECT_ID: 41587
4. AFTER REMOVING PATIENTS AGE < 18:        (1) ICUSTAY_ID: 42276  (2) HADM_ID: 42276  (3) SUBJECT_ID: 33798
INCLUDED VARIABLES IN STAYS TABLE:
 ['SUBJECT_ID', 'HADM_ID', 'ICUSTAY_ID', 'LAST_CAREUNIT', 'DBSOURCE', 'INTIME', 'OUTTIME', 'LOS', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME', 'ETHNICITY', 'DIAGNOSIS', 'GENDER', 'DOB', 'DOD', 'AGE', 'MORTALITY_INUNIT', 'MORTALITY', 'MORTALITY_INHOSPITAL']
5. WRITE STAYS TO CSV
6. GET DIAGNOSIS DATA AND CREATE PHENOTYPES FROM IT
8. PREPARE SUBDIRECTORIES AND INDIVIDUAL DATA FILES
BREAKING UP STAYS BY SUBJECT:
SUBJECT 33798 of 33798...DONE!
BREAKING UP DIAGNOSES BY SUBJECT:
SUBJECT 33798 of 33798...DONE!

## Step 2: Clinical events processing

### Step 2a: Filter events with no matches

In [4]:
%run benchmarks/mimic/scripts/validate_events.py data/mimic/basic

finished processing 33798 / 33798
STATISTICS OF EVENT VALIDATION:
n_events: 503462876
empty_hadm: 10312392
no_hadm_in_stay: 64157325
no_icustay: 31453293
recovered: 31453293
could_not_recover: 0
icustay_missing_in_stays: 14148515


### Step 2b: Compile timeseries of events 
Included events are based on 17 physiological variables; a new timeseries is constructed for each individual episode (ICU stay) during a hospitalization 

In [5]:
%run benchmarks/mimic/scripts/extract_episodes_from_subjects.py data/mimic/basic

VARIABLES USED FOR EVENTS DATA: ['Capillary refill rate' 'Diastolic blood pressure'
 'Fraction inspired oxygen' 'Glascow coma scale eye opening'
 'Glascow coma scale motor response' 'Glascow coma scale total'
 'Glascow coma scale verbal response' 'Glucose' 'Heart Rate' 'Height'
 'Mean blood pressure' 'Oxygen saturation' 'pH' 'Respiratory rate'
 'Systolic blood pressure' 'Temperature' 'Weight']
SUBJECT 33802 of 33802...

In [8]:
def is_subject_folder(x):
    return str.isdigit(x)

In [16]:
args.subjects_root_path_train = 'data/data_basic/train'
args.subjects_root_path_test = 'data/data_basic/test'

In [18]:
subdirectories_train = os.listdir(args.subjects_root_path_train)
subjects_train = list(filter(is_subject_folder, subdirectories_train))

subdirectories_test = os.listdir(args.subjects_root_path_test)
subjects_test = list(filter(is_subject_folder, subdirectories_test))

nb_patients_train = len(subjects_train)
nb_patients_test = len(subjects_test)
print('number subjects in train:', nb_patients_train)
print('number subjects in test:', nb_patients_test)
print('total number subjects:', nb_patients_test+nb_patients_train)

number subjects in train: 28728
number subjects in test: 5070
total number subjects: 33798


Note: there is an error in the subject count of the initial script as it also counts four csv files containing overall statistics and not being part of the patient subdirectories.

## Step 3: Split into train and test sets

In [6]:
%run benchmarks/mimic/scripts/split_train_and_test.py data/mimic/basic

## Step 4: Create specific benchmark datasets

### Step 4a: In-hospital mortality

#### Prepare task-specific train and test set

In [1]:
%run benchmarks/mimic/scripts/create_in_hospital_mortality.py data/mimic/basic data/mimic/basic/mortality

PARTITION FOR: test
SUBJECT 5070 of 5070...
TOTAL NUMBER ICU STAYS: 3236
TOTAL NUMBER PATIENTS: 2763
GENDER DISTRIBUTION: {1: 1229, 2: 1534, 3: 0, 0: 0}
RATIO FEMALES: 0.4448
RACE DISTRIBUTION: {0: 419, 1: 54, 2: 226, 3: 95, 4: 1969}
AGE DISTRIBUTION: {18: 7, 19: 9, 20: 8, 21: 11, 22: 12, 23: 7, 24: 8, 25: 12, 26: 10, 27: 10, 28: 16, 29: 10, 30: 12, 31: 7, 32: 10, 33: 7, 34: 20, 35: 11, 36: 18, 37: 22, 38: 14, 39: 19, 40: 21, 41: 24, 42: 18, 43: 30, 44: 29, 45: 27, 46: 30, 47: 40, 48: 27, 49: 33, 50: 30, 51: 43, 52: 40, 53: 44, 54: 40, 55: 47, 56: 50, 57: 46, 58: 57, 59: 45, 60: 51, 61: 55, 62: 52, 63: 45, 64: 52, 65: 59, 66: 60, 67: 46, 68: 45, 69: 59, 70: 47, 71: 69, 72: 65, 73: 58, 74: 58, 75: 52, 76: 69, 77: 82, 78: 53, 79: 80, 80: 61, 81: 62, 82: 57, 83: 47, 84: 55, 85: 58, 86: 47, 87: 41, 88: 33, 89: 4, 90: 160}
PARTITION FOR: train
SUBJECT 28728 of 28728...
TOTAL NUMBER ICU STAYS: 17903
TOTAL NUMBER PATIENTS: 15331
GENDER DISTRIBUTION: {1: 6861, 2: 8470, 3: 0, 0: 0}
RATIO FEMALE

#### Split into train and validation set

In [3]:
%run benchmarks/mimic/split_train_val.py data/mimic/basic/mortality