# MIMIC-III Preparation

This tutorial provides the steps of downloading the publicly available MIMIC-III dataset (assuming granted access), followed by cohort selection, and transformation to a compatible format with out software.

<a name="outline"></a>

## Outline

- [1](#sec1) Downloading MIMIC-III
- [2](#sec2) Cohort Selection
- [3](#sec3) Export


In [1]:
import pandas as pd
import os
from collections import defaultdict

<a name="sec1"></a>

## 1 Downloading MIMIC-III  [^](#outline)

We assume a granted access to [MIMIC-III dataset](https://physionet.org/content/mimiciii/1.4/), a process that often takes two weeks from the access request to the approval.

From this page [https://physionet.org/content/mimiciii/1.4/](https://physionet.org/content/mimiciii/1.4/), consult the table at the end of the page to download the following files:

1. [`ADMISSIONS.csv.gz`](https://physionet.org/files/mimiciii/1.4/ADMISSIONS.csv.gz?download)
2. [`DIAGNOSES_ICD.csv.gz`](https://physionet.org/files/mimiciii/1.4/DIAGNOSES_ICD.csv.gz?download)
3. [`PROCEDURES_ICD.csv.gz`](https://physionet.org/files/mimiciii/1.4/PROCEDURES_ICD.csv.gz?download)
4. [`PATIENTS.csv.gz`](https://physionet.org/files/mimiciii/1.4/PATIENTS.csv.gz?download)


copy/paste these two files into an arbitrary location of your choice and assign that dirctory path to the variable `mimic3_dir`.


In [2]:
# HOME and DATA_STORE are arbitrary, change as appropriate.
HOME = os.environ.get('HOME')
DATA_STORE = f'{HOME}/GP/ehr-data'


mimic3_dir = f'{DATA_STORE}/mimic3-cohort'
# Load admission file
admissions_df = pd.read_csv(f'{mimic3_dir}/ADMISSIONS.csv.gz')
static_df = pd.read_csv(f'{mimic3_dir}/PATIENTS.csv.gz')
# Count of all subjects in MIMIC-III
print(f'#Subjects: {admissions_df.SUBJECT_ID.nunique()}')

admissions_df

#Subjects: 46520


Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DEATHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,INSURANCE,LANGUAGE,RELIGION,MARITAL_STATUS,ETHNICITY,EDREGTIME,EDOUTTIME,DIAGNOSIS,HOSPITAL_EXPIRE_FLAG,HAS_CHARTEVENTS_DATA
0,21,22,165315,2196-04-09 12:26:00,2196-04-10 15:54:00,,EMERGENCY,EMERGENCY ROOM ADMIT,DISC-TRAN CANCER/CHLDRN H,Private,,UNOBTAINABLE,MARRIED,WHITE,2196-04-09 10:06:00,2196-04-09 13:24:00,BENZODIAZEPINE OVERDOSE,0,1
1,22,23,152223,2153-09-03 07:15:00,2153-09-08 19:10:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,Medicare,,CATHOLIC,MARRIED,WHITE,,,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,0,1
2,23,23,124321,2157-10-18 19:34:00,2157-10-25 14:00:00,,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME HEALTH CARE,Medicare,ENGL,CATHOLIC,MARRIED,WHITE,,,BRAIN MASS,0,1
3,24,24,161859,2139-06-06 16:14:00,2139-06-09 12:48:00,,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME,Private,,PROTESTANT QUAKER,SINGLE,WHITE,,,INTERIOR MYOCARDIAL INFARCTION,0,1
4,25,25,129635,2160-11-02 02:06:00,2160-11-05 14:55:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,Private,,UNOBTAINABLE,MARRIED,WHITE,2160-11-02 01:01:00,2160-11-02 04:27:00,ACUTE CORONARY SYNDROME,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58971,58594,98800,191113,2131-03-30 21:13:00,2131-04-02 15:02:00,,EMERGENCY,CLINIC REFERRAL/PREMATURE,HOME,Private,ENGL,NOT SPECIFIED,SINGLE,WHITE,2131-03-30 19:44:00,2131-03-30 22:41:00,TRAUMA,0,1
58972,58595,98802,101071,2151-03-05 20:00:00,2151-03-06 09:10:00,2151-03-06 09:10:00,EMERGENCY,CLINIC REFERRAL/PREMATURE,DEAD/EXPIRED,Medicare,ENGL,CATHOLIC,WIDOWED,WHITE,2151-03-05 17:23:00,2151-03-05 21:06:00,SAH,1,1
58973,58596,98805,122631,2200-09-12 07:15:00,2200-09-20 12:08:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,Private,ENGL,NOT SPECIFIED,MARRIED,WHITE,,,RENAL CANCER/SDA,0,1
58974,58597,98813,170407,2128-11-11 02:29:00,2128-12-22 13:11:00,,EMERGENCY,EMERGENCY ROOM ADMIT,SNF,Private,ENGL,CATHOLIC,MARRIED,WHITE,2128-11-10 23:48:00,2128-11-11 03:16:00,S/P FALL,0,0


In [3]:
# Load Diagnosis file
diag_df = pd.read_csv(f'{mimic3_dir}/DIAGNOSES_ICD.csv.gz', dtype = {'ICD9_CODE': str})


diag_df

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
0,1297,109,172335,1.0,40301
1,1298,109,172335,2.0,486
2,1299,109,172335,3.0,58281
3,1300,109,172335,4.0,5855
4,1301,109,172335,5.0,4254
...,...,...,...,...,...
651042,639798,97503,188195,2.0,20280
651043,639799,97503,188195,3.0,V5869
651044,639800,97503,188195,4.0,V1279
651045,639801,97503,188195,5.0,5275


In [4]:
# Load Diagnosis file
proc_df = pd.read_csv(f'{mimic3_dir}/PROCEDURES_ICD.csv.gz', dtype = {'ICD9_CODE': str})


proc_df

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
0,944,62641,154460,3,3404
1,945,2592,130856,1,9671
2,946,2592,130856,2,3893
3,947,55357,119355,1,9672
4,948,55357,119355,2,0331
...,...,...,...,...,...
240090,228330,67415,150871,5,3736
240091,228331,67415,150871,6,3893
240092,228332,67415,150871,7,8872
240093,228333,67415,150871,8,3893


<a name="sec2"></a>

## 2 Cohort Selection  [^](#outline)

### 2.A Patient Selection: Minimum of Two Visits

Patients with only one admission (i.e. single timestamp for the diagnosis codes) are not useful in training/validation/testing.

In [5]:
patient_admissions = defaultdict(set)

for row in admissions_df.itertuples():
    patient_admissions[row.SUBJECT_ID].add(row.HADM_ID)
    
patients_admissions_df = pd.DataFrame({
    'patient': patient_admissions.keys(), 
    'n_admissions': map(len, patient_admissions.values())
})


selected_patients_A = set(patients_admissions_df[patients_admissions_df.n_admissions > 1].patient.tolist())

len(selected_patients_A)

7537

Apply the filtration

In [6]:
admissions_A_df = admissions_df[admissions_df.SUBJECT_ID.isin(selected_patients_A)].reset_index(drop=True)
static_A_df = static_df[static_df.SUBJECT_ID.isin(selected_patients_A)].reset_index(drop=True)
diag_A_df =  diag_df[diag_df.HADM_ID.isin(admissions_A_df.HADM_ID)].reset_index(drop=True)
diag_A_df = diag_A_df[diag_A_df.ICD9_CODE.notnull()].reset_index(drop=True)

proc_A_df =  proc_df[proc_df.HADM_ID.isin(admissions_A_df.HADM_ID)].reset_index(drop=True)
proc_A_df = proc_A_df[proc_A_df.ICD9_CODE.notnull()].reset_index(drop=True)


admissions_A_df.SUBJECT_ID.nunique(), static_A_df.SUBJECT_ID.nunique(), len(admissions_A_df), len(diag_A_df)

(7537, 7537, 19993, 260282)

### 2.B Patient Selection: Maximum Hospital Stay is Two Weeks

In [7]:
admit = pd.to_datetime(admissions_A_df['ADMITTIME'], infer_datetime_format=True).dt.normalize() 
disch = pd.to_datetime(admissions_A_df['DISCHTIME'], infer_datetime_format=True).dt.normalize()
admissions_A_df['days'] = (disch - admit).dt.days
admissions_A_df.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DEATHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,INSURANCE,LANGUAGE,RELIGION,MARITAL_STATUS,ETHNICITY,EDREGTIME,EDOUTTIME,DIAGNOSIS,HOSPITAL_EXPIRE_FLAG,HAS_CHARTEVENTS_DATA,days
0,22,23,152223,2153-09-03 07:15:00,2153-09-08 19:10:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,Medicare,,CATHOLIC,MARRIED,WHITE,,,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,0,1,5
1,23,23,124321,2157-10-18 19:34:00,2157-10-25 14:00:00,,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME HEALTH CARE,Medicare,ENGL,CATHOLIC,MARRIED,WHITE,,,BRAIN MASS,0,1,7
2,33,34,115799,2186-07-18 16:46:00,2186-07-20 16:00:00,,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME,Medicare,ENGL,CATHOLIC,MARRIED,WHITE,,,CHEST PAIN\CATH,0,1,2
3,34,34,144319,2191-02-23 05:23:00,2191-02-25 20:20:00,,EMERGENCY,CLINIC REFERRAL/PREMATURE,HOME HEALTH CARE,Medicare,ENGL,CATHOLIC,MARRIED,WHITE,2191-02-23 04:23:00,2191-02-23 07:25:00,BRADYCARDIA,0,1,2
4,36,36,182104,2131-04-30 07:15:00,2131-05-08 14:00:00,,EMERGENCY,CLINIC REFERRAL/PREMATURE,HOME HEALTH CARE,Medicare,ENGL,NOT SPECIFIED,MARRIED,WHITE,,,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,0,1,8


In [8]:
longest_admission = {}
for subject_id, subject_df in admissions_A_df.groupby('SUBJECT_ID'):
    longest_admission[subject_id] = subject_df.days.max()
    
admissions_A_df['max_days'] = admissions_A_df.SUBJECT_ID.map(longest_admission)
selected_patients_B = set(admissions_A_df[admissions_A_df.max_days <= 14].SUBJECT_ID)

Apply the filtration

In [9]:
admissions_B_df = admissions_A_df[admissions_A_df.SUBJECT_ID.isin(selected_patients_B)].reset_index(drop=True)
static_B_df = static_A_df[static_A_df.SUBJECT_ID.isin(selected_patients_B)].reset_index(drop=True)

diag_B_df =  diag_A_df[diag_A_df.HADM_ID.isin(admissions_B_df.HADM_ID)].reset_index(drop=True)
diag_B_df = diag_B_df[diag_B_df.ICD9_CODE.notnull()].reset_index(drop=True)

proc_B_df =  proc_A_df[proc_A_df.HADM_ID.isin(admissions_B_df.HADM_ID)].reset_index(drop=True)
proc_B_df = proc_B_df[proc_B_df.ICD9_CODE.notnull()].reset_index(drop=True)

admissions_B_df.SUBJECT_ID.nunique(), static_B_df.SUBJECT_ID.nunique(), len(admissions_B_df), len(diag_B_df)

(4434, 4434, 10954, 127227)

<a name="sec3"></a>

## 3 Export  [^](#outline)

Select relevant columns from `admissions_B_df` and `diag_B_df` then write to disk.

In [10]:
admissions_selected_df = admissions_B_df[['SUBJECT_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME']]

In [11]:
admissions_selected_df.ADMITTIME = pd.to_datetime(admissions_selected_df.ADMITTIME, 
                                                  infer_datetime_format=True).dt.normalize()
admissions_selected_df.DISCHTIME = pd.to_datetime(admissions_selected_df.DISCHTIME, 
                                                  infer_datetime_format=True).dt.normalize()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [12]:
diag_selected_df = diag_B_df[['SUBJECT_ID', 'HADM_ID', 'ICD9_CODE']]
diag_selected_df = diag_selected_df[diag_selected_df.ICD9_CODE.notnull()]

In [13]:
proc_selected_df = proc_B_df[['SUBJECT_ID', 'HADM_ID', 'ICD9_CODE']]
proc_selected_df = proc_selected_df[proc_selected_df.ICD9_CODE.notnull()]

In [14]:
admissions_selected_df.to_csv(f'{mimic3_dir}/adm_df.csv.gz', compression='gzip', index=False)
static_B_df.to_csv(f'{mimic3_dir}/static_df.csv.gz', compression='gzip', index=False)
diag_selected_df.to_csv(f'{mimic3_dir}/dx_df.csv.gz', compression='gzip', index=False)
proc_selected_df.to_csv(f'{mimic3_dir}/pr_df.csv.gz', compression='gzip', index=False)

### Generate Synthetic

Generate a sample with shuffled event types in case you are interested to share a public sample for testing.

In [15]:
adm_syn_df = admissions_selected_df.copy()
stat_syn_df = static_B_df.copy()
diag_syn_df = diag_selected_df.copy()
proc_syn_df = proc_selected_df.copy()

In [16]:
subjects = set(adm_syn_df.SUBJECT_ID)

In [17]:
len(subjects)

4434

In [18]:
import random
import numpy as np
random.seed(42)
syn_subjects = random.sample(subjects, 150)

In [19]:
adm_syn_df = adm_syn_df[adm_syn_df.SUBJECT_ID.isin(syn_subjects)]
stat_syn_df = stat_syn_df[stat_syn_df.SUBJECT_ID.isin(syn_subjects)]

diag_syn_df = diag_syn_df[diag_syn_df.SUBJECT_ID.isin(syn_subjects)]
proc_syn_df = proc_syn_df[proc_syn_df.SUBJECT_ID.isin(syn_subjects)]

In [20]:
diag_syn_df.head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ICD9_CODE
335,321,192097,73382
336,321,192097,42731
337,321,192097,29281
338,321,192097,2851
339,321,192097,25000


In [21]:
diag_syn_df['ICD9_CODE'] =  np.random.permutation(list(diag_syn_df['ICD9_CODE']))
proc_syn_df['ICD9_CODE'] =  np.random.permutation(list(proc_syn_df['ICD9_CODE']))

In [22]:
diag_syn_df.head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ICD9_CODE
335,321,192097,4168
336,321,192097,42731
337,321,192097,25000
338,321,192097,V667
339,321,192097,E8798


In [23]:
adm_syn_df.to_csv(f'{mimic3_dir}/syn_adm_df.csv.gz', compression='gzip', index=False)
stat_syn_df.to_csv(f'{mimic3_dir}/syn_static_df.csv.gz', compression='gzip', index=False)

diag_syn_df.to_csv(f'{mimic3_dir}/syn_dx_df.csv.gz', compression='gzip', index=False)
proc_syn_df.to_csv(f'{mimic3_dir}/syn_pr_df.csv.gz', compression='gzip', index=False)



In [24]:
stat_syn_df

Unnamed: 0,ROW_ID,SUBJECT_ID,GENDER,DOB,DOD,DOD_HOSP,DOD_SSN,EXPIRE_FLAG
2,642,679,F,2059-11-04 00:00:00,2145-03-19 00:00:00,2145-03-19 00:00:00,2145-03-19 00:00:00,1
50,301,321,F,2113-12-13 00:00:00,,,,0
64,476,505,M,2097-11-23 00:00:00,2154-08-29 00:00:00,2154-08-29 00:00:00,2154-08-29 00:00:00,1
65,481,510,F,2099-05-17 00:00:00,,,,0
111,2292,2420,F,2136-05-20 00:00:00,,,,0
...,...,...,...,...,...,...,...,...
4307,38738,69912,F,2040-04-05 00:00:00,2120-03-27 00:00:00,2120-03-27 00:00:00,,1
4308,38753,69995,F,2143-01-30 00:00:00,2194-09-24 00:00:00,2194-09-24 00:00:00,2194-09-24 00:00:00,1
4312,37813,66499,M,2041-06-06 00:00:00,,,,0
4347,40549,77037,F,2164-11-10 00:00:00,,,,0
