# MIMIC-IV v2.0 Preparation

This tutorial provides the steps of downloading the publicly available MIMIC-IV dataset (assuming granted access), followed by cohort selection, and transformation to a compatible format with out software.

<a name="outline"></a>

## Outline

- [1](#sec1) Downloading MIMIC-IV
- [2](#sec2) Cohort Selection
- [3](#sec3) Export


In [1]:
import pandas as pd
from collections import defaultdict
import os

<a name="sec1"></a>

## 1 Downloading MIMIC-IV  [^](#outline)

We assume a granted access to [MIMIC-IV dataset](https://physionet.org/content/mimiciv/2.0/), a process that often takes two weeks from the access request to the approval. From our experience, a granted access to MIMIC-III automatically grants access to MIMIC-IV.

From this page [https://physionet.org/content/mimiciv/1.0/](https://physionet.org/content/mimiciv/1.0/), consult the online file browser at the end of the page to download the following files:

1. [`hosp/admissions.csv.gz`](https://physionet.org/files/mimiciv/2.0/hosp/admissions.csv.gz?download)
2. [`hosp/diagnoses_icd.csv.gz`](https://physionet.org/files/mimiciv/2.0/hosp/diagnoses_icd.csv.gz?download)
3. [`hosp/procedures_icd.csv.gz`](https://physionet.org/files/mimiciv/2.0/hosp/procedures_icd.csv.gz?download)


copy/paste these two files into an arbitrary location of your choice and assign that dirctory path to the variable `mimic4_dir`.


In [2]:
# HOME and DATA_STORE are arbitrary, change as appropriate.
HOME = os.environ.get('HOME')
DATA_STORE = f'{HOME}/GP/ehr-data'


mimic4_dir = f'{DATA_STORE}/mimic4v2.0-cohort'

# Load admission file
admissions_df = pd.read_csv(f'{mimic4_dir}/admissions.csv.gz')

# Count of all subjects in MIMIC-III
print(f'#Subjects: {admissions_df.subject_id.nunique()}')

admissions_df

#Subjects: 190279


Unnamed: 0,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
0,10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
1,10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
2,10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
3,10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
4,10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
454319,19999828,25744818,2149-01-08 16:44:00,2149-01-18 17:00:00,,EW EMER.,TRANSFER FROM HOSPITAL,HOME HEALTH CARE,Other,ENGLISH,SINGLE,WHITE,2149-01-08 09:11:00,2149-01-08 18:12:00,0
454320,19999828,29734428,2147-07-18 16:23:00,2147-08-04 18:10:00,,EW EMER.,PHYSICIAN REFERRAL,HOME HEALTH CARE,Other,ENGLISH,SINGLE,WHITE,2147-07-17 17:18:00,2147-07-18 17:34:00,0
454321,19999840,21033226,2164-09-10 13:47:00,2164-09-17 13:42:00,2164-09-17 13:42:00,EW EMER.,EMERGENCY ROOM,DIED,Other,ENGLISH,WIDOWED,WHITE,2164-09-10 11:09:00,2164-09-10 14:46:00,1
454322,19999840,26071774,2164-07-25 00:27:00,2164-07-28 12:15:00,,EW EMER.,EMERGENCY ROOM,HOME,Other,ENGLISH,WIDOWED,WHITE,2164-07-24 21:16:00,2164-07-25 01:20:00,0


In [3]:
# Load Diagnosis file
diag_df = pd.read_csv(f'{mimic4_dir}/diagnoses_icd.csv.gz', dtype = {'icd_code': str})
diag_df

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version
0,10000032,22595853,1,5723,9
1,10000032,22595853,2,78959,9
2,10000032,22595853,3,5715,9
3,10000032,22595853,4,07070,9
4,10000032,22595853,5,496,9
...,...,...,...,...,...
5006879,19999987,23865745,7,41401,9
5006880,19999987,23865745,8,78039,9
5006881,19999987,23865745,9,0413,9
5006882,19999987,23865745,10,36846,9


In [4]:
# Load Procedures file
proc_df = pd.read_csv(f'{mimic4_dir}/procedures_icd.csv.gz', dtype = {'icd_code': str})
proc_df

Unnamed: 0,subject_id,hadm_id,seq_num,chartdate,icd_code,icd_version
0,10000032,22595853,1,2180-05-07,5491,9
1,10000032,22841357,1,2180-06-27,5491,9
2,10000032,25742920,1,2180-08-06,5491,9
3,10000068,25022803,1,2160-03-03,8938,9
4,10000117,27988844,1,2183-09-19,0QS734Z,10
...,...,...,...,...,...,...
704119,19999840,21033226,5,2164-09-16,0331,9
704120,19999840,26071774,1,2164-07-25,8891,9
704121,19999840,26071774,2,2164-07-25,8841,9
704122,19999987,23865745,1,2145-11-07,8841,9


<a name="sec2"></a>

## 2 Cohort Selection  [^](#outline)

### 2.A Patient Selection: Minimum of Two Visits

Patients with only one admission (i.e. single timestamp for the diagnosis codes) are not useful in training/validation/testing.

In [5]:
patient_admissions = defaultdict(set)

for row in admissions_df.itertuples():
    patient_admissions[row.subject_id].add(row.hadm_id)
    
patients_admissions_df = pd.DataFrame({
    'patient': patient_admissions.keys(), 
    'n_admissions': map(len, patient_admissions.values())
})


selected_patients_A = set(patients_admissions_df[patients_admissions_df.n_admissions > 1].patient.tolist())

len(selected_patients_A)

83811

Apply the filtration

In [6]:
admissions_A_df = admissions_df[admissions_df.subject_id.isin(selected_patients_A)].reset_index(drop=True)
diag_A_df =  diag_df[diag_df.hadm_id.isin(admissions_A_df.hadm_id)].reset_index(drop=True)
diag_A_df = diag_A_df[diag_A_df.icd_code.notnull()].reset_index(drop=True)

proc_A_df =  proc_df[proc_df.hadm_id.isin(admissions_A_df.hadm_id)].reset_index(drop=True)
proc_A_df = proc_A_df[proc_A_df.icd_code.notnull()].reset_index(drop=True)

admissions_A_df.subject_id.nunique(), len(admissions_A_df), len(diag_A_df), len(proc_A_df)

(83811, 347856, 4063303, 525578)

### 2.B Patient Selection: Maximum Hospital Stay is Two Weeks

In [7]:
admit = pd.to_datetime(admissions_A_df['admittime'], infer_datetime_format=True).dt.normalize() 
disch = pd.to_datetime(admissions_A_df['dischtime'], infer_datetime_format=True).dt.normalize()
admissions_A_df['days'] = (disch - admit).dt.days
admissions_A_df.head()

Unnamed: 0,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag,days
0,10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0,1
1,10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0,1
2,10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0,2
3,10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0,2
4,10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0,4


In [8]:
longest_admission = {}
for subject_id, subject_df in admissions_A_df.groupby('subject_id'):
    longest_admission[subject_id] = subject_df.days.max()
    
admissions_A_df['max_days'] = admissions_A_df.subject_id.map(longest_admission)
selected_patients_B = set(admissions_A_df[admissions_A_df.max_days <= 14].subject_id)

Apply the filtration

In [9]:
admissions_B_df = admissions_A_df[admissions_A_df.subject_id.isin(selected_patients_B)].reset_index(drop=True)
diag_B_df =  diag_A_df[diag_A_df.hadm_id.isin(admissions_B_df.hadm_id)].reset_index(drop=True)
diag_B_df = diag_B_df[diag_B_df.icd_code.notnull()].reset_index(drop=True)

proc_B_df =  proc_A_df[proc_A_df.hadm_id.isin(admissions_B_df.hadm_id)].reset_index(drop=True)
proc_B_df = proc_B_df[proc_B_df.icd_code.notnull()].reset_index(drop=True)

admissions_B_df.subject_id.nunique(), len(admissions_B_df), len(diag_B_df), len(proc_B_df)

(70830, 261404, 2737543, 331755)

<a name="sec3"></a>

## 3 Export  [^](#outline)

Select relevant columns from `admissions_B_df` and `diag_B_df` then write to disk.

In [10]:
admissions_selected_df = admissions_B_df[['subject_id', 'hadm_id', 'admittime', 'dischtime']]

In [11]:
admissions_selected_df.admittime = pd.to_datetime(admissions_selected_df.admittime, 
                                                  infer_datetime_format=True).dt.normalize()
admissions_selected_df.dischtime = pd.to_datetime(admissions_selected_df.dischtime, 
                                                  infer_datetime_format=True).dt.normalize()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [12]:
admissions_selected_df.head()

Unnamed: 0,subject_id,hadm_id,admittime,dischtime
0,10000032,22595853,2180-05-06,2180-05-07
1,10000032,22841357,2180-06-26,2180-06-27
2,10000032,25742920,2180-08-05,2180-08-07
3,10000032,29079034,2180-07-23,2180-07-25
4,10000084,23052089,2160-11-21,2160-11-25


In [13]:
diag_selected_df = diag_B_df[['subject_id', 'hadm_id', 'icd_code', 'icd_version']]
diag_selected_df = diag_selected_df[diag_selected_df.icd_code.notnull()]

proc_selected_df = proc_B_df[['subject_id', 'hadm_id', 'icd_code', 'icd_version']]
proc_selected_df = proc_selected_df[proc_selected_df.icd_code.notnull()]

In [14]:
admissions_selected_df.to_csv(f'{mimic4_dir}/adm_df.csv.gz', compression='gzip', index=False)
diag_selected_df.to_csv(f'{mimic4_dir}/dx_df.csv.gz', compression='gzip', index=False)
proc_selected_df.to_csv(f'{mimic4_dir}/pr_df.csv.gz', compression='gzip', index=False)

### Generate Synthetic

Generate a sample with shuffled event types in case you are interested to share a public sample for testing.

In [15]:
adm_syn_df = admissions_selected_df.copy()
diag_syn_df = diag_selected_df.copy()
proc_syn_df = proc_selected_df.copy()

In [16]:
subjects = set(adm_syn_df.subject_id)

In [17]:
len(subjects)

70830

In [18]:
import random
import numpy as np
random.seed(42)
syn_subjects = random.sample(subjects, 400)

In [19]:
adm_syn_df = adm_syn_df[adm_syn_df.subject_id.isin(syn_subjects)]
diag_syn_df = diag_syn_df[diag_syn_df.subject_id.isin(syn_subjects)]
proc_syn_df = proc_syn_df[proc_syn_df.subject_id.isin(syn_subjects)]

In [20]:
subject_permute = dict(zip(syn_subjects, np.random.permutation(syn_subjects)))
subject_shift = {i: np.random.randint(5, 30) for i in syn_subjects}
     

In [21]:
adm_syn_df['admittime'] = adm_syn_df.apply(lambda r: r['admittime'] +pd.DateOffset(months=subject_shift[r['subject_id']]) , axis=1)
adm_syn_df['dischtime'] = adm_syn_df.apply(lambda r: r['dischtime'] +pd.DateOffset(months=subject_shift[r['subject_id']]), axis=1)
adm_syn_df.head()

Unnamed: 0,subject_id,hadm_id,admittime,dischtime
429,10014765,26650343,2200-07-18,2200-07-23
430,10014765,29840268,2193-09-27,2193-09-28
1076,10043305,23614590,2191-01-17,2191-01-24
1077,10043305,27496213,2191-01-16,2191-01-17
1734,10071281,23232447,2159-07-04,2159-07-06


In [22]:
adm_syn_df['subject_id'] = adm_syn_df['subject_id'].map(subject_permute)
diag_syn_df['subject_id'] = diag_syn_df['subject_id'].map(subject_permute)
proc_syn_df['subject_id'] = proc_syn_df['subject_id'].map(subject_permute)

diag_syn_df.head()

Unnamed: 0,subject_id,hadm_id,icd_code,icd_version
4966,10596508,26650343,C801,10
4967,10596508,26650343,J910,10
4968,10596508,26650343,J9811,10
4969,10596508,26650343,I495,10
4970,10596508,26650343,J939,10


In [23]:
proc_syn_df.head()

Unnamed: 0,subject_id,hadm_id,icd_code,icd_version
519,10596508,26650343,0W9930Z,10
2152,19531565,23232447,0066,9
2153,19531565,23232447,3606,9
2154,19531565,23232447,0045,9
2155,19531565,23232447,0040,9


In [24]:
diag_syn_df.icd_version.value_counts()

9     8932
10    5329
Name: icd_version, dtype: int64

In [25]:
diag_syn_df.head()

Unnamed: 0,subject_id,hadm_id,icd_code,icd_version
4966,10596508,26650343,C801,10
4967,10596508,26650343,J910,10
4968,10596508,26650343,J9811,10
4969,10596508,26650343,I495,10
4970,10596508,26650343,J939,10


In [26]:
adm_syn_df.to_csv(f'{mimic4_dir}/syn_adm_df.csv.gz', compression='gzip', index=False)
diag_syn_df.to_csv(f'{mimic4_dir}/syn_dx_df.csv.gz', compression='gzip', index=False)
proc_syn_df.to_csv(f'{mimic4_dir}/syn_pr_df.csv.gz', compression='gzip', index=False)
