# MIMIC-III Preparation

This tutorial provides the steps of downloading the publicly available MIMIC-III dataset (assuming granted access), followed by cohort selection, and transformation to a compatible format with out software.

<a name="outline"></a>

## Outline

- [1](#sec1) Downloading MIMIC-III
- [2](#sec2) Cohort Selection
- [3](#sec3) Export


In [1]:
import pandas as pd
from collections import defaultdict

<a name="sec1"></a>

## 1 Downloading MIMIC-III  [^](#outline)

We assume a granted access to [MIMIC-III dataset](https://physionet.org/content/mimiciii/1.4/), a process that often takes two weeks from the access request to the approval.

From this page [https://physionet.org/content/mimiciii/1.4/](https://physionet.org/content/mimiciii/1.4/), consult the table at the end of the page to download the following files:

1. [`ADMISSIONS.csv.gz`](https://physionet.org/files/mimiciii/1.4/ADMISSIONS.csv.gz?download)
2. [`DIAGNOSES_ICD.csv.gz`](https://physionet.org/files/mimiciii/1.4/DIAGNOSES_ICD.csv.gz?download)


copy/paste these two files into an arbitrary location of your choice and assign that dirctory path to the variable `mimic3_dir`.


In [2]:
# HOME and DATA_STORE are arbitrary, change as appropriate.
HOME = os.environ.get('HOME')
DATA_STORE = f'{HOME}/GP/ehr-cohort'


mimic3_dir = f'{DATA_STORE}/mimic3-transforms'
# Load admission file
admissions_df = pd.read_csv(f'{mimic3_dir}/ADMISSIONS.csv.gz')

# Count of all subjects in MIMIC-III
print(f'#Subjects: {admissions_df.SUBJECT_ID.nunique()}')

admissions_df

#Subjects: 4434


Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,ETHNICITY,DIAGNOSIS,DAYS,MAX_DAYS
0,23,152223,2153-09-03,2153-09-08,ELECTIVE,PHYS REFERRAL/NORMAL DELI,WHITE,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,5,7
1,23,124321,2157-10-18,2157-10-25,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,WHITE,BRAIN MASS,7,7
2,34,115799,2186-07-18,2186-07-20,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,WHITE,CHEST PAIN\CATH,2,2
3,34,144319,2191-02-23,2191-02-25,EMERGENCY,CLINIC REFERRAL/PREMATURE,WHITE,BRADYCARDIA,2,2
4,36,182104,2131-04-30,2131-05-08,EMERGENCY,CLINIC REFERRAL/PREMATURE,WHITE,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,8,13
...,...,...,...,...,...,...,...,...,...,...
10949,98759,109836,2165-05-05,2165-05-08,EMERGENCY,CLINIC REFERRAL/PREMATURE,WHITE,BRAIN ANEURYSM,3,3
10950,98759,175386,2165-06-05,2165-06-07,ELECTIVE,PHYS REFERRAL/NORMAL DELI,WHITE,BRAIN ANEURYSM/SDA,2,3
10951,98761,184477,2186-01-16,2186-01-16,ELECTIVE,PHYS REFERRAL/NORMAL DELI,WHITE,GASTROPARESIS\PLACEMENT OF G-TUBE **REMOTE WES...,0,7
10952,98761,182540,2186-02-08,2186-02-08,ELECTIVE,PHYS REFERRAL/NORMAL DELI,WHITE,SHORT GUT SYNDROME/SDA,0,7


In [3]:
# Load Diagnosis file
diag_df = pd.read_csv(f'{mimic3_dir}/DIAGNOSES_ICD.csv.gz', dtype = {'ICD9_CODE': str})


diag_df

Unnamed: 0,SUBJECT_ID,HADM_ID,ICD9_CODE
0,112,174105,53100
1,112,174105,41071
2,112,174105,2859
3,112,174105,41401
4,112,174105,725
...,...,...,...
127261,97488,161999,0414
127262,97488,161999,30391
127263,97488,161999,E8798
127264,97488,161999,78791


<a name="sec2"></a>

## 2 Cohort Selection  [^](#outline)

### 2.A Patient Selection: Minimum of Two Visits

Patients with only one admission (i.e. single timestamp for the diagnosis codes) are not useful in training/validation/testing.

In [4]:
patient_admissions = defaultdict(set)

for row in admissions_df.itertuples():
    patient_admissions[row.SUBJECT_ID].add(row.HADM_ID)
    
patients_admissions_df = pd.DataFrame({
    'patient': patient_admissions.keys(), 
    'n_admissions': map(len, patient_admissions.values())
})


selected_patients_A = set(patients_admissions_df[patients_admissions_df.n_admissions > 1].patient.tolist())

len(selected_patients_A)

4434

Apply the filtration

In [5]:
admissions_A_df = admissions_df[admissions_df.SUBJECT_ID.isin(selected_patients_A)].reset_index(drop=True)
diag_A_df =  diag_df[diag_df.HADM_ID.isin(admissions_A_df.HADM_ID)].reset_index(drop=True)
diag_A_df = diag_A_df[diag_A_df.ICD9_CODE.notnull()].reset_index(drop=True)
admissions_A_df.SUBJECT_ID.nunique(), len(admissions_A_df), len(diag_A_df)

(4434, 10954, 127227)

### 2.B Patient Selection: Maximum Hospital Stay is Two Weeks

In [6]:
admit = pd.to_datetime(admissions_A_df['ADMITTIME'], infer_datetime_format=True).dt.normalize() 
disch = pd.to_datetime(admissions_A_df['DISCHTIME'], infer_datetime_format=True).dt.normalize()
admissions_A_df['days'] = (disch - admit).dt.days
admissions_A_df.head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,ETHNICITY,DIAGNOSIS,DAYS,MAX_DAYS,days
0,23,152223,2153-09-03,2153-09-08,ELECTIVE,PHYS REFERRAL/NORMAL DELI,WHITE,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,5,7,5
1,23,124321,2157-10-18,2157-10-25,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,WHITE,BRAIN MASS,7,7,7
2,34,115799,2186-07-18,2186-07-20,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,WHITE,CHEST PAIN\CATH,2,2,2
3,34,144319,2191-02-23,2191-02-25,EMERGENCY,CLINIC REFERRAL/PREMATURE,WHITE,BRADYCARDIA,2,2,2
4,36,182104,2131-04-30,2131-05-08,EMERGENCY,CLINIC REFERRAL/PREMATURE,WHITE,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,8,13,8


In [7]:
longest_admission = {}
for subject_id, subject_df in admissions_A_df.groupby('SUBJECT_ID'):
    longest_admission[subject_id] = subject_df.days.max()
    
admissions_A_df['max_days'] = admissions_A_df.SUBJECT_ID.map(longest_admission)
selected_patients_B = set(admissions_A_df[admissions_A_df.max_days <= 14].SUBJECT_ID)

Apply the filtration

In [8]:
admissions_B_df = admissions_A_df[admissions_A_df.SUBJECT_ID.isin(selected_patients_B)].reset_index(drop=True)
diag_B_df =  diag_A_df[diag_A_df.HADM_ID.isin(admissions_B_df.HADM_ID)].reset_index(drop=True)
diag_B_df = diag_B_df[diag_B_df.ICD9_CODE.notnull()].reset_index(drop=True)
admissions_B_df.SUBJECT_ID.nunique(), len(admissions_B_df), len(diag_B_df)

(4434, 10954, 127227)

<a name="sec3"></a>

## 3 Export  [^](#outline)

Select relevant columns from `admissions_B_df` and `diag_B_df` then write to disk.

In [9]:
admissions_selected_df = admissions_B_df[['SUBJECT_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME']]

In [10]:
admissions_selected_df.ADMITTIME = pd.to_datetime(admissions_selected_df.ADMITTIME, 
                                                  infer_datetime_format=True).dt.normalize()
admissions_selected_df.DISCHTIME = pd.to_datetime(admissions_selected_df.DISCHTIME, 
                                                  infer_datetime_format=True).dt.normalize()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [11]:
diag_selected_df = diag_B_df[['SUBJECT_ID', 'HADM_ID', 'ICD9_CODE']]
diag_selected_df = diag_selected_df[diag_selected_df.ICD9_CODE.notnull()]

In [16]:
admissions_selected_df.to_csv(f'{mimic3_dir}/adm_df.csv.gz', compression='gzip', index=False)
diag_selected_df.to_csv(f'{mimic3_dir}/dx_df.csv.gz', compression='gzip', index=False)

### Generate Synthetic

Generate a sample with shuffled event types in case you are interested to share a public sample for testing.

In [13]:
adm_syn_df = admissions_selected_df.copy()
diag_syn_df = diag_selected_df.copy()

In [14]:
subjects = set(adm_syn_df.SUBJECT_ID)

In [15]:
len(subjects)

4434

In [32]:
import random
import numpy as np
random.seed(42)
syn_subjects = random.sample(subjects, 150)

In [33]:
adm_syn_df = adm_syn_df[adm_syn_df.SUBJECT_ID.isin(syn_subjects)]
diag_syn_df = diag_syn_df[diag_syn_df.SUBJECT_ID.isin(syn_subjects)]

In [34]:
diag_syn_df.head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ICD9_CODE
335,321,192097,73382
336,321,192097,42731
337,321,192097,29281
338,321,192097,2851
339,321,192097,25000


In [35]:
diag_syn_df['ICD9_CODE'] =  np.random.permutation(list(diag_syn_df['ICD9_CODE']))

In [36]:
diag_syn_df.head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ICD9_CODE
335,321,192097,43320
336,321,192097,V103
337,321,192097,4254
338,321,192097,25000
339,321,192097,42731


In [17]:
adm_syn_df.to_csv(f'{mimic3_dir}/syn_adm_df.csv.gz', compression='gzip', index=False)
diag_syn_df.to_csv(f'{mimic3_dir}/syn_dx_df.csv.gz', compression='gzip', index=False)
