# MIMIC-III Preparation

This tutorial provides the steps of downloading the publicly available MIMIC-III dataset (assuming granted access), followed by cohort selection, and transformation to a compatible format with out software.

<a name="outline"></a>

## Outline

- [1](#sec1) Downloading MIMIC-III
- [2](#sec2) Cohort Selection
- [3](#sec3) Export


In [None]:
import pandas as pd
from collections import defaultdict

<a name="sec1"></a>

## 1 Downloading MIMIC-III  [^](#outline)

We assume a granted access to [MIMIC-III dataset](https://physionet.org/content/mimiciii/1.4/), a process that often takes two weeks from the access request to the approval.

From this page [https://physionet.org/content/mimiciii/1.4/](https://physionet.org/content/mimiciii/1.4/), consult the table at the end of the page to download the following files:

1. [`ADMISSIONS.csv.gz`](https://physionet.org/files/mimiciii/1.4/ADMISSIONS.csv.gz?download)
2. [`DIAGNOSES_ICD.csv.gz`](https://physionet.org/files/mimiciii/1.4/DIAGNOSES_ICD.csv.gz?download)


copy/paste these two files into `data` folder, in the same folder of this notebook.

In [None]:

# Load admission file
admissions_df = pd.read_csv('data/ADMISSIONS.csv.gz')

# Count of all subjects in MIMIC-III
print(f'#Subjects: {admissions_df.SUBJECT_ID.nunique()}')

admissions_df

In [None]:
# Load Diagnosis file
diag_df = pd.read_csv('data/DIAGNOSES_ICD.csv.gz', dtype = {'ICD9_CODE': str})


diag_df

<a name="sec2"></a>

## 2 Cohort Selection  [^](#outline)

### 2.A Patient Selection: Minimum of Two Visits

Patients with only one admission (i.e. single timestamp for the diagnosis codes) are not useful in training/validation/testing.

In [None]:
patient_admissions = defaultdict(set)

for row in admissions_df.itertuples():
    patient_admissions[row.SUBJECT_ID].add(row.HADM_ID)
    
patients_admissions_df = pd.DataFrame({
    'patient': patient_admissions.keys(), 
    'n_admissions': map(len, patient_admissions.values())
})


selected_patients_A = set(patients_admissions_df[patients_admissions_df.n_admissions > 1].patient.tolist())

len(selected_patients_A)

Apply the filtration

In [None]:
admissions_A_df = admissions_df[admissions_df.SUBJECT_ID.isin(selected_patients_A)].reset_index(drop=True)
diag_A_df =  diag_df[diag_df.HADM_ID.isin(admissions_A_df.HADM_ID)].reset_index(drop=True)
diag_A_df = diag_A_df[diag_A_df.ICD9_CODE.notnull()].reset_index(drop=True)
admissions_A_df.SUBJECT_ID.nunique(), len(admissions_A_df), len(diag_A_df)

### 2.B Patient Selection: Maximum Hospital Stay is Two Weeks

In [None]:
admit = pd.to_datetime(admissions_A_df['ADMITTIME'], infer_datetime_format=True).dt.normalize() 
disch = pd.to_datetime(admissions_A_df['DISCHTIME'], infer_datetime_format=True).dt.normalize()
admissions_A_df['days'] = (disch - admit).dt.days
admissions_A_df.head()

In [None]:
longest_admission = {}
for subject_id, subject_df in admissions_A_df.groupby('SUBJECT_ID'):
    longest_admission[subject_id] = subject_df.days.max()
    
admissions_A_df['max_days'] = admissions_A_df.SUBJECT_ID.map(longest_admission)
selected_patients_B = set(admissions_A_df[admissions_A_df.max_days <= 14].SUBJECT_ID)

Apply the filtration

In [None]:
admissions_B_df = admissions_A_df[admissions_A_df.SUBJECT_ID.isin(selected_patients_B)].reset_index(drop=True)
diag_B_df =  diag_A_df[diag_A_df.HADM_ID.isin(admissions_B_df.HADM_ID)].reset_index(drop=True)
diag_B_df = diag_B_df[diag_B_df.ICD9_CODE.notnull()].reset_index(drop=True)
admissions_B_df.SUBJECT_ID.nunique(), len(admissions_B_df), len(diag_B_df)

<a name="sec3"></a>

## 3 Export  [^](#outline)

Select relevant columns from `admissions_B_df` and `diag_B_df` then write to disk.

In [None]:
admissions_selected_df = admissions_B_df[['SUBJECT_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME']]

In [None]:
admissions_selected_df.ADMITTIME = pd.to_datetime(admissions_selected_df.ADMITTIME, 
                                                  infer_datetime_format=True).dt.normalize()
admissions_selected_df.DISCHTIME = pd.to_datetime(admissions_selected_df.DISCHTIME, 
                                                  infer_datetime_format=True).dt.normalize()

In [None]:
diag_selected_df = diag_B_df[['SUBJECT_ID', 'HADM_ID', 'ICD9_CODE']]
diag_selected_df = diag_selected_df[diag_selected_df.ICD9_CODE.notnull()]

In [None]:
admissions_selected_df.to_csv('data/mimic3_adm_df.csv.gz', compression='gzip', index=False)
diag_selected_df.to_csv('data/mimic3_diag_df.csv.gz', compression='gzip', index=False)