# MIMIC-IV Preparation

This tutorial provides the steps of downloading the publicly available MIMIC-IV dataset (assuming granted access), followed by cohort selection, and transformation to a compatible format with out software.

<a name="outline"></a>

## Outline

- [1](#sec1) Downloading MIMIC-IV
- [2](#sec2) Cohort Selection
- [3](#sec3) Export


In [2]:
import pandas as pd
from collections import defaultdict

<a name="sec1"></a>

## 1 Downloading MIMIC-IV  [^](#outline)

We assume a granted access to [MIMIC-IV dataset](https://physionet.org/content/mimiciv/1.0/), a process that often takes two weeks from the access request to the approval. From our experience, a granted access to MIMIC-III automatically grants access to MIMIC-IV.

From this page [https://physionet.org/content/mimiciv/1.0/](https://physionet.org/content/mimiciv/1.0/), consult the online file browser at the end of the page to download the following files:

1. [`core/admissions.csv.gz`](https://physionet.org/files/mimiciv/1.0/core/admissions.csv.gz?download)
2. [`hosp/diagnoses_icd.csv.gz`](https://physionet.org/files/mimiciv/1.0/hosp/diagnoses_icd.csv.gz?download)


copy/paste these two files into an arbitrary location of your choice and assign that dirctory path to the variable `mimic4_dir`.


In [4]:
# HOME and DATA_STORE are arbitrary, change as appropriate.
HOME = os.environ.get('HOME')
DATA_STORE = f'{HOME}/GP/ehr-data'


mimic4_dir = f'{DATA_STORE}/mimic4-cohort'

# Load admission file
admissions_df = pd.read_csv(f'{mimic4_dir}/admissions.csv.gz')

# Count of all subjects in MIMIC-III
print(f'#Subjects: {admissions_df.subject_id.nunique()}')

admissions_df

#Subjects: 256878


Unnamed: 0,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,ethnicity,edregtime,edouttime,hospital_expire_flag
0,14679932,21038362,2139-09-26 14:16:00,2139-09-28 11:30:00,,ELECTIVE,,HOME,Other,ENGLISH,SINGLE,UNKNOWN,,,0
1,15585972,24941086,2123-10-07 23:56:00,2123-10-12 11:22:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0
2,11989120,21965160,2147-01-14 09:00:00,2147-01-17 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0
3,17817079,24709883,2165-12-27 17:33:00,2165-12-31 21:18:00,,ELECTIVE,,HOME,Other,ENGLISH,,OTHER,,,0
4,15078341,23272159,2122-08-28 08:48:00,2122-08-30 12:32:00,,ELECTIVE,,HOME,Other,ENGLISH,,BLACK/AFRICAN AMERICAN,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
523735,17892964,20786062,2180-09-17 00:00:00,2180-09-18 13:37:00,,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME,Medicare,ENGLISH,SINGLE,WHITE,,,0
523736,17137572,20943099,2147-08-01 17:41:00,2147-08-02 17:30:00,,EW EMER.,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,DIVORCED,HISPANIC/LATINO,2147-07-31 23:55:00,2147-08-01 19:37:00,0
523737,19389857,23176714,2189-03-01 00:58:00,2189-03-02 15:22:00,,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME,Other,ENGLISH,MARRIED,WHITE,,,0
523738,12298845,22347500,2138-05-31 00:00:00,2138-06-04 16:50:00,,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME HEALTH CARE,Other,ENGLISH,MARRIED,WHITE,,,0


In [6]:
# Load Diagnosis file
diag_df = pd.read_csv(f'{mimic4_dir}/diagnoses_icd.csv.gz', dtype = {'icd_code': str})
diag_df

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version
0,15734973,20475282,3,2825,9
1,15734973,20475282,2,V0251,9
2,15734973,20475282,5,V270,9
3,15734973,20475282,1,64891,9
4,15734973,20475282,4,66481,9
...,...,...,...,...,...
5280346,13747041,25594844,6,R531,10
5280347,13747041,25594844,8,R0902,10
5280348,13747041,25594844,4,F1120,10
5280349,13747041,25594844,2,J189,10


<a name="sec2"></a>

## 2 Cohort Selection  [^](#outline)

### 2.A Patient Selection: Minimum of Two Visits

Patients with only one admission (i.e. single timestamp for the diagnosis codes) are not useful in training/validation/testing.

In [7]:
patient_admissions = defaultdict(set)

for row in admissions_df.itertuples():
    patient_admissions[row.subject_id].add(row.hadm_id)
    
patients_admissions_df = pd.DataFrame({
    'patient': patient_admissions.keys(), 
    'n_admissions': map(len, patient_admissions.values())
})


selected_patients_A = set(patients_admissions_df[patients_admissions_df.n_admissions > 1].patient.tolist())

len(selected_patients_A)

85798

Apply the filtration

In [8]:
admissions_A_df = admissions_df[admissions_df.subject_id.isin(selected_patients_A)].reset_index(drop=True)
diag_A_df =  diag_df[diag_df.hadm_id.isin(admissions_A_df.hadm_id)].reset_index(drop=True)
diag_A_df = diag_A_df[diag_A_df.icd_code.notnull()].reset_index(drop=True)
admissions_A_df.subject_id.nunique(), len(admissions_A_df), len(diag_A_df)

(85798, 352660, 4086854)

### 2.B Patient Selection: Maximum Hospital Stay is Two Weeks

In [9]:
admit = pd.to_datetime(admissions_A_df['admittime'], infer_datetime_format=True).dt.normalize() 
disch = pd.to_datetime(admissions_A_df['dischtime'], infer_datetime_format=True).dt.normalize()
admissions_A_df['days'] = (disch - admit).dt.days
admissions_A_df.head()

Unnamed: 0,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,ethnicity,edregtime,edouttime,hospital_expire_flag,days
0,10292548,26653546,2120-01-07 05:51:00,2120-01-12 13:45:00,,URGENT,PHYSICIAN REFERRAL,HOME,Other,ENGLISH,MARRIED,ASIAN,,,0,5
1,19120008,24459786,2185-09-18 11:15:00,2185-09-20 15:30:00,,SURGICAL SAME DAY ADMISSION,PHYSICIAN REFERRAL,HOME,Medicare,?,MARRIED,ASIAN,,,0,2
2,11735820,24560424,2151-10-24 20:32:00,2151-10-25 12:25:00,,EU OBSERVATION,EMERGENCY ROOM,,Medicaid,?,MARRIED,HISPANIC/LATINO,2151-10-24 13:45:00,2151-10-25 12:25:00,0,1
3,16261811,26233676,2145-12-08 18:41:00,2145-12-09 19:40:00,,EU OBSERVATION,EMERGENCY ROOM,,Medicare,ENGLISH,SINGLE,WHITE,2145-12-08 14:44:00,2145-12-08 19:48:00,0,1
4,12988422,25192155,2132-05-24 07:10:00,2132-05-24 13:50:00,,EU OBSERVATION,EMERGENCY ROOM,,Medicare,ENGLISH,SINGLE,WHITE,2132-05-23 22:09:00,2132-05-24 13:50:00,0,0


In [10]:
longest_admission = {}
for subject_id, subject_df in admissions_A_df.groupby('subject_id'):
    longest_admission[subject_id] = subject_df.days.max()
    
admissions_A_df['max_days'] = admissions_A_df.subject_id.map(longest_admission)
selected_patients_B = set(admissions_A_df[admissions_A_df.max_days <= 14].subject_id)

Apply the filtration

In [11]:
admissions_B_df = admissions_A_df[admissions_A_df.subject_id.isin(selected_patients_B)].reset_index(drop=True)
diag_B_df =  diag_A_df[diag_A_df.hadm_id.isin(admissions_B_df.hadm_id)].reset_index(drop=True)
diag_B_df = diag_B_df[diag_B_df.icd_code.notnull()].reset_index(drop=True)
admissions_B_df.subject_id.nunique(), len(admissions_B_df), len(diag_B_df)

(72625, 265637, 2755053)

<a name="sec3"></a>

## 3 Export  [^](#outline)

Select relevant columns from `admissions_B_df` and `diag_B_df` then write to disk.

In [12]:
admissions_selected_df = admissions_B_df[['subject_id', 'hadm_id', 'admittime', 'dischtime']]

In [13]:
admissions_selected_df.admittime = pd.to_datetime(admissions_selected_df.admittime, 
                                                  infer_datetime_format=True).dt.normalize()
admissions_selected_df.dischtime = pd.to_datetime(admissions_selected_df.dischtime, 
                                                  infer_datetime_format=True).dt.normalize()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [14]:
admissions_selected_df.head()

Unnamed: 0,subject_id,hadm_id,admittime,dischtime
0,10292548,26653546,2120-01-07,2120-01-12
1,11735820,24560424,2151-10-24,2151-10-25
2,16261811,26233676,2145-12-08,2145-12-09
3,12988422,25192155,2132-05-24,2132-05-24
4,10945838,20090853,2166-05-29,2166-05-30


In [15]:
diag_selected_df = diag_B_df[['subject_id', 'hadm_id', 'icd_code', 'icd_version']]
diag_selected_df = diag_selected_df[diag_selected_df.icd_code.notnull()]

### Convert ICD10 to ICD9

MIMIC-IV uses both ICD9 and ICD10 coding schemes as opposed to MIMIC-III which only uses ICD9.
In the current research all coding schemes are mapped to CCS coding scheme.
The developed library supports ICD9 to CCS mapping, while mapping from ICD10 is not yet supported.
For MIMIC-IV, we will first convert all ICD10 codes to ICD9 in the following cells, so after that we can apply the same pipeline on MIMIC-III and MIMIC-IV.


In [16]:
icd_conv = pd.read_csv('../icenode/ehr/resources/2018_gem_pcs_I10I9.txt.gz', dtype = str)
icd_conv.head()

Unnamed: 0,icd10cm,icd9cm,flags,approximate,no_map,combination,scenario,choice_list
0,16070,231,10000,1,0,0,0,0
1,16071,231,10000,1,0,0,0,0
2,16072,232,10000,1,0,0,0,0
3,16073,232,10000,1,0,0,0,0
4,16074,233,10000,1,0,0,0,0


In [17]:
# Conversion dictionary (1:N map)
icd_conv_dict = defaultdict(set)

for row in icd_conv[icd_conv.no_map == 0].itertuples():
    icd_conv_dict[row.icd10cm].add(row.icd9cm)
    
diagnoses_icd_10 = diag_selected_df[diag_selected_df.icd_version == 10]

diagnoses_icd9_converted = {'subject_id': [],
                            'hadm_id': [],
                            'icd_code': [],
                            'icd_version': []}

for row in diagnoses_icd_10.itertuples():
    for icd9 in icd_conv_dict.get(row.icd_code, {}):
        diagnoses_icd9_converted['subject_id'].append(row.subject_id)
        diagnoses_icd9_converted['hadm_id'].append(row.hadm_id)
        diagnoses_icd9_converted['icd_code'].append(icd9)
        diagnoses_icd9_converted['icd_version'].append(9)
diagnoses_icd9_converted = pd.DataFrame(diagnoses_icd9_converted)

# The original rows with ICD9
diagnoses_icd9 = diag_selected_df[diag_selected_df.icd_version == 9]



# Now with merging the converted ICD9
diagnoses_icd9 = diagnoses_icd9.append(diagnoses_icd9_converted)

# Remove the column 'icd_version' (everythin is version 9 now)
diagnoses_icd9 = diagnoses_icd9[['subject_id', 'hadm_id', 'icd_code']]

# capitalize all column names (to follow the same convention of MIMIC-III, and to be compatible with the library)
diagnoses_icd9.columns = ['SUBJECT_ID', 'HADM_ID', 'ICD9_CODE']
admissions_selected_df.columns = ['SUBJECT_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME']

In [18]:
diagnoses_icd9.head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ICD9_CODE
0,15734973,20475282,2825
1,15734973,20475282,V0251
2,15734973,20475282,V270
3,15734973,20475282,64891
4,15734973,20475282,66481


In [21]:
admissions_selected_df.to_csv(f'{mimic4_dir}/adm_df.csv.gz', compression='gzip', index=False)
diagnoses_icd9.to_csv(f'{mimic4_dir}/dx_df.csv.gz', compression='gzip', index=False)