<div style="text-align: center; font-weight: bold;">
    <h1>Pipeline for Research ready EHR Datasets</h1>
    <h2>Part 2: Cleaning, Normalizing and Rolling up the EHR Data</h2>
    <h4>Author: Vidul Ayakulangara Panickan</h3>
</div>



## Step 4: Cleaning!

MIMIC data has been processed for data analysis so its not a typical dataset you might encounter in a hospital system setting. Generally, EHR data is stored in databases and observations made are timestamped. However In MIMIC-IV, cetain data comes only hospital admission info and we have to get the admission time from a different table for example diagnoses_icd data. Whereas Lab,procedure and microbiology events have timestamps included.

In [1]:
import pandas as pd

diagnoses_icd_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/diagnoses_icd.csv"
admissions_file ="/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/admissions.csv"

diagnoses_icd = pd.read_csv(diagnoses_icd_file,dtype=str)

admissions = pd.read_csv(admissions_file,dtype=str)

print(diagnoses_icd.columns)
print(admissions.columns)

Index(['subject_id', 'hadm_id', 'seq_num', 'icd_code', 'icd_version'], dtype='object')
Index(['subject_id', 'hadm_id', 'admittime', 'dischtime', 'deathtime',
       'admission_type', 'admit_provider_id', 'admission_location',
       'discharge_location', 'insurance', 'language', 'marital_status', 'race',
       'edregtime', 'edouttime', 'hospital_expire_flag'],
      dtype='object')


In [2]:
# We only require hospital subject id, hadm_id and admission time from the admissions table for this operation

diagnoses_icd = pd.merge(diagnoses_icd,admissions[['subject_id','hadm_id','admittime']], how='left', on=['subject_id','hadm_id'])
diagnoses_icd

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06 22:23:00
1,10000032,22595853,2,78959,9,2180-05-06 22:23:00
2,10000032,22595853,3,5715,9,2180-05-06 22:23:00
3,10000032,22595853,4,07070,9,2180-05-06 22:23:00
4,10000032,22595853,5,496,9,2180-05-06 22:23:00
...,...,...,...,...,...,...
6364483,19999987,23865745,7,41401,9,2145-11-02 21:38:00
6364484,19999987,23865745,8,78039,9,2145-11-02 21:38:00
6364485,19999987,23865745,9,0413,9,2145-11-02 21:38:00
6364486,19999987,23865745,10,36846,9,2145-11-02 21:38:00


In [3]:
# Check for records without any admission time

print("Number of records with no hospital admission time:", diagnoses_icd['admittime'].isna().sum())

# If you have records without dates, remove them
diagnoses_icd = diagnoses_icd.dropna(subset=['admittime'])

diagnoses_icd

Number of records with no hospital admission time: 0


Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06 22:23:00
1,10000032,22595853,2,78959,9,2180-05-06 22:23:00
2,10000032,22595853,3,5715,9,2180-05-06 22:23:00
3,10000032,22595853,4,07070,9,2180-05-06 22:23:00
4,10000032,22595853,5,496,9,2180-05-06 22:23:00
...,...,...,...,...,...,...
6364483,19999987,23865745,7,41401,9,2145-11-02 21:38:00
6364484,19999987,23865745,8,78039,9,2145-11-02 21:38:00
6364485,19999987,23865745,9,0413,9,2145-11-02 21:38:00
6364486,19999987,23865745,10,36846,9,2145-11-02 21:38:00


In [4]:
# For typical analysis, the time component is not needed however this may change based on needs of the analysis

diagnoses_icd['admittime'] = diagnoses_icd['admittime'].str[:10]

diagnoses_icd

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06
1,10000032,22595853,2,78959,9,2180-05-06
2,10000032,22595853,3,5715,9,2180-05-06
3,10000032,22595853,4,07070,9,2180-05-06
4,10000032,22595853,5,496,9,2180-05-06
...,...,...,...,...,...,...
6364483,19999987,23865745,7,41401,9,2145-11-02
6364484,19999987,23865745,8,78039,9,2145-11-02
6364485,19999987,23865745,9,0413,9,2145-11-02
6364486,19999987,23865745,10,36846,9,2145-11-02


In [5]:
# For diagnosis date, we typically keep subject_id, icd_code and date. You can retain other columns as needed
diagnoses_icd=diagnoses_icd[['subject_id','icd_code','icd_version','admittime']]

# Further we rename admittime to date
diagnoses_icd = diagnoses_icd.rename(columns={'admittime': 'date'})

# Display the updated DataFrame
print(diagnoses_icd)

        subject_id icd_code icd_version        date
0         10000032     5723           9  2180-05-06
1         10000032    78959           9  2180-05-06
2         10000032     5715           9  2180-05-06
3         10000032    07070           9  2180-05-06
4         10000032      496           9  2180-05-06
...            ...      ...         ...         ...
6364483   19999987    41401           9  2145-11-02
6364484   19999987    78039           9  2145-11-02
6364485   19999987     0413           9  2145-11-02
6364486   19999987    36846           9  2145-11-02
6364487   19999987     7810           9  2145-11-02

[6364488 rows x 4 columns]


#### Establish date range for the data (Not applicalble to MIMIC-IV)
MICIC-IV has adjusted date but in real world datasets, its vital to ensure the dates make sense. We will remove dates that beyond a certain range. Eg: records with date before 1980s and dates after present year. The code for that is provided below, however for MIMIC-IV since the date is adjusted we won't be able to perform this cleaning operation.

In [None]:
# diagnoses_icd = diagnoses_icd[(diagnoses_icd['date'].str[:4].astype(int) >= 1980) & (diagnoses_icd['date'].str[:4].astype(int) <= 2024)]

In [6]:
# Check for empty cells and duplicated rows in your data

# Check for empty cells
if diagnoses_icd.isnull().values.any():
    print("Empty cells found. Removing rows with empty cells...")
    diagnoses_icd = diagnoses_icd.dropna()  # Drop rows with any null values
    print("DataFrame after removing empty cells:")
    print(diagnoses_icd)
else:
    print("No empty cells found.")

# Check for duplicates
if diagnoses_icd.duplicated().sum() > 0:
    print("Duplicate rows found. Removing duplicates...")
    diagnoses_icd = diagnoses_icd.drop_duplicates()  # Remove duplicate rows
    print("DataFrame after removing duplicates:")
    print(diagnoses_icd)
else:
    print("No duplicate rows found.")

No empty cells found.
Duplicate rows found. Removing duplicates...
DataFrame after removing duplicates:
        subject_id icd_code icd_version        date
0         10000032     5723           9  2180-05-06
1         10000032    78959           9  2180-05-06
2         10000032     5715           9  2180-05-06
3         10000032    07070           9  2180-05-06
4         10000032      496           9  2180-05-06
...            ...      ...         ...         ...
6364483   19999987    41401           9  2145-11-02
6364484   19999987    78039           9  2145-11-02
6364485   19999987     0413           9  2145-11-02
6364486   19999987    36846           9  2145-11-02
6364487   19999987     7810           9  2145-11-02

[6356481 rows x 4 columns]


## Step 5: Normalization

Normalization is typically done for laboratory observations. We need to ensure the values observed are on the same scale and sometimes the same lab will be recorded with multiple units - so its important to normalize the values so its sutable for analysis. 

For Diagnoses, Medications, Procedures the values observed are categorical so we dont have to perform normalization however to enhance data quality and usuability we will we will be mapping data recoreded in different under coding systems to a common parent code through a process called rolling up. 

## Step 6: Rolling up Data

For data to be compared and analysed, it requires to be standardized for example ehr medical codes of the same type say Medication will be coming from differetn coding systems like NDC or RxNorm or codes local to an institiution. Rolling up data from different coding systems to a common parent coding system will ensure standardization

There are further  reasons to perform rollup

1) The raw EHR codes are too specific that its not feasible to analysis
2) Rolling up helps to harmonize data across different institutions and perform analysis at larger scal

In [1]:
import os

os.makedirs("Rollup_Mappings", exist_ok=True)
os.makedirs("Intermediate_Data", exist_ok=True)
os.makedirs("Rolledup_Data", exist_ok=True)

## Creating Rollup Dictionary

To rollup EHR codes to a parent level code, we need decide on what coding system we will be rolling to 
Diagnoses - We will be rolling up ICD and other Diagnoses codes to PheCodes
Medication - We will be rolling up standard codes like RxNorm, NDC and local medication codes to RxNorm Ingredient level codes
Lab - We will be rolling up lab codes, loinc codes to LOINC Component
Prcedures - We will be rolling up Procedure codes like ICDPCS/CPT4 Codes to CCS codes

Creating these Rollup dictionaries require a lot of manual processing and quality checks to ensure the mapping dictionary created is accurate

In [23]:
#Write code to create rollup dictionries



### Diagnoses Data
ICD codes are too detailed to be used for research purposes. Phecodes solves this problem by grouping relevant ICD codes into clinical meaningful phenotypes.



In [None]:
# Write code to create icd-phecode dictionary and save it rollup-file




In [7]:
icd_to_phecode_file = "./Rollup_Mappings/icd_to_phecode.csv"

icd_to_phecode = pd.read_csv(icd_to_phecode_file, dtype=str)

icd_to_phecode 

Unnamed: 0,icd_code,PheCode,icd_version
0,001,008,9
1,0010,008,9
2,0011,008,9
3,0019,008,9
4,002,008,9
...,...,...,...
98544,T524X3D,981,10
98545,T532X4D,981,10
98546,T533X4S,981,10
98547,T521X3S,981,10


In [10]:
# Rolling up icd codes to PheCodes

diagnoses_phecode = pd.merge(diagnoses_icd,icd_to_phecode, how='left', on=['icd_code','icd_version'])
diagnoses_phecode

Unnamed: 0,subject_id,icd_code,icd_version,date,PheCode
0,10000032,5723,9,2180-05-06,571.81
1,10000032,78959,9,2180-05-06,572
2,10000032,5715,9,2180-05-06,571.51
3,10000032,07070,9,2180-05-06,070.3
4,10000032,496,9,2180-05-06,496
...,...,...,...,...,...
6356476,19999987,41401,9,2145-11-02,411.4
6356477,19999987,78039,9,2145-11-02,345.3
6356478,19999987,0413,9,2145-11-02,041
6356479,19999987,36846,9,2145-11-02,368.4


In [11]:
# Save this rolled up data in intermediate foles. In future if you update rollup mapping to be more comprehensive or if you want to look
# at codes that are unmapped, you can come back.

diagnoses_phecode['Rollup_Status'] = diagnoses_phecode['PheCode'].notna().replace({True: '1', False: '0'})

diagnoses_phecode

Unnamed: 0,subject_id,icd_code,icd_version,date,PheCode,Rollup_Status
0,10000032,5723,9,2180-05-06,571.81,1
1,10000032,78959,9,2180-05-06,572,1
2,10000032,5715,9,2180-05-06,571.51,1
3,10000032,07070,9,2180-05-06,070.3,1
4,10000032,496,9,2180-05-06,496,1
...,...,...,...,...,...,...
6356476,19999987,41401,9,2145-11-02,411.4,1
6356477,19999987,78039,9,2145-11-02,345.3,1
6356478,19999987,0413,9,2145-11-02,041,1
6356479,19999987,36846,9,2145-11-02,368.4,1


In [12]:
# Examine the unmapped rows

diagnoses_phecode_unmapped = diagnoses_phecode[diagnoses_phecode["Rollup_Status"]=="0"]
diagnoses_phecode_unmapped

Unnamed: 0,subject_id,icd_code,icd_version,date,PheCode,Rollup_Status
34,10000032,V4986,9,2180-07-23,,0
64,10000117,W010XXA,10,2183-09-18,,0
65,10000117,Y93K1,10,2183-09-18,,0
66,10000117,Y92480,10,2183-09-18,,0
77,10000161,R519,10,2163-08-20,,0
...,...,...,...,...,...,...
6356392,19999784,Y92239,10,2119-10-17,,0
6356410,19999828,T8141XA,10,2149-01-08,,0
6356417,19999828,Y929,10,2149-01-08,,0
6356435,19999828,Y92018,10,2147-07-18,,0


In [13]:
# Summarize the codes that have not been rolledup
unique_subject_icd_pairs = diagnoses_phecode_unmapped [['subject_id', 'icd_code','icd_version']].drop_duplicates()

icdcode_frequencies = unique_subject_icd_pairs[['icd_code','icd_version']].value_counts().reset_index(name='counts')

sorted_icdcode_frequencies = icdcode_frequencies.rename(columns={'index': 'icd_code'}).sort_values(by='counts', ascending=False)

sorted_icdcode_frequencies.head(10)

Unnamed: 0,icd_code,icd_version,counts
0,Z20822,10,23629
1,Y929,10,15945
2,Y92230,10,9019
3,V270,9,8375
4,V4986,9,8006
5,Y92239,10,7539
6,Y92009,10,7023
7,E8497,9,6867
8,E8490,9,5832
9,E8788,9,5377


Once the data looks reasonable, with good enough rollup done, you can save the data

In [15]:
diagnoses_phecode.to_csv("./Intermediate_Data/diagnoses_phecode_comprehensive.csv", index=None)

We don't really need all the columns after rollup is performed. Below we just keep the data we need.

In [64]:
print(diagnoses_phecode.columns)
diagnoses_phecode_filtered = diagnoses_phecode[diagnoses_phecode['Rollup_Status']=="1"]
print (diagnoses_phecode_filtered)
diagnoses_phecode_filtered = diagnoses_phecode_filtered[['subject_id','PheCode','date']]
diagnoses_phecode_filtered

NameError: name 'diagnoses_phecode' is not defined

In [23]:
if diagnoses_phecode_filtered.duplicated().sum() > 0:
    print("Duplicate rows found. Removing duplicates...")
    diagnoses_phecode_filtered = diagnoses_phecode_filtered.drop_duplicates()  # Remove duplicate rows
    print("DataFrame after removing duplicates:")
else:
    print("No duplicate rows found.")

diagnoses_phecode_filtered

No duplicate rows found.


Unnamed: 0,subject_id,PheCode,date
0,10000032,571.81,2180-05-06
1,10000032,572,2180-05-06
2,10000032,571.51,2180-05-06
3,10000032,070.3,2180-05-06
4,10000032,496,2180-05-06
...,...,...,...
6356476,19999987,411.4,2145-11-02
6356477,19999987,345.3,2145-11-02
6356478,19999987,041,2145-11-02
6356479,19999987,368.4,2145-11-02


In [25]:
# Once cleaned, we can save the rollup data
diagnoses_phecode_filtered.to_csv("./Rolledup_Data/diagnoses_phecode_rolled.csv", index=None)

In [None]:
## Defining Functions we can use reuse for other types of data

In [66]:
def rollup(raw_level_data, mapping_data, join_columns, parent_column):
    
    rolledup_data = pd.merge(raw_level_data, mapping_data, how='left', on=join_columns)
    
    rolledup_data['Rollup_Status'] = rolledup_data[parent_column].notna().replace({True: '1', False: '0'})
    
    return rolledup_data


def summarize_unmapped(rolledup_data, child_column):
    
    rolledup_data_unmapped = rolledup_data[rolledup_data["Rollup_Status"]=="0"]
    
    unique_patient_code_pairs = rolledup_data_unmapped[['subject_id', child_column]].drop_duplicates()

    unmapped_code_frequencies = unique_patient_code_pairs[[child_column]].value_counts().reset_index(name='counts')

    sorted_icdcode_frequencies = unmapped_code_frequencies.rename(columns={'index': child_column}).sort_values(by='counts', ascending=False)
    
    return sorted_icdcode_frequencies


def filter_rolledup_data(rolledup_data,parent_column):
    
    filtered = rolledup_data[rolledup_data['Rollup_Status']=="1"]
    
    filtered = filtered [['subject_id',parent_column,'date']]
    
    return filtered

In [29]:
# join_columns=['icd_code', 'icd_version']
# res = rollup(diagnoses_icd,icd_to_phecode, join_columns ,"PheCode")
# print(res)
# print(summarize_unmapped(res,'icd_code'))

        subject_id icd_code icd_version        date PheCode Rollup_Status
0         10000032     5723           9  2180-05-06  571.81             1
1         10000032    78959           9  2180-05-06     572             1
2         10000032     5715           9  2180-05-06  571.51             1
3         10000032    07070           9  2180-05-06   070.3             1
4         10000032      496           9  2180-05-06     496             1
...            ...      ...         ...         ...     ...           ...
6356476   19999987    41401           9  2145-11-02   411.4             1
6356477   19999987    78039           9  2145-11-02   345.3             1
6356478   19999987     0413           9  2145-11-02     041             1
6356479   19999987    36846           9  2145-11-02   368.4             1
6356480   19999987     7810           9  2145-11-02   350.1             1

[6356481 rows x 6 columns]
     icd_code  counts
0      Z20822   23629
1        Y929   15945
2      Y92230    9

### Procedures Data

In MIMIC Procedure data come from two sources: hcpcsevents.csv, procedures_icd.csv 
In hcpcevents, procedures are recorded as CPT codes and in procedures_icd, procedures are recorded as ICD9/ICD10 Procedure codes

In [43]:
import pandas as pd

hcpcenvents_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/hcpcsevents.csv"

hcpcenvents = pd.read_csv(hcpcenvents_file, dtype=str)

hcpcenvents

Unnamed: 0,subject_id,hadm_id,chartdate,hcpcs_cd,seq_num,short_description
0,10000068,25022803,2160-03-04,99218,1,Hospital observation services
1,10000084,29888819,2160-12-28,G0378,1,Hospital observation per hr
2,10000108,27250926,2163-09-27,99219,1,Hospital observation services
3,10000117,22927623,2181-11-15,43239,1,Digestive system
4,10000117,22927623,2181-11-15,G0378,2,Hospital observation per hr
...,...,...,...,...,...,...
186069,19999379,26008899,2174-11-04,G0378,1,Hospital observation per hr
186070,19999466,21397174,2116-08-30,G0378,1,Hospital observation per hr
186071,19999733,27674281,2152-07-09,99219,1,Hospital observation services
186072,19999784,24935234,2119-07-09,99219,1,Hospital observation services


In [68]:
hcpcs_to_ccs_file = "./Rollup_Mappings/cpt2ccs_rollup.csv"

hcpcs_to_ccs= pd.read_csv(hcpcs_to_ccs_file, dtype=str)

hcpcs_to_ccs.columns =['hcpcs_cd','ccs']

hcpcs_to_ccs.head()

Unnamed: 0,hcpcs_cd,ccs
0,61000,1
1,61001,1
2,61020,1
3,61026,1
4,61050,1


In [69]:
join_columns=['hcpcs_cd']

hcpcs_procedures_ccs= rollup(hcpcenvents ,hcpcs_to_ccs, join_columns ,"ccs")

hcpcs_procedures_ccs.to_csv("./Rolledup_Data/hcpcs_procedures_ccs.csv", index=None)

print(hcpcs_procedures_ccs)

hcpcs_procedures_ccs.columns = ['subject_id', 'hadm_id', 'date', 'hcpcs_cd', 'seq_num', 'short_description', 'ccs', 'Rollup_Status']

hcpcs_procedures_ccs.head()

       subject_id   hadm_id   chartdate hcpcs_cd seq_num  \
0        10000068  25022803  2160-03-04    99218       1   
1        10000084  29888819  2160-12-28    G0378       1   
2        10000108  27250926  2163-09-27    99219       1   
3        10000117  22927623  2181-11-15    43239       1   
4        10000117  22927623  2181-11-15    G0378       2   
...           ...       ...         ...      ...     ...   
186069   19999379  26008899  2174-11-04    G0378       1   
186070   19999466  21397174  2116-08-30    G0378       1   
186071   19999733  27674281  2152-07-09    99219       1   
186072   19999784  24935234  2119-07-09    99219       1   
186073   19999784  24935234  2119-07-10    62270       2   

                    short_description  ccs Rollup_Status  
0       Hospital observation services  227             1  
1         Hospital observation per hr  227             1  
2       Hospital observation services  227             1  
3                    Digestive system   70 

Unnamed: 0,subject_id,hadm_id,date,hcpcs_cd,seq_num,short_description,ccs,Rollup_Status
0,10000068,25022803,2160-03-04,99218,1,Hospital observation services,227,1
1,10000084,29888819,2160-12-28,G0378,1,Hospital observation per hr,227,1
2,10000108,27250926,2163-09-27,99219,1,Hospital observation services,227,1
3,10000117,22927623,2181-11-15,43239,1,Digestive system,70,1
4,10000117,22927623,2181-11-15,G0378,2,Hospital observation per hr,227,1


In [70]:
summarize_unmapped(hcpcs_procedures_ccs,'hcpcs_cd').head()

Unnamed: 0,hcpcs_cd,counts
0,92980,323
1,93545,273
2,43268,252
3,43269,196
4,43271,139


In [71]:
filtered_hcpcs_procedures_ccs= filter_rolledup_data(hcpcs_procedures_ccs,'ccs')
filtered_hcpcs_procedures_ccs.head()

Unnamed: 0,subject_id,ccs,date
0,10000068,227,2160-03-04
1,10000084,227,2160-12-28
2,10000108,227,2163-09-27
3,10000117,70,2181-11-15
4,10000117,227,2181-11-15


In [72]:
filtered_hcpcs_procedures_ccs.to_csv("./Rolledup_Data/procedures_hcpcs_rolled.csv", index=None)