<div style="text-align: center; font-weight: bold;">
    <h1>Generating Research Ready EHR Datasets</h1>
    <h2>Part 2: Cleaning, Organizing and Rolling Up EHR Data</h2>
    <h4>Author: Vidul Ayakulangara Panickan</h3>
</div>



## Step 4: Cleaning!

MIMIC data has been processed for analysis, so it's different from the typical data you might encounter at a hospital system. In general, Electronic Health Record (EHR) data is stored in databases, and each observation is usually marked with a time of observation.

In MIMIC-IV, events are also timestamped, but sometimes the data comes from different tables. For example, the diagnoses_icd table doesn’t include timestamps, while tables for lab results, procedures, and microbiology events do

In [1]:
# Importing required libraries.

import os
import sys
import time
import logging
import pandas as pd
from tqdm import tqdm
from IPython.display import clear_output
from IPython.display import display

# Set pandas options to expand all data within rows
pd.set_option('display.max_columns', None)      
pd.set_option('display.max_colwidth', None) 

In [2]:
# Setting up Directory to save Cleaned Data

base_directory = os.path.dirname(os.getcwd())
cleaned_rawdata_directory = os.path.join(base_directory, 'processed_data', 'step3_cleaned_rawdata')
os.makedirs(cleaned_rawdata_directory, exist_ok=True)

print(f"Directory created at: {cleaned_rawdata_directory}")

Directory created at: /n/data1/hsph/biostat/celehs/lab/va67/EHR_TUTORIAL_WORKSPACE/processed_data/step3_cleaned_rawdata


### Cleaning Diagnoses Data as an example
First we will clean the diagnosis data to give an example of the steps involded and the order they are executed. Then, we'll define these steps under a function so we can reuse the steps of other types of data like medication, procedures and labs

In [3]:
# The diagnosis data is available in diagnoses_icd_file and the time of recording is available in admissions_file

base_directory = os.path.dirname(os.getcwd())
diagnoses_icd_file = os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp","diagnoses_icd.csv")
admissions_file  = os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp","admissions.csv")


diagnoses_icd = pd.read_csv(diagnoses_icd_file, dtype=str)
admissions = pd.read_csv(admissions_file, dtype=str)

# Listing all columns
display(diagnoses_icd.columns)

display(admissions.columns)

Index(['subject_id', 'hadm_id', 'seq_num', 'icd_code', 'icd_version'], dtype='object')

Index(['subject_id', 'hadm_id', 'admittime', 'dischtime', 'deathtime',
       'admission_type', 'admit_provider_id', 'admission_location',
       'discharge_location', 'insurance', 'language', 'marital_status', 'race',
       'edregtime', 'edouttime', 'hospital_expire_flag'],
      dtype='object')

In [4]:
# Now before combining the tables, we need to check for missing data in raw files.
# This is a step we need to perform after join operations

print("Diagnoses Table - Missing Values Count")
diagnoses_icd_missing_df = pd.DataFrame({'Column': diagnoses_icd.columns,'Missing_Values': diagnoses_icd.isna().sum()})
display(diagnoses_icd_missing_df)

print("Admissions Table - Missing Values Count")
admissions_missing_df = pd.DataFrame({'Column': admissions.columns,'Missing_Values': admissions.isna().sum()})
display(admissions_missing_df )

Diagnoses Table - Missing Values Count


Unnamed: 0,Column,Missing_Values
subject_id,subject_id,0
hadm_id,hadm_id,0
seq_num,seq_num,0
icd_code,icd_code,0
icd_version,icd_version,0


Admissions Table - Missing Values Count


Unnamed: 0,Column,Missing_Values
subject_id,subject_id,0
hadm_id,hadm_id,0
admittime,admittime,0
dischtime,dischtime,0
deathtime,deathtime,534238
admission_type,admission_type,0
admit_provider_id,admit_provider_id,4
admission_location,admission_location,1
discharge_location,discharge_location,149818
insurance,insurance,9355


### Keeping only the required fields

We only require the subject ID, hadm_id, and admittime from the admissions table to create the timestamped diagnosis dataset. And as there are no missing values, we can proceed with joining the tables.

Note: If any rows have missing dates or ICD codes, you can either remove them or impute them. When you have large enough data, it's common pracitce to remove rows if the number of missing values is low

In [24]:
# Merging diagnoses_icd and admissions tables on 'subject_id' and 'hadm_id' columns

diagnoses_icd = pd.read_csv(diagnoses_icd_file, dtype=str)
admissions = pd.read_csv(admissions_file, dtype=str)
timed_diagnoses_icd = pd.merge(
    diagnoses_icd,
    admissions[["subject_id", "hadm_id", "admittime"]],
    how="left",
    on=["subject_id", "hadm_id"],
)


display(timed_diagnoses_icd.head())

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06 22:23:00
1,10000032,22595853,2,78959,9,2180-05-06 22:23:00
2,10000032,22595853,3,5715,9,2180-05-06 22:23:00
3,10000032,22595853,4,7070,9,2180-05-06 22:23:00
4,10000032,22595853,5,496,9,2180-05-06 22:23:00


After a merge operation, it's vital to check whether the resulting table has any missing values. If there are a significant number of 
missing values, you should investigate further to identify underlying cause. For example, the join columns might be of different data types (like 'subject_id' can be integer in one table and string in another table), which can cause the join operation to essentially fail.


In [25]:
admissions_missing_df = pd.DataFrame({'Column': timed_diagnoses_icd.columns,'Missing_Values': timed_diagnoses_icd.isna().sum()})
display(admissions_missing_df)

Unnamed: 0,Column,Missing_Values
subject_id,subject_id,0
hadm_id,hadm_id,0
seq_num,seq_num,0
icd_code,icd_code,0
icd_version,icd_version,0
admittime,admittime,0


In [26]:
# Check columns of interest. If you have records where dates, icd_code or subect_ids are null, remove them. 

print(f"Table size before null records have been removed {timed_diagnoses_icd.shape}")

timed_diagnoses_icd.dropna(subset=['admittime', 'icd_code', 'icd_version', 'subject_id'], how="any", inplace=True)

print(f"Table size after null records have been removed {timed_diagnoses_icd.shape}")

Table size before null records have been removed (6364488, 6)
Table size after null records have been removed (6364488, 6)


In [27]:
# Removing the time component from the 'admittime' column to keep only the date (YYYY-MM-DD). This is typically done in cases where only the
# date component is relevant for the analysis.

timed_diagnoses_icd["admittime"] = timed_diagnoses_icd["admittime"].str[:10]
display(timed_diagnoses_icd.head())

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06
1,10000032,22595853,2,78959,9,2180-05-06
2,10000032,22595853,3,5715,9,2180-05-06
3,10000032,22595853,4,7070,9,2180-05-06
4,10000032,22595853,5,496,9,2180-05-06


In [28]:
# For the diagnosis, we typically keep 'subject_id', 'icd_code', 'icd_version', and 'admittime'. 
# You can retain other columns as needed for your analysis.

timed_diagnoses_icd=timed_diagnoses_icd[['subject_id','icd_code','icd_version','admittime']]

# Cleaning dataset also involves, renaming and restructing to ensure the cleaned datasets are consistent
# Here we rename 'admittime' to 'date' to ensure consistency with other datasets that will be created later for meds, labs and procedures.

timed_diagnoses_icd = timed_diagnoses_icd.rename(columns={'admittime': 'date'})
display(timed_diagnoses_icd.head(5))

Unnamed: 0,subject_id,icd_code,icd_version,date
0,10000032,5723,9,2180-05-06
1,10000032,78959,9,2180-05-06
2,10000032,5715,9,2180-05-06
3,10000032,7070,9,2180-05-06
4,10000032,496,9,2180-05-06


#### Define a date range for the data (Not applicable to MIMIC-IV).

While MIMIC-IV uses adjusted dates, it’s important to ensure that dates in real-world datasets are reasonable. We will filter out records with dates that fall outside a specified range, such as those before the 1980s or after the current year. The code for this operation is provided below. However, for MIMIC-IV, since the dates are already adjusted, this cleaning step will not be applied.

In [None]:
# diagnoses_icd = diagnoses_icd[
#     (diagnoses_icd["date"].str[:4].astype(int) >= 1980)
#     & (diagnoses_icd["date"].str[:4].astype(int) <= 2024)
# ]

In [30]:
# Check for duplicated rows in your data

if timed_diagnoses_icd.duplicated().sum() > 0:
    initial_row_count=timed_diagnoses_icd.shape[0]
    print(f"Initial table size {initial_row_count}")
    print("Duplicate rows found. Removing duplicates:")
    timed_diagnoses_icd = timed_diagnoses_icd.drop_duplicates()  # Remove duplicate rows
    final_row_count=timed_diagnoses_icd.shape[0]
    print(f"Records deleted: {initial_row_count - final_row_count}")
    print(f"Table size after removing duplicates{final_row_count}")
else:
    print("No duplicate rows found.")

Initial table size 6364488
Duplicate rows found. Removing duplicates:
Table size after removing duplicates(6356481, 4)
Records deleted: 8007


## Defining Cleaning Functions 

timed_diagnoses_icd now contains the cleaned ICD diagnoses data. Next, we will define the cleaning functions based on the code above, so they can be reused for other datasets

In [5]:
## Ignore the function below, we are defining functions so that we can print and save the info in a log file

import os
import sys
import time
import logging
import pandas as pd
from tqdm import tqdm
from IPython.display import clear_output
from IPython.display import display

# Set pandas options to expand all data within rows
pd.set_option('display.max_columns', None)      
pd.set_option('display.max_colwidth', None) 


def setup_logger(log_type,log_file):

    log_folder = os.path.join("log_folder", log_type)
    
    # Ensure the log folder exists
    os.makedirs(log_folder, exist_ok=True)
    
    # Define the full path for the log file
    log_filepath = os.path.join(log_folder, log_file)

    # Delete the log file if it exists
    if os.path.exists(log_filepath):
        os.remove(log_filepath)
    
    for handler in logging.root.handlers[:]:
        logging.root.removeHandler(handler)

    logging.basicConfig(level=logging.DEBUG, format='%(message)s', handlers=[logging.FileHandler(log_filepath)])

# Define the function to print and log messages
def print_and_log_cleaning(message):
    logging.info(message)     



def missing_values_summary(df):
    missing_values = df.isna().sum()
    missing_df = pd.DataFrame({'Column Name': missing_values.index, 'Missing Values Count': missing_values.values})
    print_and_log_cleaning("Missing Values Count:")
    print_and_log_cleaning(missing_df)



def clean_data(df, cols_of_interest, time_col):

    missing_values_summary(df)

    print_and_log_cleaning(f"Initial number of rows: {df.shape[0]}")

    print_and_log_cleaning("Keeping only the essential columns")
    df = df[cols_of_interest]

    if df.isna().sum().any():
        df = df.dropna()
        print_and_log_cleaning(f"Number of rows after dropping na rows: {df.shape[0]}")
    else:
        print_and_log_cleaning("No rows with missing values to drop.")

    print_and_log_cleaning("Extracting only the date info")
    df[time_col] = df[time_col].str[:10]

    print_and_log_cleaning("Renaming the columns")
    df = df.rename(columns={time_col: "date"})

    print_and_log_cleaning("Checking for duplicate rows")
    if df.duplicated().sum() > 0:
        print_and_log_cleaning("Duplicate rows found. Removing duplicates:")
        df = df.drop_duplicates()
        return df
    else:
        print_and_log_cleaning("No duplicate rows found.")
        return df


def clean_data_batch_supportfunc(df, cols_of_interest):

    missing_values_summary(df)

    print_and_log_cleaning(f"Initial number of rows: {df.shape[0]}")

    if df.isna().sum().any():
        df.dropna(inplace=True)
        print_and_log_cleaning(f"Number of rows after dropping na rows: {df.shape[0]}")
    else:
        print_and_log_cleaning("No rows with missing values to drop.")

    print_and_log_cleaning("Extracting only the date info")
    df[cols_of_interest['date']] = df[cols_of_interest['date']].str[:10]
  

    print_and_log_cleaning("Renaming the columns")
    df = df.rename(columns={cols_of_interest['date']: "date"})
    df = df.rename(columns={cols_of_interest['code']: "code"})

    print_and_log_cleaning("Checking for duplicate rows")
    if df.duplicated().sum() > 0:
        print_and_log_cleaning("Duplicate rows found. Removing duplicates:")
        df = df.drop_duplicates()
        return df
    else:
        print_and_log_cleaning("No duplicate rows found.")
        return df


def file_line_count(file_path):
    count = 0
    with open(file_path, 'r') as file:
        for line in file:
            count += 1
    return count


def clean_data_by_batch(input_file_path, cleaned_output_dir ,patient_ids, cols_of_interest, data_name, num_rows_to_load=1500000):
    #Coding system will be a list of columns if the coding system is defined in the input data, if not, you can pass a single value list

    filename = os.path.splitext(os.path.basename(input_file_path))[0]

    # Set up logger for cleaning function
    setup_logger("cleaning",f"{data_name}_{filename}.txt")  # Create log folder if necessary and set up logging

 
    # Create a dictionary for batch_num lookup based on subject_id
    batch_lookup = dict(zip(patient_ids['subject_id'], patient_ids['batch_num']))
    
    # Get the list of unique batch numbers from patient_ids
    unique_batch_nums = patient_ids['batch_num'].unique()

    output_dir = f'{cleaned_output_dir}/{data_name}'
    os.makedirs(output_dir, exist_ok=True)

    # Iterate over each batch number
    for batch_num in tqdm(unique_batch_nums, desc="Processing batches", unit="batch"):
        
        print_and_log_cleaning(f"\nProcessing batch {batch_num}:")

        collected_data = []

        non_none_cols_of_interest = [item for key, item in cols_of_interest.items() if item is not None and key != "coding_system"
]
        
        chunk_iter = pd.read_csv(input_file_path, chunksize=num_rows_to_load, usecols=non_none_cols_of_interest , dtype=str)
        
        # Process each chunk and collect data for the current batch
        for chunk_idx, chunk in enumerate(chunk_iter):
            print_and_log_cleaning(f"------ Processing  batch {batch_num} chunk{chunk_idx + 1}")

            batch_data = chunk[chunk['subject_id'].map(batch_lookup) == batch_num]
            
            if not batch_data.empty:
                collected_data.append(batch_data)
        
        # If we found data for this batch, clean and save it
        if collected_data:

            final_batch_data = pd.concat(collected_data, ignore_index=True)

            print_and_log_cleaning(final_batch_data)

            if cols_of_interest.get('code_version'):
                cleaned_batch_data = clean_data_batch_supportfunc(final_batch_data, cols_of_interest)
                cleaned_batch_data['coding_system'] = cols_of_interest['coding_system'] + cleaned_batch_data[cols_of_interest['code_version']]

                
            else:
                cleaned_batch_data = clean_data_batch_supportfunc(final_batch_data, cols_of_interest)
                cleaned_batch_data['coding_system'] = cols_of_interest['coding_system']

            cleaned_batch_data=cleaned_batch_data[['subject_id','date','code','coding_system']]
            
            # Prepare the output file path dynamically using the data_name and batch_num
            output_file = os.path.join(output_dir, f"{data_name.lower()}_batch{batch_num}_{filename}.csv")

            
            # Save the cleaned data to the file
            cleaned_batch_data.to_csv(output_file, index=False)

            clear_output(wait=True)
            display(cleaned_batch_data.head())
            print_and_log_cleaning(f"Batch {batch_num} cleaned data saved to {output_file}.")
        else:
            print_and_log_cleaning(f"Warning: No data found for batch {batch_num}.")
 
    
    
    display(f"\nCleaning complete. All batches processed and saved in the '{output_dir}' directory.")

## Handling Large-Scale Data
In real world data, EHR datasets are typically too large to load in the memory at once so we process them in batches.


In [6]:
# First we need to identify the individual patients and then we are gonna assign them to batches. If you do not have admissions file, 
# you can simply take the diagnosis file to get the unique patient ids.

base_directory = os.path.dirname(os.getcwd())
admissions_file  = os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp","admissions.csv")

admissions = pd.read_csv(admissions_file, dtype=str)
display(admissions.head(5))

# Getting unique patient ids and sorting them
patient_ids = admissions[['subject_id']].drop_duplicates()
patient_ids = patient_ids.sort_values(by='subject_id', ascending=True)
patient_ids = patient_ids.reset_index(drop=True)
display(patient_ids.head())

# specify the number of batches you want to have. The larger the data, the more batches you need to have
num_of_batches = 8

# Assigning batch number from 1 to 8
patient_ids['batch_num'] = (patient_ids.index % num_of_batches) + 1
display(patient_ids)

patient_count_per_batch = patient_ids.groupby('batch_num')['subject_id'].count().reset_index().rename(columns={'subject_id': 'patient_count'})
display(patient_count_per_batch)

Unnamed: 0,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
0,10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P49AFC,TRANSFER FROM HOSPITAL,HOME,Medicaid,English,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
1,10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P784FA,EMERGENCY ROOM,HOME,Medicaid,English,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
2,10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P19UTS,EMERGENCY ROOM,HOSPICE,Medicaid,English,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
3,10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P06OTX,EMERGENCY ROOM,HOME,Medicaid,English,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
4,10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P39NWO,EMERGENCY ROOM,,,English,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0


Unnamed: 0,subject_id
0,10000032
1,10000068
2,10000084
3,10000108
4,10000117


Unnamed: 0,subject_id,batch_num
0,10000032,1
1,10000068,2
2,10000084,3
3,10000108,4
4,10000117,5
...,...,...
223447,19999733,8
223448,19999784,1
223449,19999828,2
223450,19999840,3


Unnamed: 0,batch_num,patient_count
0,1,27932
1,2,27932
2,3,27932
3,4,27932
4,5,27931
5,6,27931
6,7,27931
7,8,27931


### 4.1 Cleaning Diagnoses Data in batches

In [7]:
base_directory = os.path.dirname(os.getcwd())
diagnoses_icd_file = os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp","diagnoses_icd.csv")
admissions_file  = os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp","admissions.csv")


diagnoses_icd = pd.read_csv(diagnoses_icd_file, dtype=str)
admissions = pd.read_csv(admissions_file, dtype=str)

timed_diagnoses_icd = pd.merge(
    diagnoses_icd,
    admissions[["subject_id", "hadm_id", "admittime"]],
    how="left",
    on=["subject_id", "hadm_id"],
)

display(timed_diagnoses_icd.head())

tmp_directory = os.path.join(base_directory, 'scripts', 'tmp')
print(f"Creating temp directory here {tmp_directory}")
os.makedirs(tmp_directory, exist_ok=True)

timed_diagnoses_icd_file = os.path.join(tmp_directory, f"timed_diagnoses_icd.csv")
timed_diagnoses_icd.to_csv(timed_diagnoses_icd_file, index=False)

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06 22:23:00
1,10000032,22595853,2,78959,9,2180-05-06 22:23:00
2,10000032,22595853,3,5715,9,2180-05-06 22:23:00
3,10000032,22595853,4,7070,9,2180-05-06 22:23:00
4,10000032,22595853,5,496,9,2180-05-06 22:23:00


Creating temp directory here /n/data1/hsph/biostat/celehs/lab/va67/EHR_TUTORIAL_WORKSPACE/scripts/tmp


In [8]:
cols_of_interest = {
    "patient_id": "subject_id",
    "date": "admittime",
    "code": "icd_code",
    "code_version": "icd_version",
    "coding_system": "ICD"
}

clean_data_by_batch(
    timed_diagnoses_icd_file,
    cleaned_rawdata_directory,
    patient_ids,
    cols_of_interest,
    "Diagnoses",
    num_rows_to_load=15000000
)

Unnamed: 0,subject_id,date,code,coding_system
0,10000280,2151-03-18,6820,ICD9
1,10000886,2178-05-08,30500,ICD9
2,10001217,2157-11-18,3240,ICD9
3,10001217,2157-11-18,3484,ICD9
4,10001217,2157-11-18,3485,ICD9


Processing batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:09<00:00,  8.74s/batch]


### 4.2 Cleaning Procedures Data

In MIMIC Procedure data come from two sources: 
1. hcpcsevents.csv where procedures are recorded as CPT codes 
2. procedures_icd.csv where procedures are recorded as ICD9/ICD10 Procedure codes

We will need to clean them both and concatenate them.

In [18]:
# Cleaning HCPCS events
base_directory = os.path.dirname(os.getcwd())
hcpcsevents_file = os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp","hcpcsevents.csv")

hcpcsevents_sample = pd.read_csv(hcpcsevents_file ,nrows=5,dtype=str)
display(hcpcsevents_sample)

Unnamed: 0,subject_id,hadm_id,chartdate,hcpcs_cd,seq_num,short_description
0,10000068,25022803,2160-03-04,99218,1,Hospital observation services
1,10000084,29888819,2160-12-28,G0378,1,Hospital observation per hr
2,10000108,27250926,2163-09-27,99219,1,Hospital observation services
3,10000117,22927623,2181-11-15,43239,1,Digestive system
4,10000117,22927623,2181-11-15,G0378,2,Hospital observation per hr


In [19]:

cols_of_interest = {
    "patient_id": "subject_id",
    "date": "chartdate",
    "code": "hcpcs_cd",
    "code_version": None,
    "coding_system": "HCPCS"
}

clean_data_by_batch(
    hcpcsevents_file_path,
    cleaned_rawdata_directory,
    patient_ids,
    cols_of_interest,
    "Procedures",
    num_rows_to_load=15000000
)

Unnamed: 0,subject_id,date,code,coding_system
0,10000280,2151-03-18,99219,HCPCS
1,10000886,2178-05-08,99219,HCPCS
2,10002425,2153-01-07,27339,HCPCS
4,10002425,2153-01-07,G0378,HCPCS
5,10002807,2152-03-30,G0378,HCPCS


Processing batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.31batch/s]


In [20]:
# Processing procedures_icd.csv

base_directory = os.path.dirname(os.getcwd())
procedures_icd_file = os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp","procedures_icd.csv")

procedures_icd_sample = pd.read_csv(procedures_icd_file, dtype=str, nrows=5)

display(procedures_icd_sample)

Unnamed: 0,subject_id,hadm_id,seq_num,chartdate,icd_code,icd_version
0,10000032,22595853,1,2180-05-07,5491,9
1,10000032,22841357,1,2180-06-27,5491,9
2,10000032,25742920,1,2180-08-06,5491,9
3,10000068,25022803,1,2160-03-03,8938,9
4,10000117,27988844,1,2183-09-19,0QS734Z,10


In [21]:
cols_of_interest = {
    "patient_id": "subject_id",
    "date": "chartdate",
    "code": "icd_code",
    "code_version": "icd_version",
    "coding_system": "ICDPROC"
}

clean_data_by_batch(
    procedures_icd_file,
    cleaned_rawdata_directory,
    patient_ids,
    cols_of_interest,
    "Procedures",
    num_rows_to_load=15000000
)

Unnamed: 0,subject_id,date,code,coding_system
0,10000280,2151-03-18,8938,ICDPROC9
1,10000886,2178-05-08,8938,ICDPROC9
2,10001217,2157-11-20,139,ICDPROC9
3,10001217,2157-11-19,331,ICDPROC9
4,10001217,2157-11-22,3897,ICDPROC9


Processing batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:08<00:00,  1.00s/batch]


### 4.3 Cleaning Medications Data

In [22]:
base_directory = os.path.dirname(os.getcwd())
prescriptions_file = os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp","prescriptions.csv")

# Loading the top 5 rows to identify columns of interest
prescriptions_sample = pd.read_csv(prescriptions_file,dtype=str,nrows=5)

display(prescriptions_sample)

Unnamed: 0,subject_id,hadm_id,pharmacy_id,poe_id,poe_seq,order_provider_id,starttime,stoptime,drug_type,drug,formulary_drug_cd,gsn,ndc,prod_strength,form_rx,dose_val_rx,dose_unit_rx,form_val_disp,form_unit_disp,doses_per_24_hrs,route
0,10000032,22595853,12775705,10000032-55,55,P85UQ1,2180-05-08 08:00:00,2180-05-07 22:00:00,MAIN,Furosemide,FURO40,8209.0,51079007320,40mg Tablet,,40,mg,1.0,TAB,1,PO/NG
1,10000032,22595853,18415984,10000032-42,42,P23SJA,2180-05-07 02:00:00,2180-05-07 22:00:00,MAIN,Ipratropium Bromide Neb,IPRA2H,21700.0,487980125,2.5mL Vial,,1,NEB,1.0,VIAL,4,IH
2,10000032,22595853,23637373,10000032-35,35,P23SJA,2180-05-07 01:00:00,2180-05-07 09:00:00,MAIN,Furosemide,FURO20,8208.0,51079007220,20mg Tablet,,20,mg,1.0,TAB,1,PO/NG
3,10000032,22595853,26862314,10000032-41,41,P23SJA,2180-05-07 01:00:00,2180-05-07 01:00:00,MAIN,Potassium Chloride,MICROK10,1275.0,245004101,10mEq ER Tablet,,40,mEq,4.0,TAB,1,PO
4,10000032,22595853,30740602,10000032-27,27,P23SJA,2180-05-07 00:00:00,2180-05-07 22:00:00,MAIN,Sodium Chloride 0.9% Flush,NACLFLUSH,,0,10 mL Syringe,,3,mL,0.3,SYR,3,IV


In [24]:
cols_of_interest = {
    "patient_id": "subject_id",
    "date": "starttime",
    "code": "ndc",
    "code_version": None,
    "coding_system": "NDC"
}

clean_data_by_batch(
    prescriptions_file,
    cleaned_rawdata_directory,
    patient_ids,
    cols_of_interest,
    "Medication",
    num_rows_to_load=15000000
)

Unnamed: 0,subject_id,date,code,coding_system
0,10001217,2157-11-22,8290036005,NDC
1,10001217,2157-11-21,641608025,NDC
2,10001217,2157-11-22,338001702,NDC
3,10001217,2157-11-22,409433201,NDC
4,10001217,2157-11-21,0,NDC


Processing batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [04:09<00:00, 31.17s/batch]


## 4.4 Cleaning Lab Data: Handling Large-Scale Data

In [9]:
base_directory = os.path.dirname(os.getcwd())
labitems_file = os.path.join(base_directory, "raw_data", "physionet.org", "files", "mimiciv", "3.1", "hosp", "labevents.csv")

labs_sample= pd.read_csv(labitems_file,dtype=str,nrows=5)

display(labs_sample)

Unnamed: 0,labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
0,1,10000032,,2704548,50931,P69FQC,2180-03-23 11:51:00,2180-03-23 15:56:00,___,95.0,mg/dL,70.0,100.0,,ROUTINE,"IF FASTING, 70-100 NORMAL, >125 PROVISIONAL DIABETES."
1,2,10000032,,36092842,51071,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,
2,3,10000032,,36092842,51074,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,
3,4,10000032,,36092842,51075,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,"BENZODIAZEPINE IMMUNOASSAY SCREEN DOES NOT DETECT SOME DRUGS,;INCLUDING LORAZEPAM, CLONAZEPAM, AND FLUNITRAZEPAM."
4,5,10000032,,36092842,51079,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,


In [26]:
# Counting the number of lines in the lab file

file_line_count(labitems_file)

158374765

Lab data is one of the largest subdatasets in EHR due to the high frequency of recorded lab events. In MIMIC-IV, there are over 158 million lab observations recorded and loading the entire dataset into memory at once is inefficient and may cause your program to crash. Therefore, we will process the data in batches, working with one batch of patients at a time

In [10]:
# This will take atleast 3-4 minutes to process a batch, and around 30 mins for the entire lab data

cols_of_interest = {
    "patient_id": "subject_id",
    "date": "charttime",
    "code": "itemid",
    "code_version": None,
    "coding_system": "ITEMID"
}

clean_data_by_batch(
    labitems_file,
    cleaned_rawdata_directory,
    patient_ids,
    cols_of_interest,
    "Labs",
    num_rows_to_load=15000000
)


# We process data in batches of patients, rather than batches of data, to efficiently handle duplicates, as we can't load the entire dataset into memory at once.
# If it takes too long to process a chunk, try reducing num_rows_to_load

Unnamed: 0,subject_id,date,code,coding_system
0,10000280,2151-03-18,51146,ITEMID
1,10000280,2151-03-18,51200,ITEMID
2,10000280,2151-03-18,51221,ITEMID
3,10000280,2151-03-18,51222,ITEMID
4,10000280,2151-03-18,51244,ITEMID


Processing batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [32:11<00:00, 241.40s/batch]


## Step 4: Rolling up Data

For data to be effectively compared and analyzed, it must be standardized. For example, medical codes in electronic health records (EHR) like medications codes may come from different coding systems like NDC, RxNorm, or institution-specific local codes. "Rolling up" data from these various coding systems to a common parent coding system ensures standardization.

There are additional reasons to perform the rollup:

1. The raw EHR codes may be too specific, making analysis difficult or impractical.
2. Rolling up helps harmonize data across different institutions, enabling analysis on a larger scale.

In [1]:
import os
import sys
import time
import logging
import pandas as pd
from tqdm import tqdm
from IPython.display import clear_output
from IPython.display import display

# Set pandas options to expand all data within rows
pd.set_option('display.max_columns', None)      
pd.set_option('display.max_colwidth', None) 

# Setting up Directory to save Rolledup Data

base_directory = os.path.dirname(os.getcwd())

rolledup_intermediatedata_directory = os.path.join(base_directory, 'processed_data', 'step4_rolledup_intermediatedata')
os.makedirs(rolledup_intermediatedata_directory, exist_ok=True)
print(f"Directory created at: {rolledup_intermediatedata_directory}")

rolledup_finaldata_directory = os.path.join(base_directory, 'processed_data', 'step4_rolledup_finaldata')
os.makedirs(rolledup_finaldata_directory, exist_ok=True)
print(f"Directory created at: {rolledup_finaldata_directory}")


Directory created at: /n/data1/hsph/biostat/celehs/lab/va67/EHR_TUTORIAL_WORKSPACE/processed_data/step4_rolledup_intermediatedata
Directory created at: /n/data1/hsph/biostat/celehs/lab/va67/EHR_TUTORIAL_WORKSPACE/processed_data/step4_rolledup_finaldata


## Creating Rollup Dictionary

To rollup EHR codes to a parent level code, we need decide on what coding system we will be rolling to 
1. Diagnoses - We will be rolling up ICD and other Diagnoses codes to PheCodes
2. Medication - We will be rolling up standard codes like RxNorm, NDC and local medication codes to RxNorm Ingredient level codes
3. Lab - We will be rolling up lab codes, loinc codes to LOINC Component codes
4. Prcedures - We will be rolling up Procedure codes like ICDPCS/CPT4 Codes to CCS codes

Creating these Rollup dictionaries require a lot of manual processing and quality checks to ensure the mapping dictionary created is accurate

In [23]:
#Write code to create rollup dictionries



### Rolling up Diagnoses Data
ICD codes are too detailed to be used for research purposes. Phecodes solves this problem by grouping relevant ICD codes into clinical meaningful phenotypes.

In [33]:
base_directory = os.path.dirname(os.getcwd())

icd_to_phecode_file = os.path.join(base_directory, 'scripts', 'rollup_mappings',"ICD_to_PheCode.csv")
icd_to_phecode = pd.read_csv(icd_to_phecode_file, dtype=str)
display(icd_to_phecode.head())

Unnamed: 0,code,PheCode,coding_system
0,1,8,ICD9
1,10,8,ICD9
2,11,8,ICD9
3,19,8,ICD9
4,2,8,ICD9


In [34]:
# We will select a sample diagnoses file for rolling up

base_directory = os.path.dirname(os.getcwd())

diagnoses_cleaned_rawdata = os.path.join(base_directory, 'processed_data', 'step3_cleaned_rawdata',"Diagnoses")
diagnoses_files = os.listdir(diagnoses_cleaned_rawdata)

sample_diagnoses_filepath= os.path.join(diagnoses_cleaned_rawdata, diagnoses_files[0])
sample_diagnoses = pd.read_csv(sample_diagnoses_filepath, dtype=str)
display(sample_diagnoses.head())

Unnamed: 0,subject_id,date,code,coding_system
0,10000032,2180-05-06,5723,ICD9
1,10000032,2180-05-06,78959,ICD9
2,10000032,2180-05-06,5715,ICD9
3,10000032,2180-05-06,7070,ICD9
4,10000032,2180-05-06,496,ICD9


Now, if you observe the rollup mapping file and the actual diagnoses file, you will notice a mismatch in the column names. To perform the
rollup using a join operation, you will need to make the column names consistent. You can do this by renaming the columns in the rollup file.

In [35]:
# Merging the two tables to rollup/map icd code to phecode. Save this rolled up data in intermediate folder. 
# In future if you update rollup mapping to be more comprehensive or if you want to look at codes that are unmapped, you can always come back.
    
sample_diagnoses_phecode = pd.merge(sample_diagnoses, icd_to_phecode, how='left', on=['code','coding_system'])

sample_diagnoses_phecode['Rollup_Status'] = sample_diagnoses_phecode['PheCode'].notna().replace({True: '1', False: '0'})

display(sample_diagnoses_phecode.head())
print(sample_diagnoses_phecode.shape)

Unnamed: 0,subject_id,date,code,coding_system,PheCode,Rollup_Status
0,10000032,2180-05-06,5723,ICD9,571.81,1
1,10000032,2180-05-06,78959,ICD9,572.0,1
2,10000032,2180-05-06,5715,ICD9,571.51,1
3,10000032,2180-05-06,7070,ICD9,70.3,1
4,10000032,2180-05-06,496,ICD9,496.0,1


(799778, 6)


In [36]:
# Unrolled rows where ICD codes were not rolled up

sample_diagnoses_unrolled = sample_diagnoses_phecode[sample_diagnoses_phecode["Rollup_Status"]=="0"]
display(sample_diagnoses_unrolled.head())
print(sample_diagnoses_unrolled.shape)

Unnamed: 0,subject_id,date,code,coding_system,PheCode,Rollup_Status
34,10000032,2180-07-23,V4986,ICD9,,0
44,10001319,2135-07-20,V270,ICD9,,0
46,10001319,2138-11-09,V270,ICD9,,0
48,10001319,2134-04-15,V270,ICD9,,0
52,10001843,2131-11-09,Y840,ICD10,,0


(52416, 6)


In [38]:
# Summarize the codes that have not been rolledup

unique_subject_icd_pairs = sample_diagnoses_unrolled[['subject_id', 'code','coding_system']].drop_duplicates()

icdcode_frequencies = unique_subject_icd_pairs[['code','coding_system']].value_counts().reset_index(name='counts')

sorted_icdcode_frequencies = icdcode_frequencies.rename(columns={'index': 'code'}).sort_values(by='counts', ascending=False)

display(sorted_icdcode_frequencies.head(10))

Unnamed: 0,code,coding_system,counts
0,Z20822,ICD10,3009
1,Y929,ICD10,1956
2,Y92230,ICD10,1115
3,V270,ICD9,1086
4,V4986,ICD9,1057
5,Y92009,ICD10,901
6,Y92239,ICD10,898
7,E8497,ICD9,867
8,E8490,ICD9,743
9,E8788,ICD9,694


Once the data looks reasonable, with good enough rollup done, you can save the data. You can save the comprehensive data with rolled and unrolled info into the intermidate_data folder. You can come back to this if you need to check anything in the future. 

You can save the rolled up file under rolleddup_data

We don't really need all the columns after rollup is performed. Below we just keep the data we need.

In [31]:
sample_diagnoses_phecode_filtered = sample_diagnoses_phecode[sample_diagnoses_phecode['Rollup_Status']=="1"]

print(sample_diagnoses_phecode_filtered )
sample_diagnoses_phecode_filtered  = sample_diagnoses_phecode_filtered [['subject_id','PheCode','date']]
sample_diagnoses_phecode_filtered 

       subject_id        date    code coding_system PheCode Rollup_Status
0        10000032  2180-05-06    5723          ICD9  571.81             1
1        10000032  2180-05-06   78959          ICD9     572             1
2        10000032  2180-05-06    5715          ICD9  571.51             1
3        10000032  2180-05-06   07070          ICD9   070.3             1
4        10000032  2180-05-06     496          ICD9     496             1
...           ...         ...     ...           ...     ...           ...
799773   19999784  2121-01-31   Z5111         ICD10    1010             1
799774   19999784  2121-01-31   C8589         ICD10   202.2             1
799775   19999784  2121-01-31    E876         ICD10  276.14             1
799776   19999784  2121-01-31  Z87891         ICD10     318             1
799777   19999784  2121-01-31   Z8619         ICD10     136             1

[747362 rows x 6 columns]


Unnamed: 0,subject_id,PheCode,date
0,10000032,571.81,2180-05-06
1,10000032,572,2180-05-06
2,10000032,571.51,2180-05-06
3,10000032,070.3,2180-05-06
4,10000032,496,2180-05-06
...,...,...,...
799773,19999784,1010,2121-01-31
799774,19999784,202.2,2121-01-31
799775,19999784,276.14,2121-01-31
799776,19999784,318,2121-01-31


In [32]:
if sample_diagnoses_phecode_filtered.duplicated().sum() > 0:
    print("Duplicate rows found. Removing duplicates...")
    sample_diagnoses_phecode_filtered = sample_diagnoses_phecode_filtered.drop_duplicates()  # Remove duplicate rows
    print("DataFrame after removing duplicates:")
else:
    print("No duplicate rows found.")

display(sample_diagnoses_phecode_filtered)

Duplicate rows found. Removing duplicates...
DataFrame after removing duplicates:


Unnamed: 0,subject_id,PheCode,date
0,10000032,571.81,2180-05-06
1,10000032,572,2180-05-06
2,10000032,571.51,2180-05-06
3,10000032,070.3,2180-05-06
4,10000032,496,2180-05-06
...,...,...,...
799773,19999784,1010,2121-01-31
799774,19999784,202.2,2121-01-31
799775,19999784,276.14,2121-01-31
799776,19999784,318,2121-01-31


## Defining Functions 

As we will be perfoming operations similar to what we did to roll up ICD codes, its better to define these operations
as a function so we can resue them.

In [8]:
import os
import sys
import time
import logging
import pandas as pd
from tqdm import tqdm
from IPython.display import clear_output
from IPython.display import display

# Set pandas options to expand all data within rows
pd.set_option('display.max_columns', None)      
pd.set_option('display.max_colwidth', None) 

def rollup(raw_level_data, rollup_mapping , join_columns, parent_column):
    
    rolledup_data = pd.merge(raw_level_data, rollup_mapping, how='left', on=join_columns)
    
    rolledup_data['Rollup_Status'] = rolledup_data[parent_column].notna().replace({True: '1', False: '0'})
    
    return rolledup_data


def summarize_unmapped(rolledup_data, child_column):
    
    rolledup_data_unmapped = rolledup_data[rolledup_data["Rollup_Status"]=="0"]
    
    unique_patient_code_pairs = rolledup_data_unmapped[['subject_id', child_column]].drop_duplicates()

    unmapped_code_frequencies = unique_patient_code_pairs[[child_column]].value_counts().reset_index(name='counts')

    unmapped_code_frequencies = unmapped_code_frequencies.rename(columns={'index': child_column})
    
    return unmapped_code_frequencies


def filter_rolledup_data(rolledup_data,cols_of_interest):
    
    filtered = rolledup_data[rolledup_data['Rollup_Status']=="1"]
    
    filtered = filtered[cols_of_interest]

    filtered.drop_duplicates(inplace=True)
    
    return filtered
    

def rollup_data_by_batch(config, unrolled_data_folder):

    base_directory = os.path.dirname(os.getcwd())
    summary_directory = os.path.join(base_directory, 'processed_data', 'Summary')
    intermediate_dir = os.path.join(base_directory, 'processed_data', 'step4_rolledup_intermediatedata')
    rolledup_dir =  os.path.join(base_directory, 'processed_data', 'step4_rolledup_finaldata')


    folder_name = os.path.basename(unrolled_data_folder)
    intermediate_dir = f"{intermediate_dir}/{folder_name}"
    rolledup_dir = f"{rolledup_dir}/{folder_name}"
    
    unrolled_summary_dir = f"{summary_directory}/Unrolled_Summary"
    files = os.listdir(unrolled_data_folder)


    output_dir = f'{rolledup_dir}/{folder_name}'

    os.makedirs(intermediate_dir, exist_ok=True)
    os.makedirs(rolledup_dir, exist_ok=True)
    os.makedirs(unrolled_summary_dir, exist_ok=True)

    unrolled_freq_list = []

    batches = [file.split("_")[1] for file in files]
    unique_batches = set(batches)

    for batch in tqdm(unique_batches, desc="Processing batches", unit="batch"):
        load_files = [file for file in files if batch in file]
 

        if len(load_files)==1:
            filepath = os.path.join(unrolled_data_folder, load_files[0])
            file_df = pd.read_csv(filepath, dtype=str)
            
        else:
            df_list = []
            for file in load_files:
                filepath = os.path.join(unrolled_data_folder, file)
                df_list.append(pd.read_csv(filepath, dtype=str))
            file_df = pd.concat(df_list)

        
        intermediate_df = rollup(file_df, config["rollup_mapping"], [config['child_column'],"coding_system"], config['parent_column'])
        intermediate_file = os.path.join(intermediate_dir, f"intermediate_{batch}.csv")
        intermediate_df.to_csv(intermediate_file, index=False)
        
        unrolled_freq_list.append(summarize_unmapped(intermediate_df, config['child_column']))

        cols_of_interest = [config['patient_id'],config['date'],config['parent_column']]
        
        only_rolledup_df = filter_rolledup_data(intermediate_df, cols_of_interest)
        only_rolledup_filepath = os.path.join(rolledup_dir, f"rolledup_{batch}.csv")
        only_rolledup_df.to_csv(only_rolledup_filepath, index=False)

        clear_output(wait=True)
        display(only_rolledup_df)


    
    unrolled_summary_file = os.path.join(unrolled_summary_dir , "unrolled_" + folder_name + "_code_counts.csv")

    unrolled_summary =(
        pd.concat(unrolled_freq_list)  # Concatenate list of DataFrames
        .groupby(config['child_column'], as_index=False)["counts"]
        .sum()
        .sort_values(by="counts", ascending=False)
        )
    
    display(unrolled_summary)
    unrolled_summary.to_csv(unrolled_summary_file, index=False)


In [None]:
### Diagnosis Data


In [13]:
base_directory = os.path.dirname(os.getcwd())

diagnoses_cleaned_rawdata = os.path.join(base_directory, 'processed_data', 'step3_cleaned_rawdata',"Diagnoses")
diagnoses_files = os.listdir(diagnoses_cleaned_rawdata)
sample_diagnoses_file= os.path.join(diagnoses_cleaned_rawdata, diagnoses_files[0])
sample_diagnoses_file = pd.read_csv(sample_diagnoses_file, dtype=str)
display(sample_diagnoses_file.head())

icd_to_phecode_file = os.path.join(base_directory, 'scripts', 'rollup_mappings',"ICD_to_PheCode.csv")
diagnoses_to_phecode = pd.read_csv(icd_to_phecode_file , dtype=str)
display(diagnoses_to_phecode.head())

Unnamed: 0,subject_id,date,code,coding_system
0,10000032,2180-05-06,5723,ICD9
1,10000032,2180-05-06,78959,ICD9
2,10000032,2180-05-06,5715,ICD9
3,10000032,2180-05-06,7070,ICD9
4,10000032,2180-05-06,496,ICD9


Unnamed: 0,code,PheCode,coding_system
0,1,8,ICD9
1,10,8,ICD9
2,11,8,ICD9
3,19,8,ICD9
4,2,8,ICD9


In [16]:
config = {
    "patient_id":     "subject_id",
    "parent_column":  "PheCode",
    "child_column":   "code",
    "date":           "date",
    "rollup_mapping": diagnoses_to_phecode
}

rollup_data_by_batch(
    config,
    os.path.join(base_directory, 'processed_data', 'step3_cleaned_rawdata','Diagnoses')
)

Unnamed: 0,subject_id,date,PheCode
0,10000068,2160-03-03,317.1
1,10000635,2143-12-23,418
2,10000635,2143-12-23,772.6
3,10000635,2143-12-23,687.4
4,10000635,2143-12-23,401.1
...,...,...,...
799483,19999828,2147-07-18,1090
799484,19999828,2147-07-18,318
799485,19999828,2147-07-18,041.4
799486,19999828,2147-07-18,041.9


Processing batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:32<00:00,  4.08s/batch]


Unnamed: 0,code,counts
2461,Z20822,23629
2381,Y929,15945
2309,Y92230,9019
1335,V270,8375
1484,V4986,8006
...,...,...
1869,W12XXXS,1
1871,W130XXD,1
1872,W130XXS,1
860,Q9382,1


### Procedures Data

As mentioned before, procedure data comes from two sources: hcpcsevents.csv and procedures_icd.csv 
Our objective is to roll them both up to ccs code and merge them. 

In [19]:
# Rolling up procedure data
base_directory = os.path.dirname(os.getcwd())

procedure_cleaned_rawdata = os.path.join(base_directory, 'processed_data', 'step3_cleaned_rawdata',"Procedures")
procedure_files = os.listdir(procedure_cleaned_rawdata)
sample_procedure_filepath= os.path.join(procedure_cleaned_rawdata, procedure_files[0])
sample_procedure = pd.read_csv(sample_procedure_filepath, dtype=str)
display(sample_procedure.head())


procedure_rollup_file = os.path.join(base_directory, 'scripts', 'rollup_mappings',"HCPCS_ICDPROC_to_CCS.csv")
procedure_to_ccs = pd.read_csv(procedure_rollup_file , dtype=str)

display(procedure_to_ccs.head())

Unnamed: 0,subject_id,date,code,coding_system
0,10000904,2180-10-09,99218,HCPCS
1,10002131,2123-06-25,G0378,HCPCS
2,10002428,2155-07-14,G0378,HCPCS
3,10002428,2160-07-15,G0378,HCPCS
4,10002428,2157-07-16,27235,HCPCS


Unnamed: 0,code,CCS,coding_system
0,00800ZZ,1,ICDPROC10
1,00803ZZ,1,ICDPROC10
2,00804ZZ,1,ICDPROC10
3,00870ZZ,1,ICDPROC10
4,00873ZZ,1,ICDPROC10


In [21]:

config={
    "patient_id":"subject_id",
    "parent_column":"CCS",
    "child_column":"code",
    "date":"date",
    "rollup_mapping":procedure_to_ccs
}

rollup_data_by_batch(
    config,
    os.path.join(base_directory, 'processed_data', 'step3_cleaned_rawdata','Procedures')
)

Unnamed: 0,subject_id,date,CCS
0,10000068,2160-03-04,227
1,10000635,2143-12-23,227
2,10000635,2136-06-19,227
3,10000935,2187-07-11,227
4,10000935,2183-11-07,227
...,...,...,...
129547,19999828,2149-01-10,54
129548,19999828,2147-07-27,172
129549,19999828,2147-07-27,170
129551,19999828,2147-07-18,54


Processing batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00,  1.34batch/s]


Unnamed: 0,code,counts
212,XW033E5,563
149,92980,323
166,93545,273
108,43268,252
109,43269,196
...,...,...
177,BF532Z0,1
84,35721,1
85,35741,1
90,36148,1


### Medication Data

In MIMIC Medications data come from two sources: 
1. prescriptions.csv where medications prescribed are recorded 
2. emar.csv where procedures are recorded as ndc codes - Hihgly granular information

Information is often duplicated among sources. Some of these events can also be found in inputevents in the ICU module which we can get into later
Our objective is to map the codes to RxNorm and then map to rnxorm ingredient codes

In [12]:
base_directory = os.path.dirname(os.getcwd())

medication_cleaned_rawdata = os.path.join(base_directory, 'processed_data', 'step3_cleaned_rawdata', "Medication")
medication_files = os.listdir(medication_cleaned_rawdata)
sample_medication_file = os.path.join(medication_cleaned_rawdata, medication_files[0])
sample_medication = pd.read_csv(sample_medication_file, dtype=str)
display(sample_medication.head())


medication_rollup_file = os.path.join(base_directory, 'scripts', 'rollup_mappings',"NDC_to_RxNorm.csv")
medication_to_rxnorm = pd.read_csv(medication_rollup_file, dtype=str)
display(medication_to_rxnorm.head())

Unnamed: 0,subject_id,date,code,coding_system
0,10000032,2180-05-08,51079007320,NDC
1,10000032,2180-05-07,487980125,NDC
2,10000032,2180-05-07,51079007220,NDC
3,10000032,2180-05-07,245004101,NDC
4,10000032,2180-05-07,0,NDC


Unnamed: 0,code,RxNorm,coding_system
0,00295117904,5499,NDC
1,00295117916,5499,NDC
2,00363026816,5499,NDC
3,00363026832,5499,NDC
4,00363087143,5499,NDC
...,...,...,...
490356,81371017567,2586250,NDC
490357,15455956901,2586262,NDC
490358,51552160103,2587012,NDC
490359,51552160105,2587012,NDC


In [15]:
# Processing prescription.
config={
    "patient_id":"subject_id",
    "parent_column":"RxNorm", # The parent column should be exactly as the parent column in the rollup mapping file
    "child_column":"code",
    "date":"date",
    "rollup_mapping":medication_to_rxnorm
}

rollup_data_by_batch(
    config ,
    os.path.join(base_directory, 'processed_data', 'step3_cleaned_rawdata','Medication')
)

Unnamed: 0,subject_id,date,RxNorm
0,10000635,2136-06-19,8591
1,10000635,2136-06-19,1547585
2,10000635,2136-06-19,4850
3,10000635,2136-06-19,4832
4,10000635,2136-06-19,10763_5487
...,...,...,...
2056286,19999828,2147-07-18,6470
2056288,19999828,2147-07-28,2599
2056289,19999828,2147-07-28,7804
2056290,19999828,2147-07-29,3443


Processing batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:27<00:00, 10.88s/batch]


Unnamed: 0,code,counts
0,0,188786
63,49281041688,19499
40,19515090941,11213
86,66689036430,9253
80,63323029766,4491
...,...,...
18,08881200441,2
32,16590023730,1
88,68258912401,1
39,19515089241,1


### Laboratory Data

In MIMIC, the lab observations are saved under labevents.csv.  Lab observations are generall recorded as LOINC codes in US EHR systems, 
however, the latest mimic data doesn't contain that. 

In [3]:
base_directory = os.path.dirname(os.getcwd())

labs_cleaned_rawdata = os.path.join(base_directory, 'processed_data', 'step3_cleaned_rawdata',"Labs")
lab_files = os.listdir(labs_cleaned_rawdata)
sample_lab_file= os.path.join(labs_cleaned_rawdata, lab_files[0])
sample_lab = pd.read_csv(sample_lab_file, dtype=str)
display(sample_lab.head())

Unnamed: 0,subject_id,date,code,coding_system
0,10000032,2180-03-23,50931,ITEMID
1,10000032,2180-03-23,51071,ITEMID
2,10000032,2180-03-23,51074,ITEMID
3,10000032,2180-03-23,51075,ITEMID
4,10000032,2180-03-23,51079,ITEMID


If you notice the coding_system, it's ITEMID which is not a standard coding system. MIMIC-IV provides a mapping from ITEMID to LOINC code. We will rollup LOINC codes to LOINC components. Similarly, in real world EHR data, you will have to source/merge data from different sources 

In [4]:
itemid_to_loinc_file = os.path.join(base_directory, 'scripts', 'meta_files',"lab_itemid_to_loinc.csv")
itemid_to_loinc = pd.read_csv(itemid_to_loinc_file, dtype=str)
itemid_to_loinc.dropna(subset=['loinc'],inplace=True)
itemid_to_loinc.head(5)

Unnamed: 0,itemid,label,fluid,category,valueuom,loinc,loinc_version,notes
5,50903,Cholesterol Ratio (Total/HDL),Blood,Chemistry,Ratio,9830-1,2.71,Mass ratio is more common than molar ratio in the US
6,50911,"Creatine Kinase, MB Isoenzyme",Blood,Chemistry,ng/mL,13969-1,2.71,This is the LOINC code US labs use
8,50937,Hepatitis A Virus Antibody,Blood,Chemistry,N/A|Pos/Neg,13951-9,2.71,More specific method
10,50941,Hepatitis B Surface Antigen,Blood,Chemistry,,5196-1,2.71,More specific method
11,50942,Hepatitis B Virus Core Antibody,Blood,Chemistry,,13952-7,2.71,More specific method


In [5]:

loinc_hierarchy_file = os.path.join(base_directory, 'scripts', 'meta_files',"LOINC_Hierarchy_v2.73_version4.csv")
loinc_hierarchy =  pd.read_csv(loinc_hierarchy_file, dtype=str)

# Select relevant columns and rename for clarity
loinc_rollup = loinc_hierarchy[['LOINC', 'PARENT_LOINC']].rename(columns={'LOINC': 'loinc', 'PARENT_LOINC': 'LoincComponent'})
itemid_to_loinc = itemid_to_loinc[['itemid', 'loinc']]

# Merge the mappings to associate item IDs with LOINC components
lab_to_loinc_component = pd.merge(itemid_to_loinc, loinc_rollup, on="loinc", how="left")

#  rename columns to match the lab data columns
lab_to_loinc_component = lab_to_loinc_component.rename(columns={"itemid": "code"})
lab_to_loinc_component['coding_system'] = 'ITEMID'
lab_to_loinc_component = lab_to_loinc_component[['code', 'LoincComponent', 'coding_system']]

# Display the first few rows of the resulting DataFrame
display(lab_to_loinc_component.head(5))

Unnamed: 0,code,LoincComponent,coding_system
0,50903,LP307370-9,ITEMID
1,50911,LP15513-2,ITEMID
2,50937,LP38316-3,ITEMID
3,50941,LP38331-2,ITEMID
4,50942,LP38323-9,ITEMID


In [9]:
# Processing prescription.
config={
    "patient_id":"subject_id",
    "parent_column":"LoincComponent", # The parent column should be exactly as the parent column in the rollup mapping file
    "child_column":"code",
    "date":"date",
    "rollup_mapping":lab_to_loinc_component
}

rollup_data_by_batch(
    config,
    os.path.join(base_directory, 'processed_data', 'step3_cleaned_rawdata','Labs')
)

Unnamed: 0,subject_id,date,LoincComponent
13,10000084,2160-11-20,LP14635-4
34,10000084,2160-11-20,LP15957-1
36,10000084,2160-11-20,LP14328-6
37,10000084,2160-11-20,LP14539-8
40,10000084,2160-11-20,LP14540-6
...,...,...,...
16609260,19999840,2164-09-15,LP14267-6
16609275,19999840,2164-09-15,LP14635-4
16609302,19999840,2164-09-16,LP14635-4
16609328,19999840,2164-09-17,LP14635-4


Processing batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [08:36<00:00, 64.55s/batch]


Unnamed: 0,code,counts
362,51221,218775
402,51265,218717
363,51222,218647
433,51301,218646
389,51250,218644
...,...,...
787,52055,1
783,52032,1
564,51459,1
747,51927,1


In [None]:
#load data, and process by batches. - write a function to keep appending the resuting data
#also import tqdm