<div style="text-align: center; font-weight: bold;">
    <h1>Pipeline for Research ready EHR Datasets</h1>
    <h2>Part 2: Cleaning, Organizing and Rolling up the EHR Data</h2>
    <h4>Author: Vidul Ayakulangara Panickan</h3>
</div>



## Step 4: Cleaning!

MIMIC data has been processed for analysis, so it's different from the typical data you might encounter at a hospital system. In general, Electronic Health Record (EHR) data is stored in databases, and each observation is usually marked with a time of observation.

In MIMIC-IV, events are also timestamped, but sometimes the data comes from different tables. For example, the diagnoses_icd table doesn’t include timestamps, while tables for lab results, procedures, and microbiology events do

In [1]:
# Its good practice to load all the libraries you will need beforehand

import os
import sys
import time
import logging
import pandas as pd
from tqdm import tqdm
from IPython.display import clear_output

### Cleaning Diagnoses Data
First we will clean the diagnosis data to give an example of the steps involded and the order they are executed. Then, we'll define these steps under a function so we can reuse the steps of other types of data like medication, procedures and labs

In [2]:
# The diagnosis data is available in diagnoses_icd_file and the time of recording is available in admissions_file

diagnoses_icd_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/diagnoses_icd.csv"
admissions_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/admissions.csv"


diagnoses_icd = pd.read_csv(diagnoses_icd_file, dtype=str)
admissions = pd.read_csv(admissions_file, dtype=str)

# Listing all columns
print(diagnoses_icd.columns)
print(admissions.columns)

Index(['subject_id', 'hadm_id', 'seq_num', 'icd_code', 'icd_version'], dtype='object')
Index(['subject_id', 'hadm_id', 'admittime', 'dischtime', 'deathtime',
       'admission_type', 'admit_provider_id', 'admission_location',
       'discharge_location', 'insurance', 'language', 'marital_status', 'race',
       'edregtime', 'edouttime', 'hospital_expire_flag'],
      dtype='object')


In [4]:
# Now before combining the tables, we need to check for missing data in raw files.
# This is a step we need to perform after join operations

print("Dianosis - Missing Values Count")
print(diagnoses_icd.isna().sum(), "\n\n")

print("Admissions - Missing Values Count")
print(admissions.isna().sum())

Dianosis - Missing Values Count
subject_id     0
hadm_id        0
seq_num        0
icd_code       0
icd_version    0
dtype: int64 


Admissions - Missing Values Count
subject_id                   0
hadm_id                      0
admittime                    0
dischtime                    0
deathtime               534238
admission_type               0
admit_provider_id            4
admission_location           1
discharge_location      149818
insurance                 9355
language                   775
marital_status           13619
race                         0
edregtime               166788
edouttime               166788
hospital_expire_flag         0
dtype: int64


### Keeping only the required fields

We only require the subject ID, hadm_id, and admittime from the admissions table to create the timestamped diagnosis dataset. And as there are no missing values, we can proceed with joining the tables.

Note: If any rows have missing dates or ICD codes, you can either remove them or impute them. When you have large enough data, it's common pracitce to remove rows if the number of missing values is low

In [5]:
# Merging diagnoses_icd and admissions tables on 'subject_id' and 'hadm_id' columns

timed_diagnoses_icd = pd.merge(
    diagnoses_icd,
    admissions[["subject_id", "hadm_id", "admittime"]],
    how="left",
    on=["subject_id", "hadm_id"],
)


timed_diagnoses_icd

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06 22:23:00
1,10000032,22595853,2,78959,9,2180-05-06 22:23:00
2,10000032,22595853,3,5715,9,2180-05-06 22:23:00
3,10000032,22595853,4,07070,9,2180-05-06 22:23:00
4,10000032,22595853,5,496,9,2180-05-06 22:23:00
...,...,...,...,...,...,...
6364483,19999987,23865745,7,41401,9,2145-11-02 21:38:00
6364484,19999987,23865745,8,78039,9,2145-11-02 21:38:00
6364485,19999987,23865745,9,0413,9,2145-11-02 21:38:00
6364486,19999987,23865745,10,36846,9,2145-11-02 21:38:00


After a merge operation, it's vital to check whether the resulting table has any missing values. If there are a significant number of 
missing values, you should investigate further to identify underlying cause. For example, the join columns might be of different data types (like 'subject_id' can be integer in one table and string in another table), which can cause the join operation to essentially fail.


In [7]:
timed_diagnoses_icd.isna().sum()

subject_id     0
hadm_id        0
seq_num        0
icd_code       0
icd_version    0
admittime      0
dtype: int64

In [9]:
# Check columns of interest. If you have records where dates, icd_code or subect_ids are null, remove them.

timed_diagnoses_icd = timed_diagnoses_icd.dropna(subset=['admittime', 'icd_code', 'icd_version', 'subject_id'], how="any")

timed_diagnoses_icd

# We are saving a copy which we will use in the future
timed_diagnoses_icd.to_csv("uncleaned_timed_diagnoses_icd.csv",index=None)

In [7]:
# Removing the time component from the 'admittime' column to keep only the date (YYYY-MM-DD). This is typically done in cases where only the
# date component is relevant for the analysis.

timed_diagnoses_icd["admittime"] = timed_diagnoses_icd["admittime"].str[:10]

timed_diagnoses_icd

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06
1,10000032,22595853,2,78959,9,2180-05-06
2,10000032,22595853,3,5715,9,2180-05-06
3,10000032,22595853,4,07070,9,2180-05-06
4,10000032,22595853,5,496,9,2180-05-06
...,...,...,...,...,...,...
6364483,19999987,23865745,7,41401,9,2145-11-02
6364484,19999987,23865745,8,78039,9,2145-11-02
6364485,19999987,23865745,9,0413,9,2145-11-02
6364486,19999987,23865745,10,36846,9,2145-11-02


In [9]:
# For the diagnosis date, we typically keep 'subject_id', 'icd_code', 'icd_version', and 'admittime'. 
# You can retain other columns as needed for your analysis.

timed_diagnoses_icd=timed_diagnoses_icd[['subject_id','icd_code','icd_version','admittime']]

# Next, we rename 'admittime' to 'date' to ensure consistency with other datasets that will be created later for meds, labs and procedures.

timed_diagnoses_icd = timed_diagnoses_icd.rename(columns={'admittime': 'date'})
timed_diagnoses_icd.head(5)

Unnamed: 0,subject_id,icd_code,icd_version,date
0,10000032,5723,9,2180-05-06
1,10000032,78959,9,2180-05-06
2,10000032,5715,9,2180-05-06
3,10000032,07070,9,2180-05-06
4,10000032,496,9,2180-05-06
...,...,...,...,...
6364483,19999987,41401,9,2145-11-02
6364484,19999987,78039,9,2145-11-02
6364485,19999987,0413,9,2145-11-02
6364486,19999987,36846,9,2145-11-02


#### Define a date range for the data (Not applicable to MIMIC-IV).

While MIMIC-IV uses adjusted dates, it’s important to ensure that dates in real-world datasets are reasonable. We will filter out records with dates that fall outside a specified range, such as those before the 1980s or after the current year. The code for this operation is provided below. However, for MIMIC-IV, since the dates are already adjusted, this cleaning step will not be applied.

In [None]:
# diagnoses_icd = diagnoses_icd[
#     (diagnoses_icd["date"].str[:4].astype(int) >= 1980)
#     & (diagnoses_icd["date"].str[:4].astype(int) <= 2024)
# ]

In [11]:
# Check for duplicated rows in your data

if timed_diagnoses_icd.duplicated().sum() > 0:
    print("Duplicate rows found. Removing duplicates:")
    timed_diagnoses_icd = timed_diagnoses_icd.drop_duplicates()  # Remove duplicate rows
    timed_diagnoses_icd.head(5)
else:
    print("No duplicate rows found.")

Duplicate rows found. Removing duplicates:
        subject_id icd_code icd_version        date
0         10000032     5723           9  2180-05-06
1         10000032    78959           9  2180-05-06
2         10000032     5715           9  2180-05-06
3         10000032    07070           9  2180-05-06
4         10000032      496           9  2180-05-06
...            ...      ...         ...         ...
6364483   19999987    41401           9  2145-11-02
6364484   19999987    78039           9  2145-11-02
6364485   19999987     0413           9  2145-11-02
6364486   19999987    36846           9  2145-11-02
6364487   19999987     7810           9  2145-11-02

[6356481 rows x 4 columns]


## Defining Cleaning Functions 

timed_diagnoses_icd now contains the cleaned ICD diagnoses data. Next, we will define the cleaning functions based on the code above, so they can be reused for other datasets

In [2]:
## Ignore the function below, we are defining functions so that we can print and save the info in a log file

def setup_logger(log_type,log_file):

    log_folder = os.path.join("log_folder", log_type)
    
    # Ensure the log folder exists
    os.makedirs(log_folder, exist_ok=True)
    
    # Define the full path for the log file
    log_filepath = os.path.join(log_folder, log_file)

    # Delete the log file if it exists
    if os.path.exists(log_filepath):
        os.remove(log_filepath)
    
    for handler in logging.root.handlers[:]:
        logging.root.removeHandler(handler)

    logging.basicConfig(level=logging.DEBUG, format='%(message)s', handlers=[logging.FileHandler(log_filepath)])

# Define the function to print and log messages
def print_and_log_cleaning(message):
    logging.info(message)  # Log message to the specific log file
    
#################################################################################################


def missing_values_summary(df):
    missing_values = df.isna().sum()
    missing_df = pd.DataFrame({'Column Name': missing_values.index, 'Missing Values Count': missing_values.values})
    print_and_log_cleaning("Missing Values Count:")
    print_and_log_cleaning(missing_df)



def clean_data(df, cols_of_interest, time_col):

    missing_values_summary(df)

    print_and_log_cleaning(f"Initial number of rows: {df.shape[0]}")

    print_and_log_cleaning("Keeping only the essential columns")
    df = df[cols_of_interest]

    if df.isna().sum().any():
        df = df.dropna()
        print_and_log_cleaning(f"Number of rows after dropping na rows: {df.shape[0]}")
    else:
        print_and_log_cleaning("No rows with missing values to drop.")

    print_and_log_cleaning("Extracting only the date info")
    df[time_col] = df[time_col].str[:10]

    print_and_log_cleaning("Renaming the columns")
    df = df.rename(columns={time_col: "date"})

    print_and_log_cleaning("Checking for duplicate rows")
    if df.duplicated().sum() > 0:
        print_and_log_cleaning("Duplicate rows found. Removing duplicates:")
        df = df.drop_duplicates()
        return df
    else:
        print_and_log_cleaning("No duplicate rows found.")
        return df


def clean_data_batch_supportfunc(df, cols_of_interest):

    missing_values_summary(df)

    print_and_log_cleaning(f"Initial number of rows: {df.shape[0]}")

    if df.isna().sum().any():
        df.dropna(inplace=True)
        print_and_log_cleaning(f"Number of rows after dropping na rows: {df.shape[0]}")
    else:
        print_and_log_cleaning("No rows with missing values to drop.")

    print_and_log_cleaning("Extracting only the date info")
    df[cols_of_interest['date']] = df[cols_of_interest['date']].str[:10]
  

    print_and_log_cleaning("Renaming the columns")
    df = df.rename(columns={cols_of_interest['date']: "date"})
    df = df.rename(columns={cols_of_interest['code']: "code"})

    print_and_log_cleaning("Checking for duplicate rows")
    if df.duplicated().sum() > 0:
        print_and_log_cleaning("Duplicate rows found. Removing duplicates:")
        df = df.drop_duplicates()
        return df
    else:
        print_and_log_cleaning("No duplicate rows found.")
        return df


def file_line_count(file_path):
    count = 0
    with open(file_path, 'r') as file:
        for line in file:
            count += 1
    return count


def clean_data_by_batch(input_file_path, patient_ids, cols_of_interest, data_name, num_rows_to_load=1500000):
    #Coding system will be a list of columns if the coding system is defined in the input data, if not, you can pass a single value list

    filename = os.path.splitext(os.path.basename(input_file_path))[0]

    # Set up logger for cleaning function
    setup_logger("cleaning",f"{data_name}_{filename}.txt")  # Create log folder if necessary and set up logging

 
    # Create a dictionary for batch_num lookup based on subject_id
    batch_lookup = dict(zip(patient_ids['subject_id'], patient_ids['batch_num']))
    
    # Get the list of unique batch numbers from patient_ids
    unique_batch_nums = patient_ids['batch_num'].unique()

    output_dir = f'Cleaned_RawData/{data_name}'
    os.makedirs(output_dir, exist_ok=True)

    # Iterate over each batch number
    for batch_num in tqdm(unique_batch_nums, desc="Processing batches", unit="batch"):
        
        print_and_log_cleaning(f"\nProcessing batch {batch_num}:")

        collected_data = []

        non_none_cols_of_interest = [item for key, item in cols_of_interest.items() if item is not None and key != "coding_system"
]
        
        chunk_iter = pd.read_csv(input_file_path, chunksize=num_rows_to_load, usecols=non_none_cols_of_interest , dtype=str)
        
        # Process each chunk and collect data for the current batch
        for chunk_idx, chunk in enumerate(chunk_iter):
            print_and_log_cleaning(f"------ Processing  batch {batch_num} chunk{chunk_idx + 1}")

            batch_data = chunk[chunk['subject_id'].map(batch_lookup) == batch_num]
            
            if not batch_data.empty:
                collected_data.append(batch_data)
        
        # If we found data for this batch, clean and save it
        if collected_data:

            final_batch_data = pd.concat(collected_data, ignore_index=True)

            print_and_log_cleaning(final_batch_data)

            if cols_of_interest.get('code_version'):
                cleaned_batch_data = clean_data_batch_supportfunc(final_batch_data, cols_of_interest)
                cleaned_batch_data['coding_system'] = cols_of_interest['coding_system'] + cleaned_batch_data[cols_of_interest['code_version']]

                
            else:
                cleaned_batch_data = clean_data_batch_supportfunc(final_batch_data, cols_of_interest)
                cleaned_batch_data['coding_system'] = cols_of_interest['coding_system']

            cleaned_batch_data=cleaned_batch_data[['subject_id','date','code','coding_system']]
            
            # Prepare the output file path dynamically using the data_name and batch_num
            output_file = os.path.join(output_dir, f"{data_name.lower()}_batch{batch_num}_{filename}.csv")

            
            # Save the cleaned data to the file
            cleaned_batch_data.to_csv(output_file, index=False)
            print_and_log_cleaning(f"Batch {batch_num} cleaned data saved to {output_file}.")
        else:
            print_and_log_cleaning(f"Warning: No data found for batch {batch_num}.")
 
    
    
    print_and_log_cleaning(f"\nCleaning complete. All batches processed and saved in the '{output_dir}' directory.")

## Handling Large-Scale Data
In real world data, the ehr datasets are typically too big to load in the memory at once so we process them in batches.


In [3]:
# First we need to identify the individual patients and then we are gonna assign them to batches. If you do not have admissions file, 
# you can simply take the diagnosis file to get the unique patient ids.

admissions_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/admissions.csv"

admissions = pd.read_csv(admissions_file, dtype=str)

admissions.head(5)

Unnamed: 0,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
0,10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P49AFC,TRANSFER FROM HOSPITAL,HOME,Medicaid,English,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
1,10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P784FA,EMERGENCY ROOM,HOME,Medicaid,English,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
2,10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P19UTS,EMERGENCY ROOM,HOSPICE,Medicaid,English,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
3,10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P06OTX,EMERGENCY ROOM,HOME,Medicaid,English,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
4,10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P39NWO,EMERGENCY ROOM,,,English,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0


In [4]:
patient_ids = admissions[['subject_id']].drop_duplicates()
patient_ids = patient_ids.sort_values(by='subject_id', ascending=True)
patient_ids = patient_ids.reset_index(drop=True)
print(patient_ids)


# specify the number of batches 
num_of_batches = 8

patient_ids['batch_num'] = (patient_ids.index % num_of_batches) + 1

print(patient_ids)

patient_ids.groupby('batch_num')['subject_id'].count().reset_index().rename(columns={'subject_id': 'patient_count'})

       subject_id
0        10000032
1        10000068
2        10000084
3        10000108
4        10000117
...           ...
223447   19999733
223448   19999784
223449   19999828
223450   19999840
223451   19999987

[223452 rows x 1 columns]
       subject_id  batch_num
0        10000032          1
1        10000068          2
2        10000084          3
3        10000108          4
4        10000117          5
...           ...        ...
223447   19999733          8
223448   19999784          1
223449   19999828          2
223450   19999840          3
223451   19999987          4

[223452 rows x 2 columns]


Unnamed: 0,batch_num,patient_count
0,1,27932
1,2,27932
2,3,27932
3,4,27932
4,5,27931
5,6,27931
6,7,27931
7,8,27931


### 4.1 Cleaning Diagnoses Data in batches

In [12]:
uncleaned_timed_diagnoses_icd_file = "uncleaned_timed_diagnoses_icd.csv"

uncleaned_timed_diagnoses_icd = pd.read_csv(uncleaned_timed_diagnoses_icd_file,nrows=5, dtype=str)
uncleaned_timed_diagnoses_icd

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06 22:23:00
1,10000032,22595853,2,78959,9,2180-05-06 22:23:00
2,10000032,22595853,3,5715,9,2180-05-06 22:23:00
3,10000032,22595853,4,7070,9,2180-05-06 22:23:00
4,10000032,22595853,5,496,9,2180-05-06 22:23:00


In [13]:
cols_of_interest = {
    "patient_id":"subject_id",
    "date":"admittime",
    "code":"icd_code",
    "code_version":"icd_version",  #if there is no code version, specifit it as none. "code_version":None
    "coding_system":"ICD" # We specify the coding system, the others are column names
}


clean_data_by_batch(uncleaned_timed_diagnoses_icd_file, patient_ids, cols_of_interest,"Diagnoses",num_rows_to_load=15000000)

Processing batches: 100%|██████████████████████| 8/8 [00:51<00:00,  6.38s/batch]


### 4.2 Cleaning Procedures Data

In MIMIC Procedure data come from two sources: 
1. hcpcsevents.csv where procedures are recorded as CPT codes 
2. procedures_icd.csv where procedures are recorded as ICD9/ICD10 Procedure codes

We will need to clean them both and concatenate them.

In [14]:
# Cleaning HCPCS events

import pandas as pd 

hcpcenvents_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/hcpcsevents.csv"

hcpcenvents_top5rows = pd.read_csv(hcpcenvents_file, nrows=5,dtype=str)

hcpcenvents_top5rows

Unnamed: 0,subject_id,hadm_id,chartdate,hcpcs_cd,seq_num,short_description
0,10000068,25022803,2160-03-04,99218,1,Hospital observation services
1,10000084,29888819,2160-12-28,G0378,1,Hospital observation per hr
2,10000108,27250926,2163-09-27,99219,1,Hospital observation services
3,10000117,22927623,2181-11-15,43239,1,Digestive system
4,10000117,22927623,2181-11-15,G0378,2,Hospital observation per hr


In [15]:

cols_of_interest = {
    "patient_id":"subject_id",
    "date":"chartdate",
    "code":"hcpcs_cd",
    "code_version":None,  #if there is no code version, specifit it as none. "code_version":None
    "coding_system":"HCPCS" # We specify the coding system, the others are column names
}


clean_data_by_batch(hcpcenvents_file, patient_ids, cols_of_interest,"Procedures",num_rows_to_load=15000000)

Processing batches: 100%|██████████████████████| 8/8 [00:02<00:00,  3.21batch/s]


In [16]:
# Processing procedures_icd.csv

procedures_icd_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/procedures_icd.csv"

procedures_icd_top5rows = pd.read_csv(procedures_icd_file, dtype=str, nrows=5)

print(procedures_icd_top5rows)

  subject_id   hadm_id seq_num   chartdate icd_code icd_version
0   10000032  22595853       1  2180-05-07     5491           9
1   10000032  22841357       1  2180-06-27     5491           9
2   10000032  25742920       1  2180-08-06     5491           9
3   10000068  25022803       1  2160-03-03     8938           9
4   10000117  27988844       1  2183-09-19  0QS734Z          10


In [17]:

cols_of_interest = {
    "patient_id":"subject_id",
    "date":"chartdate",
    "code":"icd_code",
    "code_version":"icd_version",  #if there is no code version, specifit it as none. "code_version":None
    "coding_system":"ICDPROC" # We specify the coding system, the others are column names
}


clean_data_by_batch(procedures_icd_file, patient_ids, cols_of_interest,"Procedures",num_rows_to_load=15000000)

Processing batches: 100%|██████████████████████| 8/8 [00:08<00:00,  1.01s/batch]


### 4.3 Cleaning Medications Data

In [6]:
prescriptions_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/prescriptions.csv"

# Loading the top 5 rows to identify columns of interest
prescriptions_top5rows = pd.read_csv(prescriptions_file,dtype=str,nrows=5)

prescriptions_top5rows

Unnamed: 0,subject_id,hadm_id,pharmacy_id,poe_id,poe_seq,order_provider_id,starttime,stoptime,drug_type,drug,...,gsn,ndc,prod_strength,form_rx,dose_val_rx,dose_unit_rx,form_val_disp,form_unit_disp,doses_per_24_hrs,route
0,10000032,22595853,12775705,10000032-55,55,P85UQ1,2180-05-08 08:00:00,2180-05-07 22:00:00,MAIN,Furosemide,...,8209.0,51079007320,40mg Tablet,,40,mg,1.0,TAB,1,PO/NG
1,10000032,22595853,18415984,10000032-42,42,P23SJA,2180-05-07 02:00:00,2180-05-07 22:00:00,MAIN,Ipratropium Bromide Neb,...,21700.0,487980125,2.5mL Vial,,1,NEB,1.0,VIAL,4,IH
2,10000032,22595853,23637373,10000032-35,35,P23SJA,2180-05-07 01:00:00,2180-05-07 09:00:00,MAIN,Furosemide,...,8208.0,51079007220,20mg Tablet,,20,mg,1.0,TAB,1,PO/NG
3,10000032,22595853,26862314,10000032-41,41,P23SJA,2180-05-07 01:00:00,2180-05-07 01:00:00,MAIN,Potassium Chloride,...,1275.0,245004101,10mEq ER Tablet,,40,mEq,4.0,TAB,1,PO
4,10000032,22595853,30740602,10000032-27,27,P23SJA,2180-05-07 00:00:00,2180-05-07 22:00:00,MAIN,Sodium Chloride 0.9% Flush,...,,0,10 mL Syringe,,3,mL,0.3,SYR,3,IV


In [10]:
cols_of_interest = {
    "patient_id":"subject_id",
    "date":"starttime",
    "code":"ndc",
    "code_version":None,  #if there is no code version, specifit it as none. "code_version":None
    "coding_system":"NDC" # We specify the coding system, the others are column names
}


clean_data_by_batch(prescriptions_file, patient_ids, cols_of_interest,"Medication",num_rows_to_load=15000000)

Processing batches: 100%|██████████████████████| 8/8 [06:01<00:00, 45.16s/batch]


## 4.4 Cleaning Lab Data: Handling Large-Scale Data

In [6]:
labs_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/labevents.csv"

labs_5 = pd.read_csv(labs_file,dtype=str,nrows=5)

labs_5

Unnamed: 0,labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
0,1,10000032,,2704548,50931,P69FQC,2180-03-23 11:51:00,2180-03-23 15:56:00,___,95.0,mg/dL,70.0,100.0,,ROUTINE,"IF FASTING, 70-100 NORMAL, >125 PROVISIONAL DI..."
1,2,10000032,,36092842,51071,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,
2,3,10000032,,36092842,51074,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,
3,4,10000032,,36092842,51075,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,BENZODIAZEPINE IMMUNOASSAY SCREEN DOES NOT DET...
4,5,10000032,,36092842,51079,P69FQC,2180-03-23 11:51:00,2180-03-23 16:00:00,NEG,,,,,,ROUTINE,


In [26]:
# Counting the number of lines in the lab file

file_line_count(labs_file)

158374765

Lab data is one of the largest subdatasets in EHR due to the high frequency of recorded lab events. In MIMIC-IV, there are over 158 million lab observations recorded and loading the entire dataset into memory at once is inefficient and may cause your program to crash. Therefore, we will process the data in batches, working with one batch of patients at a time

In [7]:
# This will take atleast 5 minutes to process a batch, you can check the log folder for detailed progress

cols_of_interest = {
    "patient_id":"subject_id",
    "date":"charttime",
    "code":"itemid",
    "code_version":None,  #if there is no code version, specifit it as none. "code_version":None
    "coding_system":"ITEMID" # We specify the coding system, the others are column names
}


clean_data_by_batch(labs_file , patient_ids, cols_of_interest,"Labs",num_rows_to_load=15000000)


# We process data in batches of patients, rather than batches of data, to efficiently handle duplicates, as we can't load the entire dataset into memory at once.
# If it takes too long to process a chunk, try reducing num_rows_to_load

Processing batches: 100%|███████████████████████████████████████████████████████████████████████████████████████| 8/8 [23:39<00:00, 177.42s/batch]


## Step 5: Rolling up Data

For data to be effectively compared and analyzed, it must be standardized. For example, medical codes in electronic health records (EHR) like medications codes may come from different coding systems like NDC, RxNorm, or institution-specific local codes. "Rolling up" data from these various coding systems to a common parent coding system ensures standardization.

There are additional reasons to perform the rollup:

1. The raw EHR codes may be too specific, making analysis difficult or impractical.
2. Rolling up helps harmonize data across different institutions, enabling analysis on a larger scale.

In [12]:
import os
import pandas as pd

os.makedirs("Rollup_Mappings", exist_ok=True)
os.makedirs("Intermediate_Data", exist_ok=True)
os.makedirs("Rolledup_Data", exist_ok=True)

## Creating Rollup Dictionary

To rollup EHR codes to a parent level code, we need decide on what coding system we will be rolling to 
1. Diagnoses - We will be rolling up ICD and other Diagnoses codes to PheCodes
2. Medication - We will be rolling up standard codes like RxNorm, NDC and local medication codes to RxNorm Ingredient level codes
3. Lab - We will be rolling up lab codes, loinc codes to LOINC Component codes
4. Prcedures - We will be rolling up Procedure codes like ICDPCS/CPT4 Codes to CCS codes

Creating these Rollup dictionaries require a lot of manual processing and quality checks to ensure the mapping dictionary created is accurate

In [23]:
#Write code to create rollup dictionries



### Rolling up Diagnoses Data
ICD codes are too detailed to be used for research purposes. Phecodes solves this problem by grouping relevant ICD codes into clinical meaningful phenotypes.

In [26]:
icd_to_phecode_file = "./Rollup_Mappings/ICD_to_PheCode.csv"

icd_to_phecode = pd.read_csv(icd_to_phecode_file, dtype=str)

icd_to_phecode 

Unnamed: 0,code,PheCode,coding_system
0,001,008,ICD9
1,0010,008,ICD9
2,0011,008,ICD9
3,0019,008,ICD9
4,002,008,ICD9
...,...,...,...
98544,T524X3D,981,ICD10
98545,T532X4D,981,ICD10
98546,T533X4S,981,ICD10
98547,T521X3S,981,ICD10


In [27]:
# We will select a sample diagnoses file for rolling up

diagnoses_raw_data_cleaned = "./Cleaned_RawData/Diagnoses" 

diagnoses_files = os.listdir(diagnoses_raw_data_cleaned)

sample_diagnoses_filepath= os.path.join(diagnoses_raw_data_cleaned, diagnoses_files[0])
    
sample_diagnoses = pd.read_csv(sample_diagnoses_filepath, dtype=str)

sample_diagnoses

Unnamed: 0,subject_id,date,code,coding_system
0,10000032,2180-05-06,5723,ICD9
1,10000032,2180-05-06,78959,ICD9
2,10000032,2180-05-06,5715,ICD9
3,10000032,2180-05-06,07070,ICD9
4,10000032,2180-05-06,496,ICD9
...,...,...,...,...
799773,19999784,2121-01-31,Z5111,ICD10
799774,19999784,2121-01-31,C8589,ICD10
799775,19999784,2121-01-31,E876,ICD10
799776,19999784,2121-01-31,Z87891,ICD10


Now, if you observe the rollup mapping file and the actual diagnoses file, you will notice a mismatch in the column names. To perform the
rollup using a join operation, you will need to make the column names consistent. You can do this by renaming the columns in the rollup file.

In [29]:
# Merging the two tables to rollup/map icd code to phecode. Save this rolled up data in intermediate folder. 
# In future if you update rollup mapping to be more comprehensive or if you want to look at codes that are unmapped, you can always come back.
    
sample_diagnoses_phecode = pd.merge(sample_diagnoses, icd_to_phecode, how='left', on=['code','coding_system'])
    
sample_diagnoses_phecode['Rollup_Status'] = sample_diagnoses_phecode['PheCode'].notna().replace({True: '1', False: '0'})
sample_diagnoses_phecode

Unnamed: 0,subject_id,date,code,coding_system,PheCode,Rollup_Status
0,10000032,2180-05-06,5723,ICD9,571.81,1
1,10000032,2180-05-06,78959,ICD9,572,1
2,10000032,2180-05-06,5715,ICD9,571.51,1
3,10000032,2180-05-06,07070,ICD9,070.3,1
4,10000032,2180-05-06,496,ICD9,496,1
...,...,...,...,...,...,...
799773,19999784,2121-01-31,Z5111,ICD10,1010,1
799774,19999784,2121-01-31,C8589,ICD10,202.2,1
799775,19999784,2121-01-31,E876,ICD10,276.14,1
799776,19999784,2121-01-31,Z87891,ICD10,318,1


In [30]:
# Unrolled rows where ICD codes were not rolled up

sample_diagnoses_unrolled = sample_diagnoses_phecode[sample_diagnoses_phecode["Rollup_Status"]=="0"]
sample_diagnoses_unrolled

Unnamed: 0,subject_id,date,code,coding_system,PheCode,Rollup_Status
34,10000032,2180-07-23,V4986,ICD9,,0
44,10001319,2135-07-20,V270,ICD9,,0
46,10001319,2138-11-09,V270,ICD9,,0
48,10001319,2134-04-15,V270,ICD9,,0
52,10001843,2131-11-09,Y840,ICD10,,0
...,...,...,...,...,...,...
799680,19999784,2123-04-14,Y92239,ICD10,,0
799702,19999784,2123-04-28,R7401,ICD10,,0
799708,19999784,2123-04-28,Y92230,ICD10,,0
799737,19999784,2119-09-19,Y92239,ICD10,,0


In [32]:
# Summarize the codes that have not been rolledup

unique_subject_icd_pairs = sample_diagnoses_unrolled[['subject_id', 'code','coding_system']].drop_duplicates()

icdcode_frequencies = unique_subject_icd_pairs[['code','coding_system']].value_counts().reset_index(name='counts')

sorted_icdcode_frequencies = icdcode_frequencies.rename(columns={'index': 'code'}).sort_values(by='counts', ascending=False)

sorted_icdcode_frequencies.head(10)

Unnamed: 0,code,coding_system,counts
0,Z20822,ICD10,3009
1,Y929,ICD10,1956
2,Y92230,ICD10,1115
3,V270,ICD9,1086
4,V4986,ICD9,1057
5,Y92009,ICD10,901
6,Y92239,ICD10,898
7,E8497,ICD9,867
8,E8490,ICD9,743
9,E8788,ICD9,694


Once the data looks reasonable, with good enough rollup done, you can save the data. You can save the comprehensive data with rolled and unrolled info into the intermidate_data folder. You can come back to this if you need to check anything in the future. 

You can save the rolled up file under rolleddup_data

We don't really need all the columns after rollup is performed. Below we just keep the data we need.

In [37]:
sample_diagnoses_phecode_filtered = sample_diagnoses_phecode[sample_diagnoses_phecode['Rollup_Status']=="1"]

print(sample_diagnoses_phecode_filtered )
sample_diagnoses_phecode_filtered  = sample_diagnoses_phecode_filtered [['subject_id','PheCode','date']]
sample_diagnoses_phecode_filtered 

       subject_id        date    code coding_system PheCode Rollup_Status
0        10000032  2180-05-06    5723          ICD9  571.81             1
1        10000032  2180-05-06   78959          ICD9     572             1
2        10000032  2180-05-06    5715          ICD9  571.51             1
3        10000032  2180-05-06   07070          ICD9   070.3             1
4        10000032  2180-05-06     496          ICD9     496             1
...           ...         ...     ...           ...     ...           ...
799773   19999784  2121-01-31   Z5111         ICD10    1010             1
799774   19999784  2121-01-31   C8589         ICD10   202.2             1
799775   19999784  2121-01-31    E876         ICD10  276.14             1
799776   19999784  2121-01-31  Z87891         ICD10     318             1
799777   19999784  2121-01-31   Z8619         ICD10     136             1

[747362 rows x 6 columns]


Unnamed: 0,subject_id,PheCode,date
0,10000032,571.81,2180-05-06
1,10000032,572,2180-05-06
2,10000032,571.51,2180-05-06
3,10000032,070.3,2180-05-06
4,10000032,496,2180-05-06
...,...,...,...
799773,19999784,1010,2121-01-31
799774,19999784,202.2,2121-01-31
799775,19999784,276.14,2121-01-31
799776,19999784,318,2121-01-31


In [38]:
if sample_diagnoses_phecode_filtered.duplicated().sum() > 0:
    print("Duplicate rows found. Removing duplicates...")
    sample_diagnoses_phecode_filtered = sample_diagnoses_phecode_filtered.drop_duplicates()  # Remove duplicate rows
    print("DataFrame after removing duplicates:")
else:
    print("No duplicate rows found.")

sample_diagnoses_phecode_filtered

Duplicate rows found. Removing duplicates...
DataFrame after removing duplicates:


Unnamed: 0,subject_id,PheCode,date
0,10000032,571.81,2180-05-06
1,10000032,572,2180-05-06
2,10000032,571.51,2180-05-06
3,10000032,070.3,2180-05-06
4,10000032,496,2180-05-06
...,...,...,...
799773,19999784,1010,2121-01-31
799774,19999784,202.2,2121-01-31
799775,19999784,276.14,2121-01-31
799776,19999784,318,2121-01-31


In [39]:
# Once cleaned, we can save the rollup data
sample_diagnoses_phecode_filtered.to_csv("./Rolledup_Data/diagnoses_phecode_rolled.csv", index=None)

NameError: name 'diagnoses_phecode_filtered' is not defined

## Defining Functions 

As we will be perfoming operations similar to what we did to roll up ICD codes, its better to define these operations
as a function so we can resue them.

In [2]:
def rollup(raw_level_data, rollup_mapping , join_columns, parent_column):
    
    rolledup_data = pd.merge(raw_level_data, rollup_mapping, how='left', on=join_columns)
    
    rolledup_data['Rollup_Status'] = rolledup_data[parent_column].notna().replace({True: '1', False: '0'})
    
    return rolledup_data


def summarize_unmapped(rolledup_data, child_column):
    
    rolledup_data_unmapped = rolledup_data[rolledup_data["Rollup_Status"]=="0"]
    
    unique_patient_code_pairs = rolledup_data_unmapped[['subject_id', child_column]].drop_duplicates()

    unmapped_code_frequencies = unique_patient_code_pairs[[child_column]].value_counts().reset_index(name='counts')

    unmapped_code_frequencies = unmapped_code_frequencies.rename(columns={'index': child_column})
    
    return unmapped_code_frequencies


def filter_rolledup_data(rolledup_data,cols_of_interest):
    
    filtered = rolledup_data[rolledup_data['Rollup_Status']=="1"]
    
    filtered = filtered[cols_of_interest]

    filtered.drop_duplicates(inplace=True)
    
    return filtered
    

def rollup_data_by_batch(config, unrolled_data_folder):

    folder_name = os.path.basename(unrolled_data_folder)
    intermediate_dir = f"Intermediate_Data/{folder_name}"
    rolledup_dir = f"Rolledup_Data/{folder_name}"
    unrolled_summary_dir = "Summary/Unrolled"
    

    os.makedirs(intermediate_dir, exist_ok=True)
    os.makedirs(rolledup_dir, exist_ok=True)
    os.makedirs(unrolled_summary_dir, exist_ok=True)

    unrolled_freq_list = []

    files = os.listdir(unrolled_data_folder)

    batches = [file.split("_")[1] for file in files]
    unique_batches = set(batches)

    for batch in unique_batches:
        load_files = [file for file in files if batch in file]
        print(load_files)

        if len(load_files)==1:
            filepath = os.path.join(unrolled_data_folder, load_files[0])
            file_df = pd.read_csv(filepath, dtype=str)
            
        else:
            df_list = []
            for file in load_files:
                filepath = os.path.join(unrolled_data_folder, file)
                df_list.append(pd.read_csv(filepath, dtype=str))
            file_df = pd.concat(df_list)
            
        print(f"processing Batch: {batch}")
        
        intermediate_df = rollup(file_df, config["rollup_mapping"], [config['child_column'],"coding_system"], config['parent_column'])
        intermediate_file = os.path.join(intermediate_dir, f"intermediate_{batch}.csv")
        intermediate_df.to_csv(intermediate_file, index=False)
        
        unrolled_freq_list.append(summarize_unmapped(intermediate_df, config['child_column']))

        cols_of_interest = [config['patient_id'],config['date'],config['parent_column']]
        
        only_rolledup_df = filter_rolledup_data(intermediate_df, cols_of_interest)
        only_rolledup_filepath = os.path.join(rolledup_dir, f"rolledup_{batch}.csv")
        only_rolledup_df.to_csv(only_rolledup_filepath, index=False)
        
        print(only_rolledup_df)


    
    unrolled_summary_file = os.path.join(unrolled_summary_dir , "unrolled_" + folder_name + "_code_counts.csv")

    unrolled_summary =(
        pd.concat(unrolled_freq_list)  # Concatenate list of DataFrames
        .groupby(config['child_column'], as_index=False)["counts"]
        .sum()
        .sort_values(by="counts", ascending=False)
        )

    print(unrolled_summary)
    unrolled_summary.to_csv(unrolled_summary_file, index=False)


In [None]:
### Diagnosis Data


In [33]:
unrolled_diagnoses_data = "./Cleaned_RawData/Diagnoses" 

diagnoses_files = os.listdir(unrolled_diagnoses_data)

sample_diagnoses_filepath= os.path.join(unrolled_diagnoses_data, diagnoses_files[0])
    
sample_diagnoses_filepath = pd.read_csv(sample_diagnoses_filepath, dtype=str)

print(sample_diagnoses_filepath)



diagnoses_rollup_file = "./Rollup_Mappings/ICD_to_PheCode.csv"

diagnoses_to_phecode = pd.read_csv(diagnoses_rollup_file , dtype=str)

diagnoses_to_phecode

       subject_id        date    code coding_system
0        10000032  2180-05-06    5723          ICD9
1        10000032  2180-05-06   78959          ICD9
2        10000032  2180-05-06    5715          ICD9
3        10000032  2180-05-06   07070          ICD9
4        10000032  2180-05-06     496          ICD9
...           ...         ...     ...           ...
799773   19999784  2121-01-31   Z5111         ICD10
799774   19999784  2121-01-31   C8589         ICD10
799775   19999784  2121-01-31    E876         ICD10
799776   19999784  2121-01-31  Z87891         ICD10
799777   19999784  2121-01-31   Z8619         ICD10

[799778 rows x 4 columns]


Unnamed: 0,code,PheCode,coding_system
0,001,008,ICD9
1,0010,008,ICD9
2,0011,008,ICD9
3,0019,008,ICD9
4,002,008,ICD9
...,...,...,...
98544,T524X3D,981,ICD10
98545,T532X4D,981,ICD10
98546,T533X4S,981,ICD10
98547,T521X3S,981,ICD10


In [34]:
config={
    "patient_id":"subject_id",
    "parent_column":"PheCode", # The parent column should be exactly as the parent column in the rollup mapping file
    "child_column":"code",
    "date":"date",
    "rollup_mapping":diagnoses_to_phecode
}

rollup_data_by_batch(config ,unrolled_data_folder ="./Cleaned_RawData/Diagnoses")

['diagnoses_batch8_uncleaned_timed_diagnoses_icd.csv']
processing Batch: batch8
       subject_id        date PheCode
0        10000280  2151-03-18   681.2
1        10000886  2178-05-08   317.1
2        10001217  2157-11-18     324
3        10001217  2157-11-18   348.2
4        10001217  2157-11-18   348.2
...           ...         ...     ...
788003   19998878  2132-08-17     318
788004   19999298  2177-02-10   317.1
788005   19999298  2177-02-10   741.3
788006   19999733  2152-07-08     949
788007   19999733  2152-07-08     979

[735799 rows x 3 columns]
['diagnoses_batch1_uncleaned_timed_diagnoses_icd.csv']
processing Batch: batch1
       subject_id        date PheCode
0        10000032  2180-05-06  571.81
1        10000032  2180-05-06     572
2        10000032  2180-05-06  571.51
3        10000032  2180-05-06   070.3
4        10000032  2180-05-06     496
...           ...         ...     ...
799773   19999784  2121-01-31    1010
799774   19999784  2121-01-31   202.2
799775   199997

### Procedures Data

As mentioned before, procedure data comes from two sources: hcpcsevents.csv and procedures_icd.csv 
Our objective is to roll them both up to ccs code and merge them. 

In [35]:
# Rolling up procedure data

import pandas as pd

unrolled_procedure_data = "./Cleaned_RawData/Procedures" 

procedure_files = os.listdir(unrolled_procedure_data)

sample_procedure_filepath= os.path.join(procedure_raw_data_cleaned, procedure_files[0])
    
sample_procedure = pd.read_csv(sample_procedure_filepath, dtype=str)
print(sample_procedure)

procedure_rollup_file = "./Rollup_Mappings/HCPCS_ICDPROC_to_CCS.csv"

procedure_to_ccs = pd.read_csv(procedure_rollup_file , dtype=str)

procedure_to_ccs

      subject_id        date   code coding_system
0       10000904  2180-10-09  99218         HCPCS
1       10002131  2123-06-25  G0378         HCPCS
2       10002428  2155-07-14  G0378         HCPCS
3       10002428  2160-07-15  G0378         HCPCS
4       10002428  2157-07-16  27235         HCPCS
...          ...         ...    ...           ...
22547   19996651  2179-12-28  99219         HCPCS
22548   19997911  2193-09-02  G0378         HCPCS
22549   19999043  2165-06-03  G0378         HCPCS
22550   19999784  2119-07-09  99219         HCPCS
22551   19999784  2119-07-10  62270         HCPCS

[22552 rows x 4 columns]


Unnamed: 0,code,CCS,coding_system
0,00800ZZ,1,ICDPROC10
1,00803ZZ,1,ICDPROC10
2,00804ZZ,1,ICDPROC10
3,00870ZZ,1,ICDPROC10
4,00873ZZ,1,ICDPROC10
...,...,...,...
102208,M1146,999,HCPCS
102209,M1147,999,HCPCS
102210,M1148,999,HCPCS
102211,M1149,999,HCPCS


In [36]:
config={
    "patient_id":"subject_id",
    "parent_column":"CCS",
    "child_column":"code",
    "date":"date",
    "rollup_mapping":procedure_to_ccs
}

rollup_data_by_batch(config ,unrolled_data_folder ="./Cleaned_RawData/Procedures")

['procedures_batch8_hcpcsevents.csv', 'procedures_batch8_procedures_icd.csv']
processing Batch: batch8
       subject_id        date  CCS
0        10000280  2151-03-18  227
1        10000886  2178-05-08  227
2        10002425  2153-01-07  160
3        10002425  2153-01-07  227
4        10002807  2152-03-30  227
...           ...         ...  ...
128949   19998878  2132-06-19   61
128950   19998878  2132-03-31   70
128951   19998878  2132-05-26   61
128952   19998878  2132-05-26   61
128953   19998878  2132-05-26  191

[128451 rows x 3 columns]
['procedures_batch1_hcpcsevents.csv', 'procedures_batch1_procedures_icd.csv']
processing Batch: batch1
       subject_id        date  CCS
0        10000904  2180-10-09  227
1        10002131  2123-06-25  227
2        10002428  2155-07-14  227
3        10002428  2160-07-15  227
4        10002428  2157-07-16  146
...           ...         ...  ...
129324   19999784  2119-12-05  224
129325   19999784  2120-05-28  224
129326   19999784  2119-10-17  2

The above table is a bit different from diagnoses_icd we encountered before as it comew with date. Since this table contains
all the information we need, we will go ahead with rollup

Now that hcpcsevents file has been rolled, we will go ahead and rollup the procedures_icd file

Since we have all the required data - subject_id, chartdate, icd_code and icd_version, we can go ahead with the rollup.

**Note**: While working with real world data, in case you don't have the required information, you will need to find the source and merge them as we did in the case of diagnoses_icd

In [72]:
#filtered_hcpcs_procedures_ccs.to_csv("./Rolledup_Data/procedures_hcpcs_rolled.csv", index=None)

now that we have both procedure tables rolled up, we can concatenate them to create our final procedures rolled up data

### Medication Data

In MIMIC Medications data come from two sources: 
1. prescriptions.csv where medications prescribed are recorded 
2. emar.csv where procedures are recorded as ndc codes - Hihgly granular information

Information is often duplicated among sources. Some of these events can also be found in inputevents in the ICU module which we can get into later
Our objective is to map the codes to RxNorm and then map to rnxorm ingredient codes

In [7]:
unrolled_medication_data = "./Cleaned_RawData/Medication"
\
# Load a sample file, assuming it's a CSV file, similar to how you did for procedures
medication_files = os.listdir(unrolled_medication_data)
sample_medication_filepath = os.path.join(unrolled_medication_data, medication_files[0])
sample_medication = pd.read_csv(sample_medication_filepath, dtype=str)

# Print the sample data to check what it looks like
print(sample_medication)

# Read the medication mapping file (this might still reference "procedure" as the column name)
medication_rollup_file = "./Rollup_Mappings/NDC_to_RxNorm.csv"
medication_to_rxnorm = pd.read_csv(medication_rollup_file, dtype=str)

# Print the medication-to-CCS mapping file to check its structure
print(medication_to_rxnorm)

        subject_id        date         code coding_system
0         10000032  2180-05-08  51079007320           NDC
1         10000032  2180-05-07  00487980125           NDC
2         10000032  2180-05-07  51079007220           NDC
3         10000032  2180-05-07  00245004101           NDC
4         10000032  2180-05-07            0           NDC
...            ...         ...          ...           ...
2055122   19999784  2121-01-31  00338001711           NDC
2055123   19999784  2121-01-31  00143989001           NDC
2055124   19999784  2121-01-31  57896079101           NDC
2055125   19999784  2121-01-31  00703854023           NDC
2055126   19999784  2121-01-31  08290306510           NDC

[2055127 rows x 4 columns]
               code   RxNorm coding_system
0       00295117904     5499           NDC
1       00295117916     5499           NDC
2       00363026816     5499           NDC
3       00363026832     5499           NDC
4       00363087143     5499           NDC
...             ..

In [6]:
# Processing prescription.
config={
    "patient_id":"subject_id",
    "parent_column":"RxNorm", # The parent column should be exactly as the parent column in the rollup mapping file
    "child_column":"code",
    "date":"date",
    "rollup_mapping":medication_to_rxnorm
}

rollup_data_by_batch(config ,unrolled_data_folder ="./Cleaned_RawData/Medication")

['medication_batch8_prescriptions.csv']
processing Batch: batch8
        subject_id        date       RxNorm
0         10001217  2157-11-22      1721265
1         10001217  2157-11-21        26225
2         10001217  2157-11-22         4850
3         10001217  2157-11-22        11124
5         10001217  2157-11-23         9863
...            ...         ...          ...
2020156   19998878  2132-08-19         4850
2020157   19998878  2132-08-19  34322_55018
2020158   19998878  2132-08-22        18631
2020159   19998878  2132-08-18         6218
2020160   19998878  2132-08-17         4511

[1869640 rows x 3 columns]
['medication_batch2_prescriptions.csv']
processing Batch: batch2
        subject_id        date      RxNorm
0         10000635  2136-06-19        8591
1         10000635  2136-06-19     1547585
2         10000635  2136-06-19        4850
3         10000635  2136-06-19        4832
4         10000635  2136-06-19  10763_5487
...            ...         ...         ...
2056287   199

In [None]:
# Following the same pipeline as before


# loading only a subset of required columns helps save memory

cols=['subject_id', 'starttime', 'ndc']
prescriptions = pd.read_csv(prescriptions_file ,dtype=str,usecols=cols)
prescriptions['starttime'] = prescriptions['starttime'].str[:10]
prescriptions.drop_duplicates(inplace=True)
print(prescriptions)



# #Loading the medication Rollup file
# ndc_to_rxnorm_file = "./Rollup_Mappings/ndc_to_rxnorm.csv"
# ndc_to_rxnorm = pd.read_csv(ndc_to_rxnorm_file, dtype=str)
# print(ndc_to_rxnorm.head())

# # Prforming rollup
# join_columns=['ndc']

# ndc_prescription_rxnorm= rollup(prescriptions ,ndc_to_rxnorm , join_columns ,"rxcui")
# ndc_prescription_rxnorm.to_csv("./Intermediate_Data/ndc_prescription_rxnorm.csv", index=None)
# print(ndc_prescription_rxnorm.head())

# # Summarizing unmapped codes
# print(summarize_unmapped(ndc_prescription_rxnorm,'ndc').head())

# # Extracting only relevantinformation
# extract_columns =['subject_id','rxcui','starttime']
# filtered_ndc_prescription_rxnorm= filter_rolledup_data(ndc_prescription_rxnorm, extract_columns)
# filtered_ndc_prescription_rxnorm.head()

# Check for rows with empty values
print(prescriptions.isna().sum())
print(prescriptions[prescriptions.isna().any(axis=1)])

### Laboratory Data

In MIMIC, the lab observations are saved under labevents.csv.  Lab observations are generall recorded as LOINC codes in US EHR systems, 
however, the latest mimic data doesn't contain that. 

In [None]:
# Define the path to the folder containing lab data
unrolled_lab_data = "./Cleaned_RawData/Labs"

# List all the files in the lab data directory
lab_files = os.listdir(unrolled_lab_data)

# Load a sample lab file
sample_lab_filepath = os.path.join(unrolled_lab_data, lab_files[0])
sample_lab = pd.read_csv(sample_lab_filepath, dtype=str)

# Print the sample lab data to check its structure
print(sample_lab)

If you notice the coding_system, it's ITEMID which is not a standard coding system. MIMIC-IV provides a mapping from ITEMID to LOINC code. We will rollup LOINC codes to LOINC components. Similarly, in real world EHR data, you will have to source/merge data from different sources 

In [5]:
itemid_to_loinc = pd.read_csv("./Metadata/lab_itemid_to_loinc.csv", dtype=str)
itemid_to_loinc.dropna(subset=['loinc'],inplace=True)
itemid_to_loinc.head(5)

Unnamed: 0,itemid,label,fluid,category,valueuom,loinc,loinc_version,notes
5,50903,Cholesterol Ratio (Total/HDL),Blood,Chemistry,Ratio,9830-1,2.71,Mass ratio is more common than molar ratio in ...
6,50911,"Creatine Kinase, MB Isoenzyme",Blood,Chemistry,ng/mL,13969-1,2.71,This is the LOINC code US labs use
8,50937,Hepatitis A Virus Antibody,Blood,Chemistry,N/A|Pos/Neg,13951-9,2.71,More specific method
10,50941,Hepatitis B Surface Antigen,Blood,Chemistry,,5196-1,2.71,More specific method
11,50942,Hepatitis B Virus Core Antibody,Blood,Chemistry,,13952-7,2.71,More specific method


In [13]:
loinc_hierarchy = pd.read_csv("./Metadata/LOINC_Hierarchy_v2.73_version4.csv",dtype=str)
loinc_hierarchy.head(5)

mapping = loinc_hierarchy[['LOINC','LOINC_DESCRIPTION']]
print(mapping)
mapping.drop_duplicates(subset= ['LOINC'],inplace=True)
print(mapping)
mapping.to_csv("loinc_desc.csv",index=None)

           LOINC               LOINC_DESCRIPTION
0         1000-9             DBG Ab SerPl BPU Ql
1       100019-9          ALK gene Mut Anl Bld/T
2       100020-7        GNA11 gene Mut Anl Bld/T
3       100021-5         GNAQ gene Mut Anl Bld/T
4       100022-3         IDH1 gene Mut Anl Bld/T
...          ...                             ...
172158   96976-6  pCO2 baseline-SubMx Exer delta
172159   98501-0           L eye Near VA--uncorr
172160   98502-8           R eye Near VA--uncorr
172161   98508-5             L eye Near VA--corr
172162   98509-3             R eye Near VA--corr

[172163 rows x 2 columns]
           LOINC               LOINC_DESCRIPTION
0         1000-9             DBG Ab SerPl BPU Ql
1       100019-9          ALK gene Mut Anl Bld/T
2       100020-7        GNA11 gene Mut Anl Bld/T
3       100021-5         GNAQ gene Mut Anl Bld/T
4       100022-3         IDH1 gene Mut Anl Bld/T
...          ...                             ...
172158   96976-6  pCO2 baseline-SubMx Exer

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mapping.drop_duplicates(subset= ['LOINC'],inplace=True)


In [7]:
loinc_rollup = loinc_hierarchy[['LOINC','PARENT_LOINC']]
loinc_rollup = loinc_rollup.rename(columns={'LOINC':'loinc','PARENT_LOINC':'LoincComponent'})
itemid_to_loinc=itemid_to_loinc[['itemid','loinc']]

lab_to_LoincComponent = pd.merge(itemid_to_loinc,loinc_rollup, on ="loinc", how="left")
print(lab_to_LoincComponent.head(5))

lab_to_LoincComponent=lab_to_LoincComponent.rename(columns={"itemid":'code'})
lab_to_LoincComponent['coding_system']='ITEMID'

lab_to_LoincComponent=lab_to_LoincComponent[['code','LoincComponent','coding_system']]
lab_to_LoincComponent.head(5)

  itemid    loinc LoincComponent
0  50903   9830-1     LP307370-9
1  50911  13969-1      LP15513-2
2  50937  13951-9      LP38316-3
3  50941   5196-1      LP38331-2
4  50942  13952-7      LP38323-9


Unnamed: 0,code,LoincComponent,coding_system
0,50903,LP307370-9,ITEMID
1,50911,LP15513-2,ITEMID
2,50937,LP38316-3,ITEMID
3,50941,LP38331-2,ITEMID
4,50942,LP38323-9,ITEMID


In [9]:
# Processing prescription.
config={
    "patient_id":"subject_id",
    "parent_column":"LoincComponent", # The parent column should be exactly as the parent column in the rollup mapping file
    "child_column":"code",
    "date":"date",
    "rollup_mapping":lab_to_LoincComponent
}

rollup_data_by_batch(config ,unrolled_data_folder ="./Cleaned_RawData/Labs")

['labs_batch6_labevents.csv']
processing Batch: batch6
         subject_id        date LoincComponent
5          10000161  2163-05-03      LP15957-1
7          10000161  2163-05-03      LP14328-6
8          10000161  2163-05-03      LP14539-8
12         10000161  2163-05-03      LP14540-6
16         10000161  2163-05-03      LP14313-8
...             ...         ...            ...
16452202   19999565  2185-06-04      LP14267-6
16452209   19999565  2185-06-04      LP15957-1
16452210   19999565  2185-06-04      LP14082-9
16452233   19999565  2185-06-04      LP14635-4
16452254   19999565  2185-06-05      LP14635-4

[1708560 rows x 3 columns]
['labs_batch4_labevents.csv']
processing Batch: batch4
         subject_id        date LoincComponent
1          10000108  2163-09-27      LP14328-6
2          10000108  2163-09-27      LP14539-8
5          10000108  2163-09-27      LP14540-6
9          10000108  2163-09-27      LP14313-8
10         10000108  2163-09-27      LP14267-6
...             

In [None]:
# loading only a subset of required columns helps save memory

cols=['subject_id', 'charttime', 'itemid']
labs = pd.read_csv(labs_file ,dtype=str,usecols=cols)
labs

In [None]:
labs['starttime'] = labs['starttime'].str[:10]
labs.drop_duplicates(inplace=True)
print(labs)

In [None]:
import pandas as pd
import os

# Parameters for chunk processing
chunk_size = 5000000  # Adjust based on your system's memory

# Input and output file paths
labs_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/labevents.csv"  # Replace with your file path
itemid_to_loinc_file = "./Metadata/lab_itemid_to_loinc.csv"
output_file = "./Intermediate_Data/loinc_labs.csv"

# Columns to read
cols=['subject_id', 'charttime', 'itemid']
join_columns=['itemid']

# Loading the medication Rollup file
print("Loading loinc mapping...")
itemid_to_loinc = pd.read_csv(itemid_to_loinc_file , dtype=str)
itemid_to_loinc=itemid_to_loinc[['itemid','loinc']]


print("Mapping loaded:")
print(itemid_to_loinc.head())

# Ensure the output file is initialized
if os.path.exists(output_file):
    os.remove(output_file)

# Process the prescriptions file in chunks
print("Processing prescriptions file in batches...")
for i, chunk in enumerate(pd.read_csv(labs_file, dtype=str, usecols=cols, chunksize=chunk_size)):
    print(f"Processing batch {i+1}...")
    
    # Process the chunk
    chunk['charttime'] = chunk['charttime'].str[:10]
    chunk.drop_duplicates(inplace=True)

    print(chunk)
    
    # Perform the rollup for this chunk
    labs_chunk_rolled = rollup(chunk, itemid_to_loinc, join_columns,"loinc")

    print(labs_chunk_rolled)
    
    # Append the chunk to the output file
    labs_chunk_rolled.to_csv(output_file, mode='a', index=False, header=not os.path.exists(output_file))
    
    print(f"Batch {i+1} appended to {output_file}")


Loading loinc mapping...
Mapping loaded:
  itemid loinc
0  50856   NaN
1  50867   NaN
2  50873   NaN
3  50874   NaN
4  50893   NaN
Processing prescriptions file in batches...
Processing batch 1...
        subject_id itemid   charttime
0         10000032  50931  2180-03-23
1         10000032  51071  2180-03-23
2         10000032  51074  2180-03-23
3         10000032  51075  2180-03-23
4         10000032  51079  2180-03-23
...            ...    ...         ...
4999995   10327918  51678  2146-09-20
4999996   10327918  51221  2146-09-20
4999997   10327918  51222  2146-09-20
4999998   10327918  51248  2146-09-20
4999999   10327918  51249  2146-09-20

[4403744 rows x 3 columns]
        subject_id itemid   charttime    loinc Rollup_Status
0         10000032  50931  2180-03-23   2345-7             1
1         10000032  51071  2180-03-23      NaN             0
2         10000032  51074  2180-03-23  19270-8             1
3         10000032  51075  2180-03-23      NaN             0
4         1000

In [5]:
import pandas as pd
import os

labs_file = "./Intermediate_Data/loinc_labs.csv"
output_file = "./Intermediate_Data/filtered_labs_loinc.csv"

# Columns to extract
extract_columns = ['subject_id', 'charttime', 'loinc']

# Initialize output file
if os.path.exists(output_file):
    os.remove(output_file)

# Define a chunk size for reading
chunk_size = 5000000   # Adjust based on your system's memory

# Process the labs file in chunks
print("Processing labs file in chunks...")
for i, chunk in enumerate(pd.read_csv(labs_file, dtype=str, chunksize=chunk_size)):
    print(f"Processing batch {i+1}...")
    print(chunk)

    # Filter the chunk to include only relevant columns
    filtered_chunk =  filter_rolledup_data(chunk, extract_columns)
    print(filtered_chunk)
    
    # Append the filtered chunk to the output file
    filtered_chunk.to_csv(output_file, mode='a', index=False, header=not os.path.exists(output_file))
    
    print(f"Batch {i+1} appended to {output_file}")

print("Processing complete!")

Processing labs file in chunks...
Processing batch 1...
        subject_id itemid   charttime    loinc Rollup_Status
0         10000032  50931  2180-03-23   2345-7             1
1         10000032  51071  2180-03-23      NaN             0
2         10000032  51074  2180-03-23  19270-8             1
3         10000032  51075  2180-03-23      NaN             0
4         10000032  51079  2180-03-23      NaN             0
...            ...    ...         ...      ...           ...
4999995   10375741  50971  2147-02-10      NaN             0
4999996   10375741  50983  2147-02-10      NaN             0
4999997   10375741  51006  2147-02-10      NaN             0
4999998   10375741  51678  2147-02-10      NaN             0
4999999   10375741  51221  2147-02-10      NaN             0

[5000000 rows x 5 columns]
        subject_id   charttime    loinc
0         10000032  2180-03-23   2345-7
2         10000032  2180-03-23  19270-8
21        10000032  2180-03-23   9830-1
22        10000032  2180

In [None]:
#load data, and process by batches. - write a function to keep appending the resuting data
#also import tqdm