<div style="text-align: center; font-weight: bold;">
    <h1>Pipeline for Research ready EHR Datasets</h1>
    <h2>Part 2: Cleaning, Normalizing and Rolling up the EHR Data</h2>
    <h4>Author: Vidul Ayakulangara Panickan</h3>
</div>



## Step 4: Cleaning!

MIMIC data has been processed for data analysis so its not a typical dataset you might encounter in a hospital system setting. Generally, EHR data is stored in databases and observations made are timestamped. However In MIMIC-IV, cetain data comes only hospital admission info and we have to get the admission time from a different table for example diagnoses_icd data. Whereas Lab,procedure and microbiology events have timestamps included.

In [1]:
import pandas as pd

diagnoses_icd_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/diagnoses_icd.csv"
admissions_file ="/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/admissions.csv"

diagnoses_icd = pd.read_csv(diagnoses_icd_file,dtype=str)

admissions = pd.read_csv(admissions_file,dtype=str)

print(diagnoses_icd.columns)
print(admissions.columns)

Index(['subject_id', 'hadm_id', 'seq_num', 'icd_code', 'icd_version'], dtype='object')
Index(['subject_id', 'hadm_id', 'admittime', 'dischtime', 'deathtime',
       'admission_type', 'admit_provider_id', 'admission_location',
       'discharge_location', 'insurance', 'language', 'marital_status', 'race',
       'edregtime', 'edouttime', 'hospital_expire_flag'],
      dtype='object')


In [2]:
# We only require hospital subject id, hadm_id and admission time from the admissions table for this operation

diagnoses_icd = pd.merge(diagnoses_icd,admissions[['subject_id','hadm_id','admittime']], how='left', on=['subject_id','hadm_id'])
diagnoses_icd

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06 22:23:00
1,10000032,22595853,2,78959,9,2180-05-06 22:23:00
2,10000032,22595853,3,5715,9,2180-05-06 22:23:00
3,10000032,22595853,4,07070,9,2180-05-06 22:23:00
4,10000032,22595853,5,496,9,2180-05-06 22:23:00
...,...,...,...,...,...,...
6364483,19999987,23865745,7,41401,9,2145-11-02 21:38:00
6364484,19999987,23865745,8,78039,9,2145-11-02 21:38:00
6364485,19999987,23865745,9,0413,9,2145-11-02 21:38:00
6364486,19999987,23865745,10,36846,9,2145-11-02 21:38:00


In [3]:
# Check for records without any admission time

print("Number of records with no hospital admission time:", diagnoses_icd['admittime'].isna().sum())

# If you have records without dates, remove them
diagnoses_icd = diagnoses_icd.dropna(subset=['admittime'])

diagnoses_icd

Number of records with no hospital admission time: 0


Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06 22:23:00
1,10000032,22595853,2,78959,9,2180-05-06 22:23:00
2,10000032,22595853,3,5715,9,2180-05-06 22:23:00
3,10000032,22595853,4,07070,9,2180-05-06 22:23:00
4,10000032,22595853,5,496,9,2180-05-06 22:23:00
...,...,...,...,...,...,...
6364483,19999987,23865745,7,41401,9,2145-11-02 21:38:00
6364484,19999987,23865745,8,78039,9,2145-11-02 21:38:00
6364485,19999987,23865745,9,0413,9,2145-11-02 21:38:00
6364486,19999987,23865745,10,36846,9,2145-11-02 21:38:00


In [4]:
# For typical analysis, the time component is not needed however this may change based on needs of the analysis

diagnoses_icd['admittime'] = diagnoses_icd['admittime'].str[:10]

diagnoses_icd

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version,admittime
0,10000032,22595853,1,5723,9,2180-05-06
1,10000032,22595853,2,78959,9,2180-05-06
2,10000032,22595853,3,5715,9,2180-05-06
3,10000032,22595853,4,07070,9,2180-05-06
4,10000032,22595853,5,496,9,2180-05-06
...,...,...,...,...,...,...
6364483,19999987,23865745,7,41401,9,2145-11-02
6364484,19999987,23865745,8,78039,9,2145-11-02
6364485,19999987,23865745,9,0413,9,2145-11-02
6364486,19999987,23865745,10,36846,9,2145-11-02


In [5]:
# For diagnosis date, we typically keep subject_id, icd_code and date. You can retain other columns as needed
diagnoses_icd=diagnoses_icd[['subject_id','icd_code','icd_version','admittime']]

# Further we rename admittime to date
diagnoses_icd = diagnoses_icd.rename(columns={'admittime': 'date'})

# Display the updated DataFrame
print(diagnoses_icd)

        subject_id icd_code icd_version        date
0         10000032     5723           9  2180-05-06
1         10000032    78959           9  2180-05-06
2         10000032     5715           9  2180-05-06
3         10000032    07070           9  2180-05-06
4         10000032      496           9  2180-05-06
...            ...      ...         ...         ...
6364483   19999987    41401           9  2145-11-02
6364484   19999987    78039           9  2145-11-02
6364485   19999987     0413           9  2145-11-02
6364486   19999987    36846           9  2145-11-02
6364487   19999987     7810           9  2145-11-02

[6364488 rows x 4 columns]


#### Establish date range for the data (Not applicalble to MIMIC-IV)
MICIC-IV has adjusted date but in real world datasets, its vital to ensure the dates make sense. We will remove dates that beyond a certain range. Eg: records with date before 1980s and dates after present year. The code for that is provided below, however for MIMIC-IV since the date is adjusted we won't be able to perform this cleaning operation.

In [None]:
# diagnoses_icd = diagnoses_icd[(diagnoses_icd['date'].str[:4].astype(int) >= 1980) & (diagnoses_icd['date'].str[:4].astype(int) <= 2024)]

In [6]:
# Check for empty cells and duplicated rows in your data

# Check for empty cells
if diagnoses_icd.isnull().values.any():
    print("Empty cells found. Removing rows with empty cells...")
    diagnoses_icd = diagnoses_icd.dropna()  # Drop rows with any null values
    print("DataFrame after removing empty cells:")
    print(diagnoses_icd)
else:
    print("No empty cells found.")

# Check for duplicates
if diagnoses_icd.duplicated().sum() > 0:
    print("Duplicate rows found. Removing duplicates...")
    diagnoses_icd = diagnoses_icd.drop_duplicates()  # Remove duplicate rows
    print("DataFrame after removing duplicates:")
    print(diagnoses_icd)
else:
    print("No duplicate rows found.")

No empty cells found.
Duplicate rows found. Removing duplicates...
DataFrame after removing duplicates:
        subject_id icd_code icd_version        date
0         10000032     5723           9  2180-05-06
1         10000032    78959           9  2180-05-06
2         10000032     5715           9  2180-05-06
3         10000032    07070           9  2180-05-06
4         10000032      496           9  2180-05-06
...            ...      ...         ...         ...
6364483   19999987    41401           9  2145-11-02
6364484   19999987    78039           9  2145-11-02
6364485   19999987     0413           9  2145-11-02
6364486   19999987    36846           9  2145-11-02
6364487   19999987     7810           9  2145-11-02

[6356481 rows x 4 columns]


## Step 5: Normalization

Normalization is typically done for laboratory observations. We need to ensure the values observed are on the same scale and sometimes the same lab will be recorded with multiple units - so its important to normalize the values so its sutable for analysis. 

For Diagnoses, Medications, Procedures the values observed are categorical so we dont have to perform normalization however to enhance data quality and usuability we will we will be mapping data recoreded in different under coding systems to a common parent code through a process called rolling up. 

## Step 6: Rolling up Data

For data to be compared and analysed, it requires to be standardized for example ehr medical codes of the same type say Medication will be coming from differetn coding systems like NDC or RxNorm or codes local to an institiution. Rolling up data from different coding systems to a common parent coding system will ensure standardization

There are further  reasons to perform rollup

1) The raw EHR codes are too specific that its not feasible to analysis
2) Rolling up helps to harmonize data across different institutions and perform analysis at larger scal

In [1]:
import os

os.makedirs("Rollup_Mappings", exist_ok=True)
os.makedirs("Intermediate_Data", exist_ok=True)
os.makedirs("Rolledup_Data", exist_ok=True)

## Creating Rollup Dictionary

To rollup EHR codes to a parent level code, we need decide on what coding system we will be rolling to 
Diagnoses - We will be rolling up ICD and other Diagnoses codes to PheCodes
Medication - We will be rolling up standard codes like RxNorm, NDC and local medication codes to RxNorm Ingredient level codes
Lab - We will be rolling up lab codes, loinc codes to LOINC Component
Prcedures - We will be rolling up Procedure codes like ICDPCS/CPT4 Codes to CCS codes

Creating these Rollup dictionaries require a lot of manual processing and quality checks to ensure the mapping dictionary created is accurate

In [23]:
#Write code to create rollup dictionries



### Diagnoses Data
ICD codes are too detailed to be used for research purposes. Phecodes solves this problem by grouping relevant ICD codes into clinical meaningful phenotypes.



In [None]:
# Write code to create icd-phecode dictionary and save it rollup-file




In [7]:
icd_to_phecode_file = "./Rollup_Mappings/icd_to_phecode.csv"

icd_to_phecode = pd.read_csv(icd_to_phecode_file, dtype=str)

icd_to_phecode 

Unnamed: 0,icd_code,PheCode,icd_version
0,001,008,9
1,0010,008,9
2,0011,008,9
3,0019,008,9
4,002,008,9
...,...,...,...
98544,T524X3D,981,10
98545,T532X4D,981,10
98546,T533X4S,981,10
98547,T521X3S,981,10


In [10]:
# Rolling up icd codes to PheCodes

diagnoses_phecode = pd.merge(diagnoses_icd,icd_to_phecode, how='left', on=['icd_code','icd_version'])
diagnoses_phecode

Unnamed: 0,subject_id,icd_code,icd_version,date,PheCode
0,10000032,5723,9,2180-05-06,571.81
1,10000032,78959,9,2180-05-06,572
2,10000032,5715,9,2180-05-06,571.51
3,10000032,07070,9,2180-05-06,070.3
4,10000032,496,9,2180-05-06,496
...,...,...,...,...,...
6356476,19999987,41401,9,2145-11-02,411.4
6356477,19999987,78039,9,2145-11-02,345.3
6356478,19999987,0413,9,2145-11-02,041
6356479,19999987,36846,9,2145-11-02,368.4


In [11]:
# Save this rolled up data in intermediate foles. In future if you update rollup mapping to be more comprehensive or if you want to look
# at codes that are unmapped, you can come back.

diagnoses_phecode['Rollup_Status'] = diagnoses_phecode['PheCode'].notna().replace({True: '1', False: '0'})

diagnoses_phecode

Unnamed: 0,subject_id,icd_code,icd_version,date,PheCode,Rollup_Status
0,10000032,5723,9,2180-05-06,571.81,1
1,10000032,78959,9,2180-05-06,572,1
2,10000032,5715,9,2180-05-06,571.51,1
3,10000032,07070,9,2180-05-06,070.3,1
4,10000032,496,9,2180-05-06,496,1
...,...,...,...,...,...,...
6356476,19999987,41401,9,2145-11-02,411.4,1
6356477,19999987,78039,9,2145-11-02,345.3,1
6356478,19999987,0413,9,2145-11-02,041,1
6356479,19999987,36846,9,2145-11-02,368.4,1


In [12]:
# Examine the unmapped rows

diagnoses_phecode_unmapped = diagnoses_phecode[diagnoses_phecode["Rollup_Status"]=="0"]
diagnoses_phecode_unmapped

Unnamed: 0,subject_id,icd_code,icd_version,date,PheCode,Rollup_Status
34,10000032,V4986,9,2180-07-23,,0
64,10000117,W010XXA,10,2183-09-18,,0
65,10000117,Y93K1,10,2183-09-18,,0
66,10000117,Y92480,10,2183-09-18,,0
77,10000161,R519,10,2163-08-20,,0
...,...,...,...,...,...,...
6356392,19999784,Y92239,10,2119-10-17,,0
6356410,19999828,T8141XA,10,2149-01-08,,0
6356417,19999828,Y929,10,2149-01-08,,0
6356435,19999828,Y92018,10,2147-07-18,,0


In [13]:
# Summarize the codes that have not been rolledup
unique_subject_icd_pairs = diagnoses_phecode_unmapped [['subject_id', 'icd_code','icd_version']].drop_duplicates()

icdcode_frequencies = unique_subject_icd_pairs[['icd_code','icd_version']].value_counts().reset_index(name='counts')

sorted_icdcode_frequencies = icdcode_frequencies.rename(columns={'index': 'icd_code'}).sort_values(by='counts', ascending=False)

sorted_icdcode_frequencies.head(10)

Unnamed: 0,icd_code,icd_version,counts
0,Z20822,10,23629
1,Y929,10,15945
2,Y92230,10,9019
3,V270,9,8375
4,V4986,9,8006
5,Y92239,10,7539
6,Y92009,10,7023
7,E8497,9,6867
8,E8490,9,5832
9,E8788,9,5377


Once the data looks reasonable, with good enough rollup done, you can save the data

In [15]:
diagnoses_phecode.to_csv("./Intermediate_Data/diagnoses_phecode_comprehensive.csv", index=None)

We don't really need all the columns after rollup is performed. Below we just keep the data we need.

In [64]:
print(diagnoses_phecode.columns)
diagnoses_phecode_filtered = diagnoses_phecode[diagnoses_phecode['Rollup_Status']=="1"]
print (diagnoses_phecode_filtered)
diagnoses_phecode_filtered = diagnoses_phecode_filtered[['subject_id','PheCode','date']]
diagnoses_phecode_filtered

NameError: name 'diagnoses_phecode' is not defined

In [23]:
if diagnoses_phecode_filtered.duplicated().sum() > 0:
    print("Duplicate rows found. Removing duplicates...")
    diagnoses_phecode_filtered = diagnoses_phecode_filtered.drop_duplicates()  # Remove duplicate rows
    print("DataFrame after removing duplicates:")
else:
    print("No duplicate rows found.")

diagnoses_phecode_filtered

No duplicate rows found.


Unnamed: 0,subject_id,PheCode,date
0,10000032,571.81,2180-05-06
1,10000032,572,2180-05-06
2,10000032,571.51,2180-05-06
3,10000032,070.3,2180-05-06
4,10000032,496,2180-05-06
...,...,...,...
6356476,19999987,411.4,2145-11-02
6356477,19999987,345.3,2145-11-02
6356478,19999987,041,2145-11-02
6356479,19999987,368.4,2145-11-02


In [25]:
# Once cleaned, we can save the rollup data
diagnoses_phecode_filtered.to_csv("./Rolledup_Data/diagnoses_phecode_rolled.csv", index=None)

## Defining Functions 

As we will be perfoming operations similar to what we did to roll up ICD codes, its better to define these operations
as a function so we can resue them.

In [1]:
import pandas as pd

def rollup(raw_level_data, mapping_data, join_columns, parent_column):
    
    rolledup_data = pd.merge(raw_level_data, mapping_data, how='left', on=join_columns)
    
    rolledup_data['Rollup_Status'] = rolledup_data[parent_column].notna().replace({True: '1', False: '0'})
    
    return rolledup_data


def summarize_unmapped(rolledup_data, child_column):
    
    rolledup_data_unmapped = rolledup_data[rolledup_data["Rollup_Status"]=="0"]
    
    unique_patient_code_pairs = rolledup_data_unmapped[['subject_id', child_column]].drop_duplicates()

    unmapped_code_frequencies = unique_patient_code_pairs[[child_column]].value_counts().reset_index(name='counts')

    sorted_icdcode_frequencies = unmapped_code_frequencies.rename(columns={'index': child_column}).sort_values(by='counts', ascending=False)
    
    return sorted_icdcode_frequencies


def filter_rolledup_data(rolledup_data,extract_columns):
    
    filtered = rolledup_data[rolledup_data['Rollup_Status']=="1"]
    
    filtered = filtered [extract_columns]
    
    return filtered

    

In [2]:
# join_columns=['icd_code', 'icd_version']
# res = rollup(diagnoses_icd,icd_to_phecode, join_columns ,"PheCode")
# print(res)
# print(summarize_unmapped(res,'icd_code'))

### Procedures Data

In MIMIC Procedure data come from two sources: 
1. hcpcsevents.csv where procedures are recorded as CPT codes 
2. procedures_icd.csv where procedures are recorded as ICD9/ICD10 Procedure codes

Our objective is to roll them both up to ccs code and merge them.

In [41]:
# Processing hcpcsevents.csv

import pandas as pd

hcpcenvents_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/hcpcsevents.csv"

hcpcenvents = pd.read_csv(hcpcenvents_file, dtype=str)

hcpcenvents.head()

Unnamed: 0,subject_id,hadm_id,chartdate,hcpcs_cd,seq_num,short_description
0,10000068,25022803,2160-03-04,99218,1,Hospital observation services
1,10000084,29888819,2160-12-28,G0378,1,Hospital observation per hr
2,10000108,27250926,2163-09-27,99219,1,Hospital observation services
3,10000117,22927623,2181-11-15,43239,1,Digestive system
4,10000117,22927623,2181-11-15,G0378,2,Hospital observation per hr


The above table is a bit different from diagnoses_icd we encountered before as it comew with date. Since this table contains
all the information we need, we will go ahead with rollup

In [25]:
hcpcs_to_ccs_file = "./Rollup_Mappings/cpt2ccs_rollup.csv"

hcpcs_to_ccs= pd.read_csv(hcpcs_to_ccs_file, dtype=str)

hcpcs_to_ccs.columns =['hcpcs_cd','ccs']

hcpcs_to_ccs.head()

Unnamed: 0,hcpcs_cd,ccs
0,61000,1
1,61001,1
2,61020,1
3,61026,1
4,61050,1


In [27]:
join_columns=['hcpcs_cd']

hcpcs_procedures_ccs= rollup(hcpcenvents ,hcpcs_to_ccs, join_columns ,"ccs")

hcpcs_procedures_ccs.to_csv("./Intermediate_Data/hcpcs_procedures_ccs.csv", index=None)

hcpcs_procedures_ccs.head()

Unnamed: 0,subject_id,hadm_id,chartdate,hcpcs_cd,seq_num,short_description,ccs,Rollup_Status
0,10000068,25022803,2160-03-04,99218,1,Hospital observation services,227,1
1,10000084,29888819,2160-12-28,G0378,1,Hospital observation per hr,227,1
2,10000108,27250926,2163-09-27,99219,1,Hospital observation services,227,1
3,10000117,22927623,2181-11-15,43239,1,Digestive system,70,1
4,10000117,22927623,2181-11-15,G0378,2,Hospital observation per hr,227,1


In [28]:
summarize_unmapped(hcpcs_procedures_ccs,'hcpcs_cd').head()

Unnamed: 0,hcpcs_cd,counts
0,92980,323
1,93545,273
2,43268,252
3,43269,196
4,43271,139


In [29]:
extract_columns =['subject_id','ccs','chartdate']
filtered_hcpcs_procedures_ccs= filter_rolledup_data(hcpcs_procedures_ccs, extract_columns)
filtered_hcpcs_procedures_ccs.head()

Unnamed: 0,subject_id,ccs,chartdate
0,10000068,227,2160-03-04
1,10000084,227,2160-12-28
2,10000108,227,2163-09-27
3,10000117,70,2181-11-15
4,10000117,227,2181-11-15


Now that hcpcsevents file has been rolled, we will go ahead and rollup the procedures_icd file

In [42]:
# Processing procedures_icd.csv

import pandas as pd

procedures_icd_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/procedures_icd.csv"

procedures_icd = pd.read_csv(procedures_icd_file, dtype=str)

procedures_icd.head()

Unnamed: 0,subject_id,hadm_id,seq_num,chartdate,icd_code,icd_version
0,10000032,22595853,1,2180-05-07,5491,9
1,10000032,22841357,1,2180-06-27,5491,9
2,10000032,25742920,1,2180-08-06,5491,9
3,10000068,25022803,1,2160-03-03,8938,9
4,10000117,27988844,1,2183-09-19,0QS734Z,10


Since we have all the required data - subject_id, chartdate, icd_code and icd_version, we can go ahead with the rollup.

**Note**: While working with real world data, in case you don't have the required information, you will need to find the source and merge them as we did in the case of diagnoses_icd

In [31]:
# ICD Procedure rollup file

icdproc_to_ccs_file = "./Rollup_Mappings/icdproc_to_ccs.csv"

icdproc_to_ccs = pd.read_csv(icdproc_to_ccs_file, dtype=str)

icdproc_to_ccs.head()

Unnamed: 0,icd_code,ccs,icd_version
0,00800ZZ,1,10
1,00803ZZ,1,10
2,00804ZZ,1,10
3,00870ZZ,1,10
4,00873ZZ,1,10


In [32]:
join_columns=['icd_code','icd_version']

icd_procedures_ccs= rollup(procedures_icd ,icdproc_to_ccs , join_columns ,"ccs")

icd_procedures_ccs.to_csv("./Intermediate_Data/hcpcs_procedures_ccs.csv", index=None)

icd_procedures_ccs.head()

Unnamed: 0,subject_id,hadm_id,seq_num,chartdate,icd_code,icd_version,ccs,Rollup_Status
0,10000032,22595853,1,2180-05-07,5491,9,88,1
1,10000032,22841357,1,2180-06-27,5491,9,88,1
2,10000032,25742920,1,2180-08-06,5491,9,88,1
3,10000068,25022803,1,2160-03-03,8938,9,227,1
4,10000117,27988844,1,2183-09-19,0QS734Z,10,146,1


In [33]:
summarize_unmapped(icd_procedures_ccs,'icd_code').head()

Unnamed: 0,icd_code,counts
0,XW033E5,563
1,XW033H5,54
2,XW0DXF5,50
3,XW043E5,36
4,XW0G886,31


In [34]:
icd_procedures_ccs.describe()

Unnamed: 0,subject_id,hadm_id,seq_num,chartdate,icd_code,icd_version,ccs,Rollup_Status
count,859655,859655,859655,859655,859655,859655,858486,859655
unique,150711,287504,41,35709,14911,2,231,2
top,17295976,25434637,1,2138-08-03,3893,9,54,1
freq,350,41,287506,75,14644,469209,45580,858486


In [72]:
#filtered_hcpcs_procedures_ccs.to_csv("./Rolledup_Data/procedures_hcpcs_rolled.csv", index=None)

In [35]:
extract_columns =['subject_id','ccs','chartdate']
filtered_icd_procedures_ccs= filter_rolledup_data(icd_procedures_ccs, extract_columns)
filtered_icd_procedures_ccs.head()

Unnamed: 0,subject_id,ccs,chartdate
0,10000032,88,2180-05-07
1,10000032,88,2180-06-27
2,10000032,88,2180-08-06
3,10000068,227,2160-03-03
4,10000117,146,2183-09-19


now that we have both procedure tables rolled up, we can concatenate them to create our final procedures rolled up data

In [37]:
procedures_ccs_rolled = pd.concat([filtered_hcpcs_procedures_ccs ,filtered_icd_procedures_ccs])

procedures_ccs_rolled

Unnamed: 0,subject_id,ccs,chartdate
0,10000068,227,2160-03-04
1,10000084,227,2160-12-28
2,10000108,227,2163-09-27
3,10000117,70,2181-11-15
4,10000117,227,2181-11-15
...,...,...,...
859650,19999840,4,2164-09-16
859651,19999840,198,2164-07-25
859652,19999840,188,2164-07-25
859653,19999987,188,2145-11-07


In [44]:
# Renaming columns to make it consistent with other rolledup data

procedures_ccs_rolled.columns=['subject_id','ccs','date']

procedures_ccs_rolled.drop_duplicates(inplace=True)

procedures_ccs_rolled.to_csv("./Rolledup_Data/procedures_ccs_rolled.csv",index=None)

procedures_ccs_rolled

Unnamed: 0,subject_id,ccs,date
0,10000068,227,2160-03-04
1,10000084,227,2160-12-28
2,10000108,227,2163-09-27
3,10000117,70,2181-11-15
4,10000117,227,2181-11-15
...,...,...,...
859650,19999840,4,2164-09-16
859651,19999840,198,2164-07-25
859652,19999840,188,2164-07-25
859653,19999987,188,2145-11-07


### Medication Data

In MIMIC Medications data come from two sources: 
1. prescriptions.csv where medications prescribed are recorded 
2. emar.csv where procedures are recorded as ndc codes - Hihgly granular information

Information is often duplicated among sources. Some of these events can also be found in inputevents in the ICU module which we can get into later
Our objective is to map the codes to RxNorm and then map to rnxorm ingredient codes

In [None]:
# Processing prescription.

In [2]:
import pandas as pd

prescriptions_file = "/n/data1/hsph/biostat/celehs/lab/va67/MIMIC/mimic-iv-codified-3.1/files/mimiciv/3.1/hosp/prescriptions.csv"

pres_1000 = pd.read_csv(prescriptions_file,dtype=str,nrows=100)

pres_1000

Unnamed: 0,subject_id,hadm_id,pharmacy_id,poe_id,poe_seq,order_provider_id,starttime,stoptime,drug_type,drug,...,gsn,ndc,prod_strength,form_rx,dose_val_rx,dose_unit_rx,form_val_disp,form_unit_disp,doses_per_24_hrs,route
0,10000032,22595853,12775705,10000032-55,55,P85UQ1,2180-05-08 08:00:00,2180-05-07 22:00:00,MAIN,Furosemide,...,008209,51079007320,40mg Tablet,,40,mg,1,TAB,1,PO/NG
1,10000032,22595853,18415984,10000032-42,42,P23SJA,2180-05-07 02:00:00,2180-05-07 22:00:00,MAIN,Ipratropium Bromide Neb,...,021700,00487980125,2.5mL Vial,,1,NEB,1,VIAL,4,IH
2,10000032,22595853,23637373,10000032-35,35,P23SJA,2180-05-07 01:00:00,2180-05-07 09:00:00,MAIN,Furosemide,...,008208,51079007220,20mg Tablet,,20,mg,1,TAB,1,PO/NG
3,10000032,22595853,26862314,10000032-41,41,P23SJA,2180-05-07 01:00:00,2180-05-07 01:00:00,MAIN,Potassium Chloride,...,001275,00245004101,10mEq ER Tablet,,40,mEq,4,TAB,1,PO
4,10000032,22595853,30740602,10000032-27,27,P23SJA,2180-05-07 00:00:00,2180-05-07 22:00:00,MAIN,Sodium Chloride 0.9% Flush,...,,0,10 mL Syringe,,3,mL,0.3,SYR,3,IV
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,10004606,28731738,68195424,10004606-490,490,P81HRQ,2159-04-05 08:00:00,2159-04-07 15:00:00,BASE,Iso-Osmotic Dextrose,...,,0,100 mL Bag,,100,mL,100,mL,1,IV
9996,10004606,28731738,68195424,10004606-490,490,P81HRQ,2159-04-05 08:00:00,2159-04-07 15:00:00,MAIN,Vancomycin,...,020611,00338355148,500 mg / 100 mL Premix Bag,,500,mg,1,BAG,1,IV
9997,10004606,28731738,68693978,10004606-542,542,P81HRQ,2159-04-10 13:00:00,2159-04-11 23:00:00,MAIN,Polyethylene Glycol,...,034313,11523726808,17g Packet,,17,g,1,PKT,,PO/NG
9998,10004606,28731738,70277807,10004606-524,524,P597U8,2159-04-08 20:00:00,2159-04-09 14:00:00,MAIN,Labetalol,...,005099,00172436560,200mg Tablet,,200,mg,1,TAB,2,PO/NG


In [4]:
# loading only a subset of required columns helps save memory

cols=['subject_id', 'starttime', 'ndc']

prescriptions = pd.read_csv(prescriptions_file ,dtype=str,usecols=cols)

prescriptions

Unnamed: 0,subject_id,starttime,ndc
0,10000032,2180-05-08 08:00:00,51079007320
1,10000032,2180-05-07 02:00:00,00487980125
2,10000032,2180-05-07 01:00:00,51079007220
3,10000032,2180-05-07 01:00:00,00245004101
4,10000032,2180-05-07 00:00:00,0
...,...,...,...
20292606,19999987,2145-11-03 15:00:00,63323029766
20292607,19999987,2145-11-03 00:00:00,00338004938
20292608,19999987,2145-11-03 00:00:00,14789040005
20292609,19999987,2145-11-08 16:00:00,51079000220


In [5]:
prescriptions['starttime'] = prescriptions['starttime'].str[:10]
prescriptions

Unnamed: 0,subject_id,starttime,ndc
0,10000032,2180-05-08,51079007320
1,10000032,2180-05-07,00487980125
2,10000032,2180-05-07,51079007220
3,10000032,2180-05-07,00245004101
4,10000032,2180-05-07,0
...,...,...,...
20292606,19999987,2145-11-03,63323029766
20292607,19999987,2145-11-03,00338004938
20292608,19999987,2145-11-03,14789040005
20292609,19999987,2145-11-08,51079000220


In [None]:
#load data, and process by batches. - write a function to keep appending the resuting data
#also import tqdm