## Exploring Healthcare Insights and Outcomes in Critical Care Patients Data
### Data Cleaning & Preparation
##### Project Author: Bruno Ferreira    
##### Date: March-April 2024

In this notebook, we will focus on cleaning and transforming the MIMIC-IV demo dataset into a suitable format for exploratory data analysis and predictive modeling.

#### Loading the dataset into a SQLite database, incorporating all the database's CSV files as tables

In [1822]:
import sqlite3 # Importing necessary libraries
import pandas as pd
from pathlib import Path

csv_directory = Path('C:/Users/bruno/Desktop/mimic') # Path to the directory containing CSV files

csv_files = csv_directory.glob('*.csv') # List of CSV files

conn = sqlite3.connect('C:/Users/bruno/Desktop/mimic/mimic_iv.db') # Create and connect to SQLite database

for csv_file in csv_files: # Iterate over CSV files and create corresponding tables in SQLite database
    table_name = csv_file.stem  # Extract table name from CSV file name
    df = pd.read_csv(csv_file, dtype=str)  # Read CSV file into DataFrame with all columns as strings
    df.to_sql(table_name, conn, if_exists='replace', index=False)  # Create SQLite table

conn.commit() # Save changes to the database

- Configuring the iPython-SQL environment for queries, and retrieving the total number of patients from the "patients" table for testing purposes

In [None]:
%load_ext sql
%sql sqlite:///C:/Users/bruno/Desktop/mimic/mimic_iv.db

In [1824]:
%sql SELECT COUNT(*) AS Total_Patients FROM patients

 * sqlite:///C:/Users/bruno/Desktop/mimic/mimic_iv.db
Done.


Total_Patients
100


- Selecting database tables with the most relevant information for the 3 subprojects

##### Relevant Tables:

`Admissions`: Provides information about each patient's hospital admission, including admission and discharge timestamps and the patient outcome (in-hospital death or not) which help understanding the patient's hospitalization journey.

`Patients`: Contains demographic information about patients, including its ID, age and gender, which can be used for patient stratification and risk assessment.

`Diagnoses_ICD`: Includes diagnostic codes assigned to patients during their hospital stay, which are valuable for identifying underlying health conditions and comorbidities that may influence patient outcomes.

`Procedures_ICD`: Provides information about procedures performed on patients during their hospitalization, which can be relevant for understanding the severity of illness and predicting outcomes.

`Prescriptions`: Contains data about medications prescribed to patients, including dosage, frequency, and route of administration, which can be important for assessing treatment regimens and potential medication-related adverse events.

`omr`: Offers miscellaneous health information such as blood pressure, height, weight, and body mass index, which can be useful for risk assessment and outcome prediction.

`labevents`: Includes laboratory test results, which are crucial for assessing patient health status, monitoring disease progression, and identifying abnormalities that may indicate adverse events or mortality risk.

#### Data wrangling

- Let's first inspect the raw data in the tables, total number of entries (rows) and columns, name and data type of each column and number of non-null values in each column:

In [1825]:
import numpy as np 

relevant_tables = ['admissions', 'patients', 'diagnoses_icd', 'procedures_icd', 'prescriptions', 'omr', 'labevents']

# Define a function to preprocess each table
def inspect_tables(table_name):
 
    query = f"SELECT * FROM {table_name};" # Fetch data from the table
    data = %sql $query

    df = data.DataFrame() # Convert data to a pandas DataFrame

    print(f"\n{'='*20} Inspecting {table_name} {'='*20}") # Print header for table being processed

    print(f"\nGlimpse of {table_name} DataFrame:") # See dataframe's first 5 rows
    print(df.head())

    print(f"\nSummary information of {table_name} DataFrame:") # Display number of rows and columns, missing values and variable data types
    print(df.info())

for table_name in relevant_tables:
    inspect_tables(table_name)

 * sqlite:///C:/Users/bruno/Desktop/mimic/mimic_iv.db
Done.


Glimpse of admissions DataFrame:
  subject_id   hadm_id            admittime            dischtime  \
0   10004235  24181354  2196-02-24 14:38:00  2196-03-04 14:02:00   
1   10009628  25926192  2153-09-17 17:08:00  2153-09-25 13:20:00   
2   10018081  23983182  2134-08-18 02:02:00  2134-08-23 19:35:00   
3   10006053  22942076  2111-11-13 23:39:00  2111-11-15 17:20:00   
4   10031404  21606243  2113-08-04 18:46:00  2113-08-06 20:57:00   

             deathtime admission_type admit_provider_id  \
0                 None         URGENT            P03YMR   
1                 None         URGENT            P41R5N   
2                 None         URGENT            P233F6   
3  2111-11-15 17:20:00         URGENT            P38TI6   
4                 None         URGENT            P07HDB   

       admission_location        discharge_location insurance language  \
0  TRANSFER FROM HOSPITAL  SKILLED NURSING FACILITY  Medicaid  ENGL

- Let's load the tables information into pandas dataframes to to make them easier to handle

In [1826]:
for table in relevant_tables:
    query = f"SELECT * FROM {table}"
    globals()[table] = pd.read_sql_query(query, conn)

- Cleaning and pre-processing the raw tables

`Admissions table`

In [1827]:
admissions_del = ['hadm_id','admission_location','admit_provider_id','insurance','language','marital_status','edregtime','edouttime','deathtime','discharge_location']
admissions.drop(columns=admissions_del, inplace=True) # Drop unnecessary columns

admissions['admittime'] = pd.to_datetime(admissions['admittime']) # Convert variables to correct data type
admissions['dischtime'] = pd.to_datetime(admissions['dischtime']) # P.S.: Date years were changed to de-identify patients but are consistent across the databases tables for each patient.
admissions['hospital_expire_flag'] = admissions['hospital_expire_flag'].astype(int)

headers = ['subject_id','admittime','dischtime','admission_type','race','deceased'] # Assign new column names
admissions.columns = headers

if admissions.isnull().sum().sum() == 0: # Check for missing values
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

Relevant columns from this table have no missing values.


`Patients table`

In [1828]:
patients_del = ['anchor_year', 'anchor_year_group'] # Drop unnecessary columns
patients.drop(columns=patients_del, inplace=True) 

patients['anchor_age'] = patients['anchor_age'].astype(int) # Convert variables to correct data type
patients['dod'] = pd.to_datetime(patients['dod'])

headers = ['subject_id', 'gender', 'age', 'dateofdeath'] # Assign new column names
patients.columns = headers

if patients.isnull().sum().sum() == 0: # Check for missing values (P.S.: Null values in 'dateofdeath' mean "No death", so none are missing values.)
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

One or more relevant columns from this table have missing values.


`Diagnoses_ICD table`

In [1829]:
diagnoses_icd.drop(columns=['hadm_id'], inplace=True)  # Drop unnecessary columns

diagnoses_icd['seq_num'] = diagnoses_icd['seq_num'].astype(int) # Convert variable to correct data type

diagnoses_icd.rename(columns={'seq_num': 'diagnosis_order'}, inplace=True) # Rename 'seq_num' column

if diagnoses_icd.isnull().sum().sum() == 0: # Check for missing values
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

Relevant columns from this table have no missing values.


`Procedures_ICD table`

In [1830]:
procedures_icd.drop(columns=['hadm_id'], inplace=True) # Drop unnecessary columns

procedures_icd.rename(columns={'chartdate': 'procedure_date', 'seq_num': 'procedure_order'}, inplace=True) # Rename columns

procedures_icd['procedure_order'] = procedures_icd['procedure_order'].astype(int) # Convert variable to correct data type

if procedures_icd.isnull().sum().sum() == 0: # Check for missing values
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

Relevant columns from this table have no missing values.


`Prescriptions table`

In [1831]:
prescriptions = prescriptions[['subject_id', 'drug']] # Keep most relevant columns

prescriptions.rename(columns={'drug': 'prescriptions'}) # Rename 'drug' column to "prescriptions"

if prescriptions.isnull().sum().sum() == 0: # Check for missing values
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

Relevant columns from this table have no missing values.


`OMR (Miscellaneous measurements) table`

In [1832]:
omr.drop(columns=['seq_num'], inplace=True) # Drop unnecessary column

omr.rename(columns={'chartdate': 'date', 'result_name': 'measurement', 'result_value': 'value'}, inplace=True) # Rename columns

omr['date'] = pd.to_datetime(omr['date']) # Convert variable to correct data type

if omr.isnull().sum().sum() == 0: # Check for missing values
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

Relevant columns from this table have no missing values.


`Labevents (Laboratory Results) table`

In [1833]:
labevents = labevents[['subject_id', 'valuenum', 'flag']] # Keep only the required columns

labevents.rename(columns={'flag': 'labresult'}, inplace=True) # Rename column

labevents.dropna(subset=['valuenum'], inplace=True) # Filter rows without missing lab result values

labevents.drop(columns=['valuenum'], inplace=True) # Remove lab result values column

labevents.isnull().sum() # Check for missing values

subject_id        0
labresult     56340
dtype: int64

- Missing values on 'labresult' mean "normal" lab results, so let's replace them accordingly:

In [1834]:
labevents.loc[labevents['labresult'].isnull(), 'labresult'] = 'normal'

- Lastly, we need to check for and remove duplicated rows that don't add value to our research (Excluding tables where duplicates like multiple abnormal lab results or multiple admissions for the same patient may give us valuable insights):

In [1835]:
duplicates_from = ['patients', 'omr'] # Remove duplicated patients and identical measurements on the same day

for table_name in duplicates_from:
    df = globals()[table_name]  # Get DataFrame by its name
    
    if df.duplicated().any(): # Check for duplicates
        print(f"Duplicates found in {table_name}. Removing duplicates...")
        df.drop_duplicates(inplace=True)  # Remove duplicates
        print(f"Duplicates removed from {table_name}.")
    else:
        print(f"No duplicates found in {table_name}.")

No duplicates found in patients.
Duplicates found in omr. Removing duplicates...
Duplicates removed from omr.


#### Feature Engineering
- Now we can transform existing variables into new useful ones and rearrange the tables

`Admissions table`

In [1836]:
admissions['admission_type'].unique() # Check all types of admissions

array(['URGENT', 'ELECTIVE', 'EW EMER.', 'DIRECT EMER.', 'EU OBSERVATION',
       'OBSERVATION ADMIT', 'DIRECT OBSERVATION',
       'AMBULATORY OBSERVATION', 'SURGICAL SAME DAY ADMISSION'],
      dtype=object)

Patients will be grouped based on the urgency and purpose of their hospital admissions. Planned admissions include scheduled procedures such as Elective and Surgical Same Day Admission. Emergency admissions cover urgent medical cases like Urgent, EW EMER., and Direct Emer. For observation purposes, patients are categorized under EU Observation, Observation Admit, Direct Observation, and Ambulatory Observation, where they are monitored and evaluated before further decisions on their care.

In [1837]:
admissions['stay_length'] = (admissions['dischtime'] - admissions['admittime']) / np.timedelta64(1, 'D') # Calculating stay length

grouped_admissions = admissions.groupby('subject_id')
race = grouped_admissions['race'].last()  # Last race recorded for each subject
deceased = grouped_admissions['deceased'].max().astype(int)  # Last deceased status recorded for each subject
last_stay_length = (grouped_admissions['dischtime'].max() - grouped_admissions['admittime'].max()) / np.timedelta64(1, 'D')  # Length of last stay for each subject
avg_stay_length = admissions.groupby('subject_id')['stay_length'].mean().astype(float).round(1)  # Average stay length for each subject
total_admissions = grouped_admissions.size().astype(int)  # Total admissions for each subject
admission_sorted = admissions.sort_values(by=['subject_id', 'admittime'], ascending=[True, False]) 
last_admission_type = admission_sorted.groupby('subject_id').first()['admission_type'] # Type of latest admission of a patient

# Recreate DataFrame
admissions = pd.DataFrame({ 
    'race': race,
    'deceased': deceased, # 1 = In-hospital death 
    'last_stay': last_stay_length.round(1), # In days
    'avg_stay': avg_stay_length, # In days
    'total_admissions': total_admissions,
    'admission_type': last_admission_type
}).reset_index()

# Normalizing 'race' column so values are consistent
admissions['race'] = admissions['race'].str.split('-', expand=True)[0].str.split('/', expand=True)[0].str.upper()
admissions['race'] = admissions['race'].replace(['OTHER','UNKNOWN', 'UNABLE TO OBTAIN', 'PATIENT DECLINED TO ANSWER','PORTUGUESE'], 'OTHER')
admissions['race'] = admissions['race'].replace('HISPANIC OR LATINO', 'HISPANIC')
admissions['race'] = admissions['race'].str.strip() # Removing any trailing whitespaces
replacement_map = {
    'ELECTIVE': 'PLANNED',
    'SURGICAL SAME DAY ADMISSION': 'PLANNED',
    'URGENT': 'EMERGENCY',
    'EW EMER.': 'EMERGENCY',
    'DIRECT EMER.': 'EMERGENCY',
    'EU OBSERVATION': 'OBSERVATION',
    'OBSERVATION ADMIT': 'OBSERVATION',
    'DIRECT OBSERVATION': 'OBSERVATION'
}
admissions['admission_type'] = admissions['admission_type'].replace(replacement_map) # Replacing values in 'admission_type' using the replacement_map
admissions.tail()

Unnamed: 0,subject_id,race,deceased,last_stay,avg_stay,total_admissions,admission_type
95,10038999,WHITE,0,5.6,9.1,2,EMERGENCY
96,10039708,BLACK,0,2.4,7.5,10,OBSERVATION
97,10039831,OTHER,0,5.3,5.3,1,PLANNED
98,10039997,BLACK,0,2.8,2.3,3,OBSERVATION
99,10040025,WHITE,0,12.4,6.6,10,OBSERVATION


`Diagnoses_icd table`

In [1838]:
# Sort the DataFrame by 'subject_id' and 'diagnosis_order' to ensure correct ordering
diagnoses_icd_sorted = diagnoses_icd.sort_values(['subject_id', 'diagnosis_order'])

# Define the list of starting characters for icd codes off our mappings below
allowed_start_chars = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'S', 'T'] + [str(i) for i in range(10)]

# Filter rows where 'icd_code' starts with any of the allowed characters
filtered_diagnoses_icd = diagnoses_icd_sorted[
    diagnoses_icd_sorted['icd_code'].str[0].isin(allowed_start_chars)
].reset_index(drop=True)

# Add a column indicating the total number of diagnoses for each patient
filtered_diagnoses_icd['total_diagnoses'] = filtered_diagnoses_icd.groupby('subject_id')['subject_id'].transform('size')

# Group by 'subject_id' and select the rows corresponding to the 10 most recent diagnosis for each subject (most relevant to their final outcome)
last_diagnoses = filtered_diagnoses_icd.groupby('subject_id').tail(10)

# Reset the index of the resulting DataFrame and drop the 'diagnosis_order' column
diagnoses_icd = last_diagnoses.reset_index(drop=True).drop(columns=['diagnosis_order'])

# Define the mappings for ICD-9 and ICD-10
icd9_mapping = {
    "infectious_diseases": ["001", "139"],
    "cancer": ["140", "239"],
    "endocrine_disorders": ["240", "279"],
    "blood_disorders": ["280", "289"],
    "mental_disorders": ["290", "319"],
    "nervous_disorders": ["320", "389"],
    "cardiovascular_disorders": ["390", "459"],
    "respiratory_disorders": ["460", "519"],
    "digestive_disorders": ["520", "579"],
    "genitourinary_disorders": ["580", "629"],
    "pregnancy_complications": ["630", "679"],
    "skin_disorders": ["680", "709"],
    "musculoskeletal_disorders": ["710", "739"],
    "injuries_&_poisonings": ["800", "999"]
}

icd10_mapping = {
    "infectious_diseases": ["A00", "B99"],
    "cancer": ["C00", "D48"],
    "endocrine_disorders": ["E00", "E99"],
    "blood_disorders": ["D50", "D89"],
    "mental_disorders": ["F01", "F99"],
    "nervous_disorders": ["G00", "G99"],
    "cardiovascular_disorders": ["I00", "I99"],
    "respiratory_disorders": ["J00", "J99"],
    "digestive_disorders": ["K00", "K93"],
    "genitourinary_disorders": ["N00", "N99"],
    "pregnancy_complications": ["O00", "O99"],
    "skin_disorders": ["L00", "L99"],
    "musculoskeletal_disorders": ["M00", "M99"],
    "injuries_&_poisonings": ["S00", "T98"]
}

def map_icd_to_category(icd_code, icd9_mapping, icd10_mapping):
    for mapping in [icd9_mapping, icd10_mapping]:
        for category, code_range in mapping.items():
            start, end = code_range[0], code_range[1]
            if icd_code >= start and icd_code <= end:
                return category
    return "other"  # If code doesn't match any category, classify as 'other'

# Apply mapping function to classify diagnoses into disease categories
diagnoses_icd['disease_category'] = diagnoses_icd['icd_code'].apply(
    lambda x: map_icd_to_category(x[:3], icd9_mapping, icd10_mapping)
)

# Drop rows where disease_category is 'other'
diagnoses_icd[diagnoses_icd['disease_category'] != 'other']

# Pivot the DataFrame to create indicator columns
df_pivot = pd.pivot_table(diagnoses_icd, values='icd_code', index='subject_id', columns='disease_category', aggfunc=lambda x: 1, fill_value=0)

# Reset index to make subject_id a column again
df_pivot.reset_index(inplace=True)

# Merge back with the original DataFrame to get additional columns if needed
diagnoses_icd = diagnoses_icd.merge(df_pivot, on='subject_id', how='left')

# Drop unnecessary columns like icd_code, icd_version, and disease_category
diagnoses_icd.drop(['icd_code', 'icd_version', 'disease_category'], axis=1, inplace=True)

diagnoses_icd = diagnoses_icd.groupby('subject_id').first().reset_index()
diagnoses_icd.tail()

Unnamed: 0,subject_id,total_diagnoses,blood_disorders,cancer,cardiovascular_disorders,digestive_disorders,endocrine_disorders,genitourinary_disorders,infectious_diseases,injuries_&_poisonings,mental_disorders,musculoskeletal_disorders,nervous_disorders,other,respiratory_disorders,skin_disorders
95,10038999,16,1,0,1,0,1,0,0,0,0,0,1,0,1,0
96,10039708,152,1,1,0,1,1,0,0,1,0,0,0,0,1,0
97,10039831,5,0,1,1,0,1,0,0,1,1,0,0,0,0,0
98,10039997,26,1,0,1,0,1,1,0,0,1,1,0,0,0,0
99,10040025,196,0,0,1,1,1,0,0,1,1,1,0,0,1,1


`Labevents table`

In [1839]:
total_lab_results = labevents.groupby('subject_id').size().reset_index(name='total_lab_results') # Calculate the total number of lab results for each subject

abnormal_lab_results = labevents[labevents['labresult'] == 'abnormal'].groupby('subject_id').size().reset_index(name='abnormal_lab_results') # Calculate the number of abnormal lab results for each subject

merged_df = pd.merge(total_lab_results, abnormal_lab_results, on='subject_id', how='left') # Merge the 2 DataFrames

merged_df['abnormal_lab_results'] = merged_df['abnormal_lab_results'].fillna(0) # Fill missing values with 0 (normal lab results)

merged_df['abnorm_labresults_ratio'] = (merged_df['abnormal_lab_results'] / merged_df['total_lab_results']).round(3) # Calculate ratio of abnormal results
 
labevents = pd.merge(labevents.drop_duplicates(subset=['subject_id']), merged_df, on='subject_id', how='left') # Merge ratio of abnormal results with original DataFrame

labevents.drop(columns=['total_lab_results', 'abnormal_lab_results','labresult'], inplace=True) # Drop unnecessary columns

labevents.sort_values(by='subject_id', inplace=True) # Sort the DataFrame by subject_id 
labevents.reset_index(drop=True, inplace=True) 
labevents.tail()

Unnamed: 0,subject_id,abnorm_labresults_ratio
95,10038999,0.353
96,10039708,0.516
97,10039831,0.5
98,10039997,0.126
99,10040025,0.44


`OMR table`

In [1840]:
omr_sorted = omr.sort_values(by='date', ascending=False) # Sort the dataframe by 'date' in descending order

# Filter the dataframe to keep only rows with measurements 'BMI (kg/m2)' and 'Blood Pressure'
omr_sorted = omr_sorted[omr_sorted['measurement'].isin(['BMI (kg/m2)', 'Blood Pressure'])]

# Group by 'subject_id' and 'measurement', and keep only the first row (the most recent measurements, closer to patient outcome)
omr = omr_sorted.groupby(['subject_id', 'measurement']).first().reset_index()

omr = omr.drop(columns=['date']) # Drop 'date' column

omr_new = pd.DataFrame() # Initialize the DataFrame to store extracted values

# Extract BMI, systolic BP, and diastolic BP
omr_new['subject_id'] = omr['subject_id']

bmi_values = omr.loc[omr['measurement'] == 'BMI (kg/m2)', ['subject_id', 'value']] # Filter rows for BMI
omr_new = omr_new.merge(bmi_values, on='subject_id', how='left')
omr_new.rename(columns={'value': 'bmi_index'}, inplace=True)

bp_values = omr.loc[omr['measurement'] == 'Blood Pressure', ['subject_id', 'value']] # Filter rows for Blood Pressure and split into systolic/diastolic
bp_values[['systolic_bp', 'diastolic_bp']] = bp_values['value'].str.split('/', expand=True)
bp_values.drop(columns=['value'], inplace=True)
omr_new = omr_new.merge(bp_values, on='subject_id', how='left')

omr_new.drop_duplicates(subset='subject_id', inplace=True) # Drop duplicate subject_id entries
omr = omr_new.reset_index(drop=True) 
omr.tail()

Unnamed: 0,subject_id,bmi_index,systolic_bp,diastolic_bp
74,10038992,31.9,105.0,73.0
75,10038999,33.4,,
76,10039708,26.5,110.0,76.0
77,10039997,33.3,141.0,82.0
78,10040025,30.3,112.0,60.0


- In this table, data is missing for some patients. Missing values must be marked with NaN values and can be replaced later with nearest neighbor estimated values.

In [1841]:
missing_subject_ids = labevents[~labevents['subject_id'].isin(omr['subject_id'])]['subject_id'] # Find missing subject_id's in omr

# Create new rows for missing subject_ids in omr
missing_rows = pd.DataFrame({'subject_id': missing_subject_ids, 'bmi_index': np.nan,'systolic_bp': np.nan,'diastolic_bp': np.nan})

omr = pd.concat([omr, missing_rows], ignore_index=True) # Concatenate missing_rows with omr
omr.sort_values(by='subject_id', inplace=True)
omr.reset_index(drop=True, inplace=True)
omr.tail()

Unnamed: 0,subject_id,bmi_index,systolic_bp,diastolic_bp
95,10038999,33.4,,
96,10039708,26.5,110.0,76.0
97,10039831,,,
98,10039997,33.3,141.0,82.0
99,10040025,30.3,112.0,60.0


`Patients table`

In [1842]:
patients['1yr_death'] = patients['dateofdeath'].notnull().astype(float) # Create '1yr_death' column (death within 1 year of hospital discharge)

patients.drop(columns=['dateofdeath'], inplace=True) # Drop 'dateofdeath' column

patients.sort_values(by='subject_id', inplace=True) # Sort the DataFrame by 'subject_id' and reset the index
patients.reset_index(drop=True, inplace=True)
patients.tail()

Unnamed: 0,subject_id,gender,age,1yr_death
95,10038999,M,45,0.0
96,10039708,F,46,0.0
97,10039831,F,57,0.0
98,10039997,F,67,0.0
99,10040025,F,64,1.0


`Prescriptions table`

In [1843]:
prescriptions['prescriptions_count'] = prescriptions.groupby('subject_id')['drug'].transform('count') # Count amount of prescriptions for each patient

prescriptions.drop(columns='drug', inplace=True) # Drop drug column

prescriptions.drop_duplicates(subset='subject_id', keep='first', inplace=True) # Drop duplicate rows

prescriptions.sort_values(by='subject_id', inplace=True) # Sort by subject_id and reset the index
prescriptions.reset_index(drop=True, inplace=True)
prescriptions.tail() 

Unnamed: 0,subject_id,prescriptions_count
95,10038999,121
96,10039708,615
97,10039831,66
98,10039997,59
99,10040025,715


`Procedures_icd table`

In [1844]:
procedures_count = procedures_icd.groupby('subject_id')['procedure_order'].size() # Count amount of procedures for each patient

procedures_count_df = procedures_count.reset_index() # Create a DataFrame with subject_id and procedures_count

procedures_count_df.rename(columns={'procedure_order': 'procedures_count'}, inplace=True) # Rename the column to procedures_count

procedures_icd = procedures_count_df[['subject_id', 'procedures_count']] # Save final DataFrame with columns subject_id and procedures_count
procedures_icd.reset_index(drop=True, inplace=True)
procedures_icd.tail()

Unnamed: 0,subject_id,procedures_count
87,10038999,10
88,10039708,34
89,10039831,2
90,10039997,2
91,10040025,16


- In this table, data is missing for some patients. Missing values must be marked with NaN values and can be replaced later with nearest neighbor estimated values.

In [1845]:
# Find missing subject_id's in procedures_icd
missing_subject_ids = prescriptions[~prescriptions['subject_id'].isin(procedures_icd['subject_id'])]['subject_id']

# Create new rows for missing subject_ids in procedures_icd
missing_rows = pd.DataFrame({'subject_id': missing_subject_ids, 'procedures_count': np.nan})

# Concatenate missing_rows with procedures_icd
procedures_icd = pd.concat([procedures_icd, missing_rows], ignore_index=True)
procedures_icd.sort_values(by='subject_id', inplace=True)
procedures_icd.reset_index(drop=True, inplace=True)
procedures_icd.tail()

Unnamed: 0,subject_id,procedures_count
95,10038999,10.0
96,10039708,34.0
97,10039831,2.0
98,10039997,2.0
99,10040025,16.0


#### Combining Tables (into 1 Dataframe)

- Join all tables into a main dataframe

In [1846]:
dataframes = {
    'admissions': admissions,
    'patients': patients,
    'diagnoses_icd': diagnoses_icd,
    'procedures_icd': procedures_icd,
    'prescriptions': prescriptions,
    'omr': omr,
    'labevents': labevents
}

dataframe_names = ['admissions', 'patients', 'diagnoses_icd', 'procedures_icd', 'prescriptions', 'omr', 'labevents']

mimic_iv = dataframes[dataframe_names[0]] # Start with first dataframe

# Iterate over the remaining dataframes and merge on 'subject_id'
for name in dataframe_names[1:]:
    df_to_merge = dataframes[name]
    
    # Ensure 'subject_id' column is of the same type in both dataframes
    mimic_iv['subject_id'] = mimic_iv['subject_id'].astype(str)
    df_to_merge['subject_id'] = df_to_merge['subject_id'].astype(str)
    
    mimic_iv = pd.merge(mimic_iv, df_to_merge, on='subject_id', how='left') 

columns = mimic_iv.columns.tolist() # Identify the column names in mimic_iv

# Move '1yr_death' column to the second to last position
if '1yr_death' in columns:
    columns.remove('1yr_death')
    columns.append('1yr_death') 

# Move 'deceased' column to the last position (target variable)
if 'deceased' in columns:
    columns.remove('deceased') 
    columns.append('deceased') 

mimic_iv = mimic_iv[columns] # Reorder the columns in mimic_iv
mimic_iv.head()

Unnamed: 0,subject_id,race,last_stay,avg_stay,total_admissions,admission_type,gender,age,total_diagnoses,blood_disorders,...,respiratory_disorders,skin_disorders,procedures_count,prescriptions_count,bmi_index,systolic_bp,diastolic_bp,abnorm_labresults_ratio,1yr_death,deceased
0,10000032,WHITE,1.8,1.4,4,EMERGENCY,F,52,32,0,...,1,0,3.0,81,18.2,98,66,0.546,1.0,0
1,10001217,WHITE,5.9,6.4,2,EMERGENCY,F,55,15,0,...,1,0,4.0,94,24.3,134,84,0.123,0.0,0
2,10001725,WHITE,3.0,3.0,1,EMERGENCY,F,46,16,0,...,0,0,3.0,70,26.5,114,70,0.088,0.0,0
3,10002428,WHITE,0.8,5.6,7,OBSERVATION,F,80,99,1,...,1,0,17.0,320,18.2,110,70,0.365,0.0,0
4,10002495,OTHER,6.9,6.9,1,EMERGENCY,M,81,15,0,...,0,0,7.0,113,25.8,159,59,0.453,0.0,0


- Recheck missing values and data types again

In [1847]:
mimic_iv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 31 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   subject_id                 100 non-null    object 
 1   race                       100 non-null    object 
 2   last_stay                  100 non-null    float64
 3   avg_stay                   100 non-null    float64
 4   total_admissions           100 non-null    int32  
 5   admission_type             100 non-null    object 
 6   gender                     100 non-null    object 
 7   age                        100 non-null    int32  
 8   total_diagnoses            100 non-null    int64  
 9   blood_disorders            100 non-null    int64  
 10  cancer                     100 non-null    int64  
 11  cardiovascular_disorders   100 non-null    int64  
 12  digestive_disorders        100 non-null    int64  
 13  endocrine_disorders        100 non-null    int64  


- Correct data type of some of the features:

In [1848]:
mimic_iv['subject_id'] = mimic_iv['subject_id'].astype(int)
mimic_iv['bmi_index'] = pd.to_numeric(mimic_iv['bmi_index'], errors='coerce')
mimic_iv['systolic_bp'] = pd.to_numeric(mimic_iv['systolic_bp'], errors='coerce')
mimic_iv['diastolic_bp'] = pd.to_numeric(mimic_iv['diastolic_bp'], errors='coerce')
mimic_iv['1yr_death'] = pd.to_numeric(mimic_iv['1yr_death'], errors='coerce').astype(int)

- Use K-nearest neighbors to fill missing values. We should avoid dropping rows with missing data because the dataset in question is already too small.

In [1849]:
from sklearn.impute import KNNImputer  

mimic_iv = pd.get_dummies(mimic_iv, columns=['race', 'gender','admission_type']) # One-hot encoding 'race', 'gender' and 'admission_type'

imputer = KNNImputer(n_neighbors=5) # Initialize KNNImputer

cols_with_missing = ['procedures_count', 'bmi_index', 'systolic_bp', 'diastolic_bp'] # Specify columns with missing values

mimic_iv[cols_with_missing] = imputer.fit_transform(mimic_iv[cols_with_missing]) # Impute missing values
mimic_iv.head(1)

Unnamed: 0,subject_id,last_stay,avg_stay,total_admissions,age,total_diagnoses,blood_disorders,cancer,cardiovascular_disorders,digestive_disorders,...,deceased,race_BLACK,race_HISPANIC,race_OTHER,race_WHITE,gender_F,gender_M,admission_type_EMERGENCY,admission_type_OBSERVATION,admission_type_PLANNED
0,10000032,1.8,1.4,4,52,32,0,0,0,1,...,0,False,False,False,True,True,False,True,False,False


- Converting newly created variables after one-hot encoding from boolean to integer.

In [1850]:
encoded_columns = ['race_BLACK', 'race_HISPANIC', 'race_OTHER', 'race_WHITE', 'gender_F', 'gender_M','admission_type_EMERGENCY',
                   'admission_type_OBSERVATION','admission_type_PLANNED']
mimic_iv[encoded_columns] = mimic_iv[encoded_columns].astype(int)

- Export our final master dataframe to use later for exploratory data analysis and predictive modeling:

In [1851]:
mimic_iv.to_csv('mimic_iv.csv', index=False)