# Data Cleaning & Preparation (MIMIC-IV Demo Dataset)
##### Project Author: Bruno Ferreira    
##### Date: March-April 2024

In this first notebook, we will focus on preparing and transforming the MIMIC-IV demo dataset into a suitable format to enable the planned sub-projects to be carried out.

#### Loading the dataset into a SQLite database, incorporating all the database's CSV files as tables

In [625]:
import sqlite3
import pandas as pd
from pathlib import Path

csv_directory = Path('C:/Users/bruno/Desktop/mimic') # Path to the directory containing CSV files

csv_files = csv_directory.glob('*.csv') # List of CSV files

conn = sqlite3.connect('C:/Users/bruno/Desktop/mimic/mimic_iv.db') # Create and connect to SQLite database

for csv_file in csv_files: # Iterate over CSV files and create corresponding tables in SQLite database
    table_name = csv_file.stem  # Extract table name from CSV file name
    df = pd.read_csv(csv_file, dtype=str)  # Read CSV file into DataFrame with all columns as strings
    df.to_sql(table_name, conn, if_exists='replace', index=False)  # Create SQLite table

conn.commit()

- Configuring the iPython-SQL environment for queries, and retrieving the total number of patients from the "patients" table for testing purposes

In [None]:
%load_ext sql
%sql sqlite:///C:/Users/bruno/Desktop/mimic/mimic_iv.db

In [627]:
%sql SELECT COUNT(*) AS Total_Patients FROM patients

 * sqlite:///C:/Users/bruno/Desktop/mimic/mimic_iv.db
Done.


Total_Patients
100


- Selecting database tables with the most relevant information for the 3 subprojects

##### Relevant Tables:

`Admissions`: Provides information about each patient's hospital admission, including admission and discharge timestamps and the patient outcome (in-hospital death or not) which help understanding the patient's hospitalization journey.

`Patients`: Contains demographic information about patients, including its ID, age and gender, which can be used for patient stratification and risk assessment.

`Diagnoses_ICD`: Includes diagnostic codes assigned to patients during their hospital stay, which are valuable for identifying underlying health conditions and comorbidities that may influence patient outcomes.

`Procedures_ICD`: Provides information about procedures performed on patients during their hospitalization, which can be relevant for understanding the severity of illness and predicting outcomes.

`Prescriptions`: Contains data about medications prescribed to patients, including dosage, frequency, and route of administration, which can be important for assessing treatment regimens and potential medication-related adverse events.

`omr`: Offers miscellaneous health information such as blood pressure, height, weight, and body mass index, which can be useful for risk assessment and outcome prediction.

`labevents`: Includes laboratory test results, which are crucial for assessing patient health status, monitoring disease progression, and identifying abnormalities that may indicate adverse events or mortality risk.

#### Data wrangling

- Let's first inspect the raw data in the tables, total number of entries (rows) and columns, name and data type of each column and number of non-null values in each column:

In [628]:
import numpy as np 

relevant_tables = ['admissions', 'patients', 'diagnoses_icd', 'procedures_icd', 'prescriptions', 'omr', 'labevents']

# Define a function to preprocess each table
def inspect_tables(table_name):
 
    query = f"SELECT * FROM {table_name};" # Fetch data from the table
    data = %sql $query

    df = data.DataFrame() # Convert data to a pandas DataFrame

    print(f"\n{'='*20} Inspecting {table_name} {'='*20}") # Print header for table being processed

    print(f"\nGlimpse of {table_name} DataFrame:") # Display glimpse of DataFrame
    print(df.head())

    print(f"\nSummary information of {table_name} DataFrame:") # Display number of rows and columns, missing values and variable data types
    print(df.info())

# Iterate over relevant tables and preprocess each one
for table_name in relevant_tables:
    inspect_tables(table_name)

 * sqlite:///C:/Users/bruno/Desktop/mimic/mimic_iv.db
Done.


Glimpse of admissions DataFrame:
  subject_id   hadm_id            admittime            dischtime  \
0   10004235  24181354  2196-02-24 14:38:00  2196-03-04 14:02:00   
1   10009628  25926192  2153-09-17 17:08:00  2153-09-25 13:20:00   
2   10018081  23983182  2134-08-18 02:02:00  2134-08-23 19:35:00   
3   10006053  22942076  2111-11-13 23:39:00  2111-11-15 17:20:00   
4   10031404  21606243  2113-08-04 18:46:00  2113-08-06 20:57:00   

             deathtime admission_type admit_provider_id  \
0                 None         URGENT            P03YMR   
1                 None         URGENT            P41R5N   
2                 None         URGENT            P233F6   
3  2111-11-15 17:20:00         URGENT            P38TI6   
4                 None         URGENT            P07HDB   

       admission_location        discharge_location insurance language  \
0  TRANSFER FROM HOSPITAL  SKILLED NURSING FACILITY  Medicaid  ENGL

- Let's load the tables information into pandas dataframes to to make them easier to handle

In [629]:
for table in relevant_tables:
    query = f"SELECT * FROM {table}"
    globals()[table] = pd.read_sql_query(query, conn)

- Cleaning and pre-processing the raw tables

`Admissions table`

In [630]:
admissions_del = ['hadm_id','admission_location','admit_provider_id','admission_type','insurance','language','marital_status','edregtime','edouttime','deathtime','discharge_location']
admissions.drop(columns=admissions_del, inplace=True) # Drops unnecessary columns

admissions['admittime'] = pd.to_datetime(admissions['admittime']) # Converts variables to correct data type
admissions['dischtime'] = pd.to_datetime(admissions['dischtime']) # P.S.: Date years were changed to de-identify patients but are consistent across the databases tables for each patient.
admissions['hospital_expire_flag'] = admissions['hospital_expire_flag'].astype(int)

headers = ['subject_id','admittime','dischtime','race','deceased'] # Assigns new column names
admissions.columns = headers

if admissions.isnull().sum().sum() == 0: # Checks for missing values
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

Relevant columns from this table have no missing values.


`Patients table`

In [631]:
patients_del = ['anchor_year', 'anchor_year_group'] # Drops unnecessary columns
patients.drop(columns=patients_del, inplace=True) 

patients['anchor_age'] = patients['anchor_age'].astype(int) # Converts variables to correct data type
patients['dod'] = pd.to_datetime(patients['dod'])

headers = ['subject_id', 'gender', 'age', 'dateofdeath'] # Assigns new column names
patients.columns = headers

if patients.isnull().sum().sum() == 0: # Checks for missing values (P.S.: Null values in 'dateofdeath' mean "No death", so none are missing values.)
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

One or more relevant columns from this table have missing values.


`Diagnoses_ICD table`

In [632]:
diagnoses_icd.drop(columns=['hadm_id'], inplace=True)  # Drops unnecessary columns

diagnoses_icd['seq_num'] = diagnoses_icd['seq_num'].astype(int) # Converts variable to correct data type

diagnoses_icd.rename(columns={'seq_num': 'diagnosis_order'}, inplace=True) # Renames 'seq_num' column

if diagnoses_icd.isnull().sum().sum() == 0: # Checks for missing values
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

Relevant columns from this table have no missing values.


`Procedures_ICD table`

In [633]:
procedures_icd.drop(columns=['hadm_id'], inplace=True) # Drops unnecessary columns

procedures_icd.rename(columns={'chartdate': 'procedure_date', 'seq_num': 'procedure_order'}, inplace=True) # Renames columns

procedures_icd['procedure_order'] = procedures_icd['procedure_order'].astype(int) # Converts variable to correct data type

if procedures_icd.isnull().sum().sum() == 0: # Checks for missing values
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

Relevant columns from this table have no missing values.


`Prescriptions table`

In [634]:
prescriptions = prescriptions[['subject_id', 'drug']] # Keeps most relevant columns

prescriptions.rename(columns={'drug': 'prescriptions'}) # Renames 'drug' column to "prescriptions"

if prescriptions.isnull().sum().sum() == 0: # Checks for missing values
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

Relevant columns from this table have no missing values.


`OMR (Miscellaneous measurements) table`

In [635]:
omr.drop(columns=['seq_num'], inplace=True) # Drops unnecessary column

omr.rename(columns={'chartdate': 'date', 'result_name': 'measurement', 'result_value': 'value'}, inplace=True) # Renames columns

omr['date'] = pd.to_datetime(omr['date']) # Converts variable to correct data type

if omr.isnull().sum().sum() == 0: # Checks for missing values
    print("Relevant columns from this table have no missing values.")
else:
    print("One or more relevant columns from this table have missing values.")

Relevant columns from this table have no missing values.


`Labevents (Laboratory Results) table`

In [636]:
labevents = labevents[['subject_id', 'valuenum', 'flag']] # Keeps only the required columns

labevents.rename(columns={'flag': 'labresult'}, inplace=True) # Renames column

labevents.dropna(subset=['valuenum'], inplace=True) # Filters rows without missing lab result values

labevents.drop(columns=['valuenum'], inplace=True) # Removes lab result values column (no longer necessary)

labevents.isnull().sum() # Checks for missing values

subject_id        0
labresult     56340
dtype: int64

- Missing values on 'labresult' mean "normal" lab results, so let's replace them accordingly:

In [637]:
labevents.loc[labevents['labresult'].isnull(), 'labresult'] = 'normal'

- Lastly, we need to check for and remove duplicated rows that don't add value to our research (Excluding tables where duplicates like multiple abnormal lab results or multiple admissions for the same patient may give us valuable insights):

In [638]:
# Remove duplicated patients and identical measurements on the same day
duplicates_from = ['patients', 'omr']

for table_name in duplicates_from:
    df = globals()[table_name]  # Get DataFrame by its name
    
    if df.duplicated().any(): # Check for duplicates
        print(f"Duplicates found in {table_name}. Removing duplicates...")
        df.drop_duplicates(inplace=True)  # Remove duplicates
        print(f"Duplicates removed from {table_name}.")
    else:
        print(f"No duplicates found in {table_name}.")

No duplicates found in patients.
Duplicates found in omr. Removing duplicates...
Duplicates removed from omr.


#### Feature Engineering
- Now we can transform existing variables into new useful ones and rearrange the tables

`Admissions table`

In [639]:
admissions['stay_length'] = (admissions['dischtime'] - admissions['admittime']) / np.timedelta64(1, 'D') # Calculating stay length

grouped_admissions = admissions.groupby('subject_id')
race = grouped_admissions['race'].last()  # Last race recorded for each subject
deceased = grouped_admissions['deceased'].max().astype(int)  # Last deceased status recorded for each subject
last_stay_length = (grouped_admissions['dischtime'].max() - grouped_admissions['admittime'].max()) / np.timedelta64(1, 'D')  # Length of last stay for each subject
avg_stay_length = admissions.groupby('subject_id')['stay_length'].mean().astype(float).round(1)  # Average stay length for each subject
total_admissions = grouped_admissions.size().astype(int)  # Total admissions for each subject

# Recreate DataFrame
admissions = pd.DataFrame({ 
    'race': race,
    'deceased': deceased, # 1 = In-hospital death 
    'last_stay': last_stay_length.round(1), # In days
    'avg_stay': avg_stay_length, # In days
    'total_admissions': total_admissions
}).reset_index()

# Normalizing 'race' column so values are consistent
admissions['race'] = admissions['race'].str.split('-', expand=True)[0].str.split('/', expand=True)[0].str.upper()
admissions['race'] = admissions['race'].replace(['OTHER','UNKNOWN', 'UNABLE TO OBTAIN', 'PATIENT DECLINED TO ANSWER'], 'OTHER')
admissions['race'] = admissions['race'].replace('HISPANIC OR LATINO', 'HISPANIC')
admissions.tail()

Unnamed: 0,subject_id,race,deceased,last_stay,avg_stay,total_admissions
95,10038999,WHITE,0,5.6,9.1,2
96,10039708,BLACK,0,2.4,7.5,10
97,10039831,OTHER,0,5.3,5.3,1
98,10039997,BLACK,0,2.8,2.3,3
99,10040025,WHITE,0,12.4,6.6,10


`Diagnoses_icd table`

In [640]:
# Save a 'diagnoses' DataFrame with 'subject_id' from 'diagnoses_icd'
diagnoses = pd.DataFrame(diagnoses_icd['subject_id'].unique(), columns=['subject_id'])

# Count the occurrences of each 'subject_id' in 'diagnoses_icd' and create 'diagnosis_count' column
diagnoses['diagnosis_count'] = diagnoses['subject_id'].map(diagnoses_icd['subject_id'].value_counts())
diagnoses.sort_values(by='subject_id', inplace=True)

# Define the icd code mapping dictionaries
icd9_mapping = {
    "infectious_diseases": ["001", "139"],
    "cancer": ["140", "239"],
    "endocrine_disorders": ["240", "279"],
    "blood_disorders": ["280", "289"],
    "mental_disorders": ["290", "319"],
    "nervous_disorders": ["320", "389"],
    "cardiovascular_disorders": ["390", "459"],
    "respiratory_disorders": ["460", "519"],
    "digestive_disorders": ["520", "579"],
    "genitourinary_disorders": ["580", "629"],
    "pregnancy_complications": ["630", "679"],
    "skin_disorders": ["680", "709"],
    "musculoskeletal_disorders": ["710", "739"],
    "pregnancy_complications": ["740", "779"],
    "injuries_&_poisonings": ["800", "999"]
}
icd10_mapping = {
    "infectious_diseases": ["A00", "B99"],
    "cancer": ["C00", "D48"],
    "endocrine_disorders": ["E00", "E90"],
    "blood_disorders": ["D50", "D89"],
    "mental_disorders": ["F01", "F99"],
    "nervous_disorders": ["G00", "G99"],
    "cardiovascular_disorders": ["I00", "I99"],
    "respiratory_disorders": ["J00", "J99"],
    "digestive_disorders": ["K00", "K93"],
    "genitourinary_disorders": ["N00", "N99"],
    "pregnancy_complications": ["O00", "O99"],
    "skin_disorders": ["L00", "L99"],
    "musculoskeletal_disorders": ["M00", "M99"],
    "pregnancy_complications": ["Q00", "P96"],
    "injuries_&_poisonings": ["S00", "T98"]
}
# Define functions to map ICD codes to disease categories
def map_icd9(icd_code):
    for disease, code_range in icd9_mapping.items():
        if icd_code.isdigit() and code_range[0] <= icd_code <= code_range[1]:
            return disease
    return 'Other'

def map_icd10(icd_code):
    for disease, code_range in icd10_mapping.items():
        if code_range[0] <= icd_code <= code_range[1]:
            return disease
    return 'Other'

# Apply mapping functions to create the 'diagnosis' column
def map_diagnosis(row):
    if row['icd_version'] == '9':
        return map_icd9(row['icd_code'])
    else:
        return map_icd10(row['icd_code'])

# Create columns for each type of disease
for disease in icd9_mapping.keys():
    diagnoses_icd[disease.replace(" ", "_")] = 0

# Update values in the columns based on diagnosis
def update_diagnosis(row):
    if row['icd_version'] == '9':
        disease = map_icd9(row['icd_code'])
    else:
        disease = map_icd10(row['icd_code'])
    if disease != 'Other':
        column_name = disease.replace(" ", "_")
        diagnoses_icd.at[row.name, column_name] = 1

diagnoses_icd.apply(update_diagnosis, axis=1) # Apply the update_diagnosis function row-wise
diagnoses_icd.drop(columns=['icd_code', 'icd_version','diagnosis_order'], inplace=True) # Drop unnecessary columns

aggregated_diagnoses = diagnoses_icd.groupby('subject_id').agg(lambda x: any(x)).reset_index() # Aggregate diagnoses for each patient

# Merge aggregated diagnoses with the original DataFrame to include subject_id
diagnoses_icd = pd.merge(diagnoses_icd['subject_id'], aggregated_diagnoses, on='subject_id', how='left')

diagnoses_icd.drop_duplicates(inplace=True) # Drop duplicates
diagnoses_icd.reset_index(drop=True, inplace=True) # Reset the index to start from 0

diagnoses_icd = diagnoses_icd.astype(int) # Convert all values to float

diagnoses_icd['diagnosis_count'] = diagnoses['diagnosis_count'] # Add column with diagnosis count per patient

diagnoses_icd.sort_values(by='subject_id', inplace=True) # Sort rows by ascending order of subject_id
diagnoses_icd.reset_index(drop=True, inplace=True) # Reset the index again
diagnoses_icd.tail()

Unnamed: 0,subject_id,infectious_diseases,cancer,endocrine_disorders,blood_disorders,mental_disorders,nervous_disorders,cardiovascular_disorders,respiratory_disorders,digestive_disorders,genitourinary_disorders,pregnancy_complications,skin_disorders,musculoskeletal_disorders,injuries_&_poisonings,diagnosis_count
95,10038999,0,0,1,1,1,1,1,1,0,0,0,0,0,1,23
96,10039708,1,1,1,1,1,1,1,1,1,1,0,0,1,1,198
97,10039831,0,1,0,0,1,0,1,0,0,0,0,0,0,1,7
98,10039997,0,0,1,1,1,1,1,0,0,1,0,0,1,0,29
99,10040025,1,0,1,1,1,1,1,1,1,1,0,1,1,1,259


`Labevents table`

In [641]:
total_lab_results = labevents.groupby('subject_id').size().reset_index(name='total_lab_results') # Calculate the total number of lab results for each subject

abnormal_lab_results = labevents[labevents['labresult'] == 'abnormal'].groupby('subject_id').size().reset_index(name='abnormal_lab_results') # Calculate the number of abnormal lab results for each subject

merged_df = pd.merge(total_lab_results, abnormal_lab_results, on='subject_id', how='left') # Merge the 2 DataFrames

merged_df['abnormal_lab_results'] = merged_df['abnormal_lab_results'].fillna(0) # Fill missing values with 0 (normal lab results)

merged_df['abnorm_labresults_ratio'] = (merged_df['abnormal_lab_results'] / merged_df['total_lab_results']).round(3) # Calculate ratio of abnormal results
 
labevents = pd.merge(labevents.drop_duplicates(subset=['subject_id']), merged_df, on='subject_id', how='left') # Merge ratio of abnormal results with original DataFrame

labevents.drop(columns=['total_lab_results', 'abnormal_lab_results','labresult'], inplace=True) # Drop unnecessary columns

labevents.sort_values(by='subject_id', inplace=True) # Sort the DataFrame by subject_id 
labevents.reset_index(drop=True, inplace=True) 
labevents.tail()

Unnamed: 0,subject_id,abnorm_labresults_ratio
95,10038999,0.353
96,10039708,0.516
97,10039831,0.5
98,10039997,0.126
99,10040025,0.44


`OMR table`

In [642]:
# Sort the dataframe by 'date' in descending order
omr_sorted = omr.sort_values(by='date', ascending=False)

# Filter the dataframe to keep only rows with measurements 'BMI (kg/m2)' and 'Blood Pressure'
omr_sorted = omr_sorted[omr_sorted['measurement'].isin(['BMI (kg/m2)', 'Blood Pressure'])]

# Group by 'subject_id' and 'measurement', and keep only the first row (the most recent measurements, closer to patient outcome)
omr = omr_sorted.groupby(['subject_id', 'measurement']).first().reset_index()

# Drop the 'date' column
omr = omr.drop(columns=['date'])

# Initialize the DataFrame to store extracted values
omr_new = pd.DataFrame()

# Extract BMI, systolic BP, and diastolic BP
omr_new['subject_id'] = omr['subject_id']

# Filter rows for BMI
bmi_values = omr.loc[omr['measurement'] == 'BMI (kg/m2)', ['subject_id', 'value']]
omr_new = omr_new.merge(bmi_values, on='subject_id', how='left')
omr_new.rename(columns={'value': 'bmi_index'}, inplace=True)

# Filter rows for Blood Pressure and split into systolic and diastolic
bp_values = omr.loc[omr['measurement'] == 'Blood Pressure', ['subject_id', 'value']]
bp_values[['systolic_bp', 'diastolic_bp']] = bp_values['value'].str.split('/', expand=True)
bp_values.drop(columns=['value'], inplace=True)
omr_new = omr_new.merge(bp_values, on='subject_id', how='left')

# Drop duplicate subject_id entries
omr_new.drop_duplicates(subset='subject_id', inplace=True)
omr = omr_new.reset_index(drop=True) 
# Display the DataFrame
omr.tail()

Unnamed: 0,subject_id,bmi_index,systolic_bp,diastolic_bp
74,10038992,31.9,105.0,73.0
75,10038999,33.4,,
76,10039708,26.5,110.0,76.0
77,10039997,33.3,141.0,82.0
78,10040025,30.3,112.0,60.0


- In this table, data is missing for some patients. Missing values must be marked with NaN values and can be replaced later with nearest neighbor estimated values.

In [643]:
# Find missing subject_ids in omr compared to labevents
missing_subject_ids = labevents[~labevents['subject_id'].isin(omr['subject_id'])]['subject_id']

# Create new rows for missing subject_ids in omr
missing_rows = pd.DataFrame({'subject_id': missing_subject_ids, 'bmi_index': np.nan,'systolic_bp': np.nan,'diastolic_bp': np.nan})

# Concatenate missing_rows with omr
omr = pd.concat([omr, missing_rows], ignore_index=True)
omr.sort_values(by='subject_id', inplace=True)
omr.reset_index(drop=True, inplace=True)
omr.tail()

Unnamed: 0,subject_id,bmi_index,systolic_bp,diastolic_bp
95,10038999,33.4,,
96,10039708,26.5,110.0,76.0
97,10039831,,,
98,10039997,33.3,141.0,82.0
99,10040025,30.3,112.0,60.0


`Patients table`

In [644]:
patients['1yr_death'] = patients['dateofdeath'].notnull().astype(float) # Create '1yr_death' column (death within 1 year of hospital discharge)

patients.drop(columns=['dateofdeath'], inplace=True) # Drop the 'dateofdeath' column

patients.sort_values(by='subject_id', inplace=True) # Sort the DataFrame by 'subject_id' and reset the index
patients.reset_index(drop=True, inplace=True)
patients.tail()

Unnamed: 0,subject_id,gender,age,1yr_death
95,10038999,M,45,0.0
96,10039708,F,46,0.0
97,10039831,F,57,0.0
98,10039997,F,67,0.0
99,10040025,F,64,1.0


`Prescriptions table`

In [645]:
prescriptions['prescriptions_count'] = prescriptions.groupby('subject_id')['drug'].transform('count') # Count amount of prescriptions for each patient

prescriptions.drop(columns='drug', inplace=True) # Dropping the drug column

prescriptions.drop_duplicates(subset='subject_id', keep='first', inplace=True) # Dropping duplicate rows

prescriptions.sort_values(by='subject_id', inplace=True) # Sorting by subject_id and resetting the index
prescriptions.reset_index(drop=True, inplace=True)
prescriptions.tail() 

Unnamed: 0,subject_id,prescriptions_count
95,10038999,121
96,10039708,615
97,10039831,66
98,10039997,59
99,10040025,715


`Procedures_icd table`

In [646]:
procedures_count = procedures_icd.groupby('subject_id')['procedure_order'].size() # Count amount of procedures for each patient

procedures_count_df = procedures_count.reset_index() # Create a DataFrame with subject_id and procedures_count

procedures_count_df.rename(columns={'procedure_order': 'procedures_count'}, inplace=True) # Rename the column to procedures_count

# Display the final DataFrame with only subject_id and procedures_count
procedures_icd = procedures_count_df[['subject_id', 'procedures_count']]
procedures_icd.reset_index(drop=True, inplace=True)
procedures_icd.tail()

Unnamed: 0,subject_id,procedures_count
87,10038999,10
88,10039708,34
89,10039831,2
90,10039997,2
91,10040025,16


- In this table, data is missing for some patients. Missing values must be marked with NaN values and can be replaced later with nearest neighbor estimated values.

In [647]:
# Find missing subject_ids in procedures_icd compared to prescriptions
missing_subject_ids = prescriptions[~prescriptions['subject_id'].isin(procedures_icd['subject_id'])]['subject_id']

# Create new rows for missing subject_ids in procedures_icd
missing_rows = pd.DataFrame({'subject_id': missing_subject_ids, 'procedures_count': np.nan})

# Concatenate missing_rows with procedures_icd
procedures_icd = pd.concat([procedures_icd, missing_rows], ignore_index=True)
procedures_icd.sort_values(by='subject_id', inplace=True)
procedures_icd.reset_index(drop=True, inplace=True)
procedures_icd.tail()

Unnamed: 0,subject_id,procedures_count
95,10038999,10.0
96,10039708,34.0
97,10039831,2.0
98,10039997,2.0
99,10040025,16.0


#### Combining Tables (into 1 Dataframe)

- Join all tables into a main dataframe

In [648]:
# Assuming you have your dataframes and their names in a list
dataframes = {
    'admissions': admissions,
    'patients': patients,
    'diagnoses_icd': diagnoses_icd,
    'procedures_icd': procedures_icd,
    'prescriptions': prescriptions,
    'omr': omr,
    'labevents': labevents
}

# List of dataframe names in the desired order of merging
dataframe_names = ['admissions', 'patients', 'diagnoses_icd', 'procedures_icd', 'prescriptions', 'omr', 'labevents']

# Start with the first dataframe
mimic_iv = dataframes[dataframe_names[0]]

# Iterate over the remaining dataframes and merge on 'subject_id'
for name in dataframe_names[1:]:
    df_to_merge = dataframes[name]
    
    # Ensure 'subject_id' column is of the same type (e.g., str) in both dataframes
    mimic_iv['subject_id'] = mimic_iv['subject_id'].astype(str)
    df_to_merge['subject_id'] = df_to_merge['subject_id'].astype(str)
    
    # Perform the merge
    mimic_iv = pd.merge(mimic_iv, df_to_merge, on='subject_id', how='left')

# Identify the column names in mimic_iv
columns = mimic_iv.columns.tolist()

# Move '1yr_death' column to the second to last position
if '1yr_death' in columns:
    columns.remove('1yr_death')  # Remove '1yr_death' from the list of columns
    columns.append('1yr_death')  # Insert '1yr_death' as second to last in the list of columns

# Move 'deceased' column to the last position (target variable)
if 'deceased' in columns:
    columns.remove('deceased')  # Remove 'deceased' from the list of columns
    columns.append('deceased')  # Append 'deceased' to the end of the list of columns

# Reorder the columns in mimic_iv
mimic_iv = mimic_iv[columns]

mimic_iv.head()

Unnamed: 0,subject_id,race,last_stay,avg_stay,total_admissions,gender,age,infectious_diseases,cancer,endocrine_disorders,...,injuries_&_poisonings,diagnosis_count,procedures_count,prescriptions_count,bmi_index,systolic_bp,diastolic_bp,abnorm_labresults_ratio,1yr_death,deceased
0,10000032,WHITE,1.8,1.4,4,F,52,1,0,1,...,0,39,3.0,81,18.2,98,66,0.546,1.0,0
1,10001217,WHITE,5.9,6.4,2,F,55,1,0,0,...,0,17,4.0,94,24.3,134,84,0.123,0.0,0
2,10001725,WHITE,3.0,3.0,1,F,46,0,0,0,...,1,18,3.0,70,26.5,114,70,0.088,0.0,0
3,10002428,WHITE,0.8,5.6,7,F,80,1,0,1,...,1,114,17.0,320,18.2,110,70,0.365,0.0,0
4,10002495,OTHER,6.9,6.9,1,M,81,1,0,1,...,1,26,7.0,113,25.8,159,59,0.453,0.0,0


- Recheck missing values and data types again

In [649]:
mimic_iv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 30 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   subject_id                 100 non-null    object 
 1   race                       100 non-null    object 
 2   last_stay                  100 non-null    float64
 3   avg_stay                   100 non-null    float64
 4   total_admissions           100 non-null    int32  
 5   gender                     100 non-null    object 
 6   age                        100 non-null    int32  
 7   infectious_diseases        100 non-null    int32  
 8   cancer                     100 non-null    int32  
 9   endocrine_disorders        100 non-null    int32  
 10  blood_disorders            100 non-null    int32  
 11  mental_disorders           100 non-null    int32  
 12  nervous_disorders          100 non-null    int32  
 13  cardiovascular_disorders   100 non-null    int32  


- Correct data type of some of the features:

In [650]:
mimic_iv['subject_id'] = mimic_iv['subject_id'].astype(int)
mimic_iv['bmi_index'] = pd.to_numeric(mimic_iv['bmi_index'], errors='coerce')
mimic_iv['systolic_bp'] = pd.to_numeric(mimic_iv['systolic_bp'], errors='coerce')
mimic_iv['diastolic_bp'] = pd.to_numeric(mimic_iv['diastolic_bp'], errors='coerce')
mimic_iv['1yr_death'] = pd.to_numeric(mimic_iv['1yr_death'], errors='coerce').astype(int)

- Use K-nearest neighbors to fill missing values. We can't afford to drop rows with missing data because the dataset in question is too small.

In [651]:
from sklearn.impute import KNNImputer  # Import necessary libraries
from sklearn.preprocessing import LabelEncoder

# Encoding categorical variables 'race' and 'gender'
le = LabelEncoder()
mimic_iv['race'] = le.fit_transform(mimic_iv['race']) # 0 = BLACK; 1 = HISPANIC; 2 = OTHER; 3 = PORTUGUESE; 4 = WHITE
mimic_iv['gender'] = le.fit_transform(mimic_iv['gender']) # F = 0; M = 1

imputer = KNNImputer(n_neighbors=5) # Initialize KNNImputer

cols_with_missing = ['procedures_count', 'bmi_index', 'systolic_bp', 'diastolic_bp'] # Specify columns with missing values

mimic_iv[cols_with_missing] = imputer.fit_transform(mimic_iv[cols_with_missing]) # Impute missing values

- Export our final master dataframe to use later for exploratory data analysis and predictive modeling:

In [652]:
mimic_iv.to_csv('mimic_iv.csv', index=False)