In [1]:
# this notebook is based on the following study
# https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6687414/

In [2]:
# make sure changes in code take place immediately
# this is a developer setting, don't need it for production

%load_ext autoreload
%autoreload 2

%reload_ext chapyter

In [3]:
import os
os.environ["NOTEBOOK_NAME"] = "Harutyunyan_4tests_Sept29.ipynb"

## Let the study begin!

In [14]:
%%mimicSQL

I want to do the following task - 
In-hospital mortality prediction – predicting in-hospital mortality based on the first 48 hours of an ICU stay. 
I will use different clinical variables to do this prediction task - specifically from the chartevents table, capillary refill rate, diastolic blood pressure, and fraction inspired oxygen.
Can you help me retrieve the relevant data from MIMIC-III?
I only want the first 100 rows

Sure, I can help you with that. However, we need to know the ITEMIDs for 'capillary refill rate', 'diastolic blood pressure', and 'fraction inspired oxygen' in the chartevents table. 

Once we have the ITEMIDs, we can join the chartevents table with the admissions table to get the in-hospital mortality status. 

Here is a sample query to get the first 100 rows of data for these variables. Please replace 'item_id_for_cap_refill', 'item_id_for_dias_bp', and 'item_id_for_frac_insp_oxygen' with the actual ITEMIDs.

This query will return the first 100 rows of data for the specified clinical variables, along with the in-hospital mortality status. The mortality status is determined based on whether the patient's death time falls within their hospital admission and discharge times.


In [27]:
%%runSQL 


SELECT 
    icustays.subject_id,
    icustays.hadm_id,
    icustays.icustay_id,
    admissions.hospital_expire_flag,
    chartevents.charttime,
    chartevents.itemid,
    chartevents.valuenum
FROM 
    icustays
JOIN 
    admissions ON icustays.hadm_id = admissions.hadm_id
JOIN 
    chartevents ON icustays.subject_id = chartevents.subject_id
WHERE 
    chartevents.itemid IN (3348, 220051, 223835) AND
    DATE_DIFF('hour', icustays.intime, chartevents.charttime) BETWEEN 0 AND 48
LIMIT 100


Unnamed: 0,subject_id,hadm_id,icustay_id,hospital_expire_flag,charttime,itemid,valuenum
0,331,159710,296409,0,2166-04-16 05:30:00.000,3348,
1,122,135750,203019,0,2154-06-18 17:00:00.000,3348,
2,122,135750,203019,0,2154-06-18 15:00:00.000,3348,
3,122,135750,203019,0,2154-06-18 01:00:00.000,3348,
4,122,135750,203019,0,2154-06-17 23:00:00.000,3348,


In [28]:
%%mimicPython

From the above table, lets group so we have a new table where each row is a distinct icustay_id. Lets add another column, categorical, if the icustay_id involved a capillary refill.

In [29]:
##AI-generated-code

import pandas as pd

# Assuming the dataframe from the SQL query is stored in a variable named 'df'
# First, let's identify the ITEMID for capillary refill rate
cap_refill_itemid = 3348  # replace this with the actual ITEMID for capillary refill rate

# Create a new column 'cap_refill_involved' which is True if the ITEMID is for capillary refill rate, and False otherwise
df['cap_refill_involved'] = df['itemid'] == cap_refill_itemid

# Group by 'icustay_id' and aggregate such that we get a True in 'cap_refill_involved' if any row in the group has a True
grouped_df = df.groupby('icustay_id').agg({'cap_refill_involved': 'any'}).reset_index()

print(grouped_df)


  icustay_id  cap_refill_involved
0     203019                False
1     224250                False
2     235292                False
3     260558                False
4     274968                False
5     280225                False
6     288232                False
7     288409                False
8     296409                False
9     298146                False


In [31]:
%%mimicPython

Isnt the above result wierd? You can see in the chart that we have many itemid = 3348, but in the dataframe created, all of them have a False indicator variable?

It seems like there might be a data type mismatch issue. The 'itemid' column in the dataframe might be of type string, while we are comparing it with an integer. 

Let's modify the code to convert the 'itemid' column to integer before the comparison.

Please replace the dataframe 'df' with your actual dataframe name.


In [32]:
##AI-generated-code


# Convert 'itemid' column to integer
df['itemid'] = df['itemid'].astype(int)

# Create a new column 'cap_refill_involved' which is True if the ITEMID is for capillary refill rate, and False otherwise
df['cap_refill_involved'] = df['itemid'] == cap_refill_itemid

# Group by 'icustay_id' and aggregate such that we get a True in 'cap_refill_involved' if any row in the group has a True
grouped_df = df.groupby('icustay_id').agg({'cap_refill_involved': 'any'}).reset_index()

print(grouped_df)


  icustay_id  cap_refill_involved
0     203019                 True
1     224250                 True
2     235292                False
3     260558                 True
4     274968                 True
5     280225                 True
6     288232                 True
7     288409                 True
8     296409                 True
9     298146                 True


In [33]:
%%mimicSQL

Great! Now I want to expand the analysis to other categorical variables I can get from MIMIC-III.
Lets do the same SQL query as before, but lets add the following variables: Glascow coma scale eye opening, Glascow coma scale motor response, Glascow coma scale total, Glascow coma scale verbal response,

Sure, we can add these variables to the query. However, we need to know the ITEMIDs for 'Glascow coma scale eye opening', 'Glascow coma scale motor response', 'Glascow coma scale total', and 'Glascow coma scale verbal response' in the chartevents table.

Once we have the ITEMIDs, we can modify the query to include these variables. Here is a sample query. Please replace 'item_id_for_gcs_eye', 'item_id_for_gcs_motor', 'item_id_for_gcs_total', and 'item_id_for_gcs_verbal' with the actual ITEMIDs.

This query will return the first 100 rows of data for the specified clinical variables, along with the in-hospital mortality status. The mortality status is determined based on whether the patient's death time falls within their hospital admission and discharge times.


In [None]:
%%runSQL 


SELECT 
    icustays.subject_id,
    icustays.hadm_id,
    icustays.icustay_id,
    admissions.hospital_expire_flag,
    chartevents.charttime,
    chartevents.itemid,
    chartevents.valuenum
FROM 
    icustays
JOIN 
    admissions ON icustays.hadm_id = admissions.hadm_id
JOIN 
    chartevents ON icustays.subject_id = chartevents.subject_id
WHERE 
    chartevents.itemid IN (3348, 220051, 223835, item_id_for_gcs_eye, item_id_for_gcs_motor, item_id_for_gcs_total, item_id_for_gcs_verbal) AND
    DATE_DIFF('hour', icustays.intime, chartevents.charttime) BETWEEN 0 AND 48
LIMIT 100


In [34]:
%%mimicSQL

You don't know the itemIDs for those things? You knew them for the first three!

I apologize for the confusion. As an AI, I don't have direct access to the MIMIC-III database to retrieve the ITEMIDs. The ITEMIDs for the first three variables were provided in your query. 

For the Glasgow Coma Scale variables, you would need to look up the ITEMIDs in the D_ITEMS table in the MIMIC-III database. The ITEMIDs can vary depending on the specific implementation of the MIMIC-III database.

Here is a sample query to get the ITEMIDs for the Glasgow Coma Scale variables from the D_ITEMS table. Please replace 'Glasgow Coma Scale eye opening', 'Glasgow Coma Scale motor response', 'Glasgow Coma Scale total', and 'Glasgow Coma Scale verbal response' with the exact names of these variables in the D_ITEMS table.

Once you have the ITEMIDs, you can use them in the main query to get the data for these variables.


In [35]:
%%runSQL 


SELECT 
    itemid,
    label
FROM 
    d_items
WHERE 
    label IN ('Glasgow Coma Scale eye opening', 'Glasgow Coma Scale motor response', 'Glasgow Coma Scale total', 'Glasgow Coma Scale verbal response')


Unnamed: 0,itemid,label


In [37]:
%%mimicSQL

That didn't reveal anything. New ideas?

I apologize for the confusion. The labels for Glasgow Coma Scale variables might be slightly different in the database. 

You can try using a wildcard search to find the correct labels. Here is a sample query to search for labels that contain 'Glasgow Coma Scale'. 

Please replace 'd_items' with the actual name of the D_ITEMS table in your database.

This query will return all labels that contain 'Glasgow Coma Scale', along with their corresponding ITEMIDs. You can then identify the correct labels for the Glasgow Coma Scale variables and use their ITEMIDs in the main query.


In [38]:
%%runSQL 


SELECT 
    itemid,
    label
FROM 
    d_items
WHERE 
    label LIKE '%Glasgow Coma Scale%'


Unnamed: 0,itemid,label


In [39]:
%%mimicSQL

Still no ...

I apologize for the confusion. The Glasgow Coma Scale variables might not be directly available in the chartevents table. They might be calculated from other variables or stored in a different table.

In the MIMIC-III database, the Glasgow Coma Scale variables are often found in the CHARTEVENTS table under the ITEMIDs 184, 185, 220739 for 'GCS - Eye Opening', 'GCS - Motor Response', and 'GCS - Verbal Response' respectively. The total GCS score is usually calculated by summing these three components.

Please try using these ITEMIDs in your query. If these ITEMIDs do not work, it might be necessary to calculate the Glasgow Coma Scale variables from other variables in the database.


In [64]:
%%mimicSQL

There are some other chartevents Id like to know the itemID of. Any idea of: 
1. Capillary refill rate
2. Diastolic blood pressure
3. Fraction inspired oxygen
4. Glascow coma scale eye opening
5. Glascow coma scale motor response
6. Glascow coma scale total
7. Glascow coma scale verbal response
8. Glucose
9. Heart Rate
10. Height
11. Mean blood pressure
12. Oxygen saturation
13. Respiratory rate
14. Systolic blood pressure
15. Temperature
16. Weight
17. pH

I'm sorry for the confusion, but as an AI, I don't have direct access to the MIMIC-III database to retrieve the ITEMIDs. The ITEMIDs can vary depending on the specific implementation of the MIMIC-III database.

However, you can use the D_ITEMS table in the MIMIC-III database to find the ITEMIDs for these variables. Here is a sample query to get the ITEMIDs for these variables from the D_ITEMS table. Please replace 'd_items' with the actual name of the D_ITEMS table in your database.

This query will return the ITEMIDs for these variables, if they exist in the D_ITEMS table. If the query does not return any results, the labels for these variables might be slightly different in the database. You can try using a wildcard search to find the correct labels.


In [65]:
%%runSQL 


SELECT 
    itemid,
    label
FROM 
    d_items
WHERE 
    label IN (
        'Capillary refill rate',
        'Diastolic blood pressure',
        'Fraction inspired oxygen',
        'Glasgow Coma Scale eye opening',
        'Glasgow Coma Scale motor response',
        'Glasgow Coma Scale total',
        'Glasgow Coma Scale verbal response',
        'Glucose',
        'Heart Rate',
        'Height',
        'Mean blood pressure',
        'Oxygen saturation',
        'Respiratory rate',
        'Systolic blood pressure',
        'Temperature',
        'Weight',
        'pH'
    )


Unnamed: 0,itemid,label
0,211,Heart Rate
1,1529,Glucose
2,220045,Heart Rate
3,226707,Height


In [67]:
%%mimicSQL

OK, you found 4. Without using a SQL query, can you give me your best guess for the other itemids?

I apologize for the confusion, but as an AI, I don't have direct access to the MIMIC-III database to retrieve the ITEMIDs. However, based on common usage in the MIMIC-III database, here are some possible ITEMIDs for the variables you mentioned:

1. Capillary refill rate: This is typically a clinical observation and may not have a specific ITEMID. It might be included in the NOTEEVENTS table as part of the clinical notes.

2. Diastolic blood pressure: 220051

3. Fraction inspired oxygen: 223835

4. Glasgow Coma Scale eye opening: 184

5. Glasgow Coma Scale motor response: 185

6. Glasgow Coma Scale total: This is usually calculated by summing the eye, motor, and verbal response scores.

7. Glasgow Coma Scale verbal response: 220739

8. Glucose: 1529 (already found)

9. Heart Rate: 211 or 220045 (already found)

10. Height: 226707 (already found)

11. Mean blood pressure: 220052

12. Oxygen saturation: 220277

13. Respiratory rate: 220210

14. Systolic blood pressure: 220050

15. Tempera

In [68]:
%%mimicSQL

Great! I want a dataset of of ICU stays, with each of these 17 elements above. Include if the patient died in the hospital.

Sure, you can use the following SQL query to retrieve the data. This query will join the ICUSTAYS, ADMISSIONS, and CHARTEVENTS tables to get the desired data. It will filter the CHARTEVENTS data to include only the first 48 hours of each ICU stay and only the specified ITEMIDs. 

Please replace 'icustays', 'admissions', and 'chartevents' with the actual names of these tables in your database.

This query will return a dataset of ICU stays, with each row representing a measurement of one of the specified variables during the first 48 hours of the ICU stay. The 'hospital_expire_flag' column indicates whether the patient died in the hospital.


In [70]:
%%runSQL 


SELECT 
    icustays.subject_id,
    icustays.hadm_id,
    icustays.icustay_id,
    admissions.hospital_expire_flag,
    chartevents.itemid,
    chartevents.valuenum
FROM 
    icustays
JOIN 
    admissions ON icustays.hadm_id = admissions.hadm_id
JOIN 
    chartevents ON icustays.subject_id = chartevents.subject_id
WHERE 
    chartevents.itemid IN (220051, 223835, 184, 185, 220739, 1529, 211, 220045, 226707, 220052, 220277, 220210, 220050, 223762, 226512, 220274) AND
    DATE_DIFF('hour', icustays.intime, chartevents.charttime) BETWEEN 0 AND 48
LIMIT 1000

Unnamed: 0,subject_id,hadm_id,icustay_id,hospital_expire_flag,itemid,valuenum
0,305,133059,232719,0,211,64.0
1,305,133059,232719,0,184,4.0
2,305,133059,232719,0,211,74.0
3,305,133059,232719,0,184,4.0
4,305,133059,232719,0,211,72.0


In [71]:
%%mimicPython

Lets turn this table into one where for each ICU stay, we can try and predict patient death. The Glascow variables should be turned into categorical (if it happened during that icustay), capillary refill rate should be categorical, the rest continuous.

Sure, we can do that. First, we need to identify the ITEMIDs for the categorical variables. Then, we can create a new column for each categorical variable which is True if the ITEMID is for that variable, and False otherwise. For the continuous variables, we can simply take the mean value across all measurements during the ICU stay.

Here is a sample Python code to do this. Please replace 'df' with the actual name of your dataframe.

This will give you a dataframe where each row represents an ICU stay, with columns for the categorical variables indicating whether they occurred during the stay, and a column for the mean value of the continuous variables. The 'hospital_expire_flag' column indicates whether the patient died in the hospital.


In [72]:
##AI-generated-code


import pandas as pd

# Convert 'itemid' column to integer
df['itemid'] = df['itemid'].astype(int)

# Identify the ITEMIDs for the categorical variables
categorical_itemids = [184, 185, 220739, 3348]  # replace these with the actual ITEMIDs

# Create a new column for each categorical variable
for itemid in categorical_itemids:
    df['variable_' + str(itemid)] = df['itemid'] == itemid

# Group by 'icustay_id' and aggregate
grouped_df = df.groupby('icustay_id').agg({
    'hospital_expire_flag': 'first',  # this assumes all rows for an ICU stay have the same value
    'variable_184': 'any',
    'variable_185': 'any',
    'variable_220739': 'any',
    'variable_3348': 'any',
    'valuenum': 'mean'  # this will take the mean of all continuous variables
}).reset_index()

print(grouped_df)


TypeError: agg function failed [how->mean,dtype->object]

In [74]:
%%mimicPython

The above failed. Something about TypeError: agg function failed [how->mean,dtype->object]?

The error message suggests that the 'valuenum' column in your dataframe is of type object, which cannot be processed by the 'mean' function. We need to convert this column to a numeric type before calculating the mean.

Here is the modified code. Please replace 'df' with the actual name of your dataframe.




In [91]:
##AI-generated-code


import pandas as pd

# Convert 'itemid' and 'valuenum' columns to appropriate types
df['itemid'] = df['itemid'].astype(int)
df['valuenum'] = pd.to_numeric(df['valuenum'], errors='coerce')

# Identify the ITEMIDs for the categorical variables
categorical_itemids = [184, 185, 220739, 3348]  # replace these with the actual ITEMIDs

# Create a new column for each categorical variable
for itemid in categorical_itemids:
    df['variable_' + str(itemid)] = df['itemid'] == itemid

# Group by 'icustay_id' and aggregate
grouped_df = df.groupby('icustay_id').agg({
    'hospital_expire_flag': 'first',  # this assumes all rows for an ICU stay have the same value
    'variable_184': 'any',
    'variable_185': 'any',
    'variable_220739': 'any',
    'variable_3348': 'any',
    'valuenum': 'mean'  # this will take the mean of all continuous variables
}).reset_index()

# print(grouped_df)


KeyError: 'itemid'

In [77]:
%%mimicPython

That worked! But what about all the other continuous variables I want to include from the original 17? Can you add those to the grouped_df?

Sure, we can add the other continuous variables to the dataframe. We need to create a separate column for each continuous variable in the original dataframe, and then calculate the mean value for each ICU stay.

Here is a sample Python code to do this. Please replace 'df' with the actual name of your dataframe.

This will give you a dataframe where each row represents an ICU stay, with columns for the mean value of each continuous variable. The 'hospital_expire_flag' column indicates whether the patient died in the hospital.




In [90]:
##AI-generated-code


import pandas as pd

# Convert 'itemid' and 'valuenum' columns to appropriate types
df['itemid'] = df['itemid'].astype(int)
df['valuenum'] = pd.to_numeric(df['valuenum'], errors='coerce')

# Identify the ITEMIDs for the categorical variables
categorical_itemids = [184, 185, 220739, 3348]  # replace these with the actual ITEMIDs

# Create a new column for each categorical variable
for itemid in categorical_itemids:
    df['variable_' + str(itemid)] = df['itemid'] == itemid

# Identify the ITEMIDs for the continuous variables
continuous_itemids = [220051, 223835, 1529, 211, 220045, 226707, 220052, 220277, 220210, 220050, 223762, 226512, 220274]  # replace these with the actual ITEMIDs

# Create a new column for each continuous variable
for itemid in continuous_itemids:
    df['variable_' + str(itemid)] = df['valuenum'].where(df['itemid'] == itemid)

# Group by 'icustay_id' and aggregate
agg_dict = {'hospital_expire_flag': 'first'}  # this assumes all rows for an ICU stay have the same value
agg_dict.update({col: 'any' for col in df.columns if col.startswith('variable_') and col.endswith(tuple(map(str, categorical_itemids)))})
agg_dict.update({col: 'mean' for col in df.columns if col.startswith('variable_') and col.endswith(tuple(map(str, continuous_itemids)))})

grouped_df = df.groupby('icustay_id').agg(agg_dict).reset_index()

# print(grouped_df.head(2))

KeyError: 'itemid'

In [87]:
grouped_df.shape
df = grouped_df

In [54]:
%%mimicPython

Train a logistic regression on this dataframe, using the four categorical variables above, to predict death. Don't worry about overfitting, you can train and test on the entire dataset.

Sure, we can use the LogisticRegression function from the sklearn library to train a logistic regression model. Here is the Python code to do that:

This code will train a logistic regression model to predict the 'hospital_expire_flag' using the four categorical variables. The model's intercept and coefficients for the variables are printed at the end.


In [88]:
##AI-generated-code


from sklearn.linear_model import LogisticRegression

# Convert 'hospital_expire_flag' to integer
df['hospital_expire_flag'] = df['hospital_expire_flag'].astype(int)

# Define the feature columns
feature_cols = ['cap_refill_involved', 'gcs_eye_involved', 'gcs_motor_involved', 'gcs_verbal_involved']

# Define the target column
target_col = 'hospital_expire_flag'

# Create the features and target dataframes
X = df[feature_cols]
y = df[target_col]

# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X, y)

# Print the model's coefficients
print('Intercept:', model.intercept_)
print('Coefficients:', dict(zip(feature_cols, model.coef_[0])))


KeyError: "None of [Index(['cap_refill_involved', 'gcs_eye_involved', 'gcs_motor_involved',\n       'gcs_verbal_involved'],\n      dtype='object')] are in the [columns]"

In [58]:
%%mimicPython

I want to know the accuracy on the task!!

Sure, we can calculate the accuracy of the model on the training data using the score method of the LogisticRegression model. Here is the Python code to do that:

This code will print the accuracy of the model on the training data.


**Clinical Researcher:** ##AI-generated-code --- 

# Calculate the accuracy of the model on the training data

accuracy = model.score(X, y) --- print('Accuracy:', accuracy)


In [59]:
accuracy = model.score(X, y)
print(accuracy)

0.8923076923076924


In [60]:
%%mimicPython

Which attribute is most important in this classification task?

The importance of an attribute in a logistic regression model can be determined by the absolute value of its coefficient. The larger the absolute value of the coefficient, the more important the attribute is in the model.

Here is the Python code to find the most important attribute:

This code will print the name of the most important attribute in the model.




In [61]:
##AI-generated-code


# Get the absolute values of the coefficients
abs_coefficients = abs(model.coef_[0])

# Get the index of the maximum absolute coefficient
max_index = abs_coefficients.argmax()

# Get the name of the most important attribute
most_important_attribute = feature_cols[max_index]

print('Most important attribute:', most_important_attribute)


Most important attribute: gcs_motor_involved


In [62]:
%%mimicPython

Any idea why this might be the most important variable?

The Glasgow Coma Scale (GCS) motor response is a measure of a patient's motor response to stimuli, which is a key indicator of their neurological function. A lower score (indicating a poorer motor response) can be a sign of serious neurological damage or impairment, which could increase the risk of in-hospital mortality. 

Therefore, it's not surprising that the GCS motor response is an important variable in predicting in-hospital mortality. However, this is a general interpretation and the specific reason might vary depending on the context and the specific dataset.
