In [1]:
# this notebook is based on the following study
# https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6687414/

In [1]:
# make sure changes in code take place immediately
# this is a developer setting, don't need it for production

%load_ext autoreload
%autoreload 2

%reload_ext chapyter

In [4]:
import os
os.environ["NOTEBOOK_NAME"] = "Harutyunyan_4tests_Sept29-Copy1.ipynb"

## Let the study begin!

In [20]:
%%mimicSQL

How can I join patients and chartevents tables in MIMIC-III?

You can join the `patients` and `chartevents` tables using the `subject_id` column which is common to both tables. Here is a basic example of how you can do this:

This query will return the `subject_id`, `itemid`, `charttime`, `value`, and `valuenum` from the `chartevents` table for each patient in the `patients` table. Please modify the columns in the SELECT statement to suit your specific needs.


In [25]:
%%mimicSQL

Great, lets create a table that joins them.

Sure, you can create a new table that joins the `patients` and `chartevents` tables. Here is an example of how you can do this:

This will create a new table named `patients_chartevents` that contains the `subject_id`, `itemid`, `charttime`, `value`, and `valuenum` from the `chartevents` table for each patient in the `patients` table. Please modify the columns in the SELECT statement to suit your specific needs.


In [29]:
%%runSQL 

SELECT p.subject_id, c.itemid, c.charttime, c.value, c.valuenum
FROM patients p
JOIN chartevents c
ON p.subject_id = c.subject_id
LIMIT 100

Unnamed: 0,subject_id,itemid,charttime,value,valuenum
0,265,526,2149-02-21 03:00:00.000,PO Med,
1,265,492,2149-02-21 03:00:00.000,37,37.0
2,265,58,2149-02-21 03:00:00.000,1,1.0
3,265,56,2149-02-21 03:00:00.000,No,
4,265,51,2149-02-21 03:00:00.000,128,128.0


In [32]:
%%mimicSQL

same thing, but include icustay, as well as chart events for

1. Capillary refill rate
2. Diastolic blood pressure
3. Fraction inspired oxygen
4. Glascow coma scale eye opening
5. Glascow coma scale motor response
6. Glascow coma scale total
7. Glascow coma scale verbal response
8. Glucose
9. Heart Rate
10. Height
11. Mean blood pressure
12. Oxygen saturation
13. Respiratory rate
14. Systolic blood pressure
15. Temperature
16. Weight
17. pH

To include `icustays` and specific chart events, you would need to join the `icustays` table with the `patients` and `chartevents` tables. The `itemid` values for the chart events you mentioned are typically as follows:

1. Capillary refill rate: 3348

2. Diastolic blood pressure: 8368

3. Fraction inspired oxygen: 3420

4. Glascow coma scale eye opening: 184

5. Glascow coma scale motor response: 223901

6. Glascow coma scale total: 198

7. Glascow coma scale verbal response: 223900

8. Glucose: 807

9. Heart Rate: 211

10. Height: 920

11. Mean blood pressure: 456

12. Oxygen saturation: 646

13. Respiratory rate: 618

14. Systolic blood pressure: 51

15. Temperature: 678

16. Weight: 763

17. pH: 780

Here is the SQL query:

Please note that the `itemid` values are best guesses and may vary. You should verify these `itemid` values from the `d_items` table in your MIMIC-III dataset.


In [53]:
%%runSQL 


SELECT p.subject_id, i.icustay_id, c.itemid, c.charttime, c.value, c.valuenum
FROM patients p
JOIN icustays i
ON p.subject_id = i.subject_id
JOIN chartevents c
ON p.subject_id = c.subject_id
WHERE c.itemid IN (3348, 8368, 3420, 184, 223901, 198, 223900, 807, 211, 920, 456, 646, 618, 51, 678, 763, 780)
LIMIT 5000


Unnamed: 0,subject_id,icustay_id,itemid,charttime,value,valuenum
0,253,272631,8368,2174-01-22 13:00:00.000,52,52.0
1,253,272631,8368,2174-01-22 12:00:00.000,49,49.0
2,253,272631,8368,2174-01-22 11:00:00.000,50,50.0
3,253,272631,8368,2174-01-22 10:00:00.000,53,53.0
4,253,272631,8368,2174-01-22 09:00:00.000,43,43.0


In [61]:
%%mimicSQL

Same table, except also include whether that subject died in the hospital

To include whether the patient died in the hospital, you would need to join the `admissions` table with the `patients`, `icustays`, and `chartevents` tables. The `hospital_expire_flag` in the `admissions` table indicates whether the patient died in the hospital.

Here is the SQL query:

This will return the `subject_id`, `icustay_id`, `itemid`, `charttime`, `value`, `valuenum`, and `hospital_expire_flag` for each patient. The `hospital_expire_flag` is 1 if the patient died in the hospital and 0 otherwise.


In [79]:
%%runSQL 


SELECT p.subject_id, i.icustay_id, c.itemid, c.charttime, c.value, c.valuenum, a.hospital_expire_flag
FROM patients p
JOIN icustays i
ON p.subject_id = i.subject_id
JOIN chartevents c
ON p.subject_id = c.subject_id
JOIN admissions a
ON p.subject_id = a.subject_id
WHERE c.itemid IN (3348, 8368, 3420, 184, 223901, 198, 223900, 807, 211, 920, 456, 646, 618, 51, 678, 763, 780)
LIMIT 50000


Unnamed: 0,subject_id,icustay_id,itemid,charttime,value,valuenum,hospital_expire_flag
0,249,263882,51,2149-12-20 11:00:00.000,100,100.0,0
1,249,263882,51,2149-12-20 11:00:00.000,100,100.0,0
2,249,263882,51,2149-12-20 11:00:00.000,100,100.0,0
3,249,263882,51,2149-12-20 10:00:00.000,117,117.0,0
4,249,263882,51,2149-12-20 10:00:00.000,117,117.0,0


In [42]:
%%mimicPython

Can you take this table, and replace itemid with the relevant string-name for that item?

Sure, you can replace the `itemid` with the relevant string-name for that item by creating a dictionary where the keys are the `itemid` and the values are the corresponding string-names. Then, you can use the `replace()` function to replace the `itemid` in the dataframe. Here is an example:

This will replace the `itemid` in the dataframe with the corresponding string-names. Please note that the `itemid` values are best guesses and may vary. You should verify these `itemid` values from the `d_items` table in your MIMIC-III dataset.


In [80]:
##AI-generated-code


# Create a dictionary where the keys are the itemid and the values are the corresponding string-names
item_dict = {3348: 'Capillary refill rate', 8368: 'Diastolic blood pressure', 3420: 'Fraction inspired oxygen', 
             184: 'Glascow coma scale eye opening', 223901: 'Glascow coma scale motor response', 198: 'Glascow coma scale total', 
             223900: 'Glascow coma scale verbal response', 807: 'Glucose', 211: 'Heart Rate', 920: 'Height', 
             456: 'Mean blood pressure', 646: 'Oxygen saturation', 618: 'Respiratory rate', 51: 'Systolic blood pressure', 
             678: 'Temperature', 763: 'Weight', 780: 'pH'}

# Replace the itemid in the dataframe with the corresponding string-names
df['itemid'] = df['itemid'].replace(item_dict)

print(df.head())


  subject_id icustay_id itemid                charttime value valuenum  \
0        249     263882     51  2149-12-20 11:00:00.000   100    100.0   
1        249     263882     51  2149-12-20 11:00:00.000   100    100.0   
2        249     263882     51  2149-12-20 11:00:00.000   100    100.0   
3        249     263882     51  2149-12-20 10:00:00.000   117    117.0   
4        249     263882     51  2149-12-20 10:00:00.000   117    117.0   

  hospital_expire_flag  
0                    0  
1                    0  
2                    0  
3                    0  
4                    0  


In [47]:
%%mimicPython

Great! What are the different itemids we have now?

You can use the `unique()` function to get the unique values in the 'itemid' column. Here is how you can do it:




In [81]:
##AI-generated-code


# Get the unique values in the 'itemid' column
unique_itemids = df['itemid'].unique()

print(unique_itemids)


['51' '8368' '223901' '223900' '618' '211' '198' '184' '646' '678' '780'
 '807' '456' '920' '763' '3420' '3348']


In [71]:
%%mimicPython

Great - lets take the df above, and group by icustay_id and hospital_expire_flag, and make new columns for each itemid

You can use the `pivot_table()` function to create a new dataframe where each `itemid` is a new column. The `pivot_table()` function will group the data by `icustay_id` and `hospital_expire_flag`, and create new columns for each unique `itemid`. Here is how you can do it:

Please note that this will create NaN values for any `itemid` that a particular `icustay_id` does not have. You may want to fill these NaN values with a specific value or use some form of imputation.


In [82]:
##AI-generated-code


# Pivot the dataframe to create new columns for each itemid
df_pivot = df.pivot_table(index=['icustay_id', 'hospital_expire_flag'], columns='itemid', values='valuenum', aggfunc='mean')

# Reset the index
df_pivot.reset_index(inplace=True)

print(df_pivot.head())


TypeError: agg function failed [how->mean,dtype->object]

In [73]:
df_pivot

itemid,icustay_id,hospital_expire_flag,Diastolic blood pressure,Glascow coma scale eye opening,Glascow coma scale total,Glucose,Heart Rate,Height,Mean blood pressure,Oxygen saturation,Respiratory rate,Systolic blood pressure,Temperature,Weight,pH
0,200779,0,70.230769,4.0,15.0,113.727273,61.103774,73.0,77.478875,97.181818,17.317308,116.769231,98.604,96.300003,
1,223660,0,71.574468,3.15,12.179487,98.6,122.8,64.0,77.179487,99.208955,17.860465,129.851064,101.489655,72.999997,7.442222
2,224522,0,,3.933333,14.933333,,86.362069,,66.207555,96.18,19.857143,,96.5,79.799999,
3,224522,1,,3.933333,14.933333,,86.362069,,66.207555,96.18,19.857143,,96.5,79.799999,
4,234479,0,82.0,3.25,12.916667,,75.16129,,92.577767,96.517241,22.1,150.0,98.085715,72.800003,7.46
5,254153,0,,4.0,15.0,148.692308,99.181818,,76.507931,98.935484,17.970588,,99.050001,78.300003,
6,269462,0,64.147059,3.063291,6.512821,,96.354015,69.0,75.534799,98.049669,17.738806,97.014493,99.954386,72.0,7.426538
7,272890,0,,3.941176,10.6,,64.908397,,68.833333,98.8125,16.861538,,97.123529,37.0,
8,276575,0,,,,,134.0,,,,,,,,
9,284305,0,,3.875,14.875,,89.016667,,80.197535,97.421053,16.05,,99.107143,,


In [74]:
%%mimicPython

Now, I want to make a simple model that predicts whether the person dies in the hospital. Help me decide on a model and train it.

Sure, a simple and commonly used model for binary classification tasks like this is Logistic Regression. It's a good starting point for this kind of problem. 

Before we train the model, we need to handle missing values because Logistic Regression cannot handle them. A simple strategy is to fill missing values with the mean of the column. 

Then, we will split the data into a training set and a test set. We will use the training set to train the model and the test set to evaluate its performance.

Here is how you can do it:

Please note that this is a simple model and there is a lot of room for improvement. You could try different imputation strategies, feature selection, different models, hyperparameter tuning, etc. to improve the model's performance.


In [75]:
##AI-generated-code


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Handle missing values
imputer = SimpleImputer(strategy='mean')
df_pivot_imputed = imputer.fit_transform(df_pivot)

# Split the data into features and target
X = df_pivot_imputed[:, :-1]  # Features
y = df_pivot_imputed[:, -1]  # Target

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Print the accuracy of the model
print('Accuracy:', accuracy_score(y_test, y_pred))

# Print the confusion matrix
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))

# Print the classification report
print('Classification Report:\n', classification_report(y_test, y_pred))


ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.

In [76]:
%%mimicPython

ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.

I apologize for the confusion. It seems there was a mistake in the code. The target variable 'hospital_expire_flag' should be a binary variable (0 or 1), not a continuous one. 

Let's correct the code. We need to ensure that 'hospital_expire_flag' is treated as a categorical variable. Here is the corrected code:

Please run this corrected code.


In [77]:
##AI-generated-code


# Handle missing values
imputer = SimpleImputer(strategy='mean')
df_pivot_imputed = pd.DataFrame(imputer.fit_transform(df_pivot), columns = df_pivot.columns)

# Convert 'hospital_expire_flag' to int type
df_pivot_imputed['hospital_expire_flag'] = df_pivot_imputed['hospital_expire_flag'].astype(int)

# Split the data into features and target
X = df_pivot_imputed.drop('hospital_expire_flag', axis=1)  # Features
y = df_pivot_imputed['hospital_expire_flag']  # Target

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Print the accuracy of the model
print('Accuracy:', accuracy_score(y_test, y_pred))

# Print the confusion matrix
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))

# Print the classification report
print('Classification Report:\n', classification_report(y_test, y_pred))


Accuracy: 0.6666666666666666
Confusion Matrix:
 [[2 1]
 [0 0]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.67      0.80         3
           1       0.00      0.00      0.00         0

    accuracy                           0.67         3
   macro avg       0.50      0.33      0.40         3
weighted avg       1.00      0.67      0.80         3



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [78]:
print(df_pivot.shape)

(14, 15)
