
# Self Learning Tutorial
Vital signs: 
systolic blood pressure (SBP),
mean arterial pressure (MAP), 
respiratory rate (RR), 
oxygen saturation (PulOx), 
heart rate (HR), 
temperature (Temp)

Lab results: 
white blood cell count (WBC), 
bilirubin (Bili), 
blood urea nitrogen (BUN), 
lactate (Lac), 
creatinine (Creat), 
platelet count (Plat), 
neutrophils (Bands)

Intervention: fraction of inspired Oxygen (FiO2)

The CHARTEVENTS table in MIMIC III contains patient vitals. Key concepts related to vital signs can be identified by itemid. For example:​

Systolic Blood Pressure (SBP): itemid = 220050 (ART BP Systolic)​

Mean Arterial Pressure (MAP): itemid = 220052 (ART Mean BP)​

Respiratory Rate (RR): itemid = 220210 (Respiratory Rate)​

Oxygen Saturation (PulOx): itemid = 220277 (O2 saturation pulseoxymetry)​

Heart Rate (HR): itemid = 220045 (Heart Rate)​

Temperature (Temp): itemid = 223761 (Temperature Fahrenheit)

In [15]:
# Get data from the CHARTEVENTS table in the MIMIC-III database
dataset_labevents = 'mimic/LABEVENTS.csv'
dataset_chartevents = 'mimic/CHARTEVENTS.csv'
dataset_labitems = 'mimic/D_LABITEMS.csv'
#'mimic/DIAGNOSES_ICD.csv'
#dataset_path3 = 'mimic/D_ICD_DIAGNOSES.csv'

In [12]:
# show the chart events
import pandas as pd
dataset_chartitems = 'mimic/D_ITEMS.csv'
df_chartitems = pd.read_csv(dataset_chartitems)

df_chartitems.columns = df_chartitems.columns.str.lower()
result = df_chartitems.loc[df_chartitems['itemid'].isin([ 220050, 220052, 220210, 220277, 220045, 223761])]
print(result)

       row_id  itemid                             label   abbreviation  \
11498   12712  220045                        Heart Rate             HR   
11502   12716  220050  Arterial Blood Pressure systolic           ABPs   
11504   12718  220052      Arterial Blood Pressure mean           ABPm   
11524   12738  220210                  Respiratory Rate             RR   
12355   12746  220277       O2 saturation pulseoxymetry           SpO2   
12366   12757  223761            Temperature Fahrenheit  Temperature F   

         dbsource      linksto             category  unitname param_type  \
11498  metavision  chartevents  Routine Vital Signs       bpm    Numeric   
11502  metavision  chartevents  Routine Vital Signs      mmHg    Numeric   
11504  metavision  chartevents  Routine Vital Signs      mmHg    Numeric   
11524  metavision  chartevents          Respiratory  insp/min    Numeric   
12355  metavision  chartevents          Respiratory         %    Numeric   
12366  metavision  charte

Lab Results:​

The LABEVENTS table stores the laboratory measurements.​

White Blood Cell Count (WBC): itemid = 51300 (White Blood Cells)​

Bilirubin (Bili): itemid = 50885 (Bilirubin, Total)​

Blood Urea Nitrogen (BUN): itemid = 51006 (Urea Nitrogen)​

Lactate (Lac): itemid = 50813 (Lactate)​

Creatinine (Creat): itemid = 50912 (Creatinine)​

Platelet Count (Plat): itemid = 51265 (Platelets)​

Neutrophils (Bands): itemid = 51146 (Bands - Neutrophils)

In [5]:
# Get data from the LabEvents table in the MIMIC-III database

In [None]:
# show the lab items
import pandas as pd
df_labitems = pd.read_csv(dataset_labitems)

df_labitems.columns = df_labitems.columns.str.lower()
result = df_labitems.loc[df_labitems['itemid'].isin([51300, 50885, 51006, 50813, 50912, 51265, 51146])]
print(result)


     row_id  itemid             label  fluid    category loinc_code
140      14   50813           Lactate  Blood   Blood Gas    32693-4
212      86   50885  Bilirubin, Total  Blood   Chemistry     1975-2
239     113   50912        Creatinine  Blood   Chemistry     2160-0
332     206   51006     Urea Nitrogen  Blood   Chemistry     3094-0
472     346   51146         Basophils  Blood  Hematology      704-7
591     465   51265    Platelet Count  Blood  Hematology      777-3
626     500   51300         WBC Count  Blood  Hematology    26464-8


Interventions:​

The INPUTEVENTS_MV or INPUTEVENTS_CV table can be used to find oxygen administration values.​

Fraction of Inspired Oxygen (FiO2): itemid = 223835 (FiO2 in the CHARTEVENTS table)

In [6]:
# Get data inputevents_mv table in the MIMIC-III database

In [13]:
# show FiO2 item
result = df_chartitems.loc[df_chartitems['itemid'].isin([223835])]
print(result)


       row_id  itemid                 label abbreviation    dbsource  \
12413   12804  223835  Inspired O2 Fraction         FiO2  metavision   

           linksto     category unitname param_type  conceptid  
12413  chartevents  Respiratory      NaN    Numeric        NaN  


ICD-9 Codes for Sepsis-Related Identification:
To identify patients with sepsis and septic shock, ICD-9 codes are used in the DIAGNOSES_ICD table. Relevant codes include:

Sepsis: 99591 (Sepsis), 99592 (Severe Sepsis)
Septic Shock: 78552 (Septic Shock)

In [15]:
# Get data from DIAGNOSES_ICD table in the MIMIC-III database using ICD-9 codes for sepsis related identification
dataset_icd9_code = 'mimic/D_ICD_DIAGNOSES.csv'
df_icd9_code = pd.read_csv(dataset_icd9_code)

df_icd9_code.columns = df_icd9_code.columns.str.lower()
result = df_icd9_code.loc[df_icd9_code['icd9_code'].isin(['99591', '99592', '78552'])]
print(result)

       row_id icd9_code    short_title     long_title
10304   11403     99591         Sepsis         Sepsis
10305   11404     99592  Severe sepsis  Severe sepsis
13142   12991     78552   Septic shock   Septic shock


Load chartevents in chunks to avoid overloading memory

In [12]:
dataset_diagnoses = 'mimic/DIAGNOSES_ICD.csv'
diagnoses_icd = pd.read_csv(dataset_diagnoses)
print(diagnoses_icd.head())

   ROW_ID  SUBJECT_ID  HADM_ID  SEQ_NUM ICD9_CODE
0    1297         109   172335      1.0     40301
1    1298         109   172335      2.0       486
2    1299         109   172335      3.0     58281
3    1300         109   172335      4.0      5855
4    1301         109   172335      5.0      4254


In [13]:
# Define the relevant itemids for vital signs and FiO2
vital_signs_itemids = [220050, 220052, 220210, 220277, 220045, 223761]  # SBP, MAP, RR, PulOx, HR, Temp
fio2_itemid = [223835]  # FiO2
relevant_itemids = vital_signs_itemids + fio2_itemid

# Load diagnoses to identify sepsis patients
diagnoses_icd = pd.read_csv(dataset_diagnoses, usecols=['SUBJECT_ID', 'HADM_ID', 'ICD9_CODE'])
sepsis_icd9_codes = ['99591', '99592', '78552']  # Sepsis-related ICD-9 codes
sepsis_patients = diagnoses_icd #[diagnoses_icd['ICD9_CODE'].isin(sepsis_icd9_codes)]

# Create a set of hadm_ids for sepsis patients for fast lookup
sepsis_hadm_ids = set(sepsis_patients['HADM_ID'].unique())


In [16]:
lab_results_itemids = [51300, 50885, 51006, 50813, 50912, 51265, 51146] # WBC, HCO3, PLATELET, PH, GLUCOSE, BUN, CREATININE
labevents = pd.read_csv(dataset_labevents, usecols=['SUBJECT_ID', 'HADM_ID', 'CHARTTIME', 'ITEMID', 'VALUENUM'])
labevents = labevents[labevents['ITEMID'].isin(lab_results_itemids)]
lab_hadm_ids = set(labevents['HADM_ID'].unique())
# add lab hadm_ids to sepsis_hadm_ids
sepsis_hadm_ids.update(lab_hadm_ids)

In [17]:
# Initialize an empty list to store filtered data
filtered_chartevents = []

# Define the chunk size (number of rows to read at a time)
chunk_size = 100000  # 100k rows per chunk

# Read CHARTEVENTS in chunks
chartevents_chunks = pd.read_csv(dataset_chartevents, usecols=['SUBJECT_ID', 'HADM_ID', 'CHARTTIME', 'ITEMID', 'VALUENUM'], chunksize=chunk_size)

# Process each chunk
for chunk in chartevents_chunks:
    # Filter the chunk based on relevant itemid and sepsis patients' hadm_id
    filtered_chunk = chunk[(chunk['ITEMID'].isin(relevant_itemids)) & (chunk['HADM_ID'].isin(sepsis_hadm_ids))]
    
    # Append the filtered data to the list
    filtered_chartevents.append(filtered_chunk)

# Concatenate all the filtered chunks into a single DataFrame
filtered_chartevents_df = pd.concat(filtered_chartevents, ignore_index=True)

In [19]:
# write the filtered data to a CSV file
filtered_chartevents.to_csv('filtered_chartevents.csv', index=False)
# write the labevents data to a CSV file
filtered_labevents.to_csv('filtered_labevents.csv', index=False)
#writing the sepsis patients data to a CSV file
sepsis_patients.to_csv('sepsis_patients.csv', index=False)

In [10]:
# Load the filtered data csv files and merge them using subject id and hadm id
import pandas as pd
filtered_chartevents = pd.read_csv('filtered_chartevents.csv')
filtered_labevents = pd.read_csv('filtered_labevents.csv')
sepsis_patients = pd.read_csv('sepsis_patients.csv')
# merge the filtered chartevents, sepsis patients and labevents data
#merge_charts = pd.merge(filtered_chartevents,sepsis_patients, on=['SUBJECT_ID', 'HADM_ID'])


In [11]:
print(filtered_chartevents.head())
print(filtered_labevents.head())
print(sepsis_patients.head())


   SUBJECT_ID  HADM_ID  ITEMID            CHARTTIME  VALUENUM
0          36   165660  223835  2134-05-12 12:00:00     100.0
1          36   165660  220045  2134-05-12 13:00:00      86.0
2          36   165660  220210  2134-05-12 13:00:00      21.0
3          36   165660  220277  2134-05-12 13:00:00      93.0
4          36   165660  220045  2134-05-12 14:00:00      85.0
   SUBJECT_ID  HADM_ID  ITEMID            CHARTTIME  VALUENUM
0           3      NaN   50813  2101-10-12 18:17:00       1.8
1           3      NaN   50912  2101-10-13 03:00:00       1.7
2           3      NaN   51006  2101-10-13 03:00:00      33.0
3           3      NaN   50912  2101-10-13 15:47:00       1.5
4           3      NaN   51006  2101-10-13 15:47:00      32.0
   SUBJECT_ID  HADM_ID ICD9_CODE
0         109   172335     40301
1         109   172335       486
2         109   172335     58281
3         109   172335      5855
4         109   172335      4254


In [12]:
# find empty values in the filtered labeevents data
print(filtered_labevents.isnull().sum())

SUBJECT_ID         0
HADM_ID       669625
ITEMID             0
CHARTTIME          0
VALUENUM         556
dtype: int64


In [13]:
# find size of filtered labevents data
print(filtered_labevents.shape)

(2967466, 5)


In [14]:
# drop rows with empty values in the filtered labevents data
filtered_labevents = filtered_labevents.dropna()

In [15]:
# find empty values in the filtered chartevents data
print(filtered_chartevents.isnull().sum())

SUBJECT_ID    0
HADM_ID       0
ITEMID        0
CHARTTIME     0
VALUENUM      0
dtype: int64


In [16]:
# find empty values in the sepsis patients data
print(sepsis_patients.isnull().sum())

SUBJECT_ID     0
HADM_ID        0
ICD9_CODE     47
dtype: int64


In [17]:
# drop rows with empty values in the sepsis patients data
sepsis_patients = sepsis_patients.dropna()

In [21]:
# pivot filtered_chartevents data by itemid
filtered_chartevents_pivot = filtered_chartevents.pivot_table(index=['SUBJECT_ID', 'HADM_ID', 'CHARTTIME'], columns='ITEMID', values='VALUENUM', aggfunc='first').reset_index()
filtered_chartevents_pivot.columns

Index(['SUBJECT_ID',    'HADM_ID',  'CHARTTIME',       220045,       220050,
             220052,       220210,       220277,       223761,       223835],
      dtype='object', name='ITEMID')

In [22]:
# write the filtered_chartevents_pivot data to a CSV file
filtered_chartevents_pivot.to_csv('filtered_chartevents_pivot.csv', index=False)

In [23]:
# pivot filtered_labevents data by itemid
filtered_labevents_pivot = filtered_labevents.pivot_table(index=['SUBJECT_ID', 'HADM_ID', 'CHARTTIME'], columns='ITEMID', values='VALUENUM', aggfunc='first').reset_index()

In [24]:
filtered_labevents_pivot.columns

Index(['SUBJECT_ID',    'HADM_ID',  'CHARTTIME',        50813,        50885,
              50912,        51006,        51146,        51265,        51300],
      dtype='object', name='ITEMID')

In [25]:
# write the filtered_labevents_pivot data to a CSV file
filtered_labevents_pivot.to_csv('filtered_labevents_pivot.csv', index=False)

In [26]:
# count filtered_labevents_pivot data and filtered_chartevents_pivot data
print(filtered_labevents_pivot.shape)
print(filtered_chartevents_pivot.shape)

(905872, 10)
(3036046, 10)


In [27]:
# combine the filtered_labevents_pivot and filtered_chartevents_pivot data by subject id and hadm id
combined_data = pd.merge(filtered_labevents_pivot, filtered_chartevents_pivot, on=['SUBJECT_ID', 'HADM_ID', 'CHARTTIME'], how='outer')


In [28]:
# write the combined data to a CSV file
combined_data.to_csv('combined_data.csv', index=False)

In [31]:
# count the empty values in the combined data
print(combined_data.isnull().sum())


ITEMID
SUBJECT_ID          0
HADM_ID             0
CHARTTIME           0
50813         3753181
50885         3757307
50912         3311643
51006         3313886
51146         3821674
51265         3332301
51300         3930540
220045        1168913
220050        2781350
220052        2774965
220210        1194033
220277        1259322
223761        3408995
223835        3373077
SEPSIS              0
dtype: int64


In [34]:
icd9_code_sepsis = ['99591', '99592', '78552']

# get unique hadm_ids for sepsis patients
sepsis_hadm_ids = sepsis_patients[['SUBJECT_ID', 'HADM_ID']].drop_duplicates()

# update combined_data with sepsis label
combined_data['SEPSIS'] = combined_data[['SUBJECT_ID', 'HADM_ID']].apply(tuple, axis=1).isin(sepsis_hadm_ids.apply(tuple, axis=1)).astype(int)


In [35]:
# count the number of sepsis patients
print(combined_data['SEPSIS'].value_counts())

SEPSIS
1    3930988
0        150
Name: count, dtype: int64


In [36]:
# write the combined data to a CSV file
combined_data.to_csv('combined_data_with_labels.csv', index=False)  

In [39]:
# convert combined data to training data for below model
combined_data.columns.size
combined_data.columns

Index(['SUBJECT_ID',    'HADM_ID',  'CHARTTIME',        50813,        50885,
              50912,        51006,        51146,        51265,        51300,
             220045,       220050,       220052,       220210,       220277,
             223761,       223835,     'SEPSIS'],
      dtype='object', name='ITEMID')

In [40]:
# load combined data with labels
combined_data = pd.read_csv('combined_data_with_labels.csv')
combined_data.fillna(method='ffill', inplace=True)  # Forward-fill missing data
combined_data.fillna(0, inplace=True)  # Fill remaining NaNs with 0

  combined_data.fillna(method='ffill', inplace=True)  # Forward-fill missing data


In [66]:
# find count of empty values in the combined data
print(combined_data.isnull().sum())

SUBJECT_ID    0
HADM_ID       0
CHARTTIME     0
50813         0
50885         0
50912         0
51006         0
51146         0
51265         0
51300         0
220045        0
220050        0
220052        0
220210        0
220277        0
223761        0
223835        0
SEPSIS        0
dtype: int64


In [41]:
import torch
from torch.nn.utils.rnn import pad_sequence
from sklearn.model_selection import train_test_split
import pandas as pd
grouped = combined_data.groupby(['SUBJECT_ID', 'HADM_ID'])

# Prepare the sequences for the LSTM model
X_sequences = []
y_labels = []

for (subject_id, hadm_id), group in grouped:
    # Drop 'SUBJECT_ID', 'HADM_ID', 'CHARTTIME' to keep only the feature columns (ITEMIDs)
    X_seq = torch.tensor(group.drop(columns=['SUBJECT_ID', 'HADM_ID', 'CHARTTIME', 'SEPSIS']).values, dtype=torch.float32)
    
    # Append the sequence to the list
    X_sequences.append(X_seq)
    
    # Get the label for this admission (assuming 'SEPSIS' is the same for the entire group)
    y_label = group['SEPSIS'].values[0]  # Assumes SEPSIS is consistent within an HADM_ID
    y_labels.append(torch.tensor(y_label, dtype=torch.float32))

In [63]:
print(X_sequences[0].shape)

print(len(X_sequences))
print(len(y_labels))
print(y_labels.shape)

torch.Size([3, 14])


AttributeError: 'list' object has no attribute 'shape'

In [45]:
y_labels = torch.tensor(y_labels, dtype=torch.float32)

In [69]:
sequence_lengths = [len(seq) for seq in X_sequences]
print(f"Max sequence length: {max(sequence_lengths)}")
print(f"Min sequence length: {min(sequence_lengths)}")
print(f"Average sequence length: {sum(sequence_lengths) / len(sequence_lengths)}")

Max sequence length: 44020
Min sequence length: 1
Average sequence length: 67.64062768849581


In [68]:
# Pad the sequences
max_seq_length = 100  # Adjust as needed

# Truncate sequences to the maximum length
X_sequences_truncated = [seq[:max_seq_length] for seq in X_sequences]

# Pad the truncated sequences
X_padded = pad_sequence(X_sequences_truncated, batch_first=True)


In [46]:
# define LSTM model
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=2):
        super(LSTMModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        # Initialize hidden state and cell state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)

        # Forward propagate LSTM
        out, _ = self.lstm(x, (h0, c0))

        # Decode the last hidden state
        out = self.fc(out[:, -1, :])  # Get the output from the last time step
        return out

In [47]:
# Define the hyperparameters
input_size = 14  # Example: 14 original + 6 additional features
hidden_size = 64
output_size = 1  # Binary classification (septic shock or not)
num_layers = 2
num_epochs = 20
batch_size = 64
learning_rate = 0.001

In [71]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_padded, y_labels, test_size=0.2, random_state=42)

In [56]:
y_train

tensor([1., 1., 1.,  ..., 1., 1., 1.])

In [72]:

# Create DataLoader for training and testing

train_dataset = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)

train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

# Initialize the model, loss function, and optimizer
model = LSTMModel(input_size, hidden_size, output_size, num_layers)
criterion = nn.BCEWithLogitsLoss()  # Binary Cross-Entropy with logits
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)



LSTMModel(
  (lstm): LSTM(14, 64, num_layers=2, batch_first=True)
  (fc): Linear(in_features=64, out_features=1, bias=True)
)

In [74]:
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)

        # Forward pass
        outputs = model(X_batch)
        outputs = outputs.squeeze(dim=1)
        loss = criterion(outputs, y_batch)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}')


Epoch [20/20], Loss: 0.0040


In [75]:
# Evaluation function

model.eval()
y_true = []
y_pred = []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        X_batch = X_batch.to(device)
        outputs = model(X_batch)
        preds = torch.sigmoid(outputs).cpu().numpy()  # Use sigmoid to get probabilities
        y_pred.extend(preds)
        y_true.extend(y_batch.numpy())

y_pred = [1 if p > 0.5 else 0 for p in y_pred]  # Convert probabilities to binary labels
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
auc = roc_auc_score(y_true, y_pred)

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
print(f'AUC: {auc:.4f}')

Accuracy: 0.9995
Precision: 0.9995
Recall: 1.0000
F1 Score: 0.9997
AUC: 0.5000
