# PREDICTING SEPSIS RISK DURING IN-PATIENT ADMISSIONS 
*Client: Royal Perth Hospital*

*Team: Group 7*

# Feature Selection



In this study, we aim to establish causality with sepsis to identify which medical tests, markers, and parameters commonly encountered in clinical settings are related to sepsis. Comorbidities, conditions correlated with sepsis, are identified through medical literature review, and centrality using Neo4j. These conditions are validated using the local large language model (LLM) hosted by RPH. 

Conditions identified can have the following relation with sepsis:
- Causal relation to sepsis
- Consequence of sepsis 
- Risk factor for sepsis
- Type of infection that can lead to sepsis
- Indirect relationship with sepsis

For comorbidities that have a causal relation with sepsis, we aim to identify the associated laboratory tests (predictors), and again confirm their relationship with sepsis via the LLM. We attempt to match predictors documented in medical literature with the data in our collection. A challenge arises with the realisation that a single predictor label is often tied to several ITEM_IDs within our dataset. Take 'lymphocytes' — a critical predictor in immunity health — as a case in point: we identify five distinct ITEM_IDs for it within the lab events description table. The reason lies in the methodology; lymphocytes can be quantified in various fluids such as blood or urine, and each method corresponds to a different ITEM_ID in our system.

Based on the importance of blood and urine in sepsis prediction, we filter for tests related to fluid type of blood or urine. However for some of the vague labels (for example, 'Haemoglobin'), it is still not clear which value we should rely upon. In the context of 'Haemoglobin', we have ITEM_IDs related to both 'Haemoglobin, total', and 'Haemoglobin'. We verify the relevancy of these associated labelsand their fluid types against the language model, manually based on their count in the dataset, and through confirmation with clinical experts. 

## 1. Initiating the LLM

We use the RPH local LLM to establish causality. Specifically, we are interested in identifying what symptoms are causing sepsis.

In this section, we:
- Load data and libraries
- Initialise functions to connect to the OpenAI language model, for testing prompts when we cannot access the RPH server
- Initialise functions to connect to the RPH LLM
- Engineer prompts, largely through trial and error 
- Test the LLM with symptoms and get their relation to sepsis

### 1.1 Data Preparation

In [8]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

from importlib import reload
from pathlib import Path
import os
import sys
sys.path.append('..')  # Replace with the actual path to ROOT

sys.path.append(os.path.abspath(os.path.join('../src')))
from src import utils as utils
reload(utils)

# Set OS agnostic pathnames
ROOT_DIR = Path('..')

In [2]:
# Load relevant lab events data and descriptions 
df_lab = pd.read_csv(Path(ROOT_DIR / 'data' / 'LABEVENTS.csv'))
df_lab_desc = pd.read_csv(Path(ROOT_DIR / 'data' / 'D_LABITEMS.csv'))
df_items_desc = pd.read_csv(Path(ROOT_DIR / 'data' / 'D_ITEMS.csv'))

### 1.2 Functions to connect to OpenAI language model 

In [4]:
# Initialise OpenAI language functions
import openai
from dotenv import load_dotenv

# Load .env file into the script environment
load_dotenv()

# Retrieve the OpenAI API key from the environment variable
openai.api_key = os.getenv("OPENAI_API_KEY")

def run(prompt,test):

    response = openai.Completion.create(
      model="gpt-3.5-turbo-instruct",      
      top_p=0.1,
      frequency_penalty=0,
      presence_penalty=0,
      seed=0,
      prompt= prompt.replace("{test}",test),
      max_tokens=250
    )
    text = response.choices[0].text.strip()
    return text

### 1.3 Functions to connect to RPH LLM

In [3]:
# Initialise requirements to connect to RPH LLM

# For local streaming, the websockets are hosted without ssl - http://
HOST = 'localhost:5000'
URI = f'http://{HOST}/api/v1/generate'

def run(prompt,test):
    request = {
        'prompt': prompt.replace("{test}",test),
        'max_new_tokens': 250,
        'temperature': 0.0,
        'top_p': 0.1,
        'length_penalty': 5,
        'early_stopping': True,
        'seed': 0,
    }

    response = requests.post(URI, json=request)
    print(response.json()['results'][0]['text'])

    return response.json()['results'][0]['text']

### 1.4 Engineer Prompts 

In [5]:
prompt_causality = '''
[INST]
You are a medical expert. You are tasked to identify causality between two conditions. 
Return yes if A causes B. If B causes A or if you are not sure, return no. 
Keep your responses short and succinct. 
Input
{test}
[/INST]
'''

In [9]:
prompt_comorbidities = '''
[INST]
You are a medical expert. You are tasked to identify comorbidities or existing health conditions that could increase the risk of developing sepsis.
Return Yes if A is a commorbidity of sepsis. Return Yes if A increases the risk of developing sepsis. If A is not a commorbidity of sepsis, if A does not increase the risk of developing sepsis, or if you are not sure, return no.
Keep your responses short and succinct.
Input
{test}
[/INST]
'''

### 1.5 Testing the LLM with symptoms

In [5]:
# Initialise a testing dataframe with symptoms
symptom = ['uti', 
           'low oxygen', 
           'low neutrophils', 
           'low BP', 
           'high BP', 
           'age>50', 
           'pregnancy', 
           'age>70']
df_test = pd.DataFrame(symptom, columns=['symptom'])

In [16]:
# Get relation by checking symptoms against model
df_test['relation'] = df_test.apply(lambda x: run(prompt_causality, f"A: {x['symptom']}, B: sepsis"), axis=1)

In [7]:
# Create short result of relation
df_test['relation_short'] = df_test['relation'].apply(lambda x: 'yes' if 'yes' in x.lower() else 'no')

In [133]:
df_test

Unnamed: 0,symptom,relation
0,uti,Yes
1,low oxygen,No
2,low neutrophils,Yes
3,low BP,No.
4,high BP,No
5,age>50,Yes
6,pregnancy,No
7,age>70,Yes


# 2. Identifying Comorbidities with Sepsis

For patients with sepsis, we need to identify which other diseases are correlated with sepsis. From these comorbidities, we then see whether these are attributed to causing sepsis. If they are, then we use these as the input disease, to identify the biological indicators relating to these initial causal diseases. 

Comorbidities are identified in three ways:

1. With RPH LLM
2. With prior research references
3. Through centrality using Neo4j

## 2.1 Comorbidities Identified with RPH LLM

In [110]:
prompt_test = '''
[INST]
You are a medical researcher with expertise in diagnosing diseases. 
Your task is to identify what biological markers, demographic data, or symptoms are most crucial for diagnosing a specific disease.
List only the most crucial medical tests, markers, or parameters that are essential for diagnosing A. Please keep only the most crucial indicators. 
Keep your response short and succinct. Return your response in a array format.
[/INST]
'''

In [90]:
prompt_crucial_indicators = '''
[INST]
Your task is to identify what laboratory test measurements are most crucial for diagnosing a specific disease.
List only the most crucial medical laboratory test measurements that are essential for diagnosing A.
Return only the short label or name for the lab test.
Return your response in a array format.
Input
{test}
[/INST]
'''

## 2.2 Comobidities Identified through Existing Research

The following comorbidities were identified in this Targeted Real-time Early Warning System (TREWS) for Sepsis article:

[Prospective, multi-site study of patient outcomes after implementation of the TREWS machine learning-based early warning system for sepsis](https://www.nature.com/articles/s41591-022-01894-0)

In [76]:
comorbidities = ['metastatic cancer', 
                 'end-stage renal disease', 
                 'congestive heart failure', 
                 'acute liver disease', 
                 'gastrointestinal bleeding', 
                 'chronic obstructive pulmonary disease', 
                 'diabetes', 
                 'urinary tract infection']

df_comorb_research = pd.DataFrame(comorbidities, columns=['comorbidities'])

### 2.2.1 Indicators for comorbidities: Identified through existing research

In [120]:
df_comorb_research['indicators'] = df_comorb_research.apply(lambda x: run(prompt_crucial_indicators, f"A: {x['comorbidities']}"), axis=1)

In [102]:
import re

def ensure_list_format(result):
    """
    Ensure the result is a list. If it's not, try to convert it into a list.

    Args:
    result: The result to check or convert.

    Returns:
    list: A list, either the original result (if it was already a list) or the result converted to a list.
    """
    if isinstance(result, list):
        return result
    elif isinstance(result, str):
        # If the string is a representation of a list
        if result.startswith("[") and result.endswith("]"):
            # Remove brackets and quotes, then split into a list
            result = result.strip('[]')
            result_list = [item.strip(' "').strip("'") for item in result.split(',')]
            return result_list
        else:
            # If the string is not a representation of a list,
            # split by newlines and remove numbers at the start of each item
            items = result.split('\n')
            items = [re.sub(r'^\d+\.\s*', '', item).strip() for item in items]
            return items
    else:
        return [] # handle error

In [121]:
# Ensure the 'indicators' column is a list or convert strings to lists
df_comorb_research['indicators'] = df_comorb_research['indicators'].apply(ensure_list_format)

In [122]:
df_comorb_research

Unnamed: 0,comorbidities,indicators
0,metastatic cancer,"[biopsy, CT scan, MRI, PET scan, blood test, u..."
1,end-stage renal disease,"[BUN, Creatinine, GFR, Urine albumin, Urine pr..."
2,congestive heart failure,"[BNP, Troponin, Echocardiogram, Chest X-ray, E..."
3,acute liver disease,"[ALT, AST, Bilirubin, Albumin, Prothrombin Time]"
4,gastrointestinal bleeding,"[Complete Blood Count (CBC), Stool Occult Bloo..."
5,chronic obstructive pulmonary disease,"[FEV1, FVC, FEV1/FVC, DLCO, ABG, Chest X-ray]"
6,diabetes,"[Glucose, Hemoglobin A1c, C-peptide, Insulin, ..."
7,urinary tract infection,"[urinalysis, urine culture]"


## 2.3 Comorbidities Identified via Neo4j Centrality Results

The conditions listed, represented by their ICD-9 codes and short titles, are either potential risk factors for developing sepsis, consequences of sepsis, or, in some cases, different manifestations or coding of sepsis itself.

In [123]:
# Load comorbidities identified in Neo4j
df_comorb_neo4j = pd.read_csv(ROOT_DIR / 'data' / 'report_commorbs_long.csv')

In [None]:
# Get relation for first 50 rows
df_comorb_neo4j['relation'] = df_comorb_neo4j.iloc[:50].apply(lambda x: run(prompt_causality, f"A: {x['LONG_TITLE']}, B: Sepsis"), axis=1)

In [None]:
# Create short result of relation
df_comorb_neo4j['relation'] = df_comorb_neo4j.iloc[:50]['relation'].apply(lambda x: 'yes' if 'yes' in x.lower() else 'no')

In [134]:
# Get risk factor indicator
df_comorb_neo4j['risk_factor'] = df_comorb_neo4j.iloc[:50].apply(lambda x: run(prompt_comorbidities, f"A: {x['LONG_TITLE']}, B: Sepsis"), axis=1)

In [135]:
df_comorb_neo4j.iloc[:50]

Unnamed: 0,ICD9_CODE,LONG_TITLE,SEPSIS_COUNT,SEPSIS_percentage,NONSEPSIS_COUNT,NON_SEPSIS_percentage,relation,risk_factor
0,0389,Unspecified septicemia,3102,59.93,604,1.32,yes,Yes
1,78552,Septic shock,2442,47.18,143,0.31,yes,Yes
2,5849,"Acute kidney failure, unspecified",2106,40.69,7006,15.37,yes,Yes
3,51881,Acute respiratory failure,1963,37.93,5529,12.13,yes,Yes
4,4019,Unspecified essential hypertension,1750,33.81,18932,41.53,no,Yes
5,4280,"Congestive heart failure, unspecified",1745,33.71,11352,24.9,yes,Yes
6,42731,Atrial fibrillation,1605,31.01,11281,24.74,no,Yes
7,5990,"Urinary tract infection, site not specified",1329,25.68,5218,11.45,yes,Yes
8,2762,Acidosis,1168,22.57,3343,7.33,yes,Yes
9,25000,Diabetes mellitus without mention of complicat...,1069,20.65,7988,17.52,no,Yes


# 3. Causal Relationship identified in Existing Literature

This section involves locating the related ITEM_IDs indicated in literature reviews from within our data. 

Variables that are engineered are labeled as "Derived".

## 3.1 First Reference - Ref001

The first paper analysed is: 

[Comparison of different machine learning algorithms to classify patients suspected of having sepsis infection in the intensive care unit](https://doi.org/10.1016/j.imu.2023.101236)

In [8]:
data_ref001 = {
    'Predictor': [
        'ICULOS', 
        'HospAdmTime', 
        'Age', 
        'Gender',
        'O2Sat', 
        'Resp', 
        'HR', 
        'Temp', 
        'DBP', 
        'MAP', 
        'SBP',
        'FiO2', 
        'pH', 
        'SIRS_Score', 
        'Base Excess', 
        'Glucose (serum)', 
        'PCO2', 
        'WBC', 
        'OXYGEN SATURATION',
        'Potassium', 
        'Calcium', 
        'Hematocrit', 
        'Hemoglobin', 
        'Magnesium', 
        'blood urea nitrogen', 
        'Phosphate', 
        'Bicarbonate',
        'Creatinine', 
        'PLATELET COUNT', 
        'Chloride'
    ],
    'Description': [
        'ICU length of stay (hours since ICU admission)', 
        'The time between hospital and ICU admission', 
        'Age (years)', 
        'Female (0) or male (1)',
        'Pulse oximetry (%)', 
        'Respiration rate (breaths per minute)', 
        'Heart rate (beats per minute)', 
        'Temperature (deg C)',
        'Diastolic BP (mm Hg)', 
        'Mean arterial pressure (mm Hg)', 
        'Systolic BP (mm Hg)',
        'Fraction of inspired oxygen (%)', 
        'A blood pH test is a normal part of a blood gas test or arterial blood gas (ABG) test. It measures how much oxygen and carbon dioxide are in your blood',
        'Four SIRS criteria were defined, namely tachycardia (heart rate >90 beats/min), tachypnea (respiratory rate >20 breaths/min), fever or hypothermia (temperature >38 or <36 °C), and leukocytosis, leukopenia, or bandemia (white blood cells >1200/mm3, <4000/mm3 or bandemia ≥10%',
        'Excess bicarbonate (mmol/L)', 
        'Serum glucose (mg/dL)', 
        'The partial pressure of carbon dioxide from arterial blood (mm Hg). Ranges 35 to 45 mmHg, or 4.7 to 6.0 kPa.', 
        'Leukocyte count (count/L)',
        'Oxygen saturation from arterial blood (%) SaO2', 
        'Potassiam (mmol/L)', 
        'Calcium (mg/dL)', 
        'Hematocrit (%)', 
        'Hemoglobin (g/dL)',
        'Magnesium (mmol/dL)', 
        'Blood urea nitrogen (mg/dL)', 
        'Phosphate (mg/dL)', 
        'Bicarbonate (mmol/L)', 
        'Creatinine (mg/dL)',
        'Platelet count (count/mL)', 
        'Chloride (mmol/L)'
    ],
    'Type': [
        'Numeric', 
        'Date', 
        'Numeric', 
        'Categorical',
        'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric',
        'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric',
        'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric',
        'Numeric', 'Numeric', 'Numeric', 'Numeric', 'Numeric'
    ],
    'Table': [
        'ICUSTAYS', 
        'ADMISSIONS', 
        'Derived: Admissions', 
        'ADMISSIONS',
        'CHARTEVENTS', 
        'CHARTEVENTS', 
        'CHARTEVENTS', 
        'CHARTEVENTS', 
        'CHARTEVENTS', 
        'CHARTEVENTS', 
        'CHARTEVENTS',
        'CHARTEVENTS', 
        'LABEVENTS', 
        'Derived', 
        'LABEVENTS', 
        'CHARTEVENTS', 
        'LABEVENTS', 
        'LABEVENTS', 
        'LABEVENTS',
        'LABEVENTS', 
        'LABEVENTS', 
        'LABEVENTS', 
        'LABEVENTS', 
        'LABEVENTS', 
        'LABEVENTS', 
        'LABEVENTS', 
        'LABEVENTS',
        'LABEVENTS', 
        'LABEVENTS', 
        'LABEVENTS'
    ],
    'ITEMID':[
        '', '', '', '', 
        '[646, 220277]', '[618, 220210, 3603, 224689, 614, 651, 224422, 615, 224690]', '[211, 220045]', '', '', '', '',
        '', '50820', '', '', '220621', '', '', '',
        '', '', '', '', '', '', '', '',
        '50912', '', ''
        
    ]
}

df_ref001 = pd.DataFrame(data_ref001)
df_ref001

Unnamed: 0,Predictor,Description,Category,Type,Table,ITEMID
0,ICULOS,ICU length of stay (hours since ICU admission),Demographic data,Numeric,ICUSTAYS,
1,HospAdmTime,The time between hospital and ICU admission,Demographic data,Date,ADMISSIONS,
2,Age,Age (years),Demographic data,Numeric,PATIENTS,
3,Gender,Female (0) or male (1),Demographic data,Categorical,PATIENTS,
4,O2Sat,Pulse oximetry (%),Clinical time series data,Numeric,CHARTEVENTS,"[646, 220277]"
5,Resp,Respiration rate (breaths per minute),Clinical time series data,Numeric,CHARTEVENTS,"[618, 220210, 3603, 224689, 614, 651, 224422, ..."
6,HR,Heart rate (beats per minute),Clinical time series data,Numeric,CHARTEVENTS,"[211, 220045]"
7,Temp,Temperature (deg C),Clinical time series data,Numeric,CHARTEVENTS,
8,DBP,Diastolic BP (mm Hg),Clinical time series data,Numeric,CHARTEVENTS,
9,MAP,Mean arterial pressure (mm Hg),Clinical time series data,Numeric,CHARTEVENTS,


In [None]:
# Confirm relation
df_ref001['relation'] = df_ref001.apply(lambda x: run(prompt_causality, f"A: {x['Predictor']}, B: sepsis"), axis=1)
# Create short result of relation
df_ref001['relation_short'] = df_ref001['relation'].apply(lambda x: 'yes' if 'yes' in x.lower() else 'no')

Certain labels from the literature appear to not exist in the given data labels, however they may be coded differently. These labels were identified and manually converted to match the label in the given data. 

For labels that exist, there are often multiple associated ITEMID values; the test label depends on the fluid and category. We will need to confirm that the label, from the fluid and category, is important.

In [44]:
# 4 values of pH
df_lab_desc[df_lab_desc['LABEL'] == ('PH')]

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE
147,21,50820,pH,Blood,Blood Gas,11558-4
158,32,50831,pH,Other Body Fluid,Blood Gas,2748-2
420,294,51094,pH,Urine,Chemistry,2756-5
690,691,51491,pH,Urine,Hematology,5803-2


In [299]:
# multiple labels containing 'Platelet'
df_lab_desc[df_lab_desc['LABEL'].str.contains('Platelet', case=False)]

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE,Exists,ref002
566,440,51240,LARGE PLATELETS,Blood,Hematology,34167-7,yes,yes
590,464,51264,PLATELET CLUMPS,Blood,Hematology,40741-1,yes,yes
591,465,51265,PLATELET COUNT,Blood,Hematology,777-3,yes,yes
592,466,51266,PLATELET SMEAR,Blood,Hematology,778-1,yes,yes


In [None]:
df_items_desc[df_items_desc['ITEMID'].isin([618, 220210, 3603, 224689, 614, 651, 224422, 615, 224690])]

## 3.2 Second Reference - Ref 002

[Development and Evaluation of a Machine Learning Model for the Early Identification of Patients at Risk for Sepsis](https://doi.org/10.1016/j.annemergmed.2018.11.036)

Features identified in Table 1 of the supplementary data.



In [192]:
data_ref002 = {
    'Predictor': [
        'AGE', 'DOBUTAMINE  (Y/N)', 'DOPAMINE  (Y/N)', 'EPINEPHRINE  (Y/N)', 'NOREPINEPHRINE  (Y/N)',
        'SYSTOLIC BP x AGE', 'SHOCK INDEX (HR/SYSTOLIC BP)', 'SHOCK INDEX x AGE',
        'ANION GAP', 'ALBUMIN', 'ALKALINE PHOSPHATASE', 'ASPARATE AMINOTRANSFERASE', 'BANDS', 'BILIRUBIN',
        'BLOOD UREA NITROGEN', 'CALCIUM', 'CREATININE', 'ESTIMATED GFR', 'GLUCOSE', 'LACTATE',
        'LYMPHOCYTE', 'MONOCYTE', 'NEUTROPHILS', 'PLATELET COUNT', 'WBC',
        'GLASGOW COMA SCALE', 'SUPPLEMENTAL OXYGEN (Y/N)', 'VENTILATOR', 'ALTERED MENTAL STATUS',
        'ABSCESS', 'ACUTE', 'ALTERED', 'BACTEREMIA', 'CELLULITIS', 'CYSTITIS', 'DIABETES',
        'FAILURE', 'LACTIC', 'LEUKOCYTOSIS', 'PNA', 'PNEUMONIA', 'PYELONEPHRITIS', 'RESPIRATORY',
        'SEPSIS', 'SEPTIC', 'UROSEPSIS', 'UTI',
        'HEART RATE', 'MEAN ARTERIAL PRESSURE', 'RESPIRATORY RATE', 'SYSTTOLIC BLOOD PRESSURE',
        'OXYGEN SATURATION', 'TEMPERATURE', 'WEIGHT', 'HEIGHT', 'BMI'
    ],
    'Category': [
        'Demographic', 'Medication', 'Medication', 'Medication', 'Medication',
        'Engineered Feature', 'Engineered', 'Engineered',
        'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result',
        'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result',
        'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result', 'Laboratory Result',
        'Neurological Evaluation', 'Nursing Documentation', 'Nursing Documentation', 'Nursing Documentation',
        'Text- Keyword (ED chief complaint)', 'Text- Keyword (ED chief complaint)', 'Text- Keyword (ED chief complaint)', 'Text- Keyword (ED chief complaint)', 'Text- Keyword (ED chief complaint)', 'Text- Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)',
        'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)',
        'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)', 'Text-Keyword (ED chief complaint)',
        'Vital Sign', 'Vital Sign', 'Vital Sign', 'Vital Sign',
        'Vital Sign', 'Vital Sign', 'Physiological', 'Physiological', 'Physiological'
    ],
    'Table': [
        'PATIENTS', 'PRESCRIPTIONS', 'PRESCRIPTIONS', 'PRESCRIPTIONS', 'PRESCRIPTIONS',
        'Derived', 'Derived', 'Derived',
        'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS',
        'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS',
        'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS', 'LABEVENTS',
        'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS',
        'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS',
        'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS',
        'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS', 'NOTEEVENTS',
        'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS', 'CHARTEVENTS',
        'CHARTEVENTS', 'CHARTEVENTS', 'ADMISSIONS', 'ADMISSIONS', 'Derived'
    ]
}

df_ref002 = pd.DataFrame(data_ref002)
df_ref002

Unnamed: 0,Predictor,Category,Table
0,AGE,Demographic,PATIENTS
1,DOBUTAMINE (Y/N),Medication,PRESCRIPTIONS
2,DOPAMINE (Y/N),Medication,PRESCRIPTIONS
3,EPINEPHRINE (Y/N),Medication,PRESCRIPTIONS
4,NOREPINEPHRINE (Y/N),Medication,PRESCRIPTIONS
5,SYSTOLIC BP x AGE,Engineered Feature,Derived
6,SHOCK INDEX (HR/SYSTOLIC BP),Engineered,Derived
7,SHOCK INDEX x AGE,Engineered,Derived
8,ANION GAP,Laboratory Result,LABEVENTS
9,ALBUMIN,Laboratory Result,LABEVENTS


## 3.3 Third Reference - Ref003

These sepsis predictors were scraped from various sources

In [191]:
data_ref003 = {
'Predictor': [
    'Lactate',
    'Systolic BP', 
    'Diastolic BP',
    'Glasgow Coma Scale',
    'Heart Rate',
    'Respiratory Rate',
    'WBC',
    'WBC',
    'C-Reactive Protein',
    'Blood Culture',
    'Temperature',
    'Platelet Count',
    'Bilirubin',
    'Creatinine',
    'Bicarbonate',
    'Blood Glucose',
    'Urine Culture',
    'ASPARATE AMINOTRANSFERASE', 
    'ALANINE AMINOTRANSFERASE',
    'Troponin',
    'Hemoglobin',
    'INR',
    'PTT',
    'Albumin',
    'D-Dimer',
    'Ferritin',
    'LACTATE DEHYDROGENASE',
    'BLOOD UREA NITROGEN',
    'PaO2', 
    'FiO2', 
    'Age',
    'Comorbidities',
    'Surgery History',
    'Immunosuppression Status',
    'Alcohol Abuse History',
    'Intravenous Drug Use History',
    'Recent Infection or Antibiotic Use',
    'Mechanical Ventilation Status',
    'Central Venous Catheter Use',
    'Urinary Catheter Use',
    'Recent Hospitalization',
    'Chronic Renal Failure',
    'Chronic Liver Disease',
    'Pregnancy Status',
    'Burns or Trauma',
    'NEUTROPHILS',
    'Nutritional Status',
    'BMI',
    'Infection Source',
    'Skin Mottling',
    'Skin Color',
    'Urine Output'
],
'Description': [
    "High Lactate Levels",
    "Low Blood Pressure (Hypotension) ('Systolic BP', 'Diastolic BP')",
    "Low Blood Pressure (Hypotension) ('Systolic BP', 'Diastolic BP')",
    "Altered Mental Status",
    "High Heart Rate (Tachycardia)",
    "Rapid Breathing (Tachypnea)",
    "High WBC Count",
    "Low WBC Count",
    "Elevated CRP",
    "Positive Blood Culture",
    "Fever or Hypothermia",
    "Low Platelet Count",
    "Elevated Bilirubin",
    "Elevated Creatinine",
    "Low Bicarbonate Levels",
    "High Blood Glucose",
    "Positive Urine Culture",
    "Elevated Liver Enzymes (AST, ALT)",
    "Elevated Liver Enzymes (AST, ALT)",
    "Elevated Troponin",
    "Low Hemoglobin",
    "High INR or PTT",
    "High INR or PTT",
    "Low Albumin",
    "Elevated D-Dimer",
    "Elevated Ferritin",
    "Elevated LDH (LACTATE DEHYDROGENASE)",
    "High Blood Urea Nitrogen",
    "Low PaO2/FiO2 Ratio (Partial pressure of oxygen)",
    "Low PaO2/FiO2 Ratio (fraction of inspired oxygen)",
    "Age > 65 years",
    "Comorbidities (e.g., Diabetes, Heart Disease)",
    "Recent Surgery",
    "Immunosuppression",
    "Chronic Alcohol Abuse",
    "Intravenous Drug Use",
    "Recent Infection or Antibiotic Use",
    "Mechanical Ventilation",
    "Central Venous Catheter Use",
    "Urinary Catheter Use",
    "Recent Hospitalization",
    "Chronic Renal Failure",
    "Chronic Liver Disease",
    "Pregnancy",
    "Burns or Trauma",
    "Neutropenia",
    "Malnutrition",
    "Obesity",
    "Community-Acquired vs Hospital-Acquired Infection",
    "Skin Mottling",
    "Cyanosis (Bluish Skin)",
    "Oliguria (Low Urine Output)"
],
'Table': [
    "labevents",  # High Lactate Levels
    "chartevents",  # Low Blood Pressure (Hypotension)
    "chartevents",  # Low Blood Pressure (Hypotension)
    "chartevents",  # Altered Mental Status
    "chartevents",  # High Heart Rate (Tachycardia)
    "chartevents",  # Rapid Breathing (Tachypnea)
    "labevents",  # High WBC Count
    "labevents",  # Low WBC Count
    "labevents",  # Elevated CRP
    "microbiologyevents",  # Positive Blood Culture
    "chartevents",  # Fever or Hypothermia
    "labevents",  # Low Platelet Count
    "labevents",  # Elevated Bilirubin
    "labevents",  # Elevated Creatinine
    "labevents",  # Low Bicarbonate Levels
    "labevents",  # High Blood Glucose
    "microbiologyevents",  # Positive Urine Culture
    "labevents",  # Elevated Liver Enzymes (AST, ALT)
    "labevents",  # Elevated Liver Enzymes (AST, ALT)
    "labevents",  # Elevated Troponin
    "labevents",  # Low Hemoglobin
    "labevents",  # High INR or PTT
    "labevents",  # High INR or PTT
    "labevents",  # Low Albumin
    "labevents",  # Elevated D-Dimer
    "labevents",  # Elevated Ferritin
    "labevents",  # Elevated LDH
    "labevents",  # High BUN
    "chartevents",  # Low PaO2/FiO2 Ratio
    "chartevents",  # Low PaO2/FiO2 Ratio (fraction of inspired oxygen)
    "patients",  # Age > 65 years
    "admissions",  # Comorbidities (e.g., Diabetes, Heart Disease)
    "admissions",  # Recent Surgery
    "admissions",  # Immunosuppression
    "admissions",  # Chronic Alcohol Abuse
    "admissions",  # Intravenous Drug Use
    "admissions",  # Recent Infection or Antibiotic Use
    "chartevents",  # Mechanical Ventilation
    "procedures_icd",  # Central Venous Catheter Use
    "procedures_icd",  # Urinary Catheter Use
    "admissions",  # Recent Hospitalization
    "admissions",  # Chronic Renal Failure
    "admissions",  # Chronic Liver Disease
    "admissions",  # Pregnancy
    "admissions",  # Burns or Trauma
    "labevents",  # Neutropenia
    "admissions",  # Malnutrition
    "admissions",  # Obesity
    "admissions",  # Community-Acquired vs Hospital-Acquired Infection
    "chartevents",  # Skin Mottling
    "chartevents",  # Cyanosis (Bluish Skin)
    "outputevents"  # Oliguria (Low Urine Output)
]
}

df_ref003 = pd.DataFrame(data_ref003)
df_ref003['Table'] = df_ref003['Table'].str.upper()
df_ref003

Unnamed: 0,Predictor,Description,Table
0,Lactate,High Lactate Levels,LABEVENTS
1,Systolic BP,Low Blood Pressure (Hypotension) ('Systolic BP...,CHARTEVENTS
2,Diastolic BP,Low Blood Pressure (Hypotension) ('Systolic BP...,CHARTEVENTS
3,Glasgow Coma Scale,Altered Mental Status,CHARTEVENTS
4,Heart Rate,High Heart Rate (Tachycardia),CHARTEVENTS
5,Respiratory Rate,Rapid Breathing (Tachypnea),CHARTEVENTS
6,WBC,High WBC Count,LABEVENTS
7,WBC,Low WBC Count,LABEVENTS
8,C-Reactive Protein,Elevated CRP,LABEVENTS
9,Blood Culture,Positive Blood Culture,MICROBIOLOGYEVENTS


##### Save references to CSV

In [136]:
reference_directory = ROOT_DIR / 'data' / 'reference'

In [61]:
# Store dataframes in a list
reference_dataframes = [df_ref001, df_ref002, df_ref003]

# Create the directory if it doesn't exist
reference_directory.mkdir(parents=True, exist_ok=True)

In [None]:
# Save each reference DataFrame to a CSV file
for i in range(len(reference_dataframes)):
  ref_name = 'ref_00' + str(i + 1) + '.csv'
  csv_file_path = Path(ROOT_DIR / 'data' / 'reference' / ref_name )
  reference_dataframes[i].to_csv(csv_file_path, index=False)

## 4. Filter References by Labevents

Given the unavailability of chart event data in RPH (received from ward monitors), in this section we investigate only the labevents related predictors.

This involves: 
- Locating lab event ITEM_IDs by their labels within our data
- Verifying the ITEM_IDs found, from a list of potential ITEM_IDs

In [None]:
# Convert all labevent strings to uppercase 
df_lab_desc = df_lab_desc.applymap(lambda x: x.upper() if type(x) == str else x)
  
def get_filtered_reference(table_query, reference):
  """
    Parameters: 
    - table_query: the data table name (LABEVENTS, microbiology, CHARTEVENTS, ADMISSIONS)
    - reference [DataFrame]: Sepsis reference with the columns 'Predictors' and 'Table'
    
    Returns:
    - reference_filtered [DataFrame]: The reference, filtered by the table query. The column 'Exists' represents the predictor's presence in the data.
    - predictors_missing [array]: Predictors not in the data. 
    
  """
  # convert reference string data to uppercase
  reference = reference.applymap(lambda x: x.upper() if type(x) == str else x)
  
  # filter by name of table (input)
  reference_filtered = reference[reference['Table'] == table_query]

  # check if variable exists in table input data
  reference_filtered['Exists'] = reference_filtered['Predictor'].apply(lambda x: 'yes' if x in df_lab_desc['LABEL'].values else 'no')
  
  # debugging to check the predictors not in df
  predictors_missing = reference_filtered[reference_filtered['Exists']=='no']['Predictor'].values
  
  return reference_filtered, predictors_missing


In [194]:
df_ref001_lab, df_ref001_lab_missing = get_filtered_reference('LABEVENTS', df_ref001)
df_ref002_lab, df_ref002_lab_missing = get_filtered_reference('LABEVENTS', df_ref002)
df_ref003_lab, df_ref003_lab_missing = get_filtered_reference('LABEVENTS', df_ref003)

In [195]:
print(df_ref001_lab_missing)
print(df_ref002_lab_missing)
print(df_ref003_lab_missing)

['CALCIUM' 'BLOOD UREA NITROGEN']
['ASPARATE AMINOTRANSFERASE' 'BLOOD UREA NITROGEN' 'CALCIUM'
 'ESTIMATED GFR' 'LYMPHOCYTE' 'MONOCYTE']
['BLOOD GLUCOSE' 'ASPARATE AMINOTRANSFERASE' 'ALANINE AMINOTRANSFERASE'
 'TROPONIN' 'INR' 'LACTATE DEHYDROGENASE' 'BLOOD UREA NITROGEN']


In [125]:
df_ref001_lab

Unnamed: 0,Predictor,Description,Category,Type,Table,ITEMID,Exists,Matching_ITEMID
12,PH,A BLOOD PH TEST IS A NORMAL PART OF A BLOOD GA...,LABORATORY VALUES,NUMERIC,LABEVENTS,,yes,"[50820, 50831, 51094, 51491]"
14,BASE EXCESS,EXCESS BICARBONATE (MMOL/L),LABORATORY VALUES,NUMERIC,LABEVENTS,,yes,[50802]
15,GLUCOSE,SERUM GLUCOSE (MG/DL),LABORATORY VALUES,NUMERIC,LABEVENTS,,yes,"[50809, 50842, 50931, 51014, 51022, 51034, 510..."
16,PCO2,THE PARTIAL PRESSURE OF CARBON DIOXIDE FROM AR...,LABORATORY VALUES,NUMERIC,LABEVENTS,,yes,"[50818, 50830]"
17,WBC,LEUKOCYTE COUNT (COUNT/L),LABORATORY VALUES,NUMERIC,LABEVENTS,,yes,"[51363, 51384, 51439, 51458, 51128, 51300, 515..."
18,OXYGEN SATURATION,OXYGEN SATURATION FROM ARTERIAL BLOOD (%) SAO2,LABORATORY VALUES,NUMERIC,LABEVENTS,,yes,[50817]
19,POTASSIUM,POTASSIAM (MMOL/L),LABORATORY VALUES,NUMERIC,LABEVENTS,,yes,"[50822, 50833, 50847, 50971, 51041, 51057, 510..."
20,CALCIUM,CALCIUM (MG/DL),LABORATORY VALUES,NUMERIC,LABEVENTS,,no,"[51468, 51469, 51470, 50808, 50893, 51029, 510..."
21,HEMATOCRIT,HEMATOCRIT (%),LABORATORY VALUES,NUMERIC,LABEVENTS,,yes,"[51348, 51369, 51422, 51445, 50810, 51115, 512..."
22,HEMOGLOBIN,HEMOGLOBIN (G/DL),LABORATORY VALUES,NUMERIC,LABEVENTS,,yes,"[50805, 50811, 50814, 50852, 50855, 51212, 512..."


In [124]:
df_ref002_lab

Unnamed: 0,Predictor,Category,Table,Exists,Matching_ITEMID
8,ANION GAP,LABORATORY RESULT,LABEVENTS,yes,[50868]
9,ALBUMIN,LABORATORY RESULT,LABEVENTS,yes,"[50835, 50862, 51011, 51019, 51025, 51046, 510..."
10,ALKALINE PHOSPHATASE,LABORATORY RESULT,LABEVENTS,yes,[50863]
11,ASPARATE AMINOTRANSFERASE,LABORATORY RESULT,LABEVENTS,no,[50878]
12,BANDS,LABORATORY RESULT,LABEVENTS,yes,"[51366, 51386, 51441, 51111, 51144, 51344]"
13,BILIRUBIN,LABORATORY RESULT,LABEVENTS,yes,"[51464, 51465, 50838, 50883, 50884, 50885, 510..."
14,UREA NITROGEN,LABORATORY RESULT,LABEVENTS,yes,"[50851, 51006, 51045, 51104]"
15,CALCIUM,LABORATORY RESULT,LABEVENTS,no,"[51468, 51469, 51470, 50808, 50893, 51029, 510..."
16,CREATININE,LABORATORY RESULT,LABEVENTS,yes,"[50841, 50912, 51021, 51032, 51052, 51067, 510..."
17,ESTIMATED GFR,LABORATORY RESULT,LABEVENTS,no,[50920]


In [123]:
df_ref003_lab

Unnamed: 0,Predictor,Description,Table,Exists,Matching_ITEMID
0,LACTATE,HIGH LACTATE LEVELS,LABEVENTS,yes,"[50813, 50843, 50954, 51015, 51054]"
6,WBC,HIGH WBC COUNT,LABEVENTS,yes,"[51363, 51384, 51439, 51458, 51128, 51300, 515..."
7,WBC,LOW WBC COUNT,LABEVENTS,yes,"[51363, 51384, 51439, 51458, 51128, 51300, 515..."
8,C-REACTIVE PROTEIN,ELEVATED CRP,LABEVENTS,yes,[50889]
11,PLATELET COUNT,LOW PLATELET COUNT,LABEVENTS,yes,[51265]
12,BILIRUBIN,ELEVATED BILIRUBIN,LABEVENTS,yes,"[51464, 51465, 50838, 50883, 50884, 50885, 510..."
13,CREATININE,ELEVATED CREATININE,LABEVENTS,yes,"[50841, 50912, 51021, 51032, 51052, 51067, 510..."
14,BICARBONATE,LOW BICARBONATE LEVELS,LABEVENTS,yes,"[50803, 50837, 50882, 51027, 51048, 51061, 51076]"
15,BLOOD GLUCOSE,HIGH BLOOD GLUCOSE,LABEVENTS,no,
17,ASPARATE AMINOTRANSFERASE,"ELEVATED LIVER ENZYMES (AST, ALT)",LABEVENTS,no,[50878]


In [None]:
suspicious_predictors = ['LYMPHOCYTE', 'MONOCYTE', 'PLATE', 'NEUTROPHIL', 'CALCIUM', 'LACTATE', 'AMINOTRANSFERASE', 'ALBUMIN']

def verify_labels(df, suspicious_predictors):
    filtered_dfs = []
    for predictor in suspicious_predictors:
        filtered_df = df_lab_desc[df_lab_desc['LABEL'].apply(lambda x: word in x)]
        filtered_dfs.append(filtered_df)
    df_query_labels = pd.concat(filtered_dfs).drop_duplicates()
    return df_query_labels

print("Words related to ref002:")
print("Input:", len(suspicious_predictors))
print("Related labels:", verify_labels(df_lab_desc, suspicious_predictors).shape[0])
verify_labels(df_lab_desc, suspicious_predictors)

The following predictors had multiple associated Item_IDs in lab events. 

In [202]:
verified_predictors = {
    'PH': 50820, 
    'NEUTROPHILS': 51256,
    'ALKALINE PHOSPHATASE': 50863,
    'CREATININE': 50912,
    'BANDS': 51144,
    'BILIRUBIN': 50885,
    'HEMATOCRIT': 51221,            # total
    'LACTATE DEHYDROGENASE': 50954,  # (LD)
    'LACTATE': 50813,
    'WBC': 51301,
    'BLOOD UREA NITROGEN': 51006, # blood urea nitrogen
    'MONOCYTES': 51254
}

potential_predictors = {
    'HEMOGLOBIN':[51222, 50811, 50852, 50855],    # All from blood fluid
    'D-DIMER': [51196, 50915],                    # Both from blood fluid
    'BICARBONATE': [50882, 50803],                # Bicarbonate in blood fluid, Bicarbonate total
    'Calcium': [50893, 50808],
    'WBC': [51300, 51301],                        # Both from blood fluid
    'BLOOD GLUCOSE': [50809, 50931, 51529],
    'ALBUMIN': [50862, 51069, 51070],
    'LYMPHOCYTE': [51244, 51143, 51245]
}

In [None]:
# Function to match 'Predictor' to 'LABEL' and return 'ITEMID'
def match_predictor_to_itemid(predictor):
    
    if predictor in verified_predictors.keys():
        # Find ITEMID from exclude_search dictionary
        itemid = verified_predictors[predictor]
        
        # Check for an exact match when the predictor is in the exclude_search list
        exact_match = df_lab_desc[df_lab_desc['ITEMID'] == itemid]
        if not exact_match.empty:
            return exact_match['ITEMID'].tolist()
        return None
    
    else:
        # Check for non-exact match.    
        matching_rows = df_lab_desc[df_lab_desc['LABEL'].str.contains(predictor, case=False, na=False)]
        if matching_rows.empty:
            return None
        else:
            return matching_rows['ITEMID'].tolist()

In [200]:
df_ref003_lab

Unnamed: 0,Predictor,Description,Table,Exists,Matching_ITEMID
0,LACTATE,HIGH LACTATE LEVELS,LABEVENTS,yes,[50813]
6,WBC,HIGH WBC COUNT,LABEVENTS,yes,[51301]
7,WBC,LOW WBC COUNT,LABEVENTS,yes,[51301]
8,C-REACTIVE PROTEIN,ELEVATED CRP,LABEVENTS,yes,[50889]
11,PLATELET COUNT,LOW PLATELET COUNT,LABEVENTS,yes,[51265]
12,BILIRUBIN,ELEVATED BILIRUBIN,LABEVENTS,yes,[50885]
13,CREATININE,ELEVATED CREATININE,LABEVENTS,yes,[50912]
14,BICARBONATE,LOW BICARBONATE LEVELS,LABEVENTS,yes,"[50803, 50837, 50882, 51027, 51048, 51061, 51076]"
15,BLOOD GLUCOSE,HIGH BLOOD GLUCOSE,LABEVENTS,no,
17,ASPARATE AMINOTRANSFERASE,"ELEVATED LIVER ENZYMES (AST, ALT)",LABEVENTS,no,[50878]


In [None]:
# Apply the function to the 'Predictor' column and create a new column called 'Matching_ITEMID'
df_ref001_lab['Matching_ITEMID'] = df_ref001_lab['Predictor'].apply(match_predictor_to_itemid)
df_ref002_lab['Matching_ITEMID'] = df_ref002_lab['Predictor'].apply(match_predictor_to_itemid)
df_ref003_lab['Matching_ITEMID'] = df_ref003_lab['Predictor'].apply(match_predictor_to_itemid)

### 4.1 Locating lab event Item_IDs from existing literature labels


In [198]:
references = [df_ref001, df_ref002, df_ref003]

table_query = "LABEVENTS"
refcount = 0

# Function to check if any 'Predictor' exists in the 'LABEL'
def check_label_in_predictor(label):
  for predictor in df_reference['Predictor']:
    if predictor in verified_predictors:
      if predictor == label:
        return 'yes'
      continue
    if predictor in label:
      return 'yes'
  return 'no'


for reference in references: 
  # Get unique reference name
  refcount+=1
  reference.name = "ref" + str(refcount)
  
  # Filter reference by table
  df_reference = get_filtered_reference(table_query, reference)[0]
  
  # Apply the function to the 'LABEL' column and create a new column called 'reference_name_exists'
  df_lab_desc[reference.name] = df_lab_desc['LABEL'].apply(check_label_in_predictor)


In [199]:
# Subset of item_IDs related to sepsis
df_potential_lab = df_lab_desc[(df_lab_desc['ref1']=='yes') | 
                               (df_lab_desc['ref2']=='yes') | 
                               (df_lab_desc['ref3']=='yes')]
df_potential_lab

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE,ref1,ref2,ref3
2,548,51348,"HEMATOCRIT, CSF",CEREBROSPINAL FLUID (CSF),HEMATOLOGY,30398-2,yes,no,no
9,555,51355,MONOCYTES,CEREBROSPINAL FLUID (CSF),HEMATOLOGY,26486-1,no,yes,no
17,563,51363,"WBC, CSF",CEREBROSPINAL FLUID (CSF),HEMATOLOGY,26465-5,yes,yes,yes
19,565,51365,ATYPICAL LYMPHOCYTES,JOINT FLUID,HEMATOLOGY,33371-6,no,yes,no
20,566,51366,BANDS,JOINT FLUID,HEMATOLOGY,33361-7,no,yes,no
...,...,...,...,...,...,...,...,...,...
716,717,51517,WBC CASTS,URINE,HEMATOLOGY,5820-6,yes,yes,yes
717,718,51518,WBC CLUMPS,URINE,HEMATOLOGY,,yes,yes,yes
728,729,51529,ESTIMATED ACTUAL GLUCOSE,BLOOD,CHEMISTRY,,no,yes,no
732,733,51533,WBCP,BLOOD,HEMATOLOGY,,yes,yes,yes


### 4.2 Verifying ITEM_IDs with Potential Labels


This section aims to investigate the multiple ITEM_IDs associated with predictor labels.

We find:

- 4 items related to platelets 
- 5 items related to monocytes; two of which are related to blood fluids
- 4 items related to neutrophils
- 8 items related to BILIRUBIN; Bilirubin Total [50885] related to blood is confirmed with clinical expert
- 13 items related to lymphocyctes; 3 of which are related to blood fluids
- 15 items related to calcium; 2 of which are from blood fluid [50893, 50808]
- 4 items related to hemoglobin; all of which are related to blood
- 3 items related to albumin; 2 from urine, 1 from blood fluid
- 2 items related to bicarbonate; both from blood
- 2 items related to d-dimer; both from blood
- 3 items related to glucose in blood

- Lactic acid is available in chart events, however is labelled as lactate in lab events
- AMINOTRANSFERASE is both AST and ALT.


In [None]:
df_lab_desc[df_lab_desc['LABEL'].str.contains('PLATE', case=False)]

In [None]:
df_lab_desc[df_lab_desc['LABEL'].str.contains('MONOCYTE', case=False)]

In [None]:
df_lab_desc[df_lab_desc['LABEL'].str.contains('GLUCOSE', case=False)]

In [None]:
df_lab_desc[df_lab_desc['LABEL'].str.contains('BILIRUBIN', case=False)]

In [None]:
df_lab_desc[df_lab_desc['LABEL'].str.contains('LYMPHOCYTE', case=False)]

In [None]:
df_lab_desc[df_lab_desc['LABEL'].str.contains('AMINOTRANSFERASE', case=False)]

In [None]:
df_lab_desc[df_lab_desc['LABEL'].str.contains('CALCIUM', case=False)]

##### Save references (lab filtered) to CSV

In [None]:
# Create the directory if it doesn't exist
reference_path = Path(ROOT_DIR / 'data' / 'reference')
reference_path.mkdir(parents=True, exist_ok=True)

ref_lab_dfs = [df_ref001_lab, df_ref002_lab, df_ref003_lab]

# save labevent filtered data
for i in range(len(ref_lab_dfs)):
  ref_name = 'ref_00' + str(i + 1) + '_labevents.csv'
  ref_lab_dfs[i].to_csv(Path(reference_path  / ref_name), index=False)

# save final potential lab
df_potential_lab.to_csv(Path(reference_path / "potential_labevents.csv"), index=False)

## 5. Post Confirmation Revision with Medical Expert

After revising our 159 features with the RPH clinician, we reduce the potential features to 69. We then query these with the LLM to see the assess whether features can be reduced further. 

In [143]:
reference_path = Path(ROOT_DIR / 'data' / 'reference')
df_potential_lab_true = pd.read_csv(Path(reference_path / "potential_labevents_clinician.csv"))
df_potential_lab_true.shape

In [161]:
df_potential_lab_true['relation'] = df_potential_lab_true.apply(lambda x: run(prompt_causality, f"A: {x['LABEL']}, B: Sepsis"), axis=1)

In [145]:
df_potential_lab_true['relation_fluid'] = df_potential_lab_true.apply(lambda x: run(prompt_causality, f"A: {x['LABEL']} from fluid: {x['FLUID']}, B: Sepsis"), axis=1)

In [158]:
pd.set_option('display.max_rows', None)
df_potential_lab_true[(df_potential_lab_true['relation_fluid'] == 'Yes')]

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE,ref1,ref2,ref3,relation
7,14,50813,LACTATE,BLOOD,BLOOD GAS,32693-4,no,yes,yes,Yes
21,69,50868,ANION GAP,BLOOD,CHEMISTRY,1863-0,no,yes,no,Yes
25,90,50889,C-REACTIVE PROTEIN,BLOOD,CHEMISTRY,1988-5,no,no,yes,Yes
28,113,50912,CREATININE,BLOOD,CHEMISTRY,2160-0,yes,yes,yes,Yes
29,116,50915,D-DIMER,BLOOD,CHEMISTRY,,no,no,yes,Yes
33,155,50954,LACTATE DEHYDROGENASE (LD),BLOOD,CHEMISTRY,2532-0,no,yes,yes,Yes
38,203,51003,TROPONIN T,BLOOD,CHEMISTRY,6598-7,no,no,yes,Yes
52,396,51196,D-DIMER,BLOOD,HEMATOLOGY,48065-7,no,no,yes,Yes
61,453,51253,MONOCYTE COUNT,BLOOD,HEMATOLOGY,26484-6,no,yes,no,Yes
63,456,51256,NEUTROPHILS,BLOOD,HEMATOLOGY,761-7,no,yes,yes,Yes


In [162]:
df_potential_lab_true[(df_potential_lab_true['relation'] == 'Yes')]

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE,ref1,ref2,ref3,relation_fluid,relation
7,14,50813,LACTATE,BLOOD,BLOOD GAS,32693-4,no,yes,yes,Yes,Yes
28,113,50912,CREATININE,BLOOD,CHEMISTRY,2160-0,yes,yes,yes,Yes,Yes
29,116,50915,D-DIMER,BLOOD,CHEMISTRY,,no,no,yes,Yes,Yes
33,155,50954,LACTATE DEHYDROGENASE (LD),BLOOD,CHEMISTRY,2532-0,no,yes,yes,Yes,Yes
52,396,51196,D-DIMER,BLOOD,HEMATOLOGY,48065-7,no,no,yes,Yes,Yes
62,454,51254,MONOCYTES,BLOOD,HEMATOLOGY,742-7,no,yes,no,No,Yes
63,456,51256,NEUTROPHILS,BLOOD,HEMATOLOGY,761-7,no,yes,yes,Yes,Yes
65,475,51275,PTT,BLOOD,HEMATOLOGY,3173-2,no,no,yes,Yes,Yes
66,500,51300,WBC COUNT,BLOOD,HEMATOLOGY,26464-8,yes,yes,yes,Yes,Yes


## 6. Centrality and LLM relations

After revision of the important features identified through centrality using Neo4j, we query them against the LLM to get their relation and assess their significance to prediction.

In [6]:
df_potential_neo4j = pd.read_csv(Path(ROOT_DIR / "data" / "report_features.csv"))
df_potential_neo4j.shape

(30, 11)

In [10]:
# Add causal relation
df_potential_neo4j['LLM_causal_relation'] = df_potential_neo4j.apply(lambda x: run(prompt_causality, f"A: {x['LABEL']}, B: Sepsis"), axis=1)
df_potential_neo4j['LLM__causal_relation'] = df_potential_neo4j['LLM_causal_relation'].apply(lambda x: 'yes' if 'yes' in x.lower() else 'no')

In [None]:
prompt_importance = '''
[INST]
You are a medical expert. You are tasked to assess the importance of A in predicting the onset of B in patients.
Return yes if A is important in predicting the onset of B. If B is not important in predicting the onset of A or if you are not sure, return no. 
Keep your responses short and succinct. 
Input
{test}
[/INST]
'''

In [16]:
df_potential_neo4j['LLM_important_in_onset'] = df_potential_neo4j.apply(lambda x: run(prompt_importance, f"A: {x['LABEL']}, B: Sepsis"), axis=1)
df_potential_neo4j['LLM_important_in_onset'] = df_potential_neo4j['LLM_important_in_onset'].apply(lambda x: 'yes' if 'yes' in x.lower() else 'no')

In [None]:
prompt_crucial = '''
[INST]
You are a medical expert. Your task is to determine whether a specific lab event or test measurement is crucial in predicting sepsis.
Return yes if A is crucial in predicting the onset of B. If B is not crucial in predicting the onset of A or if you are not sure, return no. 
Keep your responses short and succinct. 
Input
{test}
[/INST]
'''

In [28]:
df_potential_neo4j['LLM_crucial_in_onset'] = df_potential_neo4j.apply(lambda x: run(prompt_crucial, f"A: {x['LABEL']}, B: Sepsis"), axis=1)
df_potential_neo4j['LLM_crucial_in_onset'] = df_potential_neo4j['LLM_crucial_in_onset'].apply(lambda x: 'yes' if 'yes' in x.lower() else 'no')

Unnamed: 0,ITEMID,LABEL,FLUID,ref1,ref2,chatgpt,neo4j,Sepsis_abnormal,SEPSIS_PROP,NonSepsis_abnormal,NONSEPSIS_PROP,relation,important_in_onset,crucial_in_onset
0,51279,RED BLOOD CELLS,BLOOD,no,no,no,yes,5109,98.705564,43107,94.55363,no,yes,no
1,51222,HEMOGLOBIN,BLOOD,yes,no,yes,yes,5114,98.802164,42535,93.298969,no,yes,yes
2,51221,HEMATOCRIT,BLOOD,yes,no,no,yes,5094,98.415765,42724,93.713534,no,yes,yes
3,50931,GLUCOSE,BLOOD,no,yes,no,yes,5108,98.686244,43091,94.518535,no,yes,yes
4,51006,UREA NITROGEN,BLOOD,no,no,yes,yes,4663,90.088872,31697,69.526212,no,yes,yes
5,51244,LYMPHOCYTES,BLOOD,no,yes,no,yes,4811,92.948223,25470,55.867515,no,yes,yes
6,51256,NEUTROPHILS,BLOOD,no,yes,yes,yes,4827,93.257342,27690,60.737004,yes,yes,yes
7,51301,WHITE BLOOD CELLS,BLOOD,no,no,yes,yes,4824,93.199382,34151,74.908971,yes,yes,yes
8,51274,PT,BLOOD,no,no,yes,yes,4703,90.861669,32398,71.06383,no,yes,no
9,50912,CREATININE,BLOOD,yes,yes,yes,yes,4154,80.255023,21021,46.108796,yes,yes,yes


In [30]:
df_potential_neo4j

Unnamed: 0,ITEMID,LABEL,FLUID,ref1,ref2,chatgpt,neo4j,Sepsis_abnormal,SEPSIS_PROP,NonSepsis_abnormal,NONSEPSIS_PROP,LLM_causal_relation,LLM_importance_in_onset,LLM_crucial_in_onset
0,51279,RED BLOOD CELLS,BLOOD,no,no,no,yes,5109,98.705564,43107,94.55363,no,yes,no
1,51222,HEMOGLOBIN,BLOOD,yes,no,yes,yes,5114,98.802164,42535,93.298969,no,yes,yes
2,51221,HEMATOCRIT,BLOOD,yes,no,no,yes,5094,98.415765,42724,93.713534,no,yes,yes
3,50931,GLUCOSE,BLOOD,no,yes,no,yes,5108,98.686244,43091,94.518535,no,yes,yes
4,51006,UREA NITROGEN,BLOOD,no,no,yes,yes,4663,90.088872,31697,69.526212,no,yes,yes
5,51244,LYMPHOCYTES,BLOOD,no,yes,no,yes,4811,92.948223,25470,55.867515,no,yes,yes
6,51256,NEUTROPHILS,BLOOD,no,yes,yes,yes,4827,93.257342,27690,60.737004,yes,yes,yes
7,51301,WHITE BLOOD CELLS,BLOOD,no,no,yes,yes,4824,93.199382,34151,74.908971,yes,yes,yes
8,51274,PT,BLOOD,no,no,yes,yes,4703,90.861669,32398,71.06383,no,yes,no
9,50912,CREATININE,BLOOD,yes,yes,yes,yes,4154,80.255023,21021,46.108796,yes,yes,yes


In [31]:
utils.save_csv(df_potential_neo4j, ROOT_DIR / 'data' / 'report_features_relation.csv')