Overview of the problem to solve with this model
- Intially train a simple model off few features utilizing a training pipeline to predict patients who 1) are septic or 2) at-risk of becoming septic
- Develop heuristics to filter out patients who are septic to avoid costs for model inferencing
- Train more complex models to infer patients at-risk of becoming septic based on more complex data sources and features

Overview of Sepsis Indicators

Sepsis - Systemic inflammatory response syndrome (SIRS) 2 or more are met:
1. Body temperature > 38.5°C or < 35.0°C
1. Heart rate > 90 beats per minute
1. Respiratory rate > 20 breaths per minute or arterial CO2 tension < 32 mm Hg or need for mechanical ventilation
1. White blood cell count > 12,000/mm3 or < 4,000/mm3 or immature forms > 10%

Severe sepsis - Sepsis and at least one sign of organ hypoperfusion or organ dysfunction:
1. Areas of mottled skin
1. Capillary refilling time ≥ 3 s
1. Urinary output < 0.5 mL/kg for at least 1 h or renal replacement therapy
1. Lactates > 2 mmol/L
1. Abrupt change in mental status or abnormal electroencephalogram
1. Platelet counts < 100,000/mL or disseminated intravascular coagulation
1. Acute lung injury—acute respiratory distress syndrome
1. Cardiac dysfunction (echocardiography) 

Septic shock - Severe sepsis and one of:
1. Systemic mean blood pressure of < 60 mm Hg (< 80 mm Hg if previous hypertension) after 20–30 mL/kg starch or 40–60 mL/kg serum saline, or pulmonary capillary wedge pressure between 12 and 20 mm Hg
1. Need for dopamine > 5 μg/kg per min or norepinephrine or epinephrine < 0.25 μg/kg per min to maintain mean blood pressure above 60 mm Hg (> 80 mm Hg if previous hypertension) ### Refractory septic shock
1. Need for dopamine > 15 μg/kg per min or norepinephrine or epinephrine > 0.25 μg/kg per min to maintain mean blood pressure above 60 mm Hg (> 80 mm Hg if previous hypertension)

Overview of this projects current goal for the Data Engineering
- use only the Patient Vital signs (pat_vitals_labeled-dataSepsis.csv) to indentify predictive signals (columns) 
- generate a data preprocessing pipeline for feeding data to the model

Overview of the data for this project
- Data was originally based on a Kaggle project https://www.kaggle.com/maxskoryk/datasepsishttps://www.kaggle.com/maxskoryk/datasepsis
- Major changes were made due to the data bias for demographics influencing sepsis indicator AND the sepsis indicators were not accurate
- Patient ID, record date and record time were added 
- HR, Temp and RR were generated to accurately reflect values and patient percentage representation in the believed real world
- Data was split into 3 separate labeled data files
    - Patient Demographics (pat_demog_labeled_dataSepsis.csv)
    - Patient Laboratory Values (pat_labs_labeled_dataSepsis.csv)
    - Patient Vital Signs (pat_vitals_labeled_dataSepsis.csv)

Overview of steps in the notebook Overview of steps in the notebook
- Fetch and write the data for updates using urllib, zipfile, and os for OS agnostic handling
- Load the data as a Dataframe using Pandas
- Explore the Dataframe with Pandas
- Split the data into train and test sets with Scikit-Learn
- Visualize the train data with Matplotlib and Seaborn
- Explore correlation among features
- Feature down selection

Import Packages

In [1]:
# data ingestion
import urllib.request
import os
import zipfile

# data manipulation
import pandas as pd
import numpy as np

# data visualization
import seaborn as sns
import matplotlib.pyplot as plt
# Add directory above current directory to path
import sys; sys.path.insert(0, '..')

# possible removeable of submodules
#from submodules.fetch_data import fetch_data
#from submodules.load_data import load_data

from pandas.plotting import scatter_matrix
from IPython.display import Image


# data splitting
from sklearn.model_selection import train_test_split

Fetch the data

In [2]:
# fetch the data using a python function, commented out b/c cannot use with Kaggle source
#fetch_data()

Load the data

In [8]:
# load the data using a python function
#data = load_data()

# without using a python function
# set for the Signal definitions
attr_path = "../../data/dataSepsis/csv_format/attribute_definitions.csv"
attr = pd.read_csv(attr_path, sep=",")
# set for the Patient Vital Signs
csv_path = "../../data/dataSepsis/csv_format/pat_vitals_labeled-dataSepsis.csv"
data = pd.read_csv(csv_path, sep=",")

Review of Signal Definitions available from the data source

In [13]:
# list the attributes definition file for the Patient Vital Signs
attr.head(13)

Unnamed: 0,target_file,attribute_name,attribute_definition
0,,List all the attributes in the dataset. Label ...,
1,,,
2,pat_vitals_labeled.csv,Vital signs (columns 1-8),"Doctors order, basis every 4 hours, least inva..."
3,pat_vitals_labeled.csv,HR,Heart rate (beats per minute)
4,pat_vitals_labeled.csv,O2Sat,Pulse oximetry (%)
5,pat_vitals_labeled.csv,Temp,Temperature (Deg C)
6,pat_vitals_labeled.csv,SBP,Systolic BP (mm Hg)
7,pat_vitals_labeled.csv,MAP,Mean arterial pressure (mm Hg)
8,pat_vitals_labeled.csv,DBP,Diastolic BP (mm Hg)
9,pat_vitals_labeled.csv,Resp,Respiration rate (breaths per minute)


First glance of raw data. First 10 rows.

In [15]:
data.head(10)

Unnamed: 0,patient_id,record_date,record_time,HR,O2Sat,Temp,SBP,MAP,DBP,Resp,EtCO2,isSepsis
0,1,,,63,90.0,40.3,,,,17,,0
1,2,,,79,95.0,39.2,143.0,77.0,47.0,13,,0
2,3,,,87,94.0,40.3,133.0,74.0,48.0,20,,0
3,4,,,71,100.0,42.1,,,,15,,0
4,5,,,68,94.5,39.7,147.5,102.0,,20,,0
5,6,,,78,99.0,39.6,100.0,67.0,49.5,18,,0
6,7,,,242,,39.3,,,,33,,1
7,8,,,81,100.0,40.3,112.0,79.5,63.0,18,,0
8,9,,,178,100.0,39.22,141.0,85.0,57.0,22,,1
9,10,,,81,95.0,39.2,121.0,20.0,,17,,0
