# Sepsis-3 evaluation in the MIMIC-III database

This notebook goes over the evaluation of the new Sepsis-3 guidelines in the MIMIC database. The goals of this analysis include:

1. Evaluating the Sepsis-3 guidelines in MIMIC using the same methodology as in the research paper
2. Evaluating the Sepsis-3 guidelines against ANGUS criteria
3. Assessing if there are interesting subgroup(s) which are missed by the criteria

In [21]:
from __future__ import print_function

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import OrderedDict

from sepsis_utils import sepsis_utils as su
from sepsis_utils import roc_utils as ru

# used to calculate AUROC
from sklearn import metrics

# default colours for prettier plots
col = [[0.9047, 0.1918, 0.1988],
    [0.2941, 0.5447, 0.7494],
    [0.3718, 0.7176, 0.3612],
    [1.0000, 0.5482, 0.1000],
    [0.4550, 0.4946, 0.4722],
    [0.6859, 0.4035, 0.2412],
    [0.9718, 0.5553, 0.7741],
    [0.5313, 0.3359, 0.6523]];
marker = ['v','o','d','^','s','o','+']
ls = ['-','-','-','-','-','s','--','--']

%matplotlib inline

In [2]:
# load data
df = pd.read_csv('sepsis3-df.csv',sep=',')
df_mdl = pd.read_csv('sepsis3-design-matrix.csv',sep=',')

# define outcome
target_header = "angus"
y = df[target_header].values == 1

# define the covariates to be added in the MFP model (used for table of AUROCs)
preds_header = ['sirs','qsofa','sofa','mlods']

# Study questions

1. How well do the guidelines detect sepsis (Angus criteria) in the antibiotics/culture subset?
2. How well do the guidelines predict mortality (in-hospital) in the antibiotics/culture subset?
3. What factors would improve the sensitivity of the guidelines?
4. What factors would improve the specificity of the guidelines?

## Angus criteria evaluation

In [3]:
yhat = df.sepsis3.values
print('\n SEPSIS-3 guidelines for Angus criteria sepsis \n')
print('Accuracy = {}'.format(metrics.accuracy_score(y, yhat)))
su.print_cm(y, yhat, header1='ang',header2='sep3') # print confusion matrix


 SEPSIS-3 guidelines for Angus criteria sepsis 

Accuracy = 0.595044978617

Confusion matrix
      	ang=0 	ang=1 
sep3=0	  2060	  1273	NPV=61.81
sep3=1	  1473	  1975	PPV=57.28
   	58.31	60.81	Acc=59.50
   	Spec	Sens


Predictions using various levels of confounder adjustment are calculated in the subfunctions `calc_predictions`:

* `model=None` - the severity scores on their own
* `model='baseline'` - the severity scores in a vanilla regression
* `model='mfp'` -the severity scores in a fractional polynomial regression (calls an R script)

For Angus criteria we do not adjust for other factors when presenting the AUROCs.

In [4]:
preds = su.calc_predictions(df, preds_header, target_header, model=None)

In [5]:
# reproduce the AUC table
su.print_auc_table(preds, y, preds_header)
su.print_auc_table_to_file(preds, y, preds_header=preds_header,
                           filename='auc-table.csv')

     	sirs                	qsofa               	sofa                	mlods               	
sirs 	0.607 [0.594, 0.620]	0.436 [0.413, 0.458]	0.179 [0.161, 0.196]	0.228 [0.207, 0.246]	
qsofa	0.347               	0.600 [0.587, 0.612]	0.271 [0.260, 0.282]	0.356 [0.343, 0.368]	
sofa 	< 0.001               	< 0.001               	0.682 [0.669, 0.694]	0.872 [0.866, 0.877]	
mlods	< 0.001               	< 0.001               	0.552               	0.685 [0.672, 0.697]	


## Operating point statistics

This section evaluates the standard operating point statistics:

* sensitivity (% of true positives which are correctly classified)
* specificity (% of true negatives which are correctly classified)
* positive predictive value (given a positive prediction is made, what % are correct)
* negative predictive value (given a negative prediction is made, what % are correct)
* F1 score (harmonic mean of sensitivity and PPV)

In addition, we evaluate the number of false positives per 100 cases, or NFP/100. We feel this gives helpful perspective in interpretting the positive predictive value of the prediction and its relationship to the prevalance of the outcome. In this context, the measure can be summarized as: given 100 patients with suspected infection, how many will each algorithm inappropriately give a positive prediction?

In [54]:
# sepsis3 defined as qSOFA >= 2 and SOFA >= 2
yhat_dict = OrderedDict([['SOFA', df.sofa.values >= 2],
                        ['SIRS', df.sirs.values >= 2],
                        ['mLODS', df.mlods.values >= 2],
                        ['qSOFA', df.qsofa.values >= 2],
                        ['seps3', df.sepsis3.values]])

stats_all = su.get_op_stats(yhat_dict, y)

su.print_op_stats(stats_all)

Metric

     	SOFA    	SIRS    	mLODS   	qSOFA   	seps3   
TN   	  790		  992		 1636		 1877		 2060
FP   	 2743		 2541		 1897		 1656		 1473
FN   	  264		  530		  677		 1173		 1273
TP   	 2984		 2718		 2571		 2075		 1975
Sens 	92 [91, 93]	84 [82, 85]	79 [78, 81]	64 [62, 66]	61 [59, 62]
Spec 	22 [21, 24]	28 [27, 30]	46 [45, 48]	53 [51, 55]	58 [57, 60]
PPV  	52 [51, 53]	52 [50, 53]	58 [56, 59]	56 [54, 57]	57 [56, 59]
NPV  	75 [72, 78]	65 [63, 68]	71 [69, 73]	62 [60, 63]	62 [60, 63]
F1   	 66.50   	 63.90   	 66.64   	 59.46   	 58.99   
NTP  	 44.01   	 40.08   	 37.91   	 30.60   	 29.13   
NFP  	 40.45   	 37.47   	 27.98   	 24.42   	 21.72   
