# Physionet 2019 Sepsis prediction challenge

The [Physionet 2019](https://physionet.org/content/challenge-2019/1.0.0/) challenge was to predict the onset of sepsis in ICU patients using vital signs and lab measurements.

Outcome column is `SepsisLabel`: For sepsis patients, SepsisLabel is 1 if t>=tsepsis−6 and 0 if t<tsepsis−6. For non-sepsis patients, `SepsisLabel` is 0. So all samples starting 6h before sepsis onset are positive class.

The time column is `ICULOS`: the hours since ICU admit. So every row is an hour of patient record.

The raw data are in pipe-separated files:

```
HR|O2Sat|Temp|SBP
NaN|NaN|NaN|NaN
97|95|NaN|98
89|99|NaN|122
```

In [20]:
import numpy as np
import pandas as pd
import urllib
import tarfile
import zipfile
from pathlib import Path

import matplotlib.pyplot as plt
import seaborn as sns
sns.reset_defaults()
%matplotlib inline

In [45]:
# constants
ROOT = Path('./data/physionet2019')
ROOT.mkdir(parents=True, exist_ok=True)

WIDE = ROOT / "physionet2019_timeseries_wide.parquet"

Download data

In [19]:
sources = [
    "https://archive.physionet.org/users/shared/challenge-2019/training_setA.zip",
    "https://archive.physionet.org/users/shared/challenge-2019/training_setB.zip"
]

def download_url(url, root, filename=None):
    if not filename:
        filename = os.path.basename(url)
    fpath = os.path.join(root, filename)
    os.makedirs(root, exist_ok=True)
    try:
        urllib.request.urlretrieve(url, fpath)
    except (urllib.error.URLError, IOError) as e:
        if url[:5] == 'https':
            url = url.replace('https:', 'http:')
            urllib.request.urlretrieve(url, fpath)


def unzip(file, root):
    if file.endswith("tar.gz"):
        tar = tarfile.open(file, "r:gz")
        tar.extractall(path=root)
        tar.close()
    if file.endswith("tar"):
        tar = tarfile.open(file, "r:")
        tar.extractall(path=root)
        tar.close()
    if file.endswith("zip"):
        with zipfile.ZipFile(file, 'r') as z:
            z.extractall(root)

In [24]:
for url in sources:
    download_url(url, ROOT)
    unzip(str(ROOT / os.path.basename(url)), ROOT)

In [26]:
# setA folder is named training so rename it to training_setA
os.rename(ROOT / 'training', ROOT / 'training_setA')

In [27]:
!ls {ROOT}

[1m[36mtraining_setA[m[m     training_setA.zip [1m[36mtraining_setB[m[m     training_setB.zip


Merge all patient records into a single dataframe

In [42]:
datasets = ['training_setA','training_setB']
id_var = 'RecordID'

def load_dataset(root, name):
    df = []
    for file in root.glob('*.psv'):
        d = pd.read_csv(file, sep="|")
        d.loc[:,'RecordID'] = file.name.split('.')[0]
        df.append(d)
    df = pd.concat(df)
    df.loc[:,'Dataset'] = name
    return df

In [43]:
df = pd.concat([load_dataset(ROOT / name, name) for name in datasets])

In [46]:
df.shape

(1552210, 43)

In [47]:
df.head()

Unnamed: 0,HR,O2Sat,Temp,SBP,MAP,DBP,Resp,EtCO2,BaseExcess,HCO3,...,Platelets,Age,Gender,Unit1,Unit2,HospAdmTime,ICULOS,SepsisLabel,RecordID,Dataset
0,80.0,100.0,36.5,121.0,58.0,41.0,13.5,,1.0,25.0,...,160.0,77.27,1,0.0,1.0,-69.14,3,0,p014977,training_setA
1,76.0,100.0,36.25,113.25,61.0,41.5,12.0,,1.0,25.0,...,,77.27,1,0.0,1.0,-69.14,4,0,p014977,training_setA
2,80.0,100.0,36.25,132.75,71.5,46.25,12.0,,,,...,,77.27,1,0.0,1.0,-69.14,5,0,p014977,training_setA
3,78.0,100.0,36.1,103.5,58.0,43.0,12.0,,-3.0,,...,,77.27,1,0.0,1.0,-69.14,6,0,p014977,training_setA
4,74.0,100.0,36.0,128.75,69.5,44.5,12.5,,-3.0,,...,,77.27,1,0.0,1.0,-69.14,7,0,p014977,training_setA


Save as a parquet file partitioned by Dataset.

In [48]:
df.to_parquet(WIDE, index=False, engine="pyarrow", partition_cols=["Dataset"])

Cleanup

In [54]:
import shutil
def delete(file) -> None:
    if os.path.isdir(file):
        shutil.rmtree(file)
    else:
        if os.path.exists(file):
            os.remove(file)
            
keep = [
    WIDE,
]
for f in ROOT.glob("*"):
    if f not in keep:
        print(f)
        delete(f)

data/physionet2019/.DS_Store
data/physionet2019/training_setA
data/physionet2019/training_setB.zip
data/physionet2019/training_setB
data/physionet2019/training_setA.zip
