# Exploratory Data Analysis - MIMIC-IV Dataset in PostgreSQL

The notebook has been implemented using Python 3.10.11.
We suggest creating a virtual environment for this notebook.
You need to install the following packages to run this notebook:

| Package Name | License                                                                                                                 | Documentation                           |
|--------------|-------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|
| psycopg2     | [![License: LGPL v3](https://img.shields.io/badge/License-LGPL_v3-blue.svg)](https://www.gnu.org/licenses/lgpl-3.0)     | [Docs](https://www.psycopg.org/)        |
| pandas       | [![License](https://img.shields.io/badge/License-BSD_3--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause) | [Docs](https://pandas.pydata.org/)      |
| numpy        | [![License](https://img.shields.io/badge/License-BSD_3--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause) | [Docs](https://numpy.org/)              |
| seaborn      | [![License](https://img.shields.io/badge/License-BSD_3--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause) | [Docs](https://seaborn.pydata.org/)     |
| scipy        | [![License](https://img.shields.io/badge/License-BSD_3--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause) | [Docs](https://scipy.org/)              |
| tomli        | [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)               | [Docs](https://github.com/hukkin/tomli) |
| matplotlib   | [(BSD-compatible, PSF-based)](https://matplotlib.org/stable/users/project/license.html)                                 | [Docs](https://matplotlib.org/)         |


In [None]:
"""Update pip and install requirements."""
%pip install --upgrade pip
%pip install -r requirements.txt

In [None]:
"""Relevant imports for EDA; setup and styling."""

# data manipulation
import numpy as np
import pandas as pd

# data vizualisation
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams

# default styling for plots
plt.style.use("ggplot")  # gnuplot style
rcParams["figure.figsize"] = 12, 6  # figure size
from matplotlib.colors import ListedColormap

# hls colormap for sns styled pie charts using matplotlib
hls = ListedColormap(sns.color_palette("hls").as_hex())

In [None]:
"""Functions for database connection, query execution, dataframe plotting."""

import tomli as toml
import psycopg2 as pg
from typing import Any


def read_config(path: str) -> dict:
    """Read config file and return config dict."""
    with open(path, "rb") as f:
        config = toml.load(f)["database"]
    return config


def connect_to_db(config: dict) -> Any:
    """Connect to database and return connection object."""
    conn = pg.connect(**config)
    cur = conn.cursor()
    return conn, cur


def read_sql(path: str) -> str:
    """Read SQL file and returns string"""
    with open(path, "r") as f:
        sql = f.read()
    return sql


def sql_to_df(path: str) -> pd.DataFrame:
    """Read SQL file, execute query and return pandas DataFrame."""
    conn, cur = connect_to_db(read_config("./config.toml"))
    cur.execute(read_sql(path))
    df = pd.DataFrame(cur.fetchall(), columns=[desc[0] for desc in cur.description])
    conn.close()
    return df


def plot_corr_matrix(
    df: pd.DataFrame,
    title: str="",
    figsize=(10, 5),
    linewidth=0.3,
    fmt=".2f",
    annot_kws={"size": 10},
    cmap="Spectral_r",
    cbar=True,
    ax=None,
    cbar_kws={"shrink": 0.8},
) -> None:
    """Plot heatmap of correlation matrix."""
    # set figure size
    if ax is None:
        plt.subplots(figsize=figsize)
    corr = df.corr()
    sns.heatmap(
        corr,
        cbar=cbar,  # show color bar? yes/no
        annot=True,  # show numbers in cells? yes/no
        square=True,  # square cells? yes/no
        linewidths=linewidth,  # linewidth between cells
        fmt=fmt,  # precision
        annot_kws=annot_kws,  # size of numbers in cells
        yticklabels=df.columns,  # y-axis labels
        xticklabels=df.columns,  # x-axis labels
        cmap=cmap,  # color palette
        ax=ax,  # axes object
        cbar_kws=cbar_kws,  # shrink color bar
    )
    if title:
        plt.title(title)
    if ax is None:
        plt.show()


def plot_boxplot_grid(df: pd.DataFrame, target: str) -> None:
    """Plot boxplots of multiple columns against a single target variable."""
    # calculate number of rows and columns
    n_cols = int(np.ceil(np.sqrt(len(df.columns) - 1)))
    n_rows = int(np.ceil((len(df.columns) - 1) / n_cols))
    # create figure and axes
    fig, axes = plt.subplots(
        nrows=n_rows, ncols=n_cols, figsize=(n_cols * 6, n_rows * 5)
    )
    # iterate over columns, rows and create boxplots
    for col, ax in zip(df.columns.drop(target), axes.flatten()):
        sns.boxplot(x=target, y=col, data=df, ax=ax)
        # set title to column name vs. target
        ax.set_title(f"{col} vs. {target}")
    plt.show()


def plot_corr_matrix_diff(
    df_one: pd.DataFrame,
    df_two: pd.DataFrame,
    figsize=(10, 5),
    cmap="vlag",
    title="",
    ax=None,
) -> None:
    """Plot heatmap of difference of correlation matrices."""
    # calculate difference of correlation matrices
    corr_diff = df_one.corr() - df_two.corr()
    # plot heatmap
    plt.subplots(figsize=figsize)
    # draw arrows in cells according to correlation difference?
    sns.heatmap(
        corr_diff,
        annot=True,
        annot_kws={"size": 10},
        cbar=True,
        cmap=cmap,
        fmt=".2f",
        square=True,
        center=0,
        ax=ax,
    )
    plt.title(title)
    if ax is None:
        plt.show()


def plot_pie_chart(df, col="race", title="", ax=None, cmap=hls, explode=.1):
    """Plot pie chart for a given column in a dataframe."""
    explode = [explode] * len(df[col].value_counts())
    df[col].value_counts().plot.pie(
        shadow=True,
        autopct="%1.1f%%",
        startangle=90,
        title=title,
        cmap=cmap,
        ax=ax,
        labeldistance=1.1,
        pctdistance=0.5,
        explode=explode,
    )
    if ax is None:
        plt.show()

### Patient Age Distribution - General Hospital Population

In [None]:
gen_pop_age = sql_to_df("./sql/demographics_age_patients.sql")

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(18, 6))
fig.suptitle("Age Distribution of Hospital Patients", fontsize=16)
# create color palette
palette = dict(zip(["F", "M"], sns.color_palette()))
# plot histograms
a1 = sns.histplot(
    data=gen_pop_age,
    x="age",
    hue="gender",
    kde=True,
    multiple="layer",
    ax=axes[0],
    palette=palette,
)
a2 = sns.histplot(
    data=gen_pop_age,
    x="age",
    hue="gender",
    multiple="fill",
    ax=axes[1],
    palette=palette,
    stat="percent",
)
# set title, x- and y-labels for subplots
a1.set(xlabel="Age", ylabel="Count", title="Age Distribution")
a2.set(xlabel="Age", ylabel="Percent", title="Gender Distribution")
# add labels to bars
# for container in a1.containers:
# a1.bar_label(container)

### Patient Age Distribution - ICU Population

In [None]:
icu_pop_age = sql_to_df("./sql/demographics_age_icu.sql")

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(18, 6), sharex=True)
# set title
fig.suptitle("Age Distribution of Patients in ICU", fontsize=16)
# create color palette
palette = dict(zip(["F", "M"], sns.color_palette()))
# plot histograms
a1 = sns.histplot(
    data=icu_pop_age.copy(),
    x="age",
    hue="gender",
    kde=True,
    multiple="layer",
    ax=axes[0],
    palette=palette,
)
a2 = sns.histplot(
    data=icu_pop_age.copy(),
    x="age",
    hue="gender",
    multiple="fill",
    ax=axes[1],
    palette=palette,
    stat="percent",
)
# set title, x- and y-labels for subplots
a1.set(xlabel="Age", ylabel="Count", title="Age Distribution")
a2.set(xlabel="Age", ylabel="Percent", title="Gender Distribution")

### Age Distribution of Patients with Sepsis-3

In [None]:
sepsis_pop_age = sql_to_df("./sql/demographics_age_sepsis.sql")

In [None]:
# create color palette
palette = dict(zip(["F", "M"], sns.color_palette()))
# plot histograms
g = sns.displot(
    sepsis_pop_age,
    x="age",
    col="sepsis",
    hue="gender",
    facet_kws=dict(margin_titles=True),
    binwidth=2,
    height=6,
    kde=True,
    palette=palette,
)
# set title
g.set_titles(col_template="Sepsis={col_name}")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Age Distribution of Patients with and without Sepsis", fontsize=16)

### Ethnicity of Hospital Patients

In [None]:
asian_map = {
    "ASIAN": "Asian",
    "ASIAN - ASIAN INDIAN": "Asian",
    "ASIAN - CHINESE": "Asian",
    "ASIAN - KOREAN": "Asian",
    "ASIAN - SOUTH EAST ASIAN": "Asian",
}
black_map = {
    "BLACK": "Black",
    "BLACK/AFRICAN AMERICAN": "Black",
    "BLACK/AFRICAN": "Black",
    "BLACK/CAPE VERDEAN": "Black",
    "BLACK/CARIBBEAN ISLAND": "Black",
    "BLACK/HAITIAN": "Black",
}
white_map = {
    "WHITE": "White",
    "WHITE - BRAZILIAN": "White",
    "WHITE - EASTERN EUROPEAN": "White",
    "WHITE - OTHER EUROPEAN": "White",
    "WHITE - RUSSIAN": "White",
}
hispanic_map = {
    "HISPANIC": "Hispanic",
    "HISPANIC/LATINO - CENTRAL AMERICAN": "Hispanic",
    "HISPANIC/LATINO - COLOMBIAN": "Hispanic",
    "HISPANIC/LATINO - CUBAN": "Hispanic",
    "HISPANIC/LATINO - DOMINICAN": "Hispanic",
    "HISPANIC/LATINO - GUATEMALAN": "Hispanic",
    "HISPANIC/LATINO - HONDURAN": "Hispanic",
    "HISPANIC/LATINO - MEXICAN": "Hispanic",
    "HISPANIC/LATINO - PUERTO RICAN": "Hispanic",
    "HISPANIC/LATINO - SALVADORAN": "Hispanic",
    "HISPANIC OR LATINO": "Hispanic",
    "SOUTH AMERICAN": "Hispanic",
}
other_map = {
    "OTHER": "Other",
    "PORGTUGUESE": "Other",
    "MULTIPLE RAACE/ETHNICITY": "Other",
    "AMERICAN INDIAN/ALASKA NATIVE": "Other",
}
unknown_map = {
    "UNABLE TO OBTAIN": "Unknown",
    "UNKNOWN": "Unknown",
    "PATIENT DECLINED TO ANSWER": "Unknown",
}
all_map = asian_map | black_map | white_map | hispanic_map | other_map | unknown_map

In [None]:
# run query
df = sql_to_df("./sql/race.sql")
# apply general ethnicity mapping
df["mapped_race"] = df["race"].map(all_map)

In [None]:
ax = sns.histplot(data=df, x="mapped_race", hue="gender", multiple="dodge")
# add bar labels
for container in ax.containers:
    ax.bar_label(container, rotation=90)
# rotate x-tick labels
plt.xticks(rotation=90)

In [None]:
from matplotlib.gridspec import GridSpec

asian = df[df["race"].isin(asian_map)]
black = df[df["race"].isin(black_map)]
white = df[df["race"].isin(white_map)]
hispanic = df[df["race"].isin(hispanic_map)]
other = df[df["race"].isin(other_map)]
unknown = df[df["race"].isin(unknown_map)]

fig = plt.figure(figsize=(20, 20))
gs = GridSpec(3, 3, figure=fig)

ax1 = fig.add_subplot(gs[0, :])
ax2 = fig.add_subplot(gs[1, 0])
ax3 = fig.add_subplot(gs[1, 1])
ax4 = fig.add_subplot(gs[1, 2])
ax5 = fig.add_subplot(gs[2, 0])
ax6 = fig.add_subplot(gs[2, 1])
ax7 = fig.add_subplot(gs[2, 2])

plot_pie_chart(df, "mapped_race", title="All Patients", ax=ax1)
plot_pie_chart(asian, title="Asian", ax=ax2, explode=0.05)
plot_pie_chart(black, title="Black", ax=ax3, explode=0.05)
plot_pie_chart(white, title="White", ax=ax4, explode=0.05)
plot_pie_chart(hispanic, title="Hispanic", ax=ax5, explode=0.05)
plot_pie_chart(other, title="Other", ax=ax6, explode=0.05)
plot_pie_chart(unknown, title="Unknown", ax=ax7, explode=0.05)

In [None]:
# create color palette
palette = dict(zip(["F", "M"], sns.color_palette()))
# plot histograms
g = sns.displot(
    df,
    x="race",
    col="sepsis",
    hue="gender",
    facet_kws=dict(margin_titles=True),
    binwidth=2,
    height=6,
    kde=True,
    palette=palette,
)
g.set_titles(col_template="Sepsis={col_name}")
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Ethnicity", fontsize=16)
# rotate x-tick labels
for ax in g.axes.flat:
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

### Time to Infection - How long does it take for a patient to get infected after admission to the ICU?

In [None]:
tti = sql_to_df("./sql/time_to_infection.sql")

In [None]:
graph = sns.boxplot(
    x=tti["time_to_infection"],
)
graph.axvline(8, color="green", linestyle="--")
graph.axvline(24, color="green", linestyle="--")
graph.axvline(72, color="green", linestyle="--")
graph.text(
    9.5, 0.48, "8 hour ICU stay", va="bottom", ha="left", color="green", rotation=90
)
graph.text(
    25.5, 0.48, "24 hour ICU stay", va="bottom", ha="left", color="green", rotation=90
)
graph.text(
    310,
    0.45,
    str(tti["time_to_infection"].describe()),
    va="bottom",
    color="black",
    bbox=dict(facecolor="white", alpha=0.5),
)
plt.title("Sepsis Onset Time")
plt.xlabel("Time to infection (hours)")
plt.show()

In [None]:
sns.histplot(data=tti, x="time_to_infection", stat="percent", bins=100)
sns.rugplot(data=tti, x="time_to_infection", height=-0.02, clip_on=False, lw=1)

#### Notes on Sepsis Onset Time

The MIMIC-IV authors note the following in their sepsis3 concept:
> As many variables used in SOFA are only collected in the ICU, this query can only define sepsis-3 onset within the ICU.
 
This somewhat explains the distribution of sepsis-3 onset times in the figure above. The distribution of sepsis-3 onset times is heavily skewed towards the point of admission to the ICU. Patients that may already have sepsis upon admission to the ICU will only be identified as having sepsis using the sepsis-3 criteria once the SOFA relevant variables have been collected. This is usually within the first few hours of admission to the ICU. It appears that the majority of patients that develop sepsis in the ICU do so within the first 24 hours of admission. Some patients have a recorded onset time that is well before their admission to the ICU. This is likely due to the fact that for some patients SOFA score variables may have been collected in the ED or the ward. This is the case for 39 patients in total. 31408 patients experience their sepsis onset in the first 24 hours of their icu stay. The mean onset time is 6.9h.

### Blood Gas Analysis - First 24 Hours in the ICU



In [None]:
bg_24h = sql_to_df("./sql/blood_gas_analysis_corr.sql")

In [None]:
bg_mean_std_24h = bg_24h.filter(regex="(_mean|_std|sepsis)$")
bg_max_min_24h = bg_24h.filter(regex="(_max|_min|sepsis)$")

# plot correlation matrices side by side
# fig, axes = plt.subplots(ncols=2, figsize=(25, 10))
plot_corr_matrix(
    bg_max_min_24h,
    title="Blood Gas Analysis Correlation Matrix - Max/Min [24h ICU]",
    figsize=(25, 20),
    linewidth=0,
    # ax=axes[0],
    cbar=True,
    cbar_kws={"shrink": 0.8},
)
plot_corr_matrix(
    bg_mean_std_24h,
    title="Blood Gas Analysis Correlation Matrix - Mean/Std [24h ICU]",
    figsize=(25, 20),
    linewidth=0,
    # ax=axes[1],
    cbar=True,
    cbar_kws={"shrink": 0.8},
)

### Pre Septic Blood Gas Analysis - 8 Hours before Sepsis Onset

For patients without sepsis, the measurement interval is from ICU admission time to ICU admission time +8 hours.
This may or may not be a valid approach.

In [None]:
bg_pre_septic_8h = sql_to_df("./sql/pre_septic_bg_8h.sql")

In [None]:
bg_mean_std_pre_septic_8h = bg_pre_septic_8h.filter(regex="(_mean|_std|sepsis)$")
bg_max_min_pre_septic_8h = bg_pre_septic_8h.filter(regex="(_max|_min|sepsis)$")

# plot correlation matrices side by side
# fig, axes = plt.subplots(ncols=2, figsize=(25, 10))
plot_corr_matrix(
    bg_max_min_24h,
    title="Blood Gas Analysis Correlation Matrix - Max/Min [8h Pre-Septic]",
    figsize=(25, 20),
    linewidth=0,
    # ax=axes[0],
    cbar=True,
    cbar_kws={"shrink": 0.8},
)
plot_corr_matrix(
    bg_mean_std_pre_septic_8h,
    title="Blood Gas Analysis Correlation Matrix - Mean/Std [8h Pre-Septic]",
    figsize=(25, 20),
    linewidth=0,
    # ax=axes[1],
    cbar=True,
    cbar_kws={"shrink": 0.8},
)

plot_corr_matrix_diff(bg_max_min_24h, bg_max_min_pre_septic_8h, figsize=(25, 20), title="Blood Gas Analysis Correlation Difference - Max/Min [24h ICU - 8h Pre-Septic]")
plot_corr_matrix_diff(bg_mean_std_24h, bg_mean_std_pre_septic_8h, figsize=(25, 20), title="Blood Gas Analysis Correlation Difference - Mean/Std [24h ICU - 8h Pre-Septic]")

In [None]:
bg_pre_septic_2h = sql_to_df("./sql/pre_septic_bg_2h.sql")
bg_pre_septic_4h = sql_to_df("./sql/pre_septic_bg_4h.sql")

In [None]:
# create a dataframe that records the correlation between each blood gas component and sepsis
# use only the last row of the correlation matrix
bg_corr_2h = bg_pre_septic_2h.corr().iloc[-1:]
bg_corr_4h = bg_pre_septic_4h.corr().iloc[-1:]
bg_corr_8h = bg_pre_septic_8h.corr().iloc[-1:]

# add a column to record the time before sepsis onset
bg_corr_2h["h_before_onset"] = -2
bg_corr_4h["h_before_onset"] = -4
bg_corr_8h["h_before_onset"] = -8
# combine the dataframes into one, row-wise
combined = pd.concat([bg_corr_2h, bg_corr_4h, bg_corr_8h])
# reformat the dataframe such that it has 3 columns: vital sign, correlation, time before sepsis onset
# combined = combined.reset_index()
combined = combined.melt(
    id_vars=["sepsis", "h_before_onset"], value_vars=combined.columns[:-2]
)
# plot as facetgrid
g = sns.relplot(
    data=combined,
    y="value",
    x="h_before_onset",
    col="variable",
    hue="sepsis",
    kind="line",
    col_wrap=11,
    height=2.5,
    facet_kws=dict(margin_titles=True),
    marker="o",
    estimator="mean"
)
g.fig.tight_layout()
g.fig.suptitle("Correlation between Blood Gas Components and Sepsis Onset Time")

### Vital Signs - 24 hours after ICU admission

Technically its -6 to +24 hours after ICU admission. The first 6 hours are recorded because the first 6 hours are used to calculate the SOFA score.

In [None]:
vs_24h = sql_to_df("./sql/vital_signs_corr.sql")

In [None]:
plot_corr_matrix(
    vs_24h, "Vital Signs Correlation Matrix", figsize=(25, 15), linewidth=0
)

In [None]:
vs_mean_std_24h = vs_24h.filter(regex="(_mean|_std|sepsis)$")
vs_max_min_24h = vs_24h.filter(regex="(_max|_min|sepsis)$")

# plot correlation matrices side by side
plot_corr_matrix(
    vs_max_min_24h,
    title="Vital Signs Correlation Matrix - Max/Min [24h ICU]",
    figsize=(15, 10),
    linewidth=0,
    # ax=axes[0],
    cbar=True,
    cbar_kws={"shrink": 0.8},
)
plot_corr_matrix(
    vs_mean_std_24h,
    title="Vital Signs Correlation Matrix - Mean/Std [24h ICU]",
    figsize=(15, 10),
    linewidth=0,
    # ax=axes[1],
    cbar=True,
    cbar_kws={"shrink": 0.8},
)

### Vital Signs - 8/4/2 hours before Sepsis Onset

In [None]:
vs_pre_septic_8h = sql_to_df("./sql/pre_septic_vitalsign_8h.sql")

In [None]:
vs_mean_std_pre_septic_8h = vs_pre_septic_8h.filter(regex="(_mean|_std|sepsis)$")
vs_max_min_pre_septic_8h = vs_pre_septic_8h.filter(regex="(_max|_min|sepsis)$")

# plot correlation matrices side by side
plot_corr_matrix(
    vs_max_min_pre_septic_8h,
    title="Vital Signs Correlation Matrix - Max/Min [8h Pre-Septic]",
    figsize=(15, 10),
    linewidth=0,
    #ax=axes[0],
    cbar=True,
    cbar_kws={"shrink": 0.8},
)
plot_corr_matrix(
    vs_mean_std_pre_septic_8h,
    title="Vital Signs Correlation Matrix - Mean/Std [8h Pre-Septic]",
    figsize=(15, 10),
    linewidth=0,
#    ax=axes[1],
    cbar=True,
    cbar_kws={"shrink": 0.8},
)

plot_corr_matrix_diff(vs_max_min_24h, vs_max_min_pre_septic_8h, figsize=(15, 10), title="Vital Sign Correlation Difference - Max/Min [24h ICU - 8h Pre-Septic]")
plot_corr_matrix_diff(vs_mean_std_24h, vs_mean_std_pre_septic_8h, figsize=(15, 10), title="Vital Sign Correlation Difference - Mean/Std [24h ICU - 8h Pre-Septic]")

In [None]:
# filter columns by mean, std, max, min, sepsis
vs_pre_septic_8h_mean_std = vs_pre_septic_8h.filter(regex="(_mean|_std|sepsis)$")
vs_pre_septic_8h_min_max = vs_pre_septic_8h.filter(regex="(_max|_min|sepsis)$")

# plot correlation matrices side by side
fig, axes = plt.subplots(ncols=2, figsize=(25, 10))
plot_corr_matrix(
    vs_pre_septic_8h_min_max,
    "Vital Signs - Correlation Matrix",
    figsize=(15, 10),
    linewidth=0,
    ax=axes[0],
    cbar=False,
    cbar_kws={"shrink": 0.8},
)
plot_corr_matrix(
    vs_pre_septic_8h_mean_std,
    "Vital Signs - Correlation Matrix",
    figsize=(15, 10),
    linewidth=0,
    ax=axes[1],
    cbar=True,
    cbar_kws={"shrink": 0.8},
)

In [None]:
# NOTE: broken plot, NaN values are weird in boxplots
vs_pre_septic_8h_min_max = vs_pre_septic_8h.filter(regex="(_mean|_std|sepsis)$")
vs_pre_septic_8h_mean_std = vs_pre_septic_8h.filter(regex="(_max|_min|sepsis)$")

# melt dataframes to have one column for all vital signs and one column for the values and one column for sepsis
vs_pre_septic_8h_mean_std = vs_pre_septic_8h_mean_std.melt(id_vars=["sepsis"], value_vars=vs_pre_septic_8h_mean_std.columns[:-1])
vs_pre_septic_8h_min_max = vs_pre_septic_8h_min_max.melt(id_vars=["sepsis"], value_vars=vs_pre_septic_8h_min_max.columns[:-1])

# NOTE: long form dataframes are the way to go!
sns.catplot(data=vs_pre_septic_8h_min_max, x="sepsis", y="value", col="variable", kind="box", col_wrap=5)

# plot_boxplot_grid(vs_mean_std_ps, "sepsis")
# plot_boxplot_grid(vs_max_min_ps, "sepsis")

### Pre Septic Vital Signs - 8/4/2 Hours before Sepsis Onset

In [None]:
plot_corr_matrix(
    vs_pre_septic_8h,
    "Vital Signs Correlation Matrix - 8h before Sepsis Onset",
    figsize=(20, 15),
    linewidth=0,
)

In [None]:
vs_pre_septic_4h = sql_to_df("./sql/pre_septic_vitalsign_4h.sql")

In [None]:
plot_corr_matrix(vs_pre_septic_4h, "Vital Signs Correlation Matrix - 4h before Sepsis Onset", figsize=(25, 15))

In [None]:
vs_pre_septic_2h = sql_to_df("./sql/pre_septic_vitalsign_2h.sql")

In [None]:
plot_corr_matrix(vs_pre_septic_2h, "Vital Signs Correlation Matrix - 2h before Sepsis Onset", figsize=(25, 15))

In [None]:
# create a dataframe that records the correlation between each vital sign and sepsis
# use only the last row of the correlation matrix
vs_corr_2h = vs_pre_septic_2h.corr().iloc[-1:]
vs_corr_4h = vs_pre_septic_4h.corr().iloc[-1:]
vs_corr_8h = vs_pre_septic_8h.corr().iloc[-1:]

# add a column to record the time before sepsis onset
vs_corr_2h["h_before_onset"] = -2
vs_corr_4h["h_before_onset"] = -4
vs_corr_8h["h_before_onset"] = -8
# combine the dataframes into one, row-wise
combined = pd.concat([vs_corr_2h, vs_corr_4h, vs_corr_8h])
# reformat the dataframe such that it has 3 columns: vital sign, correlation, time before sepsis onset
# combined = combined.reset_index()
combined = combined.melt(
    id_vars=["sepsis", "h_before_onset"], value_vars=combined.columns[:-2]
)
# plot as facetgrid
sns.relplot(
    data=combined,
    y="value",
    x="h_before_onset",
    col="variable",
    hue="sepsis",
    kind="line",
    col_wrap=10,
    height=2.5,
    facet_kws=dict(margin_titles=True),
    marker="o",
    estimator="mean"
)
g.fig.tight_layout()
g.fig.suptitle("Correlation between Vital Sign Measurements and Sepsis Onset")

### Dialysis and Urine Output

In [None]:
df_dia = sql_to_df("./sql/dialysis_urine_corr.sql")

In [None]:
plot_corr_matrix(
    df_dia,
    "Dialysis and Urine Output - Correlation Matrix",
    figsize=(10, 5),
    linewidth=0,
)

### Lab Results

In [None]:
tti = sql_to_df("./sql/lab_results_corr.sql")

In [None]:
plot_corr_matrix(tti, "Lab Results - Correlation Matrix", figsize=(40, 30), linewidth=0)

In [None]:
plot_corr_matrix(df, "Correlation Matrix")

### Glasgow Coma Scale (GCS)

The Glasgow Coma Scale (GCS) is a neurological scale which aims to give a reliable and objective way of recording the conscious state of a person for initial as well as subsequent assessment. A patient is assessed against the criteria of the scale, and the resulting points give a patient score between 3 (indicating deep unconsciousness) and either 14 (original scale) or 15 (more widely used modified or revised scale). The coma scale has three parameters: eye response (4), verbal response (5) and motor response (6) which are summed up to give the final score. The three values separately as well as their sum are considered in the evaluation of the patient's condition.

The MIMIC-IV concept table `first_day_gcs` contains the first day GCS score for patients in the ICU. Patients that are sedated are assigned a GCS score of 15. The authors of the concept recommend to use the last recorded GCS score before sedation for sedated patients (compare to SAPS II score). SAPS II is a severity score for ICU patients that is calculated based on the worst values for 12 physiological variables during the first 24 hours after admission. The GCS score is one of the 12 variables. Patients that receive ventilation through a tracheal tube are marked in the `gcs_unable` column.

In [None]:
df = sql_to_df("./sql/gcs.sql")

In [None]:
plot_corr_matrix(df, "Correlation Matrix - Glasgow Coma Scale")

#### Notes on the GCS Heatmap
The only positive correlation is between the `gcs_unable` column and the sepsis target variable. This is somewhat expected, as patients that are marked as unable are patients that receive ventilation through a tracheal tube. Ventilation is probably an indicator for the severity of the patients condition, as it is (probably) only necessary for more severe conditions. The other correlations will be negative, as the GCS score is a severity score. The lower the score, the more severe the condition. The gcs verbal subscore has the highest correlation between the GCS subscores and the sepsis target variable. 

Note that the query is skewed by design, since sedated patients receive a GCS of 15. This should be seen as a limitation of this specific query. The heatmap is still useful to get a general idea of the correlations between the GCS subscores and the sepsis target variable.

In [None]:
# show distribution of gcs values
sns.displot(df, x="gcs_total", kind="kde", hue="sepsis", fill=True)

#### Notes on the Distribution of the GCS Score
The distribution of the GCS score is skewed to the right. The majority of patients have a GCS score of 15, which is the maximum score. We can observe, that sepsis patients have a higher density of lower GCS scores than non-sepsis patients. The distribution itself follow a similar pattern between sepsis/non-sepsis patients. This leads to the assumption, that the distribution of the GCS score may not be a good indicator for sepsis.

### Ventilation

In [None]:
df_ventilation = sql_to_df("./sql/ventilation_corr.sql")

In [None]:
plot_corr_matrix(df_ventilation, "Correlation Matrix - Ventilation")

#### Notes on the Ventilation Heatmap
The GCS Heatmap contains a positive correlation between the `gcs_unable` column and the sepsis target variable. This heatmap shows a similar relation between the `non_invasive_vent` column aswell as the `invasive_vent` column and the sepsis target variable. This further corroborates the theory, that the `gcs_unable` column is actually just a correlation factor between ventilation and sepsis and has nothing to do with the actual GCS score.

We should note the following limitations for the query used to generate this heatmap:
- The query includes the whole ICU stay of the patient. This means that the patient may have been ventilated at some point during their ICU stay, but not necessarily during the first 24 hours.
- As the query contains the whole ICU stay, the patient may have been non-invasively ventilated at some point and then invasively ventilated at a later point. 
- The sepsis onset time is not considered in the query. This means that the query does not consider the ventilation status of the patient at the time of sepsis onset. This may be a useful insight to have, as it may be an indicator for sepsis onset.

### Patient Age Distribution

In [None]:
all_ages = sql_to_df("./sql/all_ages.sql")

In [None]:
plt.plot(all_ages[0], all_ages[1])
plt.title('Age of all patients')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

In [None]:
icu_ages = sql_to_df("./sql/all_icu_ages.sql")

In [None]:
plt.plot(icu_ages[0], icu_ages[1])
plt.title('Age of all ICU patients')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()