# In-hospital Mortality Prediction

This notebook showcases in-hospital mortality prediction due to heart failure on a subset of the MIMIC-III dataset using Cyclops.

## Import Libraries

In [None]:
import plotly.express as px
import plotly.graph_objects as go
from datasets import Dataset
from datasets.features import ClassLabel
from kaggle.api.kaggle_api_extended import KaggleApi
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

from cyclops.data.slicer import SliceSpec
from cyclops.evaluate.fairness import FairnessConfig  # noqa: E402
from cyclops.evaluate.metrics import MetricCollection, create_metric
from cyclops.models.catalog import create_model
from cyclops.process.feature.feature import TabularFeatures
from cyclops.tasks.mortality_prediction import MortalityPredictionTask
from cyclops.utils.file import join, load_dataframe

## Constants

In [None]:
DATA_DIR = "./data"
RANDOM_SEED = 85
NAN_THRESHOLD = 0.75
TRAIN_SIZE = 0.8

## Data Loading

Before starting, make sure to install the Kaggle API by running `pip install kaggle`. To use the Kaggle API, you need to sign up for a Kaggle account at https://www.kaggle.com. Then go to the 'Account' tab of your user profile (https://www.kaggle.com/<username>/account) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials. Place this file in the location ~/.kaggle/kaggle.json on your machine.

In [None]:
api = KaggleApi()
api.authenticate()
api.dataset_download_files(
    "saurabhshahane/in-hospital-mortality-prediction", path=DATA_DIR, unzip=True
)

In [None]:
df = load_dataframe(join(DATA_DIR, "data01.csv"), file_format="csv")
df

## Data Inspection and Preprocessing

#### Drop NaNs based on the `NAN_THRESHOLD`

In [None]:
null_counts = df.isnull().sum()[df.isnull().sum() > 0]
fig = go.Figure(data=[go.Bar(x=null_counts.index, y=null_counts.values)])

fig.update_layout(
    title="Number of Null Values per Column",
    xaxis_title="Columns",
    yaxis_title="Number of Null Values",
    height=600,
)

fig.show()

In [None]:
thresh_nan = int(NAN_THRESHOLD * len(df))
df = df.dropna(axis=1, thresh=thresh_nan)
df = df.dropna(axis=0, subset=["outcome"])

#### Gender values

In [None]:
df["gendera"] = df["gendera"].replace({1: 0, 2: 1})
fig = px.pie(df, names="gendera")

fig.update_layout(
    title="Gender Distribution",
)

fig.show()

####  Age distribution

In [None]:
fig = px.histogram(df, x="age")
fig.update_layout(
    title="Age Distribution",
    xaxis_title="Age",
    yaxis_title="Count",
    bargap=0.2,
)

fig.show()

#### Outcome distribution

In [None]:
df["outcome"] = df["outcome"].astype("int")

In [None]:
fig = px.pie(df, names="outcome")
fig.update_traces(textinfo="percent+label")
fig.update_layout(title_text="Outcome Distribution")
fig.update_traces(
    hovertemplate="Outcome: %{label}<br>Count: %{value}<br>Percent: %{percent}"
)
fig.show()

In [None]:
class_counts = df["outcome"].value_counts()
class_ratio = class_counts[0] / class_counts[1]
class_ratio

From all the features in the dataset, we select 20 of them which was reported by [Li et al.](https://pubmed.ncbi.nlm.nih.gov/34301649/)  to be the most important features in this classification task. 

In [None]:
features_list = [
    "Anion gap",
    "Lactic acid",
    "Blood calcium",
    "Lymphocyte",
    "Leucocyte",
    "heart rate",
    "Blood sodium",
    "Urine output",
    "Platelets",
    "Urea nitrogen",
    "age",
    "MCH",
    "RBC",
    "Creatine kinase",
    "PCO2",
    "Blood potassium",
    "Diastolic blood pressure",
    "Respiratory rate",
    "Renal failure",
    "NT-proBNP",
]
features_list = sorted(features_list)

#### Identifying feature types

Cyclops `TabularFeatures` class helps to identify feature types, an essential step before preprocessing the data. Understanding feature types (numerical/categorical/binary) allows us to apply appropriate preprocessing steps for each type.

In [None]:
tab_features = TabularFeatures(
    data=df.reset_index(),
    features=features_list,
    by="ID",
    targets="outcome",
)
tab_features.types

#### Creating data preprocessors

We create a data preprocessor using sklearn's ColumnTransformer. This helps in applying different preprocessing steps to different columns in the dataframe. For instance, binary features might be processed differently from numeric features.

In [None]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())]
)

binary_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent"))]
)

In [None]:
numeric_features = sorted((tab_features.features_by_type("numeric")))
numeric_indices = [
    df[features_list].columns.get_loc(column) for column in numeric_features
]
numeric_features

In [None]:
binary_features = sorted(tab_features.features_by_type("binary"))
binary_features.remove("outcome")
binary_indices = [
    df[features_list].columns.get_loc(column) for column in binary_features
]
binary_features

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_indices),
        ("bin", binary_transformer, binary_indices),
    ],
    remainder="passthrough",
)

## Creating Hugging Face Dataset

We convert our processed Pandas dataframe into a Hugging Face dataset, a powerful and easy-to-use data format which is also compatible with Cyclops models and evaluator modules. The dataset is then split to train and test sets.

In [None]:
dataset = Dataset.from_pandas(df)
dataset.cleanup_cache_files()
dataset

In [None]:
dataset = dataset.cast_column("outcome", ClassLabel(num_classes=2))
dataset = dataset.train_test_split(
    train_size=TRAIN_SIZE, stratify_by_column="outcome", seed=RANDOM_SEED
)

## Model Creation

Cyclops model registry allows for straightforward creation and selection of models. This registry maintains a list of pre-configured models, which can be instantiated with a single line of code. Here we use a SGD classifier to fit a logisitic regression model. The model configurations can be passed to `create_model` based on the sllearn parameters for SGDClassifer.

In [None]:
model_name = "sgd_classifier"
model = create_model(model_name, random_state=123, verbose=0, class_weight="balanced")

## Task Creation

We use Cyclops tasks to define our model's task (in this case, MortalityPrediction), train the model, make predictions, and evaluate performance. Cyclops task classes encapsulate the entire ML pipeline into a single, cohesive structure, making the process smooth and easy to manage.

In [None]:
mortality_task = MortalityPredictionTask(
    {model_name: model}, task_features=features_list, task_target="outcome"
)

In [None]:
mortality_task.list_models()

## Training

If `best_model_params` is passed to the `train` method, the best model will be selected after the hyperparameter search. The parameters in `best_model_params` indicate the values to create the parameters grid.

Note that the data preprocessor needs to be passed to the tasks methods if the Hugging Face dataset is not already preprocessed. 

In [None]:
best_model_params = {
    "alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    "learning_rate": ["constant", "optimal", "invscaling", "adaptive"],
    "eta0": [0.1, 0.01, 0.001, 0.0001],
    "metric": "roc_auc",
    "method": "grid",
}

mortality_task.train(
    dataset["train"],
    model_name=model_name,
    transforms=preprocessor,
    best_model_params=best_model_params,
)

## Prediction

The prediction output can be either the whole Hugging Face dataset with the prediction columns added to it or the single column containing the predicted values.

In [None]:
y_pred = mortality_task.predict(
    dataset["test"],
    model_name=model_name,
    transforms=preprocessor,
    proba=False,
    only_predictions=True,
)
len(y_pred)

## Evaluation

Evaluation is done using various evaluation metrics that provide different perspectives on the model's predictive abilities i.e. standard performance metrics and fairness metrics.

The standard performance metrics can be created using the `MetricCollection` object.

In [None]:
metric_names = ["accuracy", "precision", "recall", "f1_score", "auroc", "roc_curve"]
metrics = [create_metric(metric_name, task="binary") for metric_name in metric_names]
metric_collection = MetricCollection(metrics)

In addition to overall metrics, it might be interesting to see how the model performs on certain subpopulations. We can define these subpopulations using `SliceSpec` objects. 

In [None]:
spec_list = [
    {
        "age": {
            "min_value": 30,
            "max_value": 50,
            "min_inclusive": True,
            "max_inclusive": False,
        }
    },
    {
        "age": {
            "min_value": 50,
            "max_value": 80,
            "min_inclusive": True,
            "max_inclusive": False,
        }
    },
    {"gendera": {"value": 1}},
    {"gendera": {"value": 0}},
    {
        "Anion gap": {
            "min_value": 14.73,
            "min_inclusive": False,
        }
    },
]
slice_spec = SliceSpec(spec_list)

A `MetricCollection` can also be defined for the fairness metrics.

In [None]:
specificity = create_metric(
    metric_name="specificity",
    task="binary",
)
sensitivity = create_metric(
    metric_name="sensitivity",
    task="binary",
)

fpr = 1 - specificity
fnr = 1 - sensitivity

ber = (fpr + fnr) / 2

fairness_metric_collection = MetricCollection(
    {
        "Sensitivity": sensitivity,
        "Specificity": specificity,
        "FPR": fpr,
        "FNR": fnr,
        "BER": ber,
    }
)

The FairnessConfig helps in setting up and evaluating the fairness of the model predictions.

In [None]:
fairness_config = FairnessConfig(
    metrics=fairness_metric_collection,
    dataset=None,  # dataset is passed from the evaluator
    target_columns=None,  # target columns are passed from the evaluator
    groups=["gendera", "age"],
    group_bins={"age": [60, 70, 80]},
    group_base_values={"age": 20, "gendera": 0},
    thresholds=[0.5],
)

The evaluate methods outputs the evaluation results and the Hugging Face dataset with the predictions added to it.

In [None]:
results, dataset_with_preds = mortality_task.evaluate(
    dataset["test"],
    metric_collection,
    model_names=model_name,
    transforms=preprocessor,
    prediction_column_prefix="preds",
    slice_spec=slice_spec,
    batch_size=64,
    fairness_config=fairness_config,
    override_fairness_metrics=False,
)
dataset_with_preds

In [None]:
results[model_name].keys()

In [None]:
results[model_name]["overall"].keys()

In [None]:
fpr, tpr, _ = results[model_name]["age:[50 - 80)"]["BinaryROCCurve"]
aurocs = results[model_name]["age:[50 - 80)"]["BinaryAUROC"]


trace0 = go.Scatter(x=fpr, y=tpr, mode="lines", name=f"ROC curve (area = {aurocs:.2f})")
trace1 = go.Scatter(
    x=[0, 1], y=[0, 1], mode="lines", name="Random", line=dict(dash="dash")
)

fig = go.Figure(data=[trace0, trace1])

fig.update_layout(
    title="ROC Curve. age:[50 - 80)",
    xaxis_title="False Positive Rate",
    yaxis_title="True Positive Rate",
    showlegend=True,
)

In [None]:
fpr, tpr, _ = results[model_name]["gendera:1"]["BinaryROCCurve"]
aurocs = results[model_name]["gendera:1"]["BinaryAUROC"]


trace0 = go.Scatter(x=fpr, y=tpr, mode="lines", name=f"ROC curve (area = {aurocs:.2f})")
trace1 = go.Scatter(
    x=[0, 1], y=[0, 1], mode="lines", name="Random", line=dict(dash="dash")
)

fig = go.Figure(data=[trace0, trace1])

fig.update_layout(
    title="ROC Curve. gender: female",
    xaxis_title="False Positive Rate",
    yaxis_title="True Positive Rate",
    showlegend=True,
)
fig.show()

In [None]:
results["fairness"]