# Mortality Prediction using Tabular Data

This notebooks presents the use-case of predicting the risk of mortality in patients on Mimic-IV dataset.

In [None]:
import yaml
from datasets import load_dataset
from datasets.splits import Split
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from cyclops.datasets.slicer import SliceSpec
from cyclops.evaluate.metrics import MetricCollection, create_metric
from cyclops.models.catalog import create_model
from cyclops.models.constants import CONFIG_ROOT
from cyclops.process.column_names import AGE, SEX
from cyclops.tasks.mortality_prediction import MortalityPrediction
from cyclops.utils.file import join, process_dir_save_path
from use_cases.util import get_pandas_df

## Constants

In [None]:
DATASET = "mimiciv"
CONST_NAME = "mortality_decompensation"

USECASE_ROOT_DIR = join(
    "/mnt/data",
    "cyclops",
    "use_cases",
    DATASET,
    CONST_NAME,
)
DATA_DIR = process_dir_save_path(join(USECASE_ROOT_DIR, "./data"))
ENCOUNTERS_FILE = join(DATA_DIR, "encounters.parquet")

OUTCOME_DEATH = "outcome_death"
TARGET = [OUTCOME_DEATH]
FEATURES = [
    AGE,
    SEX,
    "admission_type",
    "admission_location",
]

SPLIT_FRACTIONS = [0.8, 0.1, 0.1]

## Data Loading

Constructing a Hugging Face Dataset from the encounters data queried form Mimic-IV dataset.

In [None]:
encounters_ds = load_dataset(
    "parquet", data_files=ENCOUNTERS_FILE, split=Split.ALL, keep_in_memory=True
)
encounters_ds.cleanup_cache_files()
encounters_ds

The dataset is split to train, validation, and test subsets.

In [None]:
encounters_ds = encounters_ds.train_test_split(train_size=SPLIT_FRACTIONS[0], seed=42)
encounters_ds_ = encounters_ds["test"].train_test_split(test_size=0.5, seed=42)
encounters_ds["validation"] = encounters_ds_.pop("train")
encounters_ds["test"] = encounters_ds_.pop("test")
encounters_ds

## Data Preprocessing

In the preprocessing step, Scikit-learn transformations are appliedto the training dataset after converting it to a Pandas DataFrame. However, it's important to note that the conversion process should only be attempted if the dataset can fit into the memory.

In [None]:
encounters_train_df = get_pandas_df(
    encounters_ds["train"], feature_cols=FEATURES, label_cols=TARGET
)
encounters_train_df

The numeric and categorical features are specified as below.

In [None]:
numeric_features = (
    encounters_train_df[0]
    .loc[:, FEATURES]
    .select_dtypes(include=["int", "float"])
    .columns.tolist()
)
numeric_features

In [None]:
categorical_features = (
    encounters_train_df[0]
    .loc[:, FEATURES]
    .select_dtypes(include=["object"])
    .columns.tolist()
)
categorical_features

In [None]:
# pre-processing pipeline
numeric_features = [
    encounters_train_df[0].columns.get_loc(col) for col in numeric_features
]
categorical_features = [
    encounters_train_df[0].columns.get_loc(col) for col in categorical_features
]

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# fit and transform
X_train = preprocessor.fit_transform(encounters_train_df[0]).toarray()
y_train = encounters_train_df[1].to_numpy() * 1

## Model Creation

The CyclOps Model API is used to create models using estimators from the Scikit-learn package. The configuration of the model is based on the corresponding config files, which include the necessary parameters for instantiating the Scikit-learn estimators, as well as optional parameters for hyperparameter search.

In [None]:
mlp_name = "mlp"
config_path = join(CONFIG_ROOT, mlp_name + ".yaml")
with open(config_path, "r") as f:
    mlp_config = yaml.safe_load(f)

best_mlp_params = mlp_config["best_model_params"]
mlp_model = create_model(mlp_name, **mlp_config["model_params"])

In [None]:
xgb_name = "xgb_classifier"
config_path = join(CONFIG_ROOT, xgb_name + ".yaml")
with open(config_path, "r") as f:
    xgb_config = yaml.safe_load(f)

best_xgb_params = xgb_config["best_model_params"]
xgb_model = create_model(xgb_name, **xgb_config["model_params"])

## Mortality Prediction Task

The CyclOps Task API is used to create a Mortality Prediction Task based on the available models and dataset. The task can contain multiple models that can be trained and used for prediction individually. This is particularly useful when comparing the performance of multiple models during the evaluation step.

In [None]:
mortality_task = MortalityPrediction(
    {mlp_name: mlp_model}, task_features=FEATURES, task_target=TARGET
)

In [None]:
mortality_task.add_model(xgb_name)
mortality_task.list_models()

In [None]:
mortality_task.list_models_params()

### Training

There are two methods to train models for mortality prediction: `train` and `train_on_hf_dataset`.

The `train` method is used when the training features and labels are provided separately either as numpy arrays or dataframes (containing only the relevant columns). This method is suitable when the entire data can fit into the memory, partial fitting is not required, and hyperparameter search is desired. To use train, you can provide best_model_params to perform hyperparameter search.

On the other hand, the `train_on_hf_dataset` method is used when the data is in the Hugging Face dataset format, especially when the data is too large to fit into memory. In this method, you can use the training dataset that includes both the features and labels.

If the data is not preprocessed, you can use `ColumnTransformer` to preprocess the data before training.

In [None]:
mortality_task.train(
    X_train,
    y_train,
    model_name=xgb_name,
    best_model_params=best_xgb_params,
)

In [None]:
mortality_task.train(
    encounters_ds["train"],
    model_name=mlp_name,
    preprocessor=preprocessor,
    batch_size=128,
)

### Prediction

In the prediction phase, the task object allows for a variety of data inputs, including numpy arrays, pandas dataframes, and Hugging Face Datasets.

When using a Hugging Face dataset as the input, you have the option to obtain the entire dataset with the added prediction column as the output of the predict method. This is particularly useful when dealing with large datasets that cannot fit into memory or when batched prediction is desired.

In [None]:
encounters_ds_test = encounters_ds["test"]
encounters_df_test = get_pandas_df(
    encounters_ds_test, feature_cols=FEATURES, label_cols=TARGET
)
X_test = preprocessor.transform(encounters_df_test[0].to_numpy()).toarray()
Y_test = encounters_df_test[1].to_numpy()

In [None]:
mortality_task.predict(
    X_test,
    model_name=xgb_name,
    proba=False,
)

In [None]:
ds_with_mlp_preds = mortality_task.predict(
    encounters_ds_test,
    model_name=mlp_name,
    prediction_column_prefix="preds",
    preprocessor=preprocessor,
    batch_size=5000,
    only_predictions=False,
)
ds_with_mlp_preds.to_pandas()

### Evaluation

Evaluation is typically performed on a Hugging Face dataset. To evaluate the models, you can also provide a slice specification to see how well they perform for different slices of data based on the feature values.

In addition to the dataset and slice specification, you need to specify the desired evaluation metrics. This can be done by providing a MetricCollection object, a list of metrics, or metric names.


In [None]:
spec_list = [
    {"sex": {"value": "M"}},  # feature value is M
    {
        "age": {
            "min_value": 18,
            "max_value": 65,
            "min_inclusive": True,
            "max_inclusive": False,
        }
    },  # feature value is between 18 and 65, inclusive of 18, exclusive of 65
    {
        "admission_type": {"value": ["EW EMER.", "DIRECT EMER.", "URGENT"]}
    },  # feature value is in the list
    {
        "admission_location": {
            "value": ["PHYSICIAN REFERRAL", "CLINIC REFERRAL", "WALK-IN/SELF REFERRAL"],
            "negate": True,
        }
    },  # feature value is NOT in the list
    {
        "dod": {"max_value": "2019-12-01", "keep_nulls": False}
    },  # possibly before COVID-19
    {
        "dod": {"max_value": "2019-12-01", "negate": True, "keep_nulls": False}
    },  # possibly during COVID-19
    {"admit_timestamp": {"month": [6, 7, 8, 9], "keep_nulls": False}},
    {
        "sex": {"value": "F"},
        "race": {
            "value": [
                "BLACK/AFRICAN AMERICAN",
                "BLACK/CARIBBEAN ISLAND",
                "BLACK/CAPE VERDEAN",
                "BLACK/AFRICAN",
            ]
        },
        "age": {"min_value": 25, "max_value": 40},
    },  # compound slice
]


# create the slice functions
slice_spec = SliceSpec(spec_list)

In [None]:
metric_names = ["accuracy", "precision", "recall", "f1_score", "auroc"]
metrics = [create_metric(metric_name, task="binary") for metric_name in metric_names]
metric_collection = MetricCollection(metrics)


results = mortality_task.evaluate(
    encounters_ds_test,
    metric_collection,
    preprocessor=preprocessor,
    prediction_column_prefix="preds",
    slice_spec=slice_spec,
    batch_size=5000,
)

In [None]:
results[mlp_name]

In [None]:
results[xgb_name]