# Readmission Prediction Detectron Implementation

This notebook showcases readmission prediction on the [Diabetes 130-US Hospitals for Years 1999-2008](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) using CyclOps. The task is formulated as a binary classification task, where we predict the probability of early readmission of the patient within 30 days of discharge.  The model health is then evaluated on a
held-out test set using the [Detectron](https://github.com/rgklab/detectron) method.

## Install libraries

In [None]:
!pip install pycyclops[xgboost]
!pip install ucimlrepo

## Import Libraries

In [None]:
"""Readmission prediction."""

# ruff: noqa: E402


import numpy as np
import pandas as pd
import plotly.graph_objects as go
from datasets import Dataset, DatasetDict
from datasets.features import ClassLabel
from plotly.subplots import make_subplots
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from ucimlrepo import fetch_ucirepo

from cyclops.data.df.feature import TabularFeatures
from cyclops.evaluate.metrics import create_metric
from cyclops.evaluate.metrics.experimental.metric_dict import MetricDict
from cyclops.models.catalog import create_model
from cyclops.monitor.tester import Detectron
from cyclops.report.utils import flatten_results_dict
from cyclops.tasks import BinaryTabularClassificationTask

## Constants

In [None]:
RANDOM_SEED = 85
NAN_THRESHOLD = 0.75
TRAIN_SIZE = 0.05
EVAL_NUM = 3

## Data Loading

In [None]:
diabetes_130_data = fetch_ucirepo(id=296)
features = diabetes_130_data["data"]["features"]
targets = diabetes_130_data["data"]["targets"]
metadata = diabetes_130_data["metadata"]
variables = diabetes_130_data["variables"]

In [None]:
metadata

In [None]:
def transform_label(value):
    """Transform string labels of readmission into 0/1 binary labels.

    Parameters
    ----------
    value: str
        Input value

    Returns
    -------
    int
        0 if not readmitted or if greater than 30 days, 1 if less than 30 days

    """
    if value in ["NO", ">30"]:
        return 0
    if value == "<30":
        return 1

    raise ValueError("Unexpected value for readmission!")


df = features
targets["readmitted"] = targets["readmitted"].apply(transform_label)
df["readmitted"] = targets

Choose a small subset for modelling

In [None]:
df = df[0:1000000]

Remove features that are NaNs or have just a single unique value

In [None]:
df["outcome"] = df["readmitted"].astype("int")
df = df.drop(columns=["readmitted"])

In [None]:
features_to_remove = []
for col in df:
    if len(df[col].value_counts()) <= 1:
        features_to_remove.append(col)
df = df.drop(columns=features_to_remove)

In [None]:
class_counts = df["outcome"].value_counts()
class_ratio = class_counts[0] / class_counts[1]
print(class_ratio, class_counts)

From the features in the dataset, we select all of them to train the model!

In [None]:
features_list = list(df.columns)
features_list.remove("outcome")
features_list = sorted(features_list)

### Identifying feature types

Cyclops `TabularFeatures` class helps to identify feature types, an essential step before preprocessing the data. Understanding feature types (numerical/categorical/binary) allows us to apply appropriate preprocessing steps for each type.

In [None]:
tab_features = TabularFeatures(
    data=df.reset_index(),
    features=features_list,
    by="index",
    targets="outcome",
)
print(tab_features.types)

### Creating data preprocessors

We create a data preprocessor using sklearn's ColumnTransformer. This helps in applying different preprocessing steps to different columns in the dataframe. For instance, binary features might be processed differently from numeric features.

In [None]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())],
)

binary_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent"))],
)

In [None]:
numeric_features = sorted((tab_features.features_by_type("numeric")))
numeric_indices = [
    df[features_list].columns.get_loc(column) for column in numeric_features
]
print(numeric_features)

In [None]:
binary_features = sorted(tab_features.features_by_type("binary"))
binary_features.remove("outcome")
ordinal_features = sorted(
    tab_features.features_by_type("ordinal")
    + ["medical_specialty", "diag_1", "diag_2", "diag_3"]
)
binary_indices = [
    df[features_list].columns.get_loc(column) for column in binary_features
]
ordinal_indices = [
    df[features_list].columns.get_loc(column) for column in ordinal_features
]
print(binary_features, ordinal_features)

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_indices),
        (
            "onehot",
            OneHotEncoder(handle_unknown="ignore", sparse_output=False),
            binary_indices + ordinal_indices,
        ),
    ],
    remainder="passthrough",
)

## Creating Hugging Face Dataset

We convert our processed Pandas dataframe into a Hugging Face dataset, a powerful and easy-to-use data format which is also compatible with CyclOps models and evaluator modules. The dataset is then split to train and test sets.

In [None]:
dataset = Dataset.from_pandas(df)
dataset.cleanup_cache_files()
print(dataset)

In [None]:
dataset = dataset.cast_column("outcome", ClassLabel(num_classes=2))
dataset = dataset.train_test_split(
    train_size=TRAIN_SIZE,
    stratify_by_column="outcome",
    seed=RANDOM_SEED,
)

## Model Creation

CyclOps model registry allows for straightforward creation and selection of models. This registry maintains a list of pre-configured models, which can be instantiated with a single line of code. Here we use a SGD classifier to fit a logisitic regression model. The model configurations can be passed to `create_model` based on the sklearn parameters for SGDClassifier.

In [None]:
model_name = "xgb_classifier"
model = create_model(model_name, random_state=123)

## Task Creation

We use Cyclops tasks to define our model's task (in this case, readmission prediction), train the model, make predictions, and evaluate performance. Cyclops task classes encapsulate the entire ML pipeline into a single, cohesive structure, making the process smooth and easy to manage.

In [None]:
readmission_prediction_task = BinaryTabularClassificationTask(
    {model_name: model},
    task_features=features_list,
    task_target="outcome",
)

In [None]:
readmission_prediction_task.list_models()

## Training

If `best_model_params` is passed to the `train` method, the best model will be selected after the hyperparameter search. The parameters in `best_model_params` indicate the values to create the parameters grid.

Note that the data preprocessor needs to be passed to the tasks methods if the Hugging Face dataset is not already preprocessed. 

In [None]:
best_model_params = {
    "n_estimators": [100, 250, 500],
    "learning_rate": [0.1, 0.01],
    "max_depth": [2, 5],
    "reg_lambda": [0, 1, 10],
    "colsample_bytree": [0.7, 0.8, 1],
    "gamma": [0, 1, 2, 10],
    "method": "random",
    "scale_pos_weight": [int(class_ratio)],
}
dataset["train"] = dataset["train"].train_test_split(train_size=0.8, seed=RANDOM_SEED)

train_dataset = dataset["train"]
val = train_dataset.pop("test")
train_dataset["validation"] = val

readmission_prediction_task.train(
    train_dataset,
    model_name=model_name,
    transforms=preprocessor,
    best_model_params=best_model_params,
)

In [None]:
model_params = readmission_prediction_task.list_models_params()[model_name]
print(model_params)

Initialize detectron model with pre-trained weights and training/validation data.

In [None]:
tester = Detectron(
    X_s=dataset["train"],
    base_model=readmission_prediction_task.models["xgb_classifier"],
    feature_column=features_list,
    transforms=preprocessor,
    splits_mapping={"train": "train", "test": "validation"},
    sample_size=50,
    num_runs=5,
    ensemble_size=5,
    task="binary",
    save_dir="detectron",
)

Get model health using the training data and all the test data.

In [None]:
results = tester.predict(
    X_t=DatasetDict({"train": dataset["train"]["train"], "validation": dataset["test"]})
)
print(results["data"]["model_health"])

Split the test data into multiple bins and plot the model health and performance metrics for each bin.

In [None]:
test_data = dataset["test"]
test_data_list = []

indices = np.arange(0, len(test_data))

bins = np.array_split(indices, 20)

for b in bins:
    test_data_list.append(test_data.select(b))

In [None]:
metric_names = [
    "binary_accuracy",
    "binary_precision",
    "binary_recall",
    "binary_f1_score",
    "binary_auroc",
    "binary_average_precision",
    "binary_roc_curve",
    "binary_precision_recall_curve",
]
metrics = [
    create_metric(metric_name, experimental=True) for metric_name in metric_names
]
metric_collection = MetricDict(metrics)

In [None]:
results_list = []
for data in test_data_list:
    results, dataset_with_preds = readmission_prediction_task.evaluate(
        dataset=data,
        metrics=metric_collection,
        model_names=model_name,
        transforms=preprocessor,
        prediction_column_prefix="preds",
        batch_size=-1,
        override_fairness_metrics=False,
    )
    results_list.append(flatten_results_dict(results)["model_for_preds.xgb_classifier"])

In [None]:
model_health = []
for data in test_data_list:
    results = tester.predict(
        X_t=DatasetDict({"train": dataset["train"]["train"], "validation": data})
    )
    model_health.append(results["data"]["model_health"])

In [None]:
f1_score = [result["overall/BinaryF1Score"] for result in results_list]
precision = [result["overall/BinaryPrecision"] for result in results_list]
recall = [result["overall/BinaryRecall"] for result in results_list]
auroc = [result["overall/BinaryAUROC"] for result in results_list]
average_precision = [
    result["overall/BinaryAveragePrecision"] for result in results_list
]

In [None]:
model_health_df = pd.DataFrame(
    {
        "bin": np.arange(0, len(model_health)),
        "model_health": model_health,
        "f1_score": f1_score,
        "precision": precision,
        "recall": recall,
        "auroc": auroc,
        "average_precision": average_precision,
    }
)
model_health_df = model_health_df.astype(float)

# Define metrics to plot
metrics = ["f1_score", "precision", "recall", "auroc", "average_precision"]

# Define a color palette for metrics
metric_colors = {
    "f1_score": "red",
    "precision": "green",
    "recall": "purple",
    "auroc": "orange",
    "average_precision": "brown",
}

# Create subplots with secondary_y set to True for all subplots
fig = make_subplots(
    rows=len(metrics),
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.05,
    subplot_titles=[
        f"Model Health and {metric.replace('_', ' ').title()}" for metric in metrics
    ],
    specs=[[{"secondary_y": True}] for _ in metrics],
)

# Add traces for each metric
for i, metric in enumerate(metrics, start=1):
    fig.add_trace(
        go.Scatter(
            x=model_health_df["bin"],
            y=model_health_df["model_health"],
            mode="lines",
            name="Model Health",
            line={"color": "blue"},
        ),
        row=i,
        col=1,
        secondary_y=False,
    )

    fig.add_trace(
        go.Scatter(
            x=model_health_df["bin"],
            y=model_health_df[metric],
            mode="lines",
            name=metric.replace("_", " ").title(),
            line={"color": metric_colors[metric]},
        ),
        row=i,
        col=1,
        secondary_y=True,
    )

    # Update y-axes titles
    fig.update_yaxes(title_text="Model Health", secondary_y=False, row=i, col=1)
    fig.update_yaxes(
        title_text=metric.replace("_", " ").title(), secondary_y=True, row=i, col=1
    )

# Update layout
fig.update_layout(
    title_text="Model Health and Metrics on Test Data",
    height=300 * len(metrics),  # Adjust height based on number of subplots
    legend_tracegroupgap=5,
)

# Update x-axis title for the bottom subplot only
fig.update_xaxes(title_text="Bin", row=len(metrics), col=1)

# Show the figure
fig.show()