# Readmission Prediction

This notebook showcases readmission prediction on the [Diabetes 130-US Hospitals for Years 1999-2008](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) using CyclOps. The task is formulated as a binary classification task, where we predict the probability of early readmission of the patient within 30 days of discharge.

## Install libraries

In [1]:
!pip install pycyclops[xgboost]
!pip install ucimlrepo


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Import Libraries

In [2]:
"""Readmission prediction."""

# ruff: noqa: E402

import copy
import inspect
from datetime import date

import numpy as np
import pandas as pd
import plotly.express as px
from datasets import Dataset
from datasets.features import ClassLabel
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from ucimlrepo import fetch_ucirepo

from cyclops.data.df.feature import TabularFeatures
from cyclops.data.slicer import SliceSpec
from cyclops.evaluate.fairness import FairnessConfig  # noqa: E402
from cyclops.evaluate.metrics import create_metric
from cyclops.evaluate.metrics.experimental.functional import (
    binary_npv,
    binary_ppv,
    binary_roc,
)
from cyclops.evaluate.metrics.experimental.metric_dict import MetricDict
from cyclops.models.catalog import create_model
from cyclops.report import ModelCardReport
from cyclops.report.plot.classification import ClassificationPlotter
from cyclops.report.utils import flatten_results_dict
from cyclops.tasks import BinaryTabularClassificationTask

from cyclops.monitor.tester import Detectron
from datasets import DatasetDict

  from .autonotebook import tqdm as notebook_tqdm


## Constants

In [3]:
RANDOM_SEED = 85
NAN_THRESHOLD = 0.75
TRAIN_SIZE = 0.05
EVAL_NUM = 3

## Data Loading

In [4]:
diabetes_130_data = fetch_ucirepo(id=296)
features = diabetes_130_data["data"]["features"]
targets = diabetes_130_data["data"]["targets"]
metadata = diabetes_130_data["metadata"]
variables = diabetes_130_data["variables"]


Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.



In [5]:
metadata

{'uci_id': 296,
 'name': 'Diabetes 130-US Hospitals for Years 1999-2008',
 'repository_url': 'https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008',
 'data_url': 'https://archive.ics.uci.edu/static/public/296/data.csv',
 'abstract': 'The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. Each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory, medications, and stayed up to 14 days. The goal is to determine the early readmission of the patient within 30 days of discharge.\nThe problem is important for the following reasons. Despite high-quality evidence showing improved clinical outcomes for diabetic patients who receive various preventive and therapeutic interventions, many patients do not receive them. This can be partially attributed to arbitrary diabetes management in hospital environments, which fail to attend to glycemic control. Failure to provide

In [6]:
def transform_label(value):
    """Transform string labels of readmission into 0/1 binary labels.

    Parameters
    ----------
    value: str
        Input value

    Returns
    -------
    int
        0 if not readmitted or if greater than 30 days, 1 if less than 30 days

    """
    if value in ["NO", ">30"]:
        return 0
    if value == "<30":
        return 1

    raise ValueError("Unexpected value for readmission!")


df = features
targets["readmitted"] = targets["readmitted"].apply(transform_label)
df["readmitted"] = targets

Choose a small subset for modelling

In [7]:
df = df[0:1000000]

Remove features that are NaNs or have just a single unique value

In [8]:
df["outcome"] = df["readmitted"].astype("int")
df = df.drop(columns=["readmitted"])

In [9]:
features_to_remove = []
for col in df:
    if len(df[col].value_counts()) <= 1:
        features_to_remove.append(col)
df = df.drop(columns=features_to_remove)

In [10]:
class_counts = df["outcome"].value_counts()
class_ratio = class_counts[0] / class_counts[1]
print(class_ratio, class_counts)

7.960641014352381 outcome
0    90409
1    11357
Name: count, dtype: int64


From the features in the dataset, we select all of them to train the model!

In [11]:
features_list = list(df.columns)
features_list.remove("outcome")
features_list = sorted(features_list)

### Identifying feature types

Cyclops `TabularFeatures` class helps to identify feature types, an essential step before preprocessing the data. Understanding feature types (numerical/categorical/binary) allows us to apply appropriate preprocessing steps for each type.

In [12]:
tab_features = TabularFeatures(
    data=df.reset_index(),
    features=features_list,
    by="index",
    targets="outcome",
)
print(tab_features.types)

{'A1Cresult': 'ordinal', 'age': 'ordinal', 'pioglitazone': 'ordinal', 'num_medications': 'numeric', 'metformin-rosiglitazone': 'binary', 'tolazamide': 'ordinal', 'glipizide': 'ordinal', 'number_inpatient': 'numeric', 'troglitazone': 'binary', 'acarbose': 'ordinal', 'glyburide-metformin': 'ordinal', 'acetohexamide': 'binary', 'chlorpropamide': 'ordinal', 'medical_specialty': 'string', 'max_glu_serum': 'ordinal', 'repaglinide': 'ordinal', 'rosiglitazone': 'ordinal', 'admission_type_id': 'ordinal', 'glimepiride': 'ordinal', 'gender': 'ordinal', 'glipizide-metformin': 'binary', 'num_lab_procedures': 'numeric', 'number_emergency': 'numeric', 'glimepiride-pioglitazone': 'binary', 'nateglinide': 'ordinal', 'discharge_disposition_id': 'numeric', 'payer_code': 'ordinal', 'num_procedures': 'ordinal', 'number_outpatient': 'numeric', 'diag_3': 'string', 'change': 'binary', 'diabetesMed': 'binary', 'miglitol': 'ordinal', 'race': 'ordinal', 'diag_1': 'string', 'outcome': 'binary', 'diag_2': 'string'

### Creating data preprocessors

We create a data preprocessor using sklearn's ColumnTransformer. This helps in applying different preprocessing steps to different columns in the dataframe. For instance, binary features might be processed differently from numeric features.

In [13]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())],
)

binary_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent"))],
)

In [14]:
numeric_features = sorted((tab_features.features_by_type("numeric")))
numeric_indices = [
    df[features_list].columns.get_loc(column) for column in numeric_features
]
print(numeric_features)

['discharge_disposition_id', 'num_lab_procedures', 'num_medications', 'number_emergency', 'number_inpatient', 'number_outpatient']


In [15]:
binary_features = sorted(tab_features.features_by_type("binary"))
binary_features.remove("outcome")
ordinal_features = sorted(
    tab_features.features_by_type("ordinal")
    + ["medical_specialty", "diag_1", "diag_2", "diag_3"]
)
binary_indices = [
    df[features_list].columns.get_loc(column) for column in binary_features
]
ordinal_indices = [
    df[features_list].columns.get_loc(column) for column in ordinal_features
]
print(binary_features, ordinal_features)

['acetohexamide', 'change', 'diabetesMed', 'glimepiride-pioglitazone', 'glipizide-metformin', 'metformin-pioglitazone', 'metformin-rosiglitazone', 'tolbutamide', 'troglitazone'] ['A1Cresult', 'acarbose', 'admission_source_id', 'admission_type_id', 'age', 'chlorpropamide', 'diag_1', 'diag_2', 'diag_3', 'gender', 'glimepiride', 'glipizide', 'glyburide', 'glyburide-metformin', 'insulin', 'max_glu_serum', 'medical_specialty', 'metformin', 'miglitol', 'nateglinide', 'num_procedures', 'number_diagnoses', 'payer_code', 'pioglitazone', 'race', 'repaglinide', 'rosiglitazone', 'time_in_hospital', 'tolazamide', 'weight']


In [16]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_indices),
        (
            "onehot",
            OneHotEncoder(handle_unknown="ignore", sparse_output=False),
            binary_indices + ordinal_indices,
        ),
    ],
    remainder="passthrough",
)

## Creating Hugging Face Dataset

We convert our processed Pandas dataframe into a Hugging Face dataset, a powerful and easy-to-use data format which is also compatible with CyclOps models and evaluator modules. The dataset is then split to train and test sets.

In [17]:
# upsample the minority class
from sklearn.utils import resample

# df_majority = df[df.outcome == 0]

# df_minority = df[df.outcome == 1]

# df_minority_upsampled = resample(
#     df_minority,
#     replace=True,
#     n_samples=len(df_majority),
#     random_state=RANDOM_SEED,
# )

# df_upsampled = pd.concat([df_majority, df_minority_upsampled])
# df_upsampled = df_upsampled.sample(frac=1, random_state=RANDOM_SEED)
# print(df_upsampled.outcome.value_counts())
# dataset = Dataset.from_pandas(df_upsampled)

dataset = Dataset.from_pandas(df)
dataset.cleanup_cache_files()
print(dataset)

Dataset({
    features: ['race', 'gender', 'age', 'weight', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'payer_code', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'outcome'],
    num_rows: 101766
})


In [18]:
dataset = dataset.cast_column("outcome", ClassLabel(num_classes=2))
dataset = dataset.train_test_split(
    train_size=TRAIN_SIZE,
    stratify_by_column="outcome",
    seed=RANDOM_SEED,
)

## Model Creation

CyclOps model registry allows for straightforward creation and selection of models. This registry maintains a list of pre-configured models, which can be instantiated with a single line of code. Here we use a SGD classifier to fit a logisitic regression model. The model configurations can be passed to `create_model` based on the sklearn parameters for SGDClassifier.

In [19]:
model_name = "xgb_classifier"
model = create_model(model_name, random_state=123)

## Task Creation

We use Cyclops tasks to define our model's task (in this case, readmission prediction), train the model, make predictions, and evaluate performance. Cyclops task classes encapsulate the entire ML pipeline into a single, cohesive structure, making the process smooth and easy to manage.

In [20]:
readmission_prediction_task = BinaryTabularClassificationTask(
    {model_name: model},
    task_features=features_list,
    task_target="outcome",
)

In [21]:
readmission_prediction_task.list_models()

['xgb_classifier']

## Training

If `best_model_params` is passed to the `train` method, the best model will be selected after the hyperparameter search. The parameters in `best_model_params` indicate the values to create the parameters grid.

Note that the data preprocessor needs to be passed to the tasks methods if the Hugging Face dataset is not already preprocessed. 

In [22]:
best_model_params = {
    "n_estimators": [100, 250, 500],
    "learning_rate": [0.1, 0.01],
    "max_depth": [2, 5],
    "reg_lambda": [0, 1, 10],
    "colsample_bytree": [0.7, 0.8, 1],
    "gamma": [0, 1, 2, 10],
    "method": "random",
    "scale_pos_weight": [int(class_ratio)],
}
dataset["train"] = dataset["train"].train_test_split(train_size=0.8, seed=RANDOM_SEED)

readmission_prediction_task.train(
    dataset["train"],
    model_name=model_name,
    transforms=preprocessor,
    best_model_params=best_model_params,
)

2024-07-16 09:18:47,811 [1;37mINFO[0m cyclops.models.wrappers.sk_model - No validation split was found.
2024-07-16 09:20:34,989 [1;37mINFO[0m cyclops.models.wrappers.sk_model - Best scale_pos_weight: 7
2024-07-16 09:20:34,990 [1;37mINFO[0m cyclops.models.wrappers.sk_model - Best reg_lambda: 0
2024-07-16 09:20:34,990 [1;37mINFO[0m cyclops.models.wrappers.sk_model - Best n_estimators: 100
2024-07-16 09:20:34,991 [1;37mINFO[0m cyclops.models.wrappers.sk_model - Best max_depth: 5
2024-07-16 09:20:34,991 [1;37mINFO[0m cyclops.models.wrappers.sk_model - Best learning_rate: 0.1
2024-07-16 09:20:34,991 [1;37mINFO[0m cyclops.models.wrappers.sk_model - Best gamma: 0
2024-07-16 09:20:34,992 [1;37mINFO[0m cyclops.models.wrappers.sk_model - Best colsample_bytree: 1


In [23]:
model_params = readmission_prediction_task.list_models_params()[model_name]
print(model_params)

{'objective': 'binary:logistic', 'use_label_encoder': None, 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': 1, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': 'logloss', 'feature_types': None, 'gamma': 0, 'gpu_id': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': 0.1, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': 5, 'max_leaves': None, 'min_child_weight': 3, 'missing': nan, 'monotone_constraints': None, 'n_estimators': 100, 'n_jobs': None, 'num_parallel_tree': None, 'predictor': None, 'random_state': 123, 'reg_alpha': None, 'reg_lambda': 0, 'sampling_method': None, 'scale_pos_weight': 7, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None, 'seed': 123}


## Prediction

The prediction output can be either the whole Hugging Face dataset with the prediction columns added to it or the single column containing the predicted values.

In [24]:
# y_pred = readmission_prediction_task.predict(
#     dataset["test"],
#     model_name=model_name,
#     transforms=preprocessor,
#     proba=True,
#     only_predictions=True,
# )
# prediction_df = pd.DataFrame(
#     {
#         "y_prob": [y_pred_i[1] for y_pred_i in y_pred],
#         "y_true": dataset["test"]["outcome"],
#     }
# )

## Evaluation

Evaluation is done using various evaluation metrics that provide different perspectives on the model's predictive abilities i.e. standard performance metrics and fairness metrics.

The standard performance metrics can be created using the `MetricDict` object.

In [25]:
metric_names = [
    "binary_accuracy",
    "binary_precision",
    "binary_recall",
    "binary_f1_score",
    "binary_auroc",
    "binary_average_precision",
    "binary_roc_curve",
    "binary_precision_recall_curve",
]
metrics = [
    create_metric(metric_name, experimental=True) for metric_name in metric_names
]
metric_collection = MetricDict(metrics)

In [26]:
specificity = create_metric(metric_name="binary_specificity", experimental=True)
sensitivity = create_metric(metric_name="binary_sensitivity", experimental=True)

fpr = -specificity + 1
fnr = -sensitivity + 1

ber = (fpr + fnr) / 2

fairness_metric_collection = MetricDict(
    {
        "Sensitivity": sensitivity,
        "Specificity": specificity,
        "BER": ber,
    },
)

The evaluate methods outputs the evaluation results and the Hugging Face dataset with the predictions added to it.

In [27]:
tester = Detectron(X_s=dataset["train"],
                   base_model=readmission_prediction_task.models['xgb_classifier'],
                   feature_column=features_list,
                   transforms=preprocessor,
                   splits_mapping={"train": "train", "test": "test"},
                   sample_size=250,
                   num_runs=5,
                   ensemble_size=5,
                   task="binary",
                   save_dir="detectron",
)

In [28]:
# get model health on all test data
results = tester.predict(X_t = DatasetDict({"train": dataset["train"]["train"], "test": dataset["test"]}))
print(results["model_health"])

0.916030534351145


In [29]:
# split test data into 20 bins
test_data = dataset["test"]
test_data_list = []

indices = np.arange(0, len(test_data))

bins = np.array_split(indices, 20)

for bin in bins:
    test_data_list.append(test_data.select(bin))

In [30]:
# get model health on all test data bins
model_health = []
for data in test_data_list:
    results = tester.predict(X_t=DatasetDict({"train": dataset["train"]["train"], "test": data}))
    model_health.append(results["model_health"])

In [31]:
# use plotly to visualize the model health over the bins
model_health_df = pd.DataFrame(model_health, columns=["model_health"])

model_health_df["bin"] = np.arange(0, len(model_health_df))

fig = px.line(model_health_df, x="bin", y="model_health", title="Model Health")
fig.show()