# Exercise 01 - Getting Started

Welcome to the first hands-on exercise to get started with CyclOps!

We will go over the installation of CyclOps and introduce the core packages and APIs in this exercise. At the end of this exercise, you will be able to:

1. Install the Python package for CyclOps
2. Understand the core packages within CyclOps and their purpose
3. Learn how to go over the CyclOps API documentation and tutorials
4. Explore an example clinical dataset, and understand an important clinical predictive task which we will use in later exercises, where ``CyclOps`` APIs will be leveraged

## Step 01 - Install CyclOps

CyclOps is available as a [Python package](https://pypi.org/project/pycyclops/) and can be installed using ``pip``. Note that we now install ``CyclOps`` with an extra dependency ``xgboost`` since we will be using the [xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html) library.

``Colab`` will ask you to restart the session, which is normal. Click on ``Restart Session`` and re-run the cell to install ``CyclOps``.

**NOTE**: We uninstall ``cupy`` from the colab runtime to avoid conflicts with ``CyclOps`` which would attempt to use ``cupy`` if it is installed. Since the runtime does not support GPUs, we will uninstall ``cupy``.

In [None]:
!pip uninstall cupy-cuda12x -y
!pip install 'pycyclops[xgboost]'
!pip install ucimlrepo

## Step 02 - ``CyclOps`` core packages

``CyclOps`` has a few core packages that support functionality used for evaluation and monitoring. We will learn a bit more about the packages which we will use in today's workshop.

1. ``data`` - The ``data`` package supports loading and processing data into features for your ML model. More importantly, it supports slicing of data across sub-groups which can be pretty useful for evaluating your ML model across patient subpopulations.
2. ``evaluate`` - The ``evaluate`` package supports evaluation of your ML model. The package contains a ``metrics`` sub-package which supports common ML performance metrics such as ``Accuracy``, ``Sensitivity`` and ``Specificity``. Furthermore, the ``evaluate`` package also allows calculating fairness metrics which can be used to compare performance of sub-groups with respect to a reference group.
3. ``report`` - The ``report`` package supports the creation of model monitoring reports. The package allows users to customize the reports to their use case.

There are a few packages that support ML model development and benchmarking:

4. ``models`` - The ``models`` package contains baseline model implementations
using `scikit-learn`, `xgboost` and `pytorch` libraries. The package allows the user to easily train and evaluate models.
5. ``tasks`` - The ``tasks`` package contains a few classes that implement classification tasks. These can be used for classification using tabular or image data, and are used to demonstrate example use cases.
6. ``utils`` - The ``utils`` package contains useful utility functions for logging, development, saving and loading data.

In [None]:
import pkgutil
import cyclops

for package in pkgutil.iter_modules(cyclops.__path__):
    print(package.name)

## Step 03 - Explore CyclOps user guide, API documentation and tutorials

CyclOps provides detailed documentation available through the github repository. Simply click on the [landing page URL](https://vectorinstitute.github.io/cyclops/). From the landing page, you can navigate to the [API documentation](https://vectorinstitute.github.io/cyclops/api/).

The API documentation starts with user guides that cover:

1. Installation
2. Evaluation
3. Model Report
4. Monitoring

We will be covering all of the above tasks in today's workshop, however you can refer to the user guides when you wish to use CyclOps on your own.

## Step 04 - Explore [Diabetes 130-US Hospitals for Years 1999-2008](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) dataset

The [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/) provides several public datasets for research. They also provide a handy python package called [ucimlrepo](https://github.com/uci-ml-repo/ucimlrepo) for downloading datasets. We already installed this package, and now will use it to fetch the [Diabetes 130-US Hospitals for Years 1999-2008](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) dataset.

### Readmission prediction

The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria:

1. It is an inpatient encounter (a hospital admission).
2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered into the system as a diagnosis.
3. The length of stay was at least 1 day and at most 14 days.
4. Laboratory tests were performed during the encounter.
5. Medications were administered during the encounter.

It also contains ``Days to inpatient readmission``. Values: ``<30`` if the patient was readmitted in less than 30 days, ``>30`` if the patient was readmitted in more than 30 days, and ``No`` for no record of readmission.

Using this information we could predict early readmission of the patient within 30 days of discharge. This problem is important for the following reasons:

1. Despite high-quality evidence showing improved clinical outcomes for diabetic patients who receive various preventive and therapeutic interventions, many patients do not receive them. This can be partially attributed to arbitrary diabetes management in hospital environments, which fail to attend to glycemic control.
2. Failure to provide proper diabetes care not only increases the managing costs for the hospitals (as the patients are readmitted) but also impacts the morbidity and mortality of the patients, who may face complications associated with diabetes.

In [None]:
from ucimlrepo import fetch_ucirepo

diabetes_130_data = fetch_ucirepo(
    id=296
)  # This ID specifically corresponds to the Diabetes 130 dataset
features = diabetes_130_data["data"]["features"]
targets = diabetes_130_data["data"]["targets"]
metadata = diabetes_130_data["metadata"]
variables = diabetes_130_data["variables"]

In [None]:
metadata

Let's visualize the distribution of the data with respect to a few key variables such as Age, Gender and the prediction outcome of interest. We will use the popular [plotly](https://plotly.com/python/) library to achieve this.

In [None]:
import plotly.express as px
import plotly.graph_objects as go

### Distribution across gender

We see a pretty balanced distribution across Male and Female genders. There is a small number of samples which seem to have missing/invalid values.

In [None]:
fig = px.pie(features, names="gender")
fig.update_layout(
    title="Gender Distribution",
)
fig.show()

###  Distribution across age

We see a slightly skewed normal distribution across age brackets. The majority of the patients are in the 70-80 age group.

In [None]:
fig = px.histogram(features, x="age")
fig.update_layout(
    title="Age Distribution",
    xaxis_title="Age",
    yaxis_title="Count",
    bargap=0.2,
)
fig.show()

###  Distribution across race

We see a very unbalanced distribution across races. We have very few samples for Asian and Hispanic populations. This distribution is partially indicative of the patient population and hence the demographics of the region. However, it is also indicative of a bias in the data which could stem from socio-demographic inequalities (i.e. access to healthcare).

In [None]:
fig = px.histogram(features, x="race")
fig.update_layout(
    title="Race Distribution",
    xaxis_title="Race",
    yaxis_title="Count",
    bargap=0.2,
)
fig.show()

### Missing values

Let's see how much missing data there is, and which variables have the most missing values.

In [None]:
null_counts = features.isnull().sum()[features.isnull().sum() > 0]
fig = go.Figure(data=[go.Bar(x=null_counts.index, y=null_counts.values)])
fig.update_layout(
    title="Number of Null Values per Column",
    xaxis_title="Columns",
    yaxis_title="Number of Null Values",
)
fig.show()

###  Distribution across outcome (readmission)

In [None]:
fig = px.pie(targets, names="readmitted")
fig.update_traces(textinfo="percent+label")
fig.update_layout(title_text="Outcome Distribution")
fig.update_traces(
    hovertemplate="Outcome: %{label}<br>Count: \
    %{value}<br>Percent: %{percent}",
)
fig.show()

That's the end of the first exercise!

A summary of what we learnt:
1. Installed the CyclOps Python package
2. Learnt about the core packages within CyclOps
3. Learnt about where to find the API documentation and tutorials
4. Explored a clinical dataset to understand the distribution of data across variables

# Exercise 02 - Training an ML model

Welcome to the second hands-on exercise!

We will use the dataset introduced in the first exercise to train an ML model! At the end of this exercise, you will be able to:

1. Create training and validation datasets using the [🤗 Datasets](https://github.com/huggingface/datasets) library
2. Train a baseline ML model using CyclOps

First, we will transform the readmitted variable into binary 0/1 labels!

In [None]:
def transform_label(value):
    """Transform string labels of readmission into 0/1 binary labels.

    Parameters
    ----------
    value: str
        Input value

    Returns
    -------
    int
        0 if not readmitted or if greater than 30 days, 1 if less than 30 days

    """
    if value in ["NO", ">30"]:
        return 0
    if value == "<30":
        return 1

    raise ValueError("Unexpected value for readmission!")


df = features
targets.loc[:, "readmitted"] = targets["readmitted"].apply(transform_label)
df.loc[:, "readmitted"] = targets["readmitted"]

Due to the large size of the dataset (around 100k examples), we will choose a small subset for training an ML model!

In [None]:
df = df[0:1000]

We previously looked at the missingness in the data. Let's remove features that are NaNs or have just a single unique value!

In [None]:
features_to_remove = []
for col in df:
    if len(df[col].value_counts()) <= 1:
        features_to_remove.append(col)
df = df.drop(columns=features_to_remove)

It is also important that we understand the class imbalance and use it to train our binary classifier to weight the class with fewer examples accordingly.

In [None]:
class_counts = df["readmitted"].value_counts()
class_ratio = class_counts[0] / class_counts[1]
print(class_ratio, class_counts)

From the features in the dataset, we select all of them except the label to train the model!

In [None]:
features_list = list(df.columns)
features_list.remove("readmitted")
features_list = sorted(features_list)

### Identifying feature types

Cyclops `TabularFeatures` class helps to identify feature types, an essential step before preprocessing the data. Understanding feature types (numerical/categorical/binary) allows us to apply appropriate preprocessing steps for each type.

In [None]:
from cyclops.data.df.feature import TabularFeatures

In [None]:
tab_features = TabularFeatures(
    data=df.reset_index(),
    features=features_list,
    by="index",
    targets="readmitted",
)
print(tab_features.types)

### Creating data preprocessors

We create a data preprocessor using sklearn's ColumnTransformer. This helps in applying different preprocessing steps to different columns in the dataframe. For instance, binary features might be processed differently from numeric features.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

In [None]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())],
)

binary_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent"))],
)

In [None]:
numeric_features = sorted((tab_features.features_by_type("numeric")))
numeric_indices = [
    df[features_list].columns.get_loc(column) for column in numeric_features
]
print(numeric_features)

In [None]:
binary_features = sorted(tab_features.features_by_type("binary"))
binary_features.remove("readmitted")
ordinal_features = sorted(
    tab_features.features_by_type("ordinal")
    + ["medical_specialty", "diag_1", "diag_2", "diag_3"]
)
binary_indices = [
    df[features_list].columns.get_loc(column) for column in binary_features
]
ordinal_indices = [
    df[features_list].columns.get_loc(column) for column in ordinal_features
]
print(binary_features, ordinal_features)

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_indices),
        (
            "onehot",
            OneHotEncoder(handle_unknown="ignore", sparse_output=False),
            binary_indices + ordinal_indices,
        ),
    ],
    remainder="passthrough",
)

### Creating Hugging Face Dataset

We convert our processed Pandas dataframe into a Hugging Face dataset, a powerful and easy-to-use data format which is also compatible with CyclOps modules. The dataset is then split into train and test sets (80:20 split).

In [None]:
from datasets import Dataset
from datasets.features import ClassLabel

In [None]:
RANDOM_SEED = 85
TRAIN_SIZE = 0.8

In [None]:
dataset = Dataset.from_pandas(df)
dataset.cleanup_cache_files()
print(dataset)

In [None]:
dataset = dataset.cast_column("readmitted", ClassLabel(num_classes=2))
dataset = dataset.train_test_split(
    train_size=TRAIN_SIZE,
    stratify_by_column="readmitted",
    seed=RANDOM_SEED,
)

## Step 03 - Create model and train

### Model Creation

CyclOps model registry allows for straightforward creation and selection of models. This registry maintains a list of pre-configured models, which can be instantiated with a single line of code. Here we use a [XGBoost classifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html) to fit a binary classification model. The model configurations can be passed to `create_model` based on the parameters for the ``XGBClassifier``.

In [None]:
from cyclops.models.catalog import create_model

In [None]:
model_name = "xgb_classifier"
model = create_model(model_name, random_state=123)

### Task Creation

We use Cyclops tasks to define our model's task (in this case, readmission prediction), train the model, make predictions, and evaluate performance. Cyclops task classes encapsulate the entire ML pipeline into a single, cohesive structure, making the process smooth and easy to manage.

In [None]:
from cyclops.tasks import BinaryTabularClassificationTask

In [None]:
readmission_prediction_task = BinaryTabularClassificationTask(
    {model_name: model},
    task_features=features_list,
    task_target="readmitted",
)

In [None]:
readmission_prediction_task.list_models()

### Training

If `best_model_params` is passed to the `train` method, the best model will be selected after the hyperparameter search. The parameters in `best_model_params` indicate the values to create the parameters grid.

Note that the data preprocessor needs to be passed to the tasks methods if the Hugging Face dataset is not already preprocessed.

In [None]:
best_model_params = {
    "n_estimators": [250, 500],
    "learning_rate": [0.1],
    "max_depth": [5],
    "reg_lambda": [0, 1, 10],
    "colsample_bytree": [0.8],
    "gamma": [0, 1],
    "method": "random",
    "scale_pos_weight": [int(class_ratio)],
}
readmission_prediction_task.train(
    dataset["train"],
    model_name=model_name,
    transforms=preprocessor,
    best_model_params=best_model_params,
)

This is the end of the second exercise!

A summary of what we learnt:
1. Created training and validation datasets using the [🤗 Datasets](https://github.com/huggingface/datasets) library
2. Trained a baseline ML model using CyclOps