# Excercise 01 - Getting Started

Welcome to the first hands-on excercise to get started with CyclOps!

We will go over the installation of CyclOps and introduce the core packages and APIs in this excercise. At the end of this excercise, you will be able to:

1. Install the python package for cyclops
2. Understand about the core packages within cyclops and their purpose
3. Learn to go over the CyclOps API documentation and tutorials
4. Explore an example clinical dataset which we will use in later excercises

## Step 01 - Install CyclOps

CyclOps is available as a [python package](https://pypi.org/project/pycyclops/) and can be installed using ``pip``.

``Colab`` would ask you to restart the session, which is normal. Click on ``Restart Session`` and re-run the cell to install ``CyclOps``.

**NOTE**: We uninstall ``cupy`` from the colab runtime to avoid conflicts with ``CyclOps`` which would attempt to use ``cupy`` if it is installed. Since the runtime does not support GPUs, we will uninstall ``cupy``.

In [None]:
!pip uninstall cupy-cuda12x -y
!pip install pycyclops
!pip install ucimlrepo

## Step 02 - ``CyclOps`` core packages

``CyclOps`` has a few core packages that support functionality used for evaluation and monitoring. We will learn a bit more about the packages which we will use in today's workshop.

1. ``data`` - The ``data`` package supports loading and processing data into features for your ML model. More importantly, it supports slicing of data across sub-groups which can be pretty useful for evaluating your ML model across patient subpopulations.
2. ``evaluate`` - The ``evaluate`` package supports evaluation of your ML model. The package contains a ``metrics`` sub-package which supports common ML performance metrics such as ``Accuracy``, ``Sensitivity`` and ``Specificity``. Furthermore, the ``evaluate`` package also allows calculating fairness metrics which can be used to compare performance of sub-groups with respect to a reference group.
3. ``report`` - The ``report`` package supports the creation of model monitoring reports. The package allows users to customize the reports to their use case.

There are a few packages that support ML model development and benchmarking:

4. ``models`` - The ``models`` package contains baseline model implementations
using `scikit-learn`, `xgboost` and `pytorch` libraries. The package allows the user to easily train and evaluate models.
5. ``tasks`` - The ``tasks`` package contains a few classes that implement classification tasks. These can be used for classification using tabular or image data, and are used to demonstrate example use cases.
6. ``utils`` - The ``utils`` package contains useful utility functions for logging, development, saving and loading data.

In [None]:
import pkgutil
import cyclops

for package in pkgutil.iter_modules(cyclops.__path__):
    print(package.name)

## Step 03 - Explore CyclOps user guide, API documentation and tutorials

CyclOps provides detailed documentation available through the github repository. Simply click on the [landing page URL](https://vectorinstitute.github.io/cyclops/). From the landing page, you can navigate to the [API documentation](https://vectorinstitute.github.io/cyclops/api/).

The API documentation starts with user guides that cover:

1. Installation
2. Evaluation
3. Model Report
4. Monitoring

We will be covering all of the above tasks in today's workshop, however you can refer to the user guides when you wish to use CyclOps on your own.

## Step 04 - Explore [Diabetes 130-US Hospitals for Years 1999-2008](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) dataset

The [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/) provides several public datasets for research. They also provide a handy python package called [ucimlrepo](https://github.com/uci-ml-repo/ucimlrepo) for downloading datasets. We already installed this package, and now will use it to fetch the [Diabetes 130-US Hospitals for Years 1999-2008](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) dataset.

In [None]:
from ucimlrepo import fetch_ucirepo

diabetes_130_data = fetch_ucirepo(
    id=296
)  # This ID specifically corresponds to the Diabetes 130 dataset
features = diabetes_130_data["data"]["features"]
targets = diabetes_130_data["data"]["targets"]
metadata = diabetes_130_data["metadata"]
variables = diabetes_130_data["variables"]

In [None]:
metadata

Let's visualize the distribution of the data with respect to a few key variables such as Age, Gender and the prediction outcome of interest. We will use the popular [plotly](https://plotly.com/python/) library to achieve this.

In [None]:
import plotly.express as px
import plotly.graph_objects as go

### Distribution across gender

We see a pretty balanced distribution across Male and Female genders. There is a small number of samples which seem to have missing/invalid values.

In [None]:
fig = px.pie(features, names="gender")
fig.update_layout(
    title="Gender Distribution",
)
fig.show()

###  Distribution across age

We see a slightly skewed normal distribution across age brackets. The majority of the patients are in the 70-80 age group.

In [None]:
fig = px.histogram(features, x="age")
fig.update_layout(
    title="Age Distribution",
    xaxis_title="Age",
    yaxis_title="Count",
    bargap=0.2,
)
fig.show()

###  Distribution across race

We see a very unbalanced distribution across races. We have very few samples for Asian and Hispanic populations. This distribution is partially indicative of the patient population and hence the demographics of the region. However, it is also indicative of a bias in the data which could stem from socio-demographic inequalities (i.e. access to healthcare).

In [None]:
fig = px.histogram(features, x="race")
fig.update_layout(
    title="Race Distribution",
    xaxis_title="Race",
    yaxis_title="Count",
    bargap=0.2,
)
fig.show()

## Missing values

Let's see how much missing data there is, and which variables have the most missing values.

In [None]:
null_counts = features.isnull().sum()[features.isnull().sum() > 0]
fig = go.Figure(data=[go.Bar(x=null_counts.index, y=null_counts.values)])
fig.update_layout(
    title="Number of Null Values per Column",
    xaxis_title="Columns",
    yaxis_title="Number of Null Values",
)
fig.show()

### Readmission distribution

The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria:

1. It is an inpatient encounter (a hospital admission).
2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered into the system as a diagnosis.
3. The length of stay was at least 1 day and at most 14 days.
4. Laboratory tests were performed during the encounter.
5. Medications were administered during the encounter.

It also contains ``Days to inpatient readmission``. Values: ``<30`` if the patient was readmitted in less than 30 days, ``>30`` if the patient was readmitted in more than 30 days, and ``No`` for no record of readmission.

Using this information we could predict early readmission of the patient within 30 days of discharge. This problem is important for the following reasons:

1. Despite high-quality evidence showing improved clinical outcomes for diabetic patients who receive various preventive and therapeutic interventions, many patients do not receive them. This can be partially attributed to arbitrary diabetes management in hospital environments, which fail to attend to glycemic control.
2. Failure to provide proper diabetes care not only increases the managing costs for the hospitals (as the patients are readmitted) but also impacts the morbidity and mortality of the patients, who may face complications associated with diabetes.

In [None]:
fig = px.pie(targets, names="readmitted")
fig.update_traces(textinfo="percent+label")
fig.update_layout(title_text="Outcome Distribution")
fig.update_traces(
    hovertemplate="Outcome: %{label}<br>Count: \
    %{value}<br>Percent: %{percent}",
)
fig.show()