# Excercise 04 - Model Report

Welcome to the fourth hands-on excercise!

In this excercise, you will learn about the CyclOps model report. The model report is essentially a report designed for clinicians, data scientists and folks who wish to view information on the clinical ML model. At the end of this excercise, you will be able to:

1. Create, customize and use model reports for your clinical ML model
2. Use the backend report API to log ML model information to the model report

## Step 01 - Install CyclOps

CyclOps is available as a [python package](https://pypi.org/project/pycyclops/) and can be installed using ``pip``. Note that we now install ``CyclOps`` with and extra dependency ``xgboost`` since we will be using the [xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html) library.

``Colab`` would ask you to restart the session, which is normal. Click on ``Restart Session`` and re-run the cell to install ``CyclOps``.

**NOTE**: We uninstall ``cupy`` from the colab runtime to avoid conflicts with ``CyclOps`` which would attempt to use ``cupy`` if it is installed. Since the runtime does not support GPUs, we will uninstall ``cupy``.

In [None]:
# ruff: noqa: E402
!pip uninstall cupy-cuda12x -y
!pip install 'pycyclops[xgboost]'
!pip install ucimlrepo

## Step 02 - Learn about the different sections of the model report!

CyclOps offers a package for documentation of the model through a model report. The `ModelReport` class is used to populate and generate the model report as an HTML file. The model report has the following sections:

#### Overview
Provides a high level overview of how the model is doing (a quick glance of important metrics), and how it is doing over time (performance over several metrics and subgroups over time).

#### Datasets
High level statistics of the training data, including changes in distribution over time.

#### Quantitative Analysis
This section contains additional detailed performance metrics of the model for different sets of the data and subpopulations.

#### Fairness Analysis
This section contains the fairness metrics of the model.

#### Model Details
This section contains descriptive metadata about the model such as the owners, version, license, etc.

#### Model Parameters
This section contains the technical details of the model such as the model architecture, training parameters, etc.

#### Considerations
This section contains descriptions of the considerations involved in developing and using the model such as the intended use, limitations, etc.


Let's now create a model report object.

## Step 03 - Create a model report object and learn about the API methods.

In [None]:
from cyclops.report import ModelCardReport
from cyclops.report.plot.classification import ClassificationPlotter

In [None]:
report = ModelCardReport()

Let's take a look at the [API documentation](https://vectorinstitute.github.io/cyclops/api/reference/api/_autosummary/cyclops.report.report.ModelCardReport.html#cyclops.report.report.ModelCardReport) to learn more about the different API methods!

Now that we know about a few API functions, let's use one to log information about the dataset. We will use the same dataset from the previous excercies.

In [None]:
import copy
import inspect
from datetime import date

import numpy as np
import pandas as pd
import plotly.express as px
from ucimlrepo import fetch_ucirepo

In [None]:
diabetes_130_data = fetch_ucirepo(id=296)
features = diabetes_130_data["data"]["features"]
targets = diabetes_130_data["data"]["targets"]
metadata = diabetes_130_data["metadata"]
variables = diabetes_130_data["variables"]

In [None]:
features

Let's document using the `log_dataset` method, which takes the following arguments:
- description: A description of the dataset.
- citation: The citation for the dataset.
- link: A link to a resource for the dataset.
- license_id: The SPDX license identifier for the dataset.
- version: The version of the dataset.
- features: A list of features in the dataset.
- split: The split of the dataset (train, test, validation, etc.).
- sensitive_features: A list of sensitive features used to train/evaluate the model.
- sensitive_feature_justification: A justification for the sensitive features used to train/evaluate the model.

In [None]:
report.log_dataset(
    description=metadata["abstract"],
    citation=inspect.cleandoc(
        """
        @article{strack2014impact,
          title={Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records},
          author={Strack, Beata and DeShazo, Jonathan P and Gennings, Chris and Olmo, Juan L and Ventura, Sebastian and Cios, Krzysztof J and Clore, John N and others},
          journal={BioMed research international},
          volume={2014},
          year={2014},
          publisher={Hindawi}
        }
    """,
    ),
    link=metadata["repository_url"],
    license_id="CC0-1.0",
    version="Version 1",
    features=list(features.columns),
    sensitive_features=["gender", "age", "race"],
    sensitive_feature_justification="Demographic information like age and gender \
        often have a strong correlation with health outcomes. For example, older \
        patients are more likely to have a higher risk of readmission.",
)

### Sex values

In [None]:
fig = px.pie(features, names="gender")

fig.update_layout(
    title="Gender Distribution",
)

fig.show()

**Add the figure to the report**

In [None]:
report.log_plotly_figure(
    fig=fig,
    caption="Gender Distribution",
    section_name="datasets",
)

###  Age distribution

In [None]:
fig = px.histogram(features, x="age")
fig.update_layout(
    title="Age Distribution",
    xaxis_title="Age",
    yaxis_title="Count",
    bargap=0.2,
)

fig.show()

**Add the figure to the report**

In [None]:
report.log_plotly_figure(
    fig=fig,
    caption="Age Distribution",
    section_name="datasets",
)

Now that we have logged information about the dataset, we can also log information about the model we trained in the previous excercise.


Let's start with populating the model details section, which includes the following fields by default:
- description: A high-level description of the model and its usage for a general audience.
- version: The version of the model.
- owners: The individuals or organizations that own the model.
- license: The license under which the model is made available.
- citation: The citation for the model.
- references: Links to resources that are relevant to the model.
- path: The path to where the model is stored.
- regulatory_requirements: The regulatory requirements that are relevant to the model.

We can add additional fields to the model details section by passing a dictionary to the `log_from_dict` method and specifying the section name as `model_details`. You can also use the `log_descriptor` method to add a new field object with a `description` attribute to any section of the model card.

In [None]:
report.log_from_dict(
    data={
        "name": "Readmission Prediction Model",
        "description": "The model was trained on the Diabetes 130-US Hospitals for Years 1999-2008 \
        dataset to predict risk of readmission within 30 days of discharge.",
    },
    section_name="model_details",
)

report.log_version(
    version_str="0.0.1",
    date=str(date.today()),
    description="Initial Release",
)
report.log_owner(
    name="CyclOps Team",
    contact="vectorinstitute.github.io/cyclops/",
    email="cyclops@vectorinstitute.ai",
)
report.log_license(identifier="Apache-2.0")
report.log_reference(
    link="https://xgboost.readthedocs.io/en/stable/python/python_api.html",  # noqa: E501
)

Next, let's populate the considerations section, which includes the following fields by default:
- users: The intended users of the model.
- use_cases: The use cases for the model. These could be primary, downstream or out-of-scope use cases.
- fairness_assessment: A description of the benefits and harms of the model for different groups as well as the steps taken to mitigate the harms.
- ethical_considerations: The risks associated with using the model and the steps taken to mitigate them. This can be populated using the  `log_risk` method.



In [None]:
report.log_from_dict(
    data={
        "users": [
            {"description": "Hospitals"},
            {"description": "Clinicians"},
        ],
    },
    section_name="considerations",
)
report.log_user(description="ML Engineers")
report.log_use_case(
    description="Predicting risk of readmission.",
    kind="primary",
)
report.log_use_case(
    description="Predicting risk of pathologies and conditions other\
    than risk of readmission.",
    kind="out-of-scope",
)
report.log_fairness_assessment(
    affected_group="sex, age",
    benefit="Improved health outcomes for patients.",
    harm="Biased predictions for patients in certain groups (e.g. older patients) \
        may lead to worse health outcomes.",
    mitigation_strategy="We will monitor the performance of the model on these groups \
        and retrain the model if the performance drops below a certain threshold.",
)
report.log_risk(
    risk="The model may be used to make decisions that affect the health of patients.",
    mitigation_strategy="The model should be continuously monitored for performance \
        and retrained if the performance drops below a certain threshold.",
)

## Step 04 - Export the report and view it in a browser tab.

In [None]:
report_path = report.export(
    output_filename="readmission_report_periodic.html",
    synthetic_timestamp="2024-06-23",
)