# AI Fairness in Medicine: Integrated and Interactive Workshop

Let's take off our hats as medical staff and wear a new hat, AI developer, for a few hours. In the
end of this session, you will understand the basics of developing an AI solution for medical usage
through a simple example. You will also understand what is biases in AI development, their effects,
and a general idea on how to develop a fair AI solution.

There is no need to freak out even though you have zero experience in programming or even AI! This
material (which data scientists usually call it *notebook*) provides you a quick tutorial on the
machine learning lifecycle and does all the programming parts for you already.

## Part I: What does it mean to train an AI model?

There is no perfect answer in solving a problem using AI. There are many frameworks and methodology
that differ by the details, but the starting point is to understand what is the problem, what are
the data we have, and what is the goal to achieve using AI.

For example, in the context of medicine, an example problem could be to diagnose the patients given
a set of symptoms, to screen diseases or disorders given patient's imaging data, or to discover
new vaccine when the next pandemic arrives.

The diagram below depicts one viewpoint of the machine learning lifecycle. It explains briefly the
process of building an AI model to solve the predefined problem and to achieve the predefined goal.

<img src="https://towardsdatascience.com/wp-content/uploads/2024/11/1_dlG-Cju5ke-DKp8DQ9hiA@2x.jpeg"
alt="ml-lifecycle" width="400"/>

*Source: https://towardsdatascience.com/wp-content/uploads/2024/11/1_dlG-Cju5ke-DKp8DQ9hiA@2x.jpeg*

1. **Data Collection**

   We start by gathering relevant medical information like patient records, lab results, and imaging
   scans. AI developers must understand the problem well, especially in the medical context, and
   should find relevant data, or the resources, to build the AI solution efficiently.

2. **Data Cleaning**

   The raw data often contains errors, missing values, or inconsistencies that need fixing. We
   carefully review and correct these issues to ensure the information is accurate and reliable for
   analysis.

3. **Feature Engineering**

   This is when *inductive bias* first comes in. Here we identify and organize the most important
   pieces of medical data that do not exist originally in the data, but we complement them to help
   AI become more accurate. For example, we might calculate BMI from height and weight measurements,
   or track changes in lab values over time instead of given AI the raw values.

4. **Model Training**

    The AI model learns to capture patterns from between the collected data and the task's goal
    after AI developers provide a set of constraints or rules. For example, capturing the hidden
    relationship between a protein and the docking site or recognizing the pattern between apneic
    episode and the SpO2 signal. This AI model can also be large language models (LLM) that we hear
    everyday in some specific tasks!

5. **Evaluation**

    After the model has learned from the data, we rigorously test its performance using metrics
    doctors understand, like sensitivity and specificity. We also check for biases to ensure that
    the patterns the model learned are fair and accurate across different patient groups before
    clinical use.

6. **Deployment**

    Once validated, we integrate the model into hospital systems where it can assist with tasks like
    flagging abnormal test results. This is done carefully with proper staff training and monitoring
    protocols.

7. **Monitoring**

    After launch, we continuously track the model's performance in real-world use. Just like medical
    guidelines evolve, we update the models as we get new data or discover ways to improve them.

This ongoing cycle helps create AI tools that truly support clinical work while maintaining safety
and reliability. Your expertise remains essential for interpreting results and making final
decisions!

## Part II: Let's build your first medical AI model!

In this workshop, we will use a real-world medical dataset from the
[WiDS Datathon 2020](https://www.kaggle.com/competitions/widsdatathon2020/data), which contains
anonymized patient records from intensive care units (ICUs) around the world. The dataset includes a
wide range of clinical features such as demographics, vital signs, laboratory results, and
comorbidities collected during the first 24 hours of a patient's ICU stay.

While the original competition focused on predicting in-hospital mortality (`hospital_death`), our
goal will be to develop a model that predicts whether a patient has cirrhosis, using the `cirrhosis`
column as our target variable. This shift allows us to explore the challenges and considerations
involved in building AI models for different clinical outcomes, while practicing essential steps in
the machine learning workflow.

Let's first load the data and visualize it to understand it better.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from scipy.stats import ttest_ind
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix

sns.set_theme(style="white", font_scale=1.2)

In [None]:
df = pd.read_csv("./data/training_v2.csv")

Plot below shows the distribution of each demographic traits including age and gender.

In [None]:
_, axes = plt.subplots(1, 2, figsize=(16, 4))
sns.histplot(data=df, x="age", ax=axes[0])
sns.histplot(data=df, x="gender", ax=axes[1])

This dataset also contains the prevalence of 6 datasets: leukemia, hepatic failure,
immunosuppression, lymphoma, cirrhosis, and aids. Let's take a quick look on how each disease
prevalence distributes across gender and ethnicity.

In [None]:
DISEASES = ["leukemia", "hepatic_failure", "immunosuppression", "lymphoma", "cirrhosis", "aids"]
long_df = pd.melt(
    df,
    id_vars=["patient_id", "gender", "ethnicity"],
    value_vars=DISEASES,
    value_name="presence",
    var_name="disease",
)

sns.catplot(
    data=long_df[long_df["presence"] == 1],
    x="disease",
    col="gender",
    hue="disease",
    kind="count",
    aspect=2,
    legend=False,
)

In [None]:
g = sns.catplot(
    data=long_df[long_df["presence"] == 1],
    x="disease",
    col="ethnicity",
    col_wrap=3,
    hue="disease",
    kind="count",
    legend=False,
    aspect=1.5,
)

for ax in g.axes.flatten():
    ax.tick_params(axis="x", rotation=30)

**🔍 Findings**

- **Mean age is 62**, left skewed. Younger people is underrepresented.
- **Similar number of male and female**
    - **Men are likely to have disease**.
    - The difference is significant in cirrhosis, hepatic failure, aids, and leukemia. An educated
      could be that **cirrhosis because men drink more**, **aids because gays are categorized as men**.
- **77% of patients are white**.
    - Caucasian vs Native American shows significant difference.
    - Caucasian vs African American shows notable difference.

Now let's practice what we have discussed. From the list below, choose **two** or **three** problems
that you are familiar with then proceed.

Given a problem and a goal, which types of patient data should we collect? And how would it help
constructing an accurate model? What are the other use cases in ICU care or medical practice that we
can apply this solution to? Do they have the same format of data? Can we apply the same data
cleaning method? If you cannot think of any, what about applying the same modeling technique to the
following problems?

1. Estimating ST-elevation from ECG signal
2. Detecting lung cancer from CXR images
3. Adjusting insulin dose from CGM data
4. Sepsis prediction in the ICU
5. Early warning system (EWS) for deterioration
6. Data extraction from clinician notes for flowsheets
7. Bed management

Discuss your thoughts with your group!

## Part III: Monitoring your AI model

TODO: Define the metrics and evaluate the model produce in the previous part.

Based on the problems you selected earlier, hat should be the evaluation metrics for those problems?

1. Sepsis prediction in the ICU
2. Early warning system (EWS)
3. Data extraction from clinician notes for flowsheets
4. Bed management

TODO: Reveal the data bias problem with handpicked evaluation (choose specific groups and compare)

What did we do wrong?

## Part IV: Removing the bias

TODO: Reconstruct the model by removing the bias in the features

TODO: Explain that there are other ways to remove the bias, e.g. through penalization,
sampling techniques, special loss functions, diverse data retrieval, etc.

At the end of this material, let's take a step further and discuss with the team of this issue.

- Can you think of any example that bias would help training a more accurate model? And why would
  the bias help?
- Stick to that example. How do you know it is a fair bias? Or alternatively, how do you know if it
  is unfair?
- How can we prevent bias into machine learning lifecycle?
- How do we know if ChatGPT is safe from bias? What questions should we ask in order to know the
  answer?