# Bias in AI Solutions

## Learning Objectives

#### Key Concepts:
1. **AI in Clinical Practice**: AI solutions are increasingly being integrated into healthcare, but concerns exist about their generalizability and fairness across diverse populations.

2. **Generalization Issues**: Many AI models are trained on historical data from academic medical centers, which often lack representation of diverse populations. This creates models that may fail to demonstrate external validity when applied to broader patient communities.

3. **Perpetuating Disparities**: AI systems trained on non-representative data risk reinforcing existing disparities, particularly in marginalized groups like African Americans in the USA.

4. **Sources of Bias**: 
   - Training data may not represent all populations, leading to biased AI decisions.
   - Poor understanding of how diseases manifest differently across populations.
   - Limited awareness of potential biases in AI models.

5. **Bias in Healthcare**: Health disparities based on demographics are more prevalent in countries like the USA, where access to care and outcomes differ significantly by race, ethnicity, and socioeconomic status. AI solutions could worsen these disparities if not carefully monitored.

6. **Total Product Life Cycle of AI Solutions**: 
   - **Design and Development**: Involves creating the AI model.
   - **Evaluation and Validation**: Ensuring that the AI performs accurately and fairly.
   - **Deployment**: Introducing the AI model into healthcare settings.
   - **Downstream Evaluation**: Ongoing monitoring of the AI's impact post-deployment, with a focus on ensuring equity across diverse populations.

7. **Focus of the Lecture**: 
   - **Downstream Evaluation**: Monitoring the AI once deployed to ensure it produces equitable and unbiased outcomes.
   - **Learning Objectives**:
     - Overview of bias and fairness in AI healthcare solutions.
     - Types of bias in AI models.
     - Algorithmic fairness.
     - Solutions to address bias in AI healthcare models.

#### Key Points for Study:
- The problem of **data representation** in AI training, especially with historical data.
- **Disparities** in AI outcomes for non-white populations, specifically African Americans in the USA.
- The **impact of demographic bias** in healthcare and the importance of fairness across all stages of AI model development.
- The importance of **downstream evaluation** for ensuring equitable patient outcomes.


## Real World Examples of AI Bias

#### Key Concepts:
1. **Heart Failure and Gender Bias**:
   - **Clinical Trials and Guidelines**: Historically, heart failure treatment guidelines were based on male symptoms, which differ from female symptoms.
   - **Modeling Bias**: AI models trained on male heart attack symptoms are highly accurate for men but less accurate for women, leading to **gender bias** in predictive outcomes.
   - **Framingham Study**: The study highlighted biases in cardiovascular disease treatment due to underrepresentation of women’s symptoms.

2. **Genomic Databases and Racial Bias**:
   - **Genomic Bias**: Publicly available genomic databases, which are foundational to precision medicine, are biased towards individuals of **European descent**.
   - **Underrepresentation of Other Populations**: 81% of genome mapping participants in over 2,500 studies were of European descent, leading to **less accurate genetic test results** for people of African, Asian, Hispanic, or Middle Eastern ancestry.
   - **Real-World Impact**: AI models built on such databases are likely to be more beneficial for European ancestry, creating challenges in implementing genetic solutions across diverse populations.

3. **AI in Dermatology and Skin Color Bias**:
   - **Skin Cancer Detection**: AI-driven dermatology solutions aim to improve skin cancer detection but are mostly trained on fair-skinned populations.
   - **Disparity in Outcomes**: Patients with **darker skin**, who already face higher mortality rates from advanced skin diseases, may not benefit from AI solutions due to lack of inclusion in the training data.
   - **Key Issues**: The AI models may struggle with diagnosing lesions in patients of color, leading to potential **misdiagnosis** or delayed diagnosis.

#### Key Points for Study:
- **Heart Attack Symptom Bias**: Gender differences in heart attack symptoms leading to biased AI predictions.
- **Genomic Database Representation**: The heavy bias towards European genomic data and its impact on accuracy for non-European populations.
- **Skin Lesion Detection Bias**: AI models in dermatology may fail to diagnose skin cancer in people with darker skin due to biased training data.
- **Broader Implications**: These examples illustrate how **data underrepresentation** in AI training can lead to **systemic bias** in healthcare, negatively affecting diverse populations.


# Types of Bias

## Introduction - Types of Bias

#### Key Concepts:
Bias in AI can occur at multiple stages, from **data collection** to **model development** and **deployment**. A common phrase, **"garbage in, garbage out"**, reflects how poor data quality or representation leads to biased models.

#### Types of Bias:

1. **Historical Bias**:
   - **Definition**: Bias that exists in the data due to historical inequalities, norms, or prejudices.
   - **Phase**: Arises during **data collection**. 
   - **Example**: Using historical healthcare data that reflects past discriminatory practices.

2. **Representation Bias**:
   - **Definition**: Occurs when certain groups are underrepresented in the dataset.
   - **Phase**: Occurs in **data collection** and **sampling**.
   - **Example**: AI models trained on datasets that primarily consist of data from certain racial or gender groups, leading to poor performance for underrepresented groups.

3. **Measurement Bias**:
   - **Definition**: Happens when the tools or metrics used to collect data are biased.
   - **Phase**: Arises during **data collection** and **feature engineering**.
   - **Example**: Biased medical instruments that produce different results based on skin color or gender.

4. **Aggregation Bias**:
   - **Definition**: Occurs when data from different groups are aggregated in ways that overlook group-specific patterns.
   - **Phase**: Appears during **model training**.
   - **Example**: An AI model that combines data from multiple subgroups (e.g., different races or genders) without accounting for group differences, leading to poor accuracy for minority groups.

5. **Evaluation Bias**:
   - **Definition**: Happens when the performance of a model is evaluated using biased benchmarks or metrics that do not reflect diverse populations.
   - **Phase**: Occurs during **model evaluation**.
   - **Example**: Using a test set that lacks diversity, which makes the model seem more accurate than it actually is for underrepresented groups.

6. **Deployment Bias**:
   - **Definition**: Bias that arises when the model is used in real-world settings that differ from the training environment.
   - **Phase**: Happens during **model deployment**.
   - **Example**: An AI system designed for a specific hospital setting may not perform well when deployed in rural or underserved healthcare settings.

#### Key Points for Study:
- **Historical Bias**: Bias due to pre-existing societal inequalities in the data.
- **Representation Bias**: Lack of diverse representation in training data.
- **Measurement Bias**: Biased data collection tools or methods.
- **Aggregation Bias**: Generalizing across diverse groups without acknowledging specific needs or differences.
- **Evaluation Bias**: Using non-diverse evaluation metrics or datasets.
- **Deployment Bias**: Real-world settings differ from the conditions for which the model was trained.


## Historical Bias

#### Key Concepts:
**Historical bias** occurs when the present or past state of the world, influenced by societal norms or prejudices, affects a model's predictions, making them **unfair**. This type of bias is deeply rooted in **preconceived human judgments** and is difficult to eliminate, even with perfect sampling or feature selection.

#### Important Points:

1. **Definition**:
   - Historical bias reflects judgments based on societal **prejudices** or norms that are encoded in **real-world data**.
   - AI models, being data-driven, inherit these biases, making predictions unfair to certain groups.

2. **Impact on Healthcare**:
   - **Historical healthcare data** is often dominated by **male** and **white** populations, creating biased models that fail to generalize well to **diverse populations**.
   
3. **Examples**:
   - **Heart Attack Symptoms**: Early AI models for heart attack triage were trained using **male-specific symptoms**. This led to models that performed poorly for women since the **difference in symptoms** between men and women was previously unknown.
   - **Lung Cancer Screening**: Guidelines based on a study where **96%** of participants were **white** resulted in inaccurate screening for **African Americans**, who have different **smoking habits** (e.g., use of **menthol cigarettes**).

4. **Broader Implications**:
   - Historical bias in healthcare models arises from **clinical trials** and data collection processes that were **not inclusive** of diverse populations.
   - This bias disproportionately affects groups that carry a greater **disease burden**, but the models are designed to better suit **white males** due to past data collection practices.

#### Key Points for Study:
- **Historical Bias**: Arises from societal prejudices encoded in data.
- **Male and White Dominated Healthcare Data**: Leads to models that underperform for diverse populations.
- **Examples of Historical Bias**: 
   - Heart attack symptoms trained on male data.
   - Lung cancer screening guidelines based on studies with predominantly white participants.
- **Impact on Model Generalization**: Historical bias limits the model’s **external validity**, particularly for underrepresented groups.

## Representation Bias

#### Key Concepts:
**Representation bias**, also known as **sampling bias**, occurs when the **training data** used for an AI model does not reflect the **actual distribution** of the population it is intended to serve. This leads to underrepresentation of certain demographic groups, resulting in **biased predictions**.

#### Important Points:

1. **Definition**:
   - Representation bias happens when **certain parts of the final population** are **underrepresented** in the **training data** used to build an AI solution.
   - This bias directly affects the model’s ability to generalize well across diverse populations.

2. **Impact on Healthcare**:
   - Models trained with underrepresentation of certain demographic groups (e.g., minorities) may perform poorly in **real-world applications**.
   - This can lead to **disparities** in the accuracy and effectiveness of healthcare solutions for these underrepresented populations.

3. **Example of Representation Bias**:
   - A **Stanford systematic review** analyzed machine learning models for **Type 2 diabetes** using **electronic health records**.
   - Findings:
     - **Only 25%** of the reviewed papers included **population demographics**.
     - **White** and **Black** populations were adequately represented in the training data, matching their real-world prevalence.
     - **Hispanics** were **not included** in the training data, despite their **higher prevalence** and **complication rates** for Type 2 diabetes.
     - This omission raises concerns about how **AI models** trained without **Hispanic** representation would perform in broader populations, particularly in **nationwide healthcare systems**.

4. **Broader Implications**:
   - Representation bias can result in models that **fail** to deliver equitable healthcare outcomes for **underrepresented groups**.
   - Addressing this bias is critical for ensuring **fairness** and **generalizability** of AI healthcare models across **diverse populations**.

#### Key Points for Study:
- **Representation Bias**: Occurs when training data does not reflect the full population.
- **Underrepresentation**: Leads to poor model performance for certain groups.
- **Stanford Example**: Type 2 diabetes models lacked **Hispanic** representation, despite their high prevalence of the disease.
- **Impact on AI Models**: Affects the **generalization** of AI solutions across nationwide systems and diverse populations.

## Measurement Bias

#### Key Concepts:
**Measurement bias** arises when **noise** in the features or labels is **unevenly distributed** across different groups, leading to **differential performance** and skewed predictions. This occurs when the **available data** is a **noisy proxy** for the actual variables of interest, and the **measurement process** introduces bias.

#### Important Points:

1. **Definition**:
   - **Measurement bias** occurs when the **inaccuracy** in measuring a variable differs between groups.
   - It results in **differential outcomes** for various populations, affecting the fairness of the AI model.

2. **Impact in Healthcare**:
   - Measurement bias can lead to **incorrect predictions** that disproportionately affect certain groups, leading to **inequitable allocation of resources** and **misdiagnosis**.

3. **Example: Obermeyer Study**:
   - Dr. Obermeyer and colleagues studied a widely used algorithm for **predicting patients** in need of **additional care support**.
   - They found **racial bias**:
     - The algorithm required **Black patients** to be **sicker with more co-morbidities** to achieve the same predictive score as **White patients** for accessing **beneficial care programs**.
     - This was due to the algorithm **using healthcare costs** as a **proxy** for healthcare need.
   - The bias was introduced because **healthcare costs** are not a reliable measure of **medical illness**, leading to **inaccurate predictions** and **unequal resource allocation**.

4. **Causes**:
   - **Noisy proxies**: When the available features (like healthcare costs) are only a **noisy proxy** for the real variables of interest (like illness severity).
   - **Measurement errors**: These errors are often not random but reflect **systemic biases** in how different groups are measured, further **exacerbating inequalities**.

5. **Mitigation Strategies**:
   - **Awareness** of measurement bias is critical to identify and address the issue.
   - Implementing **pre- and post-processing** actions can help **mitigate the bias**:
     - **Pre-processing**: Correcting data before training the model.
     - **Post-processing**: Adjusting model predictions after they are made to ensure fairness.

#### Key Points for Study:
- **Measurement Bias**: Occurs when the noise in data features or labels is not evenly distributed across groups, leading to biased predictions.
- **Obermeyer Study**: The algorithm used **healthcare costs** as a flawed **proxy** for healthcare needs, leading to **racial bias** in care support allocation.
- **Impact**: Black patients were required to be **sicker** to achieve the same scores as White patients, affecting healthcare resource distribution.
- **Mitigation**: Awareness and **pre- and post-processing** strategies can help reduce the impact of measurement bias.

## Aggregation Bias

#### Key Concepts:
**Aggregation bias** arises when **data from distinct populations** with differing underlying distributions is combined to develop a **single model**, leading to inaccurate predictions. This occurs because a **"one-size-fits-all" approach** fails to account for **systematic differences** between groups, requiring either **separate models** or the inclusion of demographic variables.

#### Important Points:

1. **Definition**:
   - **Aggregation bias** occurs when **different populations** with **varying distributions** of the outcome under study are **aggregated** into one model.
   - This issue is known as **infra-marginality**, where differences between groups are **not properly accounted for**, leading to **inaccurate predictions**.

2. **Problem in Model Development**:
   - Healthcare data often contains **distinct stratifications** based on factors like **ethnicity, age, gender**, etc.
   - Developing a **single model** for all groups can result in **biased outcomes**, as certain groups might have **different base rates** for diseases or conditions.

3. **Example: Diabetes and Ethnic Variations**:
   - **Type 2 diabetes** has **differential prevalence** and **complication rates** across **racial and ethnic groups**.
     - **US Hispanics** have higher rates (~17%) compared to **non-Hispanic Whites** (~8%).
     - Within the Hispanic community, different subgroups (e.g., Puerto Rican vs. South American) have **varying rates** of diabetes.
   - A model that **does not account** for these differences could produce **inaccurate predictions** or lead to **inequitable healthcare** for certain subgroups.

4. **Mitigating Aggregation Bias**:
   - **Developing separate models** for distinct populations can help improve accuracy.
   - Alternatively, **demographic variables** (e.g., ethnicity) should be included in the model to account for these **systematic differences**.
   - Without such adjustments, models may disproportionately **benefit or harm certain groups**, leading to **misdiagnosis** or **inequitable treatment**.

5. **Gender Bias Example**:
   - Similar to ethnic variation in diabetes, **gender bias** in heart attack symptoms showed that **AI solutions** worked well for **men** but were **inaccurate for women**.
   - This highlights the need for **tailored models** to address **gender-specific differences**.

#### Key Points for Study:
- **Aggregation Bias**: Occurs when **data from different populations** is combined without accounting for differences, leading to **infra-marginality**.
- **Type 2 Diabetes Example**: US **Hispanics** have **higher rates** of diabetes, but these rates vary between subgroups (e.g., Puerto Rican vs. South American).
- **Mitigation**: Use **separate models** or include **demographic variables** to address systematic differences across populations.
- **Gender Bias in Healthcare**: Similar issues can arise when models **overfit** to one group (e.g., men) while being **inaccurate for others** (e.g., women).


## Evaluation Bias

#### Key Concepts:
**Evaluation bias** occurs during the **validation and tuning** of AI models when the **testing data** or **performance metrics** used are **not representative** of the population or the problem the model is intended to address. It can lead to **misleading conclusions** about the model's real-world performance.

#### Important Points:

1. **Testing Data and External Benchmarks**:
   - **Evaluation bias** can arise if the **testing data** used to validate the model does not represent the **target population**.
   - Developers often use **benchmark** or **synthetic datasets** for model training, which may not align with **real-world conditions**.
   - This mismatch leads to models that perform well on test data but poorly in the **real-world deployment**.
   - **Solution**: Use **external validation** with **unseen data** drawn from the **target population** to ensure the model generalizes effectively.

2. **Performance Metrics**:
   - Inappropriate or **incomplete metrics** can also introduce **evaluation bias**.
   - For example, using only **Area Under the Curve (AUC)** in a dataset with **class imbalance** might **overestimate performance** if **other metrics** such as **precision, recall, or F1 score** are not considered.
   - Similarly, focusing only on overall performance may hide issues with **specific subgroups**.

3. **Granular and Comprehensive Metrics**:
   - To avoid **evaluation bias**, models should be assessed using a **range of metrics** that capture the model’s performance across all important aspects.
   - Metrics should also be **granular** enough to measure performance across **different subgroups** (e.g., by gender, age, ethnicity) to ensure that the model **performs equitably**.

4. **Example**:
   - If a model is developed for a **medical condition** but the test data predominantly represents a specific group (e.g., **men**), the model might **underperform** when deployed on a **different group** (e.g., **women**).
   - In cases of **class imbalance**, using **only AUC** might mask poor performance in predicting **rare events** (e.g., disease outbreaks or fraud).

#### Key Points for Study:
- **Evaluation Bias**: Can arise from **non-representative test data** or using **inappropriate metrics**.
- **External Validation**: Models should be tested on **unseen, real-world datasets** from the target population to ensure **generalizability**.
- **Performance Metrics**: Use a **comprehensive set of metrics** that address **class imbalance** and **subgroup-specific performance**.
- **Mitigation**: Evaluate models **granularly** across **different subgroups** to avoid overlooking bias and ensure **fair outcomes**.


## Deployment Bias

#### Key Concepts:
**Deployment bias** occurs when an AI model is **used inappropriately** or its **results are misinterpreted**, leading to a mismatch between the **model's intended use** and its **actual application**. This type of bias typically arises during the **implementation** phase of AI systems.

#### Important Points:

1. **Misapplication of Model Predictions**:
   - Deployment bias happens when the **intended purpose** of the model is different from how it is actually **deployed** or used in practice.
   - For instance, a model developed to predict **costs** may be incorrectly used to predict **healthcare needs**, resulting in **inaccurate decision-making**.

2. **Example: Healthcare Cost vs. Need**:
   - In the **Obermeyer study**, a model that accurately predicted **healthcare costs** was misused to predict **healthcare needs**.
   - This misapplication of the model led to **racial disparities** in healthcare access, where **black patients** had to be nearly **twice as sick** as **white patients** to qualify for the same care programs.
   - The deployment bias here resulted from assuming that **higher costs** were directly correlated with **higher healthcare need**, which was not the model's original intent.

3. **Impact on Healthcare**:
   - **Deployment bias** is particularly dangerous in healthcare, where **misinterpretation** of an AI model’s results can lead to **inequitable treatment** of patients.
   - This can reinforce **existing societal inequalities**, especially when decisions impact **access to care**, **resource allocation**, or **treatment recommendations**.

4. **Intersection of AI and Society**:
   - Deployment bias occurs at the **intersection of AI** and **society**, highlighting how **societal understanding** and the **medical community's** use of AI can influence outcomes.
   - A **well-developed model** can still lead to biased results if it is **misused** during deployment.

#### Key Points for Study:
- **Deployment Bias**: Occurs when an AI model is **misused** or its **results** are **misinterpreted**.
- **Example**: In the **Obermeyer study**, a model predicting **healthcare costs** was used to predict **healthcare needs**, leading to **inequitable care**.
- **Impact**: Can lead to **unfair outcomes** in **resource allocation** and **patient treatment**.
- **Mitigation**: Ensuring that models are **deployed appropriately** for their **intended purpose** is critical to preventing deployment bias, especially in sensitive domains like **healthcare**.


# Algorithmic Fairness

## What is Algorithmic Fairness?

#### Key Concepts:
**Algorithmic fairness** addresses the ethical implications of AI systems in healthcare, emphasizing the importance of **justice** and **equity** in health outcomes rather than merely focusing on algorithmic outputs. The historical context of medicine often reflects **white normativity**, which has influenced clinical practices and research inclusivity.

#### Importance of Diversity in Healthcare:
- Traditional clinical signs, such as **blue lips** as an indicator of **oxygen deprivation**, highlight biases in medical education. This sign is predominantly observable in individuals with **light skin**, demonstrating a lack of consideration for patients with **darker skin tones**.
- The concept of **epistemic privilege** emphasizes that individuals from diverse backgrounds possess knowledge and experiences that can improve the accuracy and comprehensiveness of medical knowledge and AI models.

### Classes of Algorithmic Fairness:

1. **Anti-classification**:
   - This approach aims to **exclude sensitive attributes** (such as race or gender) from the decision-making process to prevent discrimination.
   - The underlying assumption is that if these attributes are not considered, then the resulting algorithm will be fair.
   - **Limitations**: This method can overlook historical injustices and systemic biases inherent in the data, leading to **unintended discriminatory outcomes**.

2. **Classification Parity**:
   - This principle focuses on achieving **equal outcomes across different groups** (e.g., equal true positive rates or false positive rates).
   - The goal is to ensure that the algorithm performs similarly for all demographic groups.
   - **Limitations**: While it addresses disparity in outcomes, it does not necessarily account for the underlying **causes of disparity** or the contextual differences among groups.

3. **Calibration**:
   - Calibration ensures that the predicted probabilities of outcomes are **accurate across different groups**.
   - For instance, if an algorithm predicts a 70% probability of an event for two different groups, it should hold true that 70% of individuals in both groups actually experience the event.
   - **Limitations**: Calibration can be challenging in practice and may not adequately address **root causes of bias** or the broader social implications of its predictions.

### Challenges and Considerations:
- There is **no universal definition** of fairness, and each framework has its **shortcomings**.
- Researchers must be **cautious** and **aware** of these limitations when applying fairness definitions in **model evaluation**.
- The conversation around algorithmic fairness must involve **diverse perspectives** to enhance the quality of scientific inquiry and ensure equitable healthcare outcomes.

#### Key Points for Study:
- **Algorithmic Fairness**: A critical aspect of AI in healthcare focusing on **justice** and **equity**.
- **Diversity Importance**: Lack of representation can lead to biased clinical signs and algorithms.
- **Three Forms of Fairness**: 
  - **Anti-classification**: Excludes sensitive attributes.
  - **Classification Parity**: Ensures equal outcomes for different groups.
  - **Calibration**: Ensures accurate predicted probabilities across groups.
- **Limitations**: Each fairness concept has statistical shortcomings and requires careful application.


## Anti-classification

#### Definition of Anti-classification:
**Anti-classification**—or **fairness through unawareness**—is a framework that posits the exclusion of protected attributes (such as gender, race, or ethnicity) from outcome modeling in AI algorithms. This approach aims to prevent discrimination by ensuring that sensitive characteristics do not influence the model's decisions.

#### Key Concepts:
- **Protected Attributes**: Characteristics like gender and race that are safeguarded by anti-discrimination laws. The strict interpretation of anti-classification suggests omitting any unprotected characteristics that might act as proxies for these protected attributes.
- **Proxies**: Features that may indirectly correlate with protected attributes (e.g., attendance at a single-sex school, height, or weight), making the identification of true proxies challenging without further data collection.

#### Challenges of Anti-classification:
1. **Practicality Issues**:
   - Removing all potential proxies may significantly reduce the number of useful features, making it difficult to develop effective predictive models.
   - Identifying which variables are true proxies for protected attributes requires thorough analysis, which may not always be feasible.

2. **Indirect Bias**:
   - Even if an algorithm does not directly consider protected attributes, it can still produce biased outcomes based on proxies. This means the exclusion of sensitive attributes does not guarantee fairness.

3. **Infra-marginality Problem**:
   - Some clinical risk models need to include protected attributes to ensure equitable outcomes, particularly when the risk distributions differ across sub-populations.
   - For instance, diabetes diagnosis and monitoring protocols often require different approaches for various ethnicities and genders, necessitating the inclusion of these attributes in models.

#### Strategies for Mitigating Bias:
- **Focus on Reducing Disparities**:
  - Developers should aim to reduce differences in model performance between groups without sacrificing overall effectiveness. Identifying and addressing trade-offs is crucial in this process.

- **Enhancing Dataset Representation**:
  - AI solutions trained on datasets that underrepresent certain groups may require additional training data. This can help improve accuracy and fairness in decision-making, ensuring that algorithms do not yield unfair results.

- **Equitable Model Design**:
  - When aiming for fairness, models must learn to leverage protected attributes judiciously to avoid reinforcing existing inequalities while still addressing the nuances of different populations.

### Conclusion:
Anti-classification as a fairness framework offers a starting point for addressing bias in AI algorithms; however, its practical limitations and challenges highlight the complexity of achieving true fairness in healthcare applications. Developers must balance the exclusion of sensitive attributes with the need for comprehensive models that account for the diverse experiences of patients. Ultimately, a nuanced approach that combines equitable modeling practices with robust data representation strategies is essential for mitigating potential biases in AI healthcare solutions.

## Parity Classification

#### Definition of Classification Parity:
**Classification parity** is a fairness framework that advocates for equal predictive performance across different protected groups. This means that the performance metrics of a model should not vary based on sensitive attributes such as race, gender, or ethnicity. The focus is on ensuring that predictions made by the model are equally accurate across all demographic groups.

#### Key Concepts:
- **Relevant Performance Metrics**: The selection of performance metrics should align with the actionable insights derived from the model outputs. Depending on the specific problem at hand, certain metrics may be more pertinent than others.

- **Critical Metrics in Classification Parity**:
  1. **False Positive Rate (FPR)**: This refers to the probability of incorrectly predicting a positive outcome when the true outcome is negative. Classification parity requires that the FPR remains consistent across all demographic groups.
  2. **False Negative Rate (FNR)**: In some contexts, the FNR, which indicates the probability of failing to predict a positive outcome when it is true, may be more significant to consider, depending on the nature of the application.
  3. **Proportion of Positive Decisions**: Also known as **demographic parity**, this measure states that the probability of predicting a positive outcome should be uniform across different demographic groups when controlling for other variables. 

#### Challenges of Classification Parity:
- **Infra-marginality Issue**: 
  - The definitions of classification parity may not be applicable when the underlying risk distributions differ among groups. This is particularly relevant in healthcare, where different populations may exhibit varying risk profiles.
  - When risk distributions are different, enforcing classification parity can lead to misleading conclusions or ineffective interventions, as the needs and characteristics of different groups are not adequately addressed.

- **Threshold-based Decision Making**: 
  - Classification models often rely on threshold-based decisions, which can produce error metrics that differ across demographic groups. This discrepancy can complicate the achievement of true classification parity, as it may lead to unequal outcomes.

#### Implications in Healthcare:
- **Clinical Relevance**: In the medical field, the assumption of equal predictive performance is crucial for ensuring equitable healthcare delivery. Disparities in predictive performance could result in unequal access to care or misdiagnosis across different demographic groups.
  
- **Need for Tailored Approaches**: Given the complexities associated with differing risk distributions, it is vital for developers to consider tailored modeling approaches that incorporate the unique characteristics of various populations while striving for fairness.

### Conclusion:
Classification parity serves as a valuable framework for evaluating fairness in AI models, emphasizing the importance of equal predictive performance across protected groups. However, challenges such as infra-marginality and the implications of threshold-based decision-making necessitate careful consideration and adaptation in healthcare applications. To promote equitable outcomes, developers must balance the principles of classification parity with the realities of diverse patient populations and their specific needs.

## Calibration

#### Definition of Calibration:
**Calibration** refers to the alignment between predicted probabilities and observed outcomes in a predictive model. A well-calibrated model ensures that, for a given risk score, the proportion of actual positive outcomes aligns with the predicted probabilities. For instance, if a model predicts a risk score of 25, we would expect approximately 25 out of 100 patients with that score to experience the event of interest.

#### Calibration in Fairness Analysis:
In the context of algorithmic fairness, calibration emphasizes that outcomes should be independent of protected attributes when conditioning on risk estimates. This means that a model should accurately reflect probabilities for all demographic groups without bias. Key points include:

- **Risk Prediction and Outcome Agreement**: Calibration seeks to ensure that the predicted risk scores correspond closely with the actual probabilities of outcomes across different demographic groups.

- **Comparison of Outputs**: Calibration can be viewed as a comparison between the actual outcomes and the expected probabilities produced by the model. The goal is to refine the model such that the predicted risk distribution mirrors the observed outcomes in the training data.

#### Challenges with Calibration:
- **Manipulation of Risk Distributions**: Calibration can be susceptible to manipulation, where changes in data aggregation or feature representation can skew the outcomes. For example, modifying how data is grouped or which features are included can lead to misleading trends in predictions.

- **Consequences of Poor Calibration**: If an AI model is poorly calibrated, it can result in false expectations for users, whether patients or healthcare professionals. This misalignment can lead to poor decision-making, such as inappropriate treatment plans or mismanagement of care.

#### Clinical Implications:
- **In Vitro Fertilization (IVF) Example**: In the context of IVF treatment, accurate calibration is critical. An overestimation of the likelihood of live birth could provide false hope to couples, potentially leading to emotional distress and exposing women to unnecessary medical risks, such as ovarian hyperstimulation syndrome.

- **Importance of Calibration Evaluation**: Ensuring proper calibration is vital before deploying models in clinical settings. Poorly calibrated models can lead to misguided expectations and decisions, highlighting the need for rigorous evaluation during the development phase.

### Conclusion:
Calibration is a fundamental aspect of algorithmic fairness, ensuring that predictive models deliver accurate risk assessments that align with observed outcomes across diverse demographic groups. The importance of calibration is underscored by its direct implications for patient care and decision-making in healthcare settings. To maintain ethical standards in AI applications, developers must prioritize calibration and thoroughly assess models before deployment to prevent adverse consequences.

## Applying Fairness Measures

#### Introduction
Algorithmic fairness concepts, including **anti-classification**, **classification parity**, and **calibration**, can be applied to assess whether weather model predictions are equitable across different demographic groups. This exploration highlights how these principles can enhance the integrity of predictive models in various contexts, including weather forecasting.

#### 1. Anti-Classification
- **Definition**: Anti-classification requires the exclusion of protected attributes (e.g., race, gender) in model decision rules. A stricter interpretation prohibits the use of proxies for these attributes, such as geographic indicators (e.g., zip codes) that correlate with race.

- **Implementation**: 
  - Evaluating anti-classification is straightforward when excluding protected attributes. However, identifying and excluding proxies can be challenging due to subtle correlations.
  - Once relevant performance metrics are selected, differences in model performance across demographic groups can be assessed using a test set. 
  - **Empirical Confidence Intervals**: It is advisable to calculate empirical confidence intervals through methods like bootstrapping to minimize dependence on the test set. Non-overlapping confidence intervals indicate statistically significant performance differences among groups.

#### 2. Classification Parity
- **Definition**: Classification parity involves ensuring equal predictive performance across different demographic groups, often assessed through metrics like false positive and negative rates.

- **Implementation**: 
  - Select key performance metrics to analyze disparities across groups.
  - Analyze the equality of predictive performance to confirm whether the model treats all demographic groups fairly.

#### 3. Calibration
- **Definition**: Calibration measures the agreement between predicted probabilities and observed outcomes, ensuring that predictions accurately reflect the likelihood of events across demographic groups.

- **Implementation**: 
  - Assess calibration by examining overall predicted and observed outcomes across different demographic groups.
  - **Bootstrapping** can be utilized to verify the consistency of observed disparities between prediction and outcome rates, revealing systematic under- or overestimations of risk.

- **Challenges**:
  - Calibration is particularly complex for binary predictions linked to continuous risk functions. To analyze calibration, patients must be grouped by risk, allowing for a comparison of observed incidence rates to predicted risks.
  - The choice of grouping methodology—whether by risk score, percentiles, or deciles—can significantly affect the calibration analysis. A common approach is to create ten groups (deciles) and compute a chi-square statistic.

- **Visualization**: 
  - The calibration curve can be plotted, with observed rates against predicted risks. A well-calibrated model will show points close to the diagonal line. If the calibration curves differ significantly across demographic groups, it indicates a violation of fairness criteria.

#### Conclusion
In summary, the three fairness concepts—anti-classification, classification parity, and calibration—provide frameworks for evaluating weather model predictions in terms of equity across demographic groups. By applying these principles, developers and researchers can enhance the fairness and effectiveness of AI solutions, ultimately leading to more equitable outcomes in weather forecasting and beyond.

# Transparency

## Lack of Transparency

#### Introduction
The use of AI solutions in healthcare, particularly those trained on electronic health records (EHRs), raises concerns about potential discrimination, especially if these models are based on historic or retrospective data. A critical aspect of addressing these concerns is ensuring that the populations represented in EHR systems are reflective of the broader community.

#### Population Representativeness
- **Concerns**: Many AI solutions are developed using data from academic medical centers, which may not accurately represent the diversity of the broader patient population. If the training data lacks diversity, the effectiveness and applicability of AI models for various patient groups become questionable.

#### Transparency and Reporting
- **Importance of Reporting**: Transparency in reporting demographic breakdowns of training data is essential for evaluating the bias and fairness of AI solutions. Without this information, stakeholders—including healthcare professionals, patients, and insurers—cannot understand or mitigate potential biases inherent in the models.

- **Findings from Research**:
  - A study conducted by the author’s team assessed the reporting and demographics of machine learning models developed using EHRs.
  - Results revealed significant inconsistencies in the reporting of demographic variables:
    - Race and ethnicity were omitted in 64% of articles.
    - Gender and age were not reported in 75% of studies.
    - Socioeconomic status was missing in over 90% of the studies.
  - Even in studies that reported these variables, it was often unclear whether they were used as model inputs.
  - The demographic profiles of the populations included in these studies showed a higher proportion of White and Black individuals compared to fewer Hispanics, highlighting potential biases in the training data.

#### Implications for AI Deployment
- **Need for Detailed Information**: To ensure the unbiased deployment of AI models in healthcare, detailed information about the data used in model development and training is crucial. Without this transparency, it is challenging to understand the potential limitations and biases of the models.

#### Conclusion
The findings emphasize a significant gap in the transparency and representativeness of training data used in AI solutions for healthcare. Addressing these issues is vital for fostering trust and ensuring equitable outcomes when implementing AI technologies in clinical settings. Enhanced reporting practices and demographic inclusivity are necessary steps to improve the applicability and fairness of AI solutions across diverse patient populations.

## Minimal Reporting Standards

#### Introduction
To promote transparency and identify best practices for designing machine learning models that account for biases and fairness, we propose the MINIMAR template (Minimum Information for Medical AI Reporting). This checklist aims to assist researchers in the thorough reporting of AI solutions in healthcare.

#### MINIMAR Requirements
1. **Population Information**:
   - Include details about the data sources and cohort selection that provided the training data.
   
2. **Training Data Demographics**:
   - Provide demographic information on the training data, enabling comparisons with the population to which the model will be applied.
   
3. **Model Architecture and Development**:
   - Offer comprehensive details about the model architecture and its development, facilitating interpretation of the model’s intent and enabling replication.
   
4. **Model Evaluation, Optimization, and Validation**:
   - Report model evaluation, optimization, and validation processes transparently to clarify how local model optimization can be achieved and to support replication and resource sharing.

#### Example Study Application
To illustrate the application of the MINIMAR criteria, we present a study from our lab focusing on predicting postoperative pain in patients taking certain antidepressants and prodrug opioids.

- **Study Overview**:
  - **Hypothesis**: Certain antidepressants inhibit the activation of common prodrug opioids.
  - **Target Population**: Patients undergoing elective surgery.
  - **Data Source**: Electronic Health Records (EHRs) from an academic hospital.
  - **Cohort Selection**: Included only adult patients, excluding those who died during hospitalization.

- **Demographic Reporting**:
  - **Variables**: Age, gender, race, and ethnicity breakdown reported.
  - **Proxy for Socioeconomic Status**: Insurance types were used as a proxy.

- **Model Output**:
  - **Target**: Predicting pain scores to aid healthcare workers in triaging patients at risk of uncontrolled pain post-surgery.

- **Model Development**:
  - **Train-Test Split**: Utilized tenfold data splitting.
  - **Gold Standard**: 100 manually annotated notes, alongside structured pain scores recorded in the EHRs.
  - **Model Task**: Regression model for prediction.
  - **Features**: Detailed reporting of features used, transformations applied, and methods for handling missing data.

- **Model Optimization and Validation**:
  - Provided parameters for model optimization and validation details.
  - Code and de-identified data shared via GitHub for transparency.

#### Transparency Gaps
While the study met many MINIMAR criteria, a notable gap was the lack of external validation using unseen data separate from the training set. This limitation prevents verification of the model's generalizability.

#### Conclusion
The MINIMAR template provides a structured approach to ensure transparency in reporting AI solutions in healthcare. By adhering to these guidelines, researchers can enhance the understanding of model deployment and applicability across diverse patient populations. The example study demonstrates the potential benefits of thorough reporting, while also highlighting areas where improvements are needed, such as external validation, to establish model generalizability.

# Downstream Evalautions

## Opportunities and Challenges

#### Introduction
The lecture addressed the complexities surrounding bias and algorithmic fairness in AI, emphasizing the ongoing challenges and opportunities in this evolving field. As researchers navigate the intricacies of AI fairness, a deeper understanding of its definitions, implications, and potential pathways for improvement is essential.

#### Challenges in Defining Fairness
1. **Varied Definitions**: 
   - Numerous definitions of AI fairness exist in the literature, each reflecting different use cases and perspectives. This diversity complicates the identification of a single, universally applicable fairness solution.
   - The debate continues around the best definition of fairness, ranging from **quality** to **equity**.

2. **Equality vs. Equity**:
   - **Equality**: Providing each individual or group with the same resources, attention, or outcomes. This concept is illustrated by examples like the Obermaier case.
   - **Equity**: Ensuring that groups receive the resources necessary to achieve similar outcomes, highlighting the importance of tailoring support based on individual needs.

3. **Complex Implementation**: 
   - Developing models that balance both equality and equity requires a paradigm shift in healthcare delivery, presenting a significant area of ongoing research focused on identifying and mitigating biases in datasets.

4. **Systematic Biases**:
   - Many biases in AI are systemic and often go unnoticed, complicating efforts to address them. For instance, the blue lips example illustrates how biases can manifest in unexpected ways.
   - While biases can be identified, systematically mitigating them remains a significant challenge, necessitating the development of tools for healthcare professionals to address biases at the point of care.

#### Opportunities for Advancement
1. **Growing Research and Awareness**: 
   - The increasing number of publications on bias in healthcare AI indicates a proactive movement towards creating fair and unbiased data-driven healthcare solutions.
   - This growing body of work reflects a collective commitment to addressing bias and enhancing fairness in AI systems.

2. **Transparency and Accountability**:
   - Collecting and reporting AI outputs, clinical recommendations, and patient decisions—alongside eventual outcomes—are crucial for accountability within healthcare institutions and among clinicians.
   - Transparency is particularly vital for populations historically distrustful of the medical establishment, especially in contexts involving AI.

3. **Benefit for All Groups**:
   - To leverage AI effectively, it is crucial to ensure that it benefits all demographic groups. This requires thoughtful evaluations and careful human interpretations of AI outputs.
   - Fostering an environment of inclusivity and trust will enhance the acceptance and effectiveness of AI solutions in healthcare.

#### Conclusion
The challenges surrounding bias and algorithmic fairness in AI highlight the complexity of defining and implementing fair solutions in healthcare. However, the opportunities presented by increasing research, transparency, and a commitment to equity signal a promising direction for future advancements. As the field continues to evolve, it is essential to prioritize the development of inclusive and accountable AI systems that genuinely benefit all populations, ensuring a more equitable healthcare landscape.