<a href="https://colab.research.google.com/github/alheliou/Bias_mitigation/blob/main/TD_intro_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TD 1: Fairness notion and data analysis

In this first TD we are going to manipulate some data and see the behaviour of the different fairness metrics

## Objectives


 1. Study the data, the distribution of each feature and its relation to the target.

 2. Highlight some bias present in the data


## Installation of the environnement

We highly recommend you to follow these steps, it will allow every student to work in an environment as similar as possible to the one used during testing.

### Colab Settings
  The next two cells of code are too execute only once per colab environment


#### 1. Python env creation

        ```
        ! python -m pip install numpy fairlearn plotly nbformat ipykernel aif360["inFairness"] aif360['AdversarialDebiasing'] causal-learn BlackBoxAuditing cvxpy dice-ml lime shapkit
        ```

#### 2. Download MEPS dataset (for part2) it can take several minutes

        ```
        ! Rscript /usr/local/lib/python3.12/dist-packages/aif360/data/raw/meps/generate_data.R
        ! mv h181.csv /usr/local/lib/python3.12/dist-packages/aif360/data/raw/meps/
        ! mv h192.csv /usr/local/lib/python3.12/dist-packages/aif360/data/raw/meps/
        ```

  
### Local Settings

#### 1. Uv installation


        https://docs.astral.sh/uv/getting-started/installation/


        `curl -LsSf https://astral.sh/uv/install.sh | sh`

        Python version 3.12 installation (highly recommended)
        `uv python install 3.12`

#### 2. R installation (needed for data download/pre-processing only of Part 2)

        In the command `Rscript` says 'command not found'

        `sudo apt install r-base-core`

#### 3. Python env creation

        ```
        mkdir TD_bias_mitigation
        cd TD_bias_mitigation
        uv python pin 3.12
        uv pip install numpy fairlearn plotly nbformat ipykernel aif360["inFairness"] aif360['AdversarialDebiasing'] causal-learn BlackBoxAuditing cvxpy dice-ml lime shapkit
        ```

#### 4. Download MEPS dataset (for part2) it can take several minutes

        ```
        cd TD_bias_mitigation/.venv/lib/python3.12/site-packages/aif360/data/raw/meps/
        Rscript generate_data.R
        ```



In [1]:
! python -m pip install numpy fairlearn plotly nbformat ipykernel aif360["inFairness"] aif360['AdversarialDebiasing'] causal-learn BlackBoxAuditing cvxpy dice-ml lime shapkit

Collecting fairlearn
  Downloading fairlearn-0.12.0-py3-none-any.whl.metadata (7.0 kB)
Collecting causal-learn
  Downloading causal_learn-0.1.4.3-py3-none-any.whl.metadata (4.6 kB)
Collecting BlackBoxAuditing
  Downloading BlackBoxAuditing-0.1.54.tar.gz (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dice-ml
  Downloading dice_ml-0.12-py3-none-any.whl.metadata (20 kB)
Collecting lime
  Downloading lime-0.2.0.1.tar.gz (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.7/275.7 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting shapkit
  Downloading shapkit-0.0.4-py3-none-any.whl.metadata (7.2 kB)
Collecting aif360[inFairness]
  Downloading aif360-0.6.1-py3-none-any.whl.metadata (5.0 kB)
Collecting skorch (from aif360[inFairness])
  Downloading skorch-1

In [3]:
! Rscript /usr/local/lib/python3.12/dist-packages/aif360/data/raw/meps/generate_data.R


By using this script you acknowledge the responsibility for reading and
abiding by any copyright/usage rules and restrictions as stated on the
MEPS web site (https://meps.ahrq.gov/data_stats/data_use.jsp).

Continue [y/n]? > y
Loading required package: foreign
trying URL 'https://meps.ahrq.gov/mepsweb/data_files/pufs/h181ssp.zip'
Content type 'application/zip' length 13303652 bytes (12.7 MB)
downloaded 12.7 MB

Loading dataframe from file: h181.ssp
Exporting dataframe to file: h181.csv
trying URL 'https://meps.ahrq.gov/mepsweb/data_files/pufs/h192ssp.zip'
Content type 'application/zip' length 15505898 bytes (14.8 MB)
downloaded 14.8 MB

Loading dataframe from file: h192.ssp
Exporting dataframe to file: h192.csv


In [5]:
! mv h181.csv /usr/local/lib/python3.12/dist-packages/aif360/data/raw/meps/
! mv h192.csv /usr/local/lib/python3.12/dist-packages/aif360/data/raw/meps/

## Dataset: Diabetes 130-Hospitals


https://fairlearn.org/main/api_reference/generated/fairlearn.datasets.fetch_diabetes_hospital.html


This dataset contains 101,766 rows, each one corresponding to a patient hospitalized for diabetes for a duration ranging from 1 to 14 days. The data was collected over 10 years and across 130 different hospitals. Each data point has 25 features, including medical and demographic information. The 'readmitted' column indicates whether the patient was readmitted, and if so, whether it was within 30 days or after. This column is further binarized into two other columns: 'readmit_30_days' (True if readmitted within 30 days, False otherwise) and 'readmitted' (True if readmitted, False otherwise).

We will use the 'readmit_30_days' column as the label/ground truth.

We will simplify the analysis by considering only a subset of 14 provided features:
age, gender, race, time_in_hospital, num_lab_procedures, num_procedures, num_medications, number_diagnoses, max_glu_serum, A1Cresult, insulin, had_emergency, had_inpatient_days, had_outpatient_days.

## Dataset: Meps

We recommend consulting the following pages for a better understanding of the dataset: [MEPSDataset19](https://aif360.readthedocs.io/en/latest/modules/generated/aif360.datasets.MEPSDataset19.html) and the [AIF360 tutorial](https://github.com/Trusted-AI/AIF360/blob/main/examples/tutorial_medical_expenditure.ipynb)

What you need to have read
- **The sensitive attribute is 'RACE' :1 is privileged, 0 is unprivileged** ; It is constructed as follows: 'Whites' (privileged class) defined by the features RACEV2X = 1 (White) and HISPANX = 2 (non Hispanic); 'Non-Whites' that included everyone else.
(The features 'RACEV2X', 'HISPANX' etc are removed, and replaced by the 'RACE')
- **'UTILIZATION' is the outcome (the label to predict for a ML model) 0 is positive 1 is negative**. It is a binary composite feature, created to measure the total number of trips requiring some sort of medical care, it sum up the following features (that are removed from the data):
    * OBTOTV15(16), the number of office based visits
    * OPTOTV15(16), the number of outpatient visits
    * ERTOT15(16), the number of ER visits
    * IPNGTD15(16), the number of inpatient nights
    * HHTOTD16, the number of home health visits
UTILISATION is set to 1 when te sum is above or equal to 10, else it is set to 0
- **The dataset is weighted** The dataset come with an 'instance_weights' attribute that corresponds to the feature perwt15f these weights are supposed to generate estimates that are representative of the United State (US) population in 2015.


Summary to remember
- **The sensitive attribute is 'RACE' :1 is privileged, 0 is unprivileged**
- **'UTILIZATION' is the outcome (the label to predict for a ML model) 0 is positive 1 is negative**
- **The dataset is weighted**


In [6]:
# Code to compute fairness metrics using aif360

from aif360.sklearn.metrics import *
from sklearn.metrics import  balanced_accuracy_score


# This method takes lists
def get_metrics(
    y_true, # list or np.array of truth values
    y_pred=None,  # list or np.array of predictions
    prot_attr=None, # list or np.array of protected/sensitive attribute values
    priv_group=1, # value taken by the privileged group
    pos_label=1, # value taken by the positive truth/prediction
    sample_weight=None # list or np.array of weights value,
):
    group_metrics = {}
    group_metrics["base_rate_truth"] = base_rate(
        y_true=y_true, pos_label=pos_label, sample_weight=sample_weight
    )
    group_metrics["statistical_parity_difference"] = statistical_parity_difference(
        y_true=y_true, y_pred=y_pred, prot_attr=prot_attr, priv_group=priv_group, pos_label=pos_label, sample_weight=sample_weight
    )
    group_metrics["disparate_impact_ratio"] = disparate_impact_ratio(
        y_true=y_true, y_pred=y_pred, prot_attr=prot_attr, priv_group=priv_group, pos_label=pos_label, sample_weight=sample_weight
    )
    if not y_pred is None:
        group_metrics["base_rate_preds"] = base_rate(
        y_true=y_pred, pos_label=pos_label, sample_weight=sample_weight
        )
        group_metrics["equal_opportunity_difference"] = equal_opportunity_difference(
            y_true=y_true, y_pred=y_pred, prot_attr=prot_attr, priv_group=priv_group, pos_label=pos_label, sample_weight=sample_weight
        )
        group_metrics["average_odds_difference"] = average_odds_difference(
            y_true=y_true, y_pred=y_pred, prot_attr=prot_attr, priv_group=priv_group, pos_label=pos_label, sample_weight=sample_weight
        )
        if len(set(y_pred))>1:
            group_metrics["conditional_demographic_disparity"] = conditional_demographic_disparity(
                y_true=y_true, y_pred=y_pred, prot_attr=prot_attr, pos_label=pos_label, sample_weight=sample_weight
            )
        else:
            group_metrics["conditional_demographic_disparity"] =None
        group_metrics["smoothed_edf"] = smoothed_edf(
        y_true=y_true, y_pred=y_pred, prot_attr=prot_attr, pos_label=pos_label, sample_weight=sample_weight
        )
        group_metrics["df_bias_amplification"] = df_bias_amplification(
        y_true=y_true, y_pred=y_pred, prot_attr=prot_attr, pos_label=pos_label, sample_weight=sample_weight
        )
        group_metrics["balanced_accuracy_score"] = balanced_accuracy_score(
        y_true=y_true, y_pred=y_pred, sample_weight=sample_weight
        )
    return group_metrics

  vect_normalized_discounted_cumulative_gain = vmap(
  monte_carlo_vect_ndcg = vmap(vect_normalized_discounted_cumulative_gain, in_dims=(0,))


## Part 1: Dataset Diabetes analysis

###  Download and simplify the dataset

In [7]:
import numpy as np
import pandas as pd
import fairlearn
np.__version__, fairlearn.__version__

('2.0.2', '0.12.0')

In [8]:
from fairlearn.datasets import fetch_diabetes_hospital
dataset = fetch_diabetes_hospital()

In [9]:
selection = [
    "age",
    "gender",
    "race",
    "time_in_hospital",
    "num_lab_procedures",
    "num_procedures",
    "num_medications",
    "number_diagnoses",
    "max_glu_serum",
    "A1Cresult",
    "insulin",
    "had_emergency",
    "had_inpatient_days",
    "had_outpatient_days"]

categorical_features = [
    "had_emergency",
    "had_inpatient_days",
    "had_outpatient_days",
    "age",
    "gender",
    "race",
    "max_glu_serum",
    "A1Cresult",
    "insulin"]

numerical_features = list(set(selection) - set(categorical_features))

df_diabetes = dataset.data[selection].copy(deep=True)

label = 'readmit_30_days'

df_diabetes[label] = dataset.target

for categorical_feature in categorical_features:
    df_diabetes[categorical_feature] = df_diabetes[categorical_feature].astype('category')
df_diabetes.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 15 columns):
 #   Column               Non-Null Count   Dtype   
---  ------               --------------   -----   
 0   age                  101766 non-null  category
 1   gender               101766 non-null  category
 2   race                 101766 non-null  category
 3   time_in_hospital     101766 non-null  int64   
 4   num_lab_procedures   101766 non-null  int64   
 5   num_procedures       101766 non-null  int64   
 6   num_medications      101766 non-null  int64   
 7   number_diagnoses     101766 non-null  int64   
 8   max_glu_serum        101766 non-null  category
 9   A1Cresult            101766 non-null  category
 10  insulin              101766 non-null  category
 11  had_emergency        101766 non-null  category
 12  had_inpatient_days   101766 non-null  category
 13  had_outpatient_days  101766 non-null  category
 14  readmit_30_days      101766 non-null  int64   
dtype

### Question1 : Count the number of positive and negative label

### Question2: Display the distribution of the numerical features and compute their correlation with the target

### Question3: Display histogram of categorical distribution by label for each categorical features.

### Question 4: Compute base rate metrics for a sensitive binary attribute (gender, race etc)

## Part 2: MEPS dataset analysis

###  Download and simplify the dataset

In [10]:
# imports
import numpy as np
import pandas as pd
import plotly.express as px
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', append=True, category=UserWarning)


In [11]:
# Datasets
from aif360.datasets import MEPSDataset19
from aif360.datasets import MEPSDataset20
from aif360.datasets import MEPSDataset21

MEPSDataset19_data = MEPSDataset19()

In [12]:
instance_weights = MEPSDataset19_data.instance_weights
f"Dataset len {len(instance_weights)}, total weight of the dataset {instance_weights.sum()}."

'Dataset len 15830, total weight of the dataset 141367240.546316.'

### First overview of the dataset

The AIF360 library provides a wrapper around the dataset, making it a bit less intuitive to use (for example, to study/visualize the attributes one by one), but it allows fairness metrics to be computed with a single command line.

In [13]:
from aif360.metrics import BinaryLabelDatasetMetric
from aif360.metrics import ClassificationMetric

metric_orig_panel19_train = BinaryLabelDatasetMetric(
        MEPSDataset19_data,
        unprivileged_groups=[{'RACE': 0}],
        privileged_groups=[{'RACE': 1}])

print(metric_orig_panel19_train.disparate_impact())

0.49826823461176517


However, since the aim of this lab is still to manipulate and analyze the data, we will return to working with the data in the form of a dataframe.

Note: To calculate fairness metrics without having to re-implement them for the weighted case (instance weights), you can use the methods implemented in AIF360 here: [Fairness Metrics Implementation](https://aif360.readthedocs.io/en/latest/modules/sklearn.html#module-aif360.sklearn.metrics)

### Conversion to a DataFrame

We have seen that the sum of the weights is significant, nearly 115 million, so we cannot reasonably duplicate each row as many times as its weight.

We will store the weighting and take it into account later in our analysis.

In [14]:
def get_df(MepsDataset):
    data = MepsDataset.convert_to_dataframe()
    # data_train est un tuple, avec le data_frame et un dictionnaire avec toutes les infos (poids, attributs sensibles etc)
    df = data[0]
    df['WEIGHT'] = data[1]['instance_weights']
    return df

df_meps = get_df(MEPSDataset19_data)

In [15]:
df_meps.columns

Index(['AGE', 'RACE', 'PCS42', 'MCS42', 'K6SUM42', 'REGION=1', 'REGION=2',
       'REGION=3', 'REGION=4', 'SEX=1',
       ...
       'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5', 'INSCOV=1',
       'INSCOV=2', 'INSCOV=3', 'UTILIZATION', 'WEIGHT'],
      dtype='object', length=140)

In [16]:
get_metrics(
   y_true=df_meps.UTILIZATION, # list or np.array of truth values
   y_pred=None,  # list or np.array of predictions
   prot_attr=df_meps.RACE, # list or np.array of protected/sensitive attribute values
   priv_group=1, # value taken by the privileged group
   pos_label=0, # value taken by the positive truth/prediction
   sample_weight=None # list or np.array of weights value
)

{'base_rate_truth': np.float64(0.8283006948831333),
 'statistical_parity_difference': np.float64(0.13008294988278912),
 'disparate_impact_ratio': 1.1746792888264614}

In [17]:
get_metrics(
    y_true=df_meps.UTILIZATION, # list or np.array of truth values
    y_pred=None,  # list or np.array of predictions
    prot_attr=df_meps.RACE, # list or np.array of protected/sensitive attribute values
    priv_group=1, # value taken by the privileged group
    pos_label=0, # value taken by the positive truth/prediction
    sample_weight=df_meps.WEIGHT # list or np.array of weights value
)

{'base_rate_truth': np.float64(0.7849286063696154),
 'statistical_parity_difference': np.float64(0.1350744772647814),
 'disparate_impact_ratio': 1.1848351529675123}

### Question 5.1 - Faire l'étude descriptive univarié de la couleur de peau ('RACE') (effectif, fréquence, model)

### Question 5.2 - Faire des graphiques décrivant la couleur de peau  (diagramme en secteur, diagramme en barres)

### Question 5.3 - Faire l'analyse bivariée entre la couleur de peau et les autres variables explicatives quantitatives (boite à moustaches des variables par genre, densité/histogramme par genre, rapport de corrélation)

### Question 5.4 - Faire l'analyse bivariée entre la couleur de peau et d'autres variables explicatives qualitative (table de contingence, diagramme en barre selon les profils lignes et selon les profils colonnes, diagramme en mosaique)

### Question 5.5 - Faire l'analyse bivariée entre la couleur de peau et la colonne 'UTILIZATION'

### Question 6 - Faire la même analyse que la question 5 avec une autre colonne sensible (sexe, âge, etc)