üèóÔ∏è The Synthetic Data Structure (SCM)The data is organized into three main types of features, plus the target variable. They are interconnected to simulate a realistic scenario, like our loan approval task.1. The Target Outcome: $Y$Definition: $Y$ is the dependent variable that your Logistic Regression model tries to predict.Analogy: In the loan project, $Y$ is the Loan Approval Decision (1 for approved, 0 for denied).How it's Determined: $Y$ is primarily a function of the Predictive Features ($R$), the Sensitive Feature ($Q$) (if bias is introduced), and some Noise ($\epsilon_y$ or $sy$).$$\mathbf{Y} = f(\mathbf{R}, \mathbf{Q}) + \epsilon_y$$

2. The Sensitive Feature: $Q$ Definition: $Q$ is the protected attribute that you use to test for fairness disparities.Analogy: In the loan project, $Q$ is the Group ID (e.g., Group A or Group B), which is binary.Relation to $Y$ (Bias):In the unbiased baseline ($D_{base}$), the parameter l_q is set to $0$, meaning $Q$ has no direct influence on $Y$.To introduce bias (H1), you increase the l_q value, creating an unfair causal link between $Q$ and $Y$.  

3. The Predictive Features: $R$ Definition: $R$ represents the set of features that should legitimately determine the outcome $Y$. These are the non-sensitive features.Analogy: In the loan project, $R$ includes features like Credit Score and Annual Income.Relation to $Y$ (Signal):$R$ has the strongest legitimate causal influence on $Y$. The model is expected to learn this relationship.The parameter l_y (historical bias on $Y$) often determines the baseline strength of the signal from $R$ to $Y$.

4. Other Variables$A$: This is often used interchangeably with the Sensitive Feature $Q$ in bias literature (for "Protected Attribute"). The p_u (undersampling) parameter uses $A=1$ to denote the group being undersampled.$\epsilon$ (Epsilon): Represents various forms of random noise.$\epsilon_y$ is the noise added to the target $Y$, controlled by the sy parameter.$\epsilon_R$ and $\epsilon_Q$ represent noise/randomness in the features themselves, which are generated in the background.

Key Assumption: We assume $Q$ represents our Sensitive Feature (e.g., Group A/B) and $R$ represents the Predictive Features (e.g., Credit Score, Income). The setting $l\_q=0.0$ ensures that the sensitive feature $Q$ has no direct causal influence on the outcome $Y$, making the baseline data inherently fair.

In [1]:
import biasondemand

# The Unbiased, Clean Baseline Dataset (D_base)
biasondemand.generate_dataset(
    path="/baseline",
    dim=10, 
)

Correlation between R and P:  [[1.         0.89342576]
 [0.89342576 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//baseline/
:)


We will create four distinct functions to generate our experimental datasets ($D_1$ to $D_4$) by systematically varying one parameter at a time.

**Bias**
We will introduce label bias by increasing the importance of the sensitive feature $Q$ on the outcome $Y$. This simulates a world where the outcome is unfairly dependent on the protected attribute.

In [2]:
def generate_bias_series(bias_levels):
    for l_q_val in bias_levels:
        biasondemand.generate_dataset(
            path=f"/bias_lq_{l_q_val:.1f}",
            dim=10000,
            sy=0.0,
            l_q=l_q_val,  # VARYING THIS: Introduces bias
            l_r_q=0.0,
            thr_supp=1.0
        )

generate_bias_series([0.1, 0.2, 0.3,0.4, 0.5,0.6, 0.7, 0.8,0.9])

Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//bias_lq_0.1/
:)
Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//bias_lq_0.2/
:)
Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//bias_lq_0.3/
:)
Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//bias_lq_0.4/
:)
Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//bias_lq_0.5/
:)
Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the dire

**Noise**
We will introduce noise directly into the outcome variable $Y$. This simulates mislabeled or corrupted target data.

In [3]:
def generate_noise_series(noise_levels):
    for sy_val in noise_levels:
        biasondemand.generate_dataset(
            path=f"/noise_sy_{sy_val:.1f}",
            dim=10000,
            sy=sy_val,  # VARYING THIS: Introduces label noise
            l_q=0.0,
            l_r_q=0.0,
            thr_supp=1.0
        )

generate_noise_series([0.1, 0.2, 0.3,0.4, 0.5,0.6, 0.7, 0.8,0.9])

Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//noise_sy_0.1/
:)
Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//noise_sy_0.2/
:)
Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//noise_sy_0.3/
:)
Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//noise_sy_0.4/
:)
Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//noise_sy_0.5/
:)
Correlation between R and P:  [[1.         0.90583194]
 [0.90583194 1.        ]]

:)
:) The dataset has been generated and saved in the

**Imbalance**
p_u = 1.0 means 100% of samples in $A=1$ are kept (balanced). Decreasing p_u from $1.0$ down to $0.2$ creates progressively stronger class imbalance by removing instances from the $A=1$ group, directly testing H2.

In [4]:
def generate_imbalance_series(undersampling_percentages):
    for p_u_val in undersampling_percentages:
        biasondemand.generate_dataset(
            path=f"/imbalance_pu_{p_u_val:.1f}",
            dim=15000,
            sy=0,
            l_q=0,
            p_u=p_u_val, # VARYING THIS: Undersamples the minority group
            l_r_q=0,
            l_y=4,
            l_h_r=1.5,
            l_h_q=1,
        )

generate_imbalance_series([0.1, 0.2, 0.3,0.4, 0.5,0.6, 0.7, 0.8,0.9][::-1])

Correlation between R and P:  [[1.         0.90457101]
 [0.90457101 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//imbalance_pu_0.9/
:)
Correlation between R and P:  [[1.         0.90457101]
 [0.90457101 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//imbalance_pu_0.8/
:)
Correlation between R and P:  [[1.         0.90457101]
 [0.90457101 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//imbalance_pu_0.7/
:)
Correlation between R and P:  [[1.         0.90457101]
 [0.90457101 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//imbalance_pu_0.6/
:)
Correlation between R and P:  [[1.         0.90457101]
 [0.90457101 1.        ]]

:)
:) The dataset has been generated and saved in the directory datasets//imbalance_pu_0.5/
:)
Correlation between R and P:  [[1.         0.90457101]
 [0.90457101 1.        ]]

:)
:) The dataset has been genera