# Example: Fair Partitioning of Synthetic Data

This notebook demonstrates the use of fair partitioning algorithms (FairGroups and FairKMeans) on synthetic data. We'll create a dataset with known ground truth partitions and compare how different algorithms perform in identifying fair groups.

We suppose that $L \sim \mathcal{U}(0,100)$ and corresponding $Y \sim  Bernoulli(p(L))$, where $p(L) = 0.1 \times\mathbf{1}_{\{L \leq 20\}} + 0.3 \times\mathbf{1}_{\{20 < L \leq 30\}} + 0.5\times\mathbf{1}_{\{30 < L \leq 55\}} + 0.7\times\mathbf{1}_{\{55 < L \leq 88\}} + 0.9\times\mathbf{1}_{\{88 < L \leq 100\}}$. We sample $N=50000$ pairs of observations $(L, Y)$ from this distribution $\mathcal{D}$. 

We apply FairKMeans and FairGroups methods to find the partition $\mathcal{P} = \{\mathcal{P}_k\}_{k=1}^K$ of $L$, using $\Phi(S^\mathcal{P}) = \mathbb{P}(Y = 1 | S^\mathcal{P}) - \mathbb{P}(Y = 1)$, where $S^\mathcal{P} = k \iff L \in \mathcal{P}_k, \text{ where } k=1,\dots,K$. 

In [1]:
import numpy as np

# Import fair grouping algorithms and utility functions
from fair_partition.partition_estimation import FairGroups, FairKMeans
from fair_partition.fairness_metrics import get_conditional_positive_y_proba
from fair_partition.visualization import plot_partition, plot_partition_with_ci, plot_conditional_proba

In [2]:
# Set random seed for reproducibility
np.random.seed(13)

In [3]:
# Generate synthetic data
nb_groups = 5  # Number of groups to partition into
nb_obs = 10000  # Number of observations per group

# Define ground truth partition boundaries
gt_partition = np.array([0, 20, 30, 55, 88, 100])

# Generate random feature values L uniformly distributed between 0 and 100
# This is a sensitive variable
s_min = 0  # Minimum value for feature L
s_max = 100  # Maximum value for feature L
s = np.random.uniform(0, 100, nb_obs*nb_groups)

# Generate binary labels Y with different probabilities for each group
y_probs = np.linspace(0.1, 0.9, nb_groups)  # Linearly spaced probabilities from 0.1 to 0.9
y = np.zeros(len(s))

# Assign labels based on the ground truth partition
for i in range(len(gt_partition)-1):
    mask = (s >= gt_partition[i]) & (s <= gt_partition[i+1])
    y[mask] = np.random.binomial(1, y_probs[i], np.sum(mask))

In [None]:
# Visualize the conditional probability of positive outcome given feature S
s_bins, y_s_proba = get_conditional_positive_y_proba(s, y)
plot_conditional_proba(s_bins, y_s_proba, 'L')

## FairGroups Partition of $L$

FairGroups is an algorithm that aims to create groups with similar positive outcome rates while maintaining reasonable group sizes. Let's see how it performs on our synthetic data.

In [None]:
# Initialize and fit the FairGroups algorithm
fair_groups = FairGroups(nb_groups)
fair_groups.fit(s, y)

In [None]:
# Display the positive outcome rates (phi) for each group
fair_groups.phi_by_group

In [None]:
# Display confidence intervals for the positive outcome rates
fair_groups.phi_by_group_ci

In [None]:
# Visualize the partition and group-wise positive outcome rates
plot_partition(fair_groups.partition, fair_groups.phi_by_group, 'L')

In [None]:
# Visualize the partition with confidence intervals
plot_partition_with_ci(fair_groups.partition, fair_groups.phi_by_group_ci, 'L')

## FairKMeans Partition of $L$

FairKMeans is an alternative algorithm that uses a k-means-like approach to create fair groups. It aims to minimize the variance in positive outcome rates while maintaining reasonable group sizes.

In [None]:
# Initialize and fit the FairKMeans algorithm
fair_kmeans = FairKMeans(nb_groups)
fair_kmeans.fit(s, y)

In [None]:
# Display the positive outcome rates (phi) for each group
fair_kmeans.phi_by_group

In [None]:
# Visualize the partition and group-wise positive outcome rates
plot_partition(fair_kmeans.partition, fair_kmeans.phi_by_group, 'L')

In [None]:
# Visualize the partition with confidence intervals
plot_partition_with_ci(fair_kmeans.partition, fair_kmeans.phi_by_group_ci, 'L')