# Data Generation

You’ve been tasked by a multinational company to implement a scalable automated application screening process to select potential employees from a large pool of applicants.

Here we simulate unit record data for the following features:

* years-of-experience - a quantification of suitability for a role, but due to societal reasons, also dependent on gender

* is_male - a flag indicating gender.

* was_hired - a *proxy* label for whether a candidate is suitable based on historical outcomes.

For the data to have realistic and interesting properties in the analysis, we assume that:

* suitability and gender are independently random (there is no inherent difference in suitability across genders)

* but we don't have access to suitability, only its proxies

* experience conflates suitability and gender, representing gender-dependent feature distributions.

* the was_hired targets also conflate suitability and gender, representing historical labeling bias.

We begin by defining our configuration parameters:


In [1]:
n = 10000               # number of unit records 
frac_male = 0.65        # fraction that are male
frac_hired = 0.2        # fraction that were hired (historically)
seed = 0                # reproducible results 

And then actually simulate (draw) a particular table of unit records: 

In [2]:
# Simulate unit record data
import numpy as np
from scipy.stats import norm
import pandas as pd
np.random.seed(seed)
show_rows = np.sort(np.random.permutation(n)[:5].astype(int))

# Gender is male with probability frac_male 
is_male = np.random.rand(n) < frac_male

# The latent suitability is independently random
_suitability = np.random.rand(n)

# Experience encodes a weighted combination of gender,
# suitability, and noise, seen through a non-linear transform:
# These weights control the degree of conflation.
_exp = .6 * is_male + 2. * _suitability + .5 * np.random.randn(n)
_exp = (_exp - _exp.mean()) / _exp.std()  # Normalise
experience = 20. * norm.cdf(0.5*_exp)  # Transform

# The label is an equivalent transformed, weighted combination.
# This time we give a smaller weight to gender, and the transform
# is a thresholding that takes the highest frac_suitable
_label = 0.05 * is_male + 1. * _suitability + 0.1 * np.random.randn(n) 
threshold = np.sort(_label)[int(n * (1 - frac_hired))]
label = _label >= threshold  # Selection

# Pack the data into a tabular format:
data = pd.DataFrame()
data['Experience'] = np.round(experience,0).astype(int)
data['Gender'] = np.array(['Female', 'Male'])[is_male.astype(int)]
data["Hired"] = np.array(['No', 'Yes'])[label.astype(int)]

# Display a preview of the data:
display(data.loc[show_rows])

# Save the full data to disk:
data.to_csv("unit_records.csv", index=False)

Unnamed: 0,Experience,Gender,Hired
898,14,Male,No
2343,6,Male,No
2398,16,Female,Yes
5906,16,Male,Yes
9394,15,Male,No


Before we run the scenario, lets verify the base rates in the generated data:

In [3]:
print("Male base rate:   ", int(100*label[is_male].mean() + 0.5), "%")
print("Female base rate: ", int(100*label[~is_male].mean() + 0.5), "%")
print("Overall base rate:", int(100*label.mean() + 0.5), "%")

Male base rate:    22 %
Female base rate:  17 %
Overall base rate: 20 %


These base rates indicate a subtle labeling bias (21% of males have been historically selected vs 18% of females).