
## To induce SP in rates

The Berekely admissions version of SP is what we will call rate based.  This requires 


2 grouping varibles & a binary
 - binary `decision` variable that will have a rate that flips (eg admission decision)
 - 1 `protected class` grouping variable that that will be how the rates are compared, eg gender (likely 2 levels, can possibly generalize)
 - 1 `explanatory` grouping variable eg has class imbalance
 
An open problem is to determine if we can define a mixed case where we use continuous variables to define the decsion variable.




In [1]:
import numpy as np
import pandas as pd
import string
import matplotlib.pylab as plt

  return f(*args, **kwds)
  return f(*args, **kwds)


The causal explanation for rate-based SP is that the application rates by gender varied in a way that was correlated with the department size and/or acceptance rate.  Here we create a model that will, for most samples, have SP for some departments. We only set the portion of applications to each department, the rate of each gender applying to each department and each department's acceptance rate.  

In [2]:
p_dept = [.15,.15,.1,.6]
p_gender_dept = [[.7, .3],[.8,.2],[.85,.15],[.2,.8]]
gender_list = ['F','M']
p_admit_dept = [[.15,.85], [.18,.82], [.25,.75],[.3,.7]]
# need to have higher accept in the larger subgroup, larger subgroup should have opposite protected class balance
N = 1000

d = np.random.choice(list(range(len(p_dept))), size=N, p =p_dept)
g = [np.random.choice(gender_list, p =p_gender_dept[d_i]) for d_i in d]
a = [np.random.choice([1,0],p = p_admit_dept[d_i]) for d_i in d]
data = [[d_i,g_i,a_i] for d_i, g_i,a_i in zip(d,g,a)]

df = pd.DataFrame(data = data, columns=['department','gender','decision'])

p_race_dept = [[.8,.1,.1], [.7,.13,.17],[.5,.2,.3],[.85,.07,.08]]
race_list = ['W','B','H']

r = [np.random.choice(race_list, p =p_race_dept[d_i]) for d_i in d]
df['race'] = r
df.head()

Unnamed: 0,department,gender,decision,race
0,2,F,0,H
1,1,F,1,W
2,2,F,1,H
3,2,F,1,W
4,3,F,0,W


We can check that the probabilities match what we set.  First the per-department admission rate.

In [3]:
actual_admit_dept = df.groupby(['department']).mean()
expected_admit_dept = [p[0] for p in p_admit_dept]
actual_admit_dept['expected'] = expected_admit_dept
actual_admit_dept

Unnamed: 0_level_0,decision,expected
department,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.152318,0.15
1,0.263158,0.18
2,0.240964,0.25
3,0.291531,0.3


In [4]:
actual_app_dept = df.groupby(['department'])['gender'].value_counts().unstack()
actual_app_dept

gender,F,M
department,Unnamed: 1_level_1,Unnamed: 2_level_1
0,107,44
1,127,25
2,68,15
3,118,496


In [5]:
expected_app_dept_dat = [[p_d[0]*n_d, p_d[1]*n_d] for p_d,n_d in zip(p_gender_dept,[p*N for p in p_dept])]
expected_app_dept = pd.DataFrame(data = expected_app_dept_dat, columns = gender_list)
expected_app_dept

Unnamed: 0,F,M
0,105.0,45.0
1,120.0,30.0
2,85.0,15.0
3,120.0,480.0


Next we can look for SP

In [6]:
df.groupby('gender')['decision'].mean()

gender
F    0.238095
M    0.279310
Name: decision, dtype: float64

In [7]:
df.groupby(['gender','department']).mean().unstack()

Unnamed: 0_level_0,decision,decision,decision,decision
department,0,1,2,3
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
F,0.158879,0.267717,0.25,0.271186
M,0.136364,0.24,0.2,0.296371


To detect SP, we compare which group is the highest in the overall to to which group is highest in each of the departments. 

In [8]:
overall_dat = df.groupby('gender')['decision'].mean()
per_dept_dat = df.groupby(['gender','department']).mean().unstack()

overall_dat.values[0]/overall_dat.values[1]
per_dept_dat.values[0]/per_dept_dat.values[1] /(overall_dat.values[0]/overall_dat.values[1])

array([1.36679343, 1.30857996, 1.46637931, 1.07341738])

In [9]:
per_dept_dat.values[0]/per_dept_dat.values[1] /(overall_dat.values[0]/overall_dat.values[1])

array([1.36679343, 1.30857996, 1.46637931, 1.07341738])

In [10]:
df.to_csv('synthetic_admission_high_sp.csv',index=False)

In [11]:
# now detect and compare which trends
overall = df.groupby('gender')['decision'].mean().idxmax()
print('overall more admitted: ', overall)

per_dept = [gender_list[g] for g in np.argmax(df.groupby(['gender','department']).mean().unstack().values,axis=0)]
print('per dept more addmitted:', per_dept)
# it's sp for each dept that is not the same as the overall
[not(dept==overall) for dept in per_dept]

overall more admitted:  M
per dept more addmitted: ['F', 'F', 'F', 'M']


[True, True, True, False]

Now, we can run it a bunch more times and count how many SP in each trial, to see how reliable it is.

In [12]:
sp_occ = []

for i in range(20):
    p_dept = [.15,.15,.1,.6]
    p_gender_dept = [[.7, .3],[.8,.2],[.85,.15],[.2,.8]]
    gender_list = ['F','M']
    p_admit_dept = [[.15,.85], [.18,.82], [.25,.75],[.3,.7]]
    # need to have higher accept in the larger subgroup, larger subgroup should have opposite protected class balance
    N = 1000

    d = np.random.choice(list(range(len(p_dept))), size=N, p =p_dept)
    g = [np.random.choice(gender_list, p =p_gender_dept[d_i]) for d_i in d]
    a = [np.random.choice([1,0],p = p_admit_dept[d_i]) for d_i in d]
    data = [[d_i,g_i,a_i] for d_i, g_i,a_i in zip(d,g,a)]

    df = pd.DataFrame(data = data, columns=['department','gender','decision'])

    overall = df.groupby('gender')['decision'].mean().idxmax()
    per_dept = [gender_list[g] for g in np.argmax(df.groupby(['gender','department']).mean().unstack().values,axis=0)]


    sp_occ.append(sum([not(dept==overall) for dept in per_dept]))
    
sp_occ

[2, 4, 2, 1, 1, 1, 3, 0, 2, 1, 3, 2, 3, 4, 1, 0, 0, 2, 1, 2]

We see from above that it generates SP most of the time but a varialbe number of times for the exact same settings to this sampling method does not reliably induce SP, it does show that, with no malintent that SP can occur.  

# intentional SP

To more reliably induce rate based SP we should sample less independently.  


In [13]:
p_dept = [.15,.2,.1,.55]
p_gender_dept = [[.7, .3],[.8,.2],[.85,.15],[.2,.8]]
gender_list = ['F','M']
p_admit_dept_gender = [{'F':.18,'M':.12},{'F':.17,'M':.1},
                       {'F':.30,'M':.27},{'F':.35,'M':.30}] 
# need to have higher accept in the larger subgroup, larger subgroup should have opposite protected class balance
N = 1000

d = np.random.choice(list(range(len(p_dept))), size=N, p =p_dept)
g = [np.random.choice(gender_list, p =p_gender_dept[d_i]) for d_i in d]
p_admit =[ p_admit_dept_gender[d_i][g_i] for d_i,g_i in zip(d,g)]
p_admit
a = [np.random.choice([1,0], p = [p,1-p]) for p in p_admit]
data = [[d_i,g_i,a_i] for d_i, g_i,a_i in zip(d,g,a)]

df = pd.DataFrame(data = data, columns=['department','gender','decision'])
df.head()

Unnamed: 0,department,gender,decision
0,0,F,0
1,3,M,1
2,0,F,0
3,1,F,0
4,0,F,0


In [14]:
p_admit_dept_gender[1]['M']

0.1

In [15]:
df.groupby('gender')['decision'].mean()

gender
F    0.189066
M    0.270945
Name: decision, dtype: float64

In [16]:
df.groupby(['gender','department']).mean().unstack()

Unnamed: 0_level_0,decision,decision,decision,decision
department,0,1,2,3
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
F,0.166667,0.126437,0.213115,0.3125
M,0.16,0.114286,0.277778,0.29476


In [17]:
overall_dat = df.groupby('gender')['decision'].mean()
per_dept_dat = df.groupby(['gender','department']).mean().unstack()

overall_dat.values[0]/overall_dat.values[1]
per_dept_dat.values[0]/per_dept_dat.values[1] /(overall_dat.values[0]/overall_dat.values[1])

array([1.49278039, 1.58543573, 1.09946947, 1.51931871])

In [18]:
overall = df.groupby('gender')['decision'].mean().idxmax()

per_dept = [gender_list[g] for g in np.argmax(df.groupby(['gender','department']).mean().unstack().values,axis=0)]


[not(dept==overall) for dept in per_dept]

[True, True, False, True]

We can try to use 3 categories and do an ordering swap detection, maybe?

In [19]:

p_race_dept = [[.8,.1,.1], [.7,.13,.17],[.5,.2,.3],[.85,.07,.08]]
race_list = ['W','B','H']

r = [np.random.choice(race_list, p =p_race_dept[d_i]) for d_i in d]
df['race'] = r

In [20]:
df.groupby('race')['decision'].mean()

race
B    0.235955
H    0.196970
W    0.241335
Name: decision, dtype: float64

In [21]:
df.groupby(['race','department']).mean().unstack()

Unnamed: 0_level_0,decision,decision,decision,decision
department,0,1,2,3
race,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
B,0.133333,0.086957,0.266667,0.361111
H,0.066667,0.095238,0.173913,0.326923
W,0.179688,0.138889,0.243902,0.2897


In [22]:
df.head()

Unnamed: 0,department,gender,decision,race
0,0,F,0,W
1,3,M,1,W
2,0,F,0,W
3,1,F,0,W
4,0,F,0,W


# Developing the detection framework

In [23]:
RESULTS_DF_HEADER = ['allCorr','attr1','attr2','reverseCorr','groupbyAttr','subgroup']
results_df = pd.DataFrame(columns=RESULTS_DF_HEADER)

In [24]:
results_df

Unnamed: 0,allCorr,attr1,attr2,reverseCorr,groupbyAttr,subgroup


In [25]:
mat = np.asarray([[4,5,6,7],[7,7,2,7],[5,8,3,9],[6,8,2,1]])
mat

array([[4, 5, 6, 7],
       [7, 7, 2, 7],
       [5, 8, 3, 9],
       [6, 8, 2, 1]])

In [38]:
regression_vars = ['A','B','C','D']
mat_df = pd.DataFrame(data=mat,columns = regression_vars)
mat_df

Unnamed: 0,A,B,C,D
0,4,5,6,7
1,7,7,2,7
2,5,8,3,9
3,6,8,2,1


In [47]:
triu_mat = np.triu(mat_df.corr())

In [49]:
triu_indices = np.triu_indices(4)
triu_vals = triu_mat[triu_indices]

In [52]:
pd.DataFrame(data=[[regression_vars[x],regression_vars[y],val] for x,y,val in zip(*triu_indices,triu_vals)])

Unnamed: 0,0,1,2
0,A,A,1.0
1,A,B,0.547723
2,A,C,-0.886593
3,A,D,-0.298142
4,B,B,1.0
5,B,C,-0.871602
6,B,D,-0.272166
7,C,C,1.0
8,C,D,0.35583
9,D,D,1.0


In [54]:
view = ('x1','x2','x3')
name = 'spectral'

In [58]:
'_'.join(['_'.join(view),name])

'x1_x2_x3_spectral'