# Fairness in Recidvisim Risk Scores

adapted from [BPDM 2017 Tutorial by Caitlin Kuhlman et al](https://github.com/caitlinkuhlman/bpdmtutorial)

__Tools:__ Analysis will be done in python, using a number of open source packages commonly used for data science tasks:
- __Numpy__ scientific computing. http://www.numpy.org/
- __Pandas__ data analysis and manipulation http://pandas.pydata.org/
- __Scikit-learn__ machine learning http://scikit-learn.org/stable/
- __Matplotlib__ plotting https://matplotlib.org/

__Material:__ *Disclaimer*: The analysis presented here is directly inspired by the following references:

[1] ProPublica, *“Machine Bias,”* https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing, May 2016.

[2] A. Chouldechova. *"Fair prediction with disparate impact: A study of bias in recidivism prediction instruments."* arXiv preprint arXiv:1703.00056 (2017).

[3] F. P. Calmon, D. Wei, K. Natesan Ramamurthy, and K. R. Varshney, *“Optimized Data Pre- Processing for Discrimination Prevention,”* arXiv preprint arXiv:1704.03354 (2017)

In [44]:
%%writefile code/tools.py
import numpy as np
import pandas as pd
import scipy

import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from sklearn.metrics import roc_curve
%matplotlib inline
sns.set(font_scale=1.5)

Overwriting code/tools.py


In [45]:
%%writefile code/loadcompas
df = pd.read_csv('data/compas-scores-two-years-clean.csv')

Writing code/loadcompas


Here is an explanation of the data:

* `age`: defendant's age
* `c_charge_degree`: degree charged (Misdemeanor of Felony)
* `race`: defendant's race
* `age_cat`: defendant's age quantized in "less than 25", "25-45", or "over 45"
* `score_text`: COMPAS score: 'low'(1 to 5), 'medium' (5 to 7), and 'high' (8 to 10).
* `sex`: defendant's gender
* `priors_count`: number of prior charges
* `days_b_screening_arrest`: number of days between charge date and arrest where defendant was screened for compas score
* `decile_score`: COMPAS score from 1 to 10
* `is_recid`: if the defendant recidivized
* `two_year_recid`: if the defendant within two years
* `c_jail_in`: date defendant was imprisoned
* `c_jail_out`: date defendant was released from jail
* `length_of_stay`: length of jail stay

Next we look at the first few rows of the dataset

In [46]:
df.head()

Unnamed: 0,id,age,c_charge_degree,race,age_cat,score_text,sex,priors_count,days_b_screening_arrest,decile_score,is_recid,two_year_recid,c_jail_in,c_jail_out,length_of_stay,score_text_quant,norm_score
1,3,34,F,African-American,25 - 45,Low,Male,0,-1.0,3,1,1,2013-01-26 03:45:27,2013-02-05 05:36:53,10,0,0.222222
2,4,24,F,African-American,Less than 25,Low,Male,4,-1.0,4,1,1,2013-04-13 04:58:34,2013-04-14 07:02:04,1,0,0.333333
4,8,41,F,Caucasian,25 - 45,Medium,Male,14,-1.0,6,1,1,2014-02-18 05:08:24,2014-02-24 12:18:30,6,1,0.555556
6,10,39,M,Caucasian,25 - 45,Low,Female,0,-1.0,1,0,0,2014-03-15 05:35:34,2014-03-18 04:28:46,2,0,0.0
7,14,27,F,Caucasian,25 - 45,Low,Male,0,-1.0,4,0,0,2013-11-25 06:31:06,2013-11-26 08:26:57,1,0,0.333333


Since we want to look at race, we first look at the counts for each

In [6]:
df['race'].value_counts()

African-American    3175
Caucasian           2103
Hispanic             509
Other                343
Asian                 31
Native American       11
Name: race, dtype: int64

In [47]:
%%writefile code/filter
df = df.loc[df['race'].isin(['African-American','Caucasian'])]

Writing code/filter


We can look at the scores

In [49]:
%%writefile code/quant.py
score_quantization = {'Low':0, 'Medium':1, 'High':1}
df['score_text_quant'] =[ score_quantization[score] for score in df['score_text']]

Writing code/quant.py


In [50]:
%%writefile code/recidcorr
# Correlation between COMPAS score and 2-year recidivism

# measure with high-low score
print(df[['two_year_recid','score_text_quant']].corr())

# measure with decile_score
print(df[['two_year_recid','decile_score']].corr())

Writing code/recidcorr


The correlation is not that high. Let's measure the disparate impact of the quantized COMPAS score ($\leq4$ is low, everything else is high) according to the EEOC rule that the values with "high" for each protected group should be within 80% of each other. Of course, the interpertation here is not the same, but it's a good starting point.

reference: https://en.wikipedia.org/wiki/Disparate_impact#The_80.25_rule

In [51]:
%%writefile code/scoremeans.py
# The correlation is not that high. Let's measure the disparate impact according to the EEOC rule
means_scores = df.groupby(['score_text_quant','race']).size().unstack()
means_scores = means_scores/means_scores.sum()
print(means_scores)
# compute disparte impact
AA_with_high_score_scores = means_scores.loc[1,'African-American']
C_with_high_score_scores = means_scores.loc[1,'Caucasian']

Writing code/scoremeans.py


In [52]:
%%writefile code/diff.py
percentage_diff_scores = 100*(__/__ -1)
print('Percentage difference: %f%%' %percentage_diff_scores)

Writing code/diff.py


In [53]:
%%writefile code/recidmeans.py
means_recid = df.groupby(['two_year_recid','race']).size().unstack()
means_recid = means_recid/means_recid.sum()
print(means_recid)
# compute disparte impact
AA_with_high_score_recid = means_recid.loc[1,'African-American']
C_with_high_score_recid = means_recid.loc[1,'Caucasian']
percentage_diff_recid = 100*(AA_with_high_score_recid/C_with_high_score_recid -1)
print(percentage_diff_recid)


Writing code/recidmeans.py


There is a difference in recidivism, but not as high as assigned by the COMPAS scores.

Now let's measure the difference in scores when we consider both the COMPAS output at true recidivism.

We will consider a few different metrics. Further explaination can be found in North Point's response to the ProPublica article, and also in Alexandra Chouldechova’s paper (listed above). The link for it is https://assets.documentcloud.org/documents/2998391/ProPublica-Commentary-Final-070616.pdf . The discussion on error rates and calibration also appear in both. 

In [54]:
%%writefile code/normalize.py
# normalize decile score
max_score = df['decile_score'].max()
min_score = df['decile_score'].min()
df['norm_score'] = (df['decile_score']-min_score)/(max_score-min_score)


plt.figure(figsize=[10,10])
#plot ROC curve for African-Americans
y = df.loc[df['race']=='African-American',['two_year_recid','norm_score']].values
fpr1,tpr1,thresh1 = roc_curve(y_true = y[:,0],y_score=y[:,1])
plt.plot(fpr1,tpr1)

#plot ROC curve for Caucasian
y = df.loc[df['race']=='Caucasian',['two_year_recid','norm_score']].values
fpr2,tpr2,thresh2 = roc_curve(y_true = y[:,0],y_score=y[:,1])
plt.plot(fpr2,tpr2)
l = np.linspace(0,1,10)
plt.plot(l,l,'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Postitive Rate')
plt.title('ROC')
plt.legend(['African-American','Caucasian'])

Writing code/normalize.py


For each group, the point in the ROC curve corresponds to a $$(\mbox{false postive rate, true positive rate})$$ pair for a given threshold. In order to caputre the difference in error rates, we map the points to $$\left(\frac{\mbox{false postive rate Afr.-American}}{\mbox{false postive rate Cauc.}},s \right)$$
and similarly for *false negative* rates for different thersholds s.

In [55]:
%%writefile code/fpr.py
fpr_ratio = fpr1[1:]/fpr2[1:]
tpr_ratio = (tpr1[1:])/(tpr2[1:])
plt.figure(figsize=[10,10])
plt.plot(thresh1[1:],fpr_ratio)
plt.plot(thresh1[1:],tpr_ratio)
plt.xlabel('Normalized score threshold')
plt.ylabel('Ratio')

plt.legend(['False positive rate','True positive rate'])
plt.title('Ratio between African-American and Caucasian error rates\n for different score thresholds')

Writing code/fpr.py


The difference is once again stark. This graph is particlarly concerning due to the significantly higher false positive rates for African Americans across all thresholds.

# What other diffrences are there?

In [56]:
%%writefile code/decilesdist.py
table = df.groupby(['race','decile_score']).size().reset_index().pivot(index='decile_score',columns='race',values=0)

# percentage of defendants in each score category
100*table/table.sum()

Writing code/decilesdist.py


In [57]:
%%writefile code/decileplot.py
# now in visual form
x = df.loc[df['race']=='African-American','decile_score'].values
y = df.loc[df['race']=='Caucasian','decile_score'].values
plt.figure(figsize=[10,8])
plt.hist([x,y],normed=True)
plt.legend(['African-American','Caucasian'])
plt.title('COMPAS score distribution')
plt.xlabel('Score')
plt.ylabel('Fraction of population')

Writing code/decileplot.py


In [58]:
%%writefile code/priors_dist.py
df_2priors = df.loc[df['priors_count']>=2]
x = df_2priors.loc[df_2priors['race']=='African-American','decile_score'].values
y = df_2priors.loc[df_2priors['race']=='Caucasian','decile_score'].values
plt.figure(figsize=[12,7])
plt.hist([x,y],normed=True)
plt.legend(['African-American','Caucasian'])
plt.title('COMPAS score distribution for defendants with more than 2 priors')
plt.xlabel('Score')
plt.ylabel('Fraction of population')

Writing code/priors_dist.py
