# Introduction to Data Science, Lab 4 (10/7)
- Understand COMPAS and Propublica's data;
- Assess type I/II errors of COMPAS predictions.

The material in this notebook is based on the Responsible Data Science course taught by Julia Stoyanovich in Spring 2020. The data and its analysis are partly adopted from [(Larson et al., 2016)](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm) published in ProPublica.
### *COMPAS*
COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a software that assesses the risk of a defendant becoming a recidivist via his/her profile informaton and a questionnaire. This tool has been used in jurisdictions across several states, includign New York and California. COMPAS is often used as the pretrial assessment tool that suggests if a defendant should be released or booked in jail during the pretrial period.
### *Understanding Data*
The dataset was collected using records from Broward County, Florida. Originally, ProPublica received records of 18,610 people scored with COMPAS between 2013 and 2014. Then, records corresponding to defendants assessed not during their pretrial period (i.e., at parole, probation, etc.) were discarded. The authors of the dataset define recidivism as occurence of an arrest within two years of the COMPAS screening date, which is suggested by COMPAS developers as the predicting scope of their software. Traffic violations were not counter as an instance of recidivism, nor were arrests associated with offenses happened prior to COMPAS screening.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data=pd.read_csv("compas-scores-two-years.csv")
print(data.shape,data.columns)

Columns represent certain profile characteristics of defendants; prefix 'c' seemingly stands for *current*, so that the corresponding fature reveals information about the current offense. Other prefixes are 'r', 'vr' and stand for *recidivism* and *violent recidivism*, respectively (the authors define *violent recidivism* as murder, manslaughter, forcible rape, robbery or aggravated assault).

Let's focus on recidivism (rather than violent recidivism) and filter out features unnecessary for our analysis:

1. `age` - age (numeric);
2. `age_cat` - age bracket [0,25],[25,45] and [45,+] (categorical);
3. `sex` - sex: male or female (categorical);
4. `race` - race: african-american, caucasian, hispanic, asian, or other (categorical);
5. `c_charge_degree` - crime degree: misdemeanor (M), felony (F), or not causing jail time (O) (categorical);
6. `priors_count` - count of prior crimes committed (numeric);
7. `days_b_screening_arrest` - days between the arrest and COMPAS screening (numeric);
8. **`decile_score` - the COMPAS score predicted by the system: 1 to 10 (numeric);**
9. **`score_text` - category of decile score: low (1-4), medium (5-7), or high (8-10) (categorical);**
10. `is_recid` - indicator of recidivism after screening: 0,1 (categorical);
11. `two_year_recid` - indicator of recidivism within two years after screening (categorical);
12. `c_jail_in` - date of imprisonment (numeric);
13. `c_jail_out` - date of release (numeric).

In [None]:
# Remove unnecessary columns:
to_save=['sex','age','age_cat','race','c_charge_degree','priors_count','days_b_screening_arrest','decile_score','score_text','is_recid','two_year_recid','c_jail_in','c_jail_out']
data=data[to_save].copy()

In [None]:
def column_type(col_name):
    return data.loc[:,col_name].apply(lambda x: type(x)).unique()
def column_range(col_name):
    return data.loc[:,col_name].unique()
def column_nan(col_name):
    return data.loc[:,col_name].isna().sum()

In [None]:
# Understand the type of each feature:
for col in data.columns:
    print(col,":",column_type(col))

In [None]:
# Understand the range of each categorical feature:
categorical=["age_cat","race","sex","c_charge_degree","score_text","is_recid","two_year_recid"]
for col in categorical:
    print(col,":",column_range(col))

In [None]:
# Understand the missing data:
for col in data.columns:
    print(col,":",column_nan(col))

In [None]:
# Delete observations with missing data:
data=data.dropna()
for col in data.columns:
    print(col,":",column_nan(col))

In [None]:
# Delete observations with arrest/assessment mismatch
data=data[(data.days_b_screening_arrest<=30)&(data.days_b_screening_arrest>=-30)]

### *Visualizing Data*
First, understand the data distribution for different groups by race, sex, and age:

In [None]:
fig,axes=plt.subplots(nrows=1,ncols=2,figsize=(11,5))
data.loc[:,"age"].hist(ax=axes[0],bins=10)
axes[0].set_xlabel("age")
axes[0].set_ylabel("counts")
data["race"].value_counts().plot(kind='bar',ax=axes[1])
plt.xticks(rotation=45)
axes[1].set_ylabel("counts")
males=round(len(data[data.sex=="Male"])/len(data),3)
females=round(1-males,3)
print(f"male: {males}%, female: {females}%")

For different groups by race, sex, and age, plot the frequency of COMPAS scores:

In [None]:
# Show decile score histograms by sex:
fig,axes=plt.subplots(nrows=1,ncols=2,figsize=(11,5))
data_f=data[(data.sex=='Female')]
data_f["decile_score"].hist(ax=axes[0],label='total')
axes[0].set_xlabel("Decile Score (1-10)")
axes[0].set_ylabel("Frequency")
axes[0].set_title("Female Criminal Cases by Decile Score")
data_f[data_f.two_year_recid==1]["decile_score"].hist(ax=axes[0],label='two year recidivist',color='pink')
axes[0].legend()
data_m=data[(data.sex=='Male')]
data_m["decile_score"].hist(ax=axes[1],label='total')
axes[1].set_xlabel("Decile Score (1-10)")
axes[1].set_ylabel("Frequency")
axes[1].set_title("Male Criminal Cases by Decile Score")
data_m[data_m.two_year_recid==1]["decile_score"].hist(ax=axes[1],label='two year recidivist',color='pink')
axes[1].legend()

In [None]:
# Compute the proportion of recidivists among all defendants in each decile group:
proportions_f=[len(data_f[(data_f.two_year_recid==1)&(data_f.decile_score==i+1)])/len(data_f[data_f.decile_score==i+1]) for i in range(10)]
proportions_m=[len(data_m[(data_m.two_year_recid==1)&(data_m.decile_score==i+1)])/len(data_m[data_m.decile_score==i+1]) for i in range(10)]
plt.plot([0,10],[0,1],linestyle='dashed',color='k')
plt.plot(range(1,11),proportions_m,color='blue',label='male')
plt.plot(range(1,11),proportions_f,color='red',label='female')
plt.title("Proportion of recidivists by Decile Score")
plt.xlabel("Decie Score (1-10)")
plt.ylabel("proportion")
plt.legend()
plt.grid()

In [None]:
fig,axes=plt.subplots(nrows=1,ncols=2,figsize=(11,5))
data_aa=data[(data.race=='African-American')]
data_aa["decile_score"].hist(ax=axes[0],label='total')
axes[0].set_xlabel("Decile Score (0-10)")
axes[0].set_ylabel("Frequency")
axes[0].set_title("African-American by Decile Score")
data_aa[data_aa.two_year_recid==1]["decile_score"].hist(ax=axes[0],label='two year recidivist',color='pink')
axes[0].legend()
data_w=data[(data.race=='Caucasian')]
data_w["decile_score"].hist(ax=axes[1],label='total')
axes[1].set_xlabel("Decile Score (0-10)")
axes[1].set_ylabel("Frequency")
axes[1].set_title("Caucasian by Decile Score")
data_w[data_w.two_year_recid==1]["decile_score"].hist(ax=axes[1],label='two year recidivist',color='pink')
axes[1].legend()

In [None]:
# Compute the proportion of recidivists among all defendants in each decile group:
proportions_aa=[len(data_aa[(data_aa.two_year_recid==1)&(data_aa.decile_score==i+1)])/len(data_aa[data_aa.decile_score==i+1]) for i in range(10)]
proportions_w=[len(data_w[(data_w.two_year_recid==1)&(data_w.decile_score==i+1)])/len(data_w[data_w.decile_score==i+1]) for i in range(10)]
plt.plot([0,10],[0,1],linestyle='dashed',color='k')
plt.plot(range(1,11),proportions_aa,color='blue',label='african-american')
plt.plot(range(1,11),proportions_w,color='red',label='caucasian')
plt.title("Proportion of recidivists by Decile Score")
plt.xlabel("Decie Score (1-10)")
plt.ylabel("proportion")
plt.legend()
plt.grid()

In [None]:
fig,axes=plt.subplots(nrows=1,ncols=2,figsize=(11,5))
data_aa["score_text"].value_counts().plot(kind='bar',ax=axes[0],label="total")
data_w["score_text"].value_counts().plot(kind='bar',ax=axes[1],label="total")
data_aa[data_aa.two_year_recid==1]["score_text"].value_counts().plot(kind='bar',ax=axes[0],label="two year recidivist",color='pink')
data_w[data_w.two_year_recid==1]["score_text"].value_counts().plot(kind='bar',ax=axes[1],label="two year recidivist",color='pink')
axes[0].set_title('COMPAS risk for African-American')
axes[1].set_title('COMPAS risk for Caucasians')
axes[0].legend()
axes[1].legend()

In [None]:
fig,axes=plt.subplots(nrows=1,ncols=3,figsize=(11,5))
data_1=data[(data.age_cat=='Less than 25')]
data_1["decile_score"].hist(ax=axes[0])
axes[0].set_xlabel("Decile Score (0-10)")
axes[0].set_ylabel("Frequency")
axes[0].set_title("Younger than 25 by Decile Score")
data_1[data_1.two_year_recid==1]["decile_score"].hist(ax=axes[0],label='two year recidivist',color='pink')
axes[0].legend()
data_2=data[(data.age_cat=='25 - 45')]
data_2["decile_score"].hist(ax=axes[1])
axes[1].set_xlabel("Decile Score (0-10)")
axes[1].set_ylabel("Frequency")
axes[1].set_title("Age 25-45 by Decile Score")
data_2[data_2.two_year_recid==1]["decile_score"].hist(ax=axes[1],label='two year recidivist',color='pink')
axes[1].legend()
data_3=data[(data.age_cat=='Greater than 45')]
data_3["decile_score"].hist(ax=axes[2])
axes[2].set_xlabel("Decile Score (0-10)")
axes[2].set_ylabel("Frequency")
axes[2].set_title("Older than 45 by Decile Score")
data_3[data_3.two_year_recid==1]["decile_score"].hist(ax=axes[2],label='two year recidivist',color='pink')
axes[2].legend()
plt.tight_layout()

In [None]:
# Compute the proportion of recidivists among all defendants in each decile group:
proportions_1=[len(data_1[(data_1.two_year_recid==1)&(data_1.decile_score==i+1)])/len(data_1[data_1.decile_score==i+1]) for i in range(10)]
proportions_2=[len(data_2[(data_2.two_year_recid==1)&(data_2.decile_score==i+1)])/len(data_2[data_2.decile_score==i+1]) for i in range(10)]
proportions_3=[len(data_3[(data_3.two_year_recid==1)&(data_3.decile_score==i+1)])/len(data_3[data_3.decile_score==i+1]) for i in range(10)]
plt.plot([0,10],[0,1],linestyle='dashed',color='k')
plt.plot(range(1,11),proportions_1,color='blue',label='younger than 25')
plt.plot(range(1,11),proportions_2,color='paleturquoise',label='25-45')
plt.plot(range(1,11),proportions_3,color='pink',label='older than 45')
plt.title("Proportion of recidivists by Decile Score")
plt.xlabel("Decie Score (1-10)")
plt.ylabel("proportion")
plt.legend()
plt.grid()

Exercise: what does the spike at decile 1 for defendants younger than 25 imply? Is this an evidence of COMPAS predicting too conservative or too liberal?
### *Assessing Type I/II Errors of COMPAS*
- *Type I errors* (false positives) are incorrect predictions of a positive class (incorrectly rejecting a true null hypothesis) when the truth is negative; 
- *Type II errors* (false negatives) are incorrect predictions of a negative class (incorrectly accepting a false null hypothesis) when the truth is positive.

Here, the action space $\mathcal{A}=[10]$ of the model (COMPAS) does not coincide with the output space $\mathcal{Y}$ (binary *two_year_recid* variable). We thus need to determine a mapping $\mathcal{A}\rightarrow\mathcal{Y}$ to be able to compute the error rates. Consider a natural choice $a\mapsto \mathbb{1}_{a>5}$ (i.e. place a threshold in the middle of the "medium" risk category).

Since the race, sex and age distributions are not uniform in the data, we are not as interested in absolute counts of type I/II errors as we are in their rates. Rather, we want to know type I/II error *rates*, i.e., among all defendants within a particular demographic group, what proportion of them is mistakingly predicted risky/non-risky?

In [None]:
data_h=data[data.race=="Hispanic"]
data_aa=data[data.race=="African-American"]
data_w=data[data.race=="Caucasian"]

In [None]:
# Compute type I/II error rates within racial groups:
cutoff=5
type_1_aa=len(data_aa[(data_aa.two_year_recid==0)&(data_aa.decile_score>cutoff)])/len(data_aa[(data_aa.two_year_recid==0)])
type_2_aa=len(data_aa[(data_aa.two_year_recid==1)&(data_aa.decile_score<=cutoff)])/len(data_aa[(data_aa.two_year_recid==1)])
type_1_w=len(data_w[(data_w.two_year_recid==0)&(data_w.decile_score>cutoff)])/len(data_w[(data_w.two_year_recid==0)])
type_2_w=len(data_w[(data_w.two_year_recid==1)&(data_w.decile_score<=cutoff)])/len(data_w[(data_w.two_year_recid==1)])
type_1_h=len(data_h[(data_h.two_year_recid==0)&(data_h.decile_score>cutoff)])/len(data_h[(data_h.two_year_recid==0)])
type_2_h=len(data_h[(data_h.two_year_recid==1)&(data_h.decile_score<=cutoff)])/len(data_h[(data_h.two_year_recid==1)])
races=pd.DataFrame([[type_1_aa,type_2_aa],[type_1_w,type_2_w],[type_1_h,type_2_h]],index=["African-Americans","Caucasians","Hispanics"],columns=["type I","type II"])
display(races.T)

Note that type II error rate is significantly lower for African-American group compared to Caucasians and Hispanics. In other words, a defendant that will become a recidivist within the next two years is much more likely to be predicted as low-risk if he/she is not African-American. Similarly, the type I error rate is much larger for African-Americans, implying that chances that a non-recidivist African-American defendant will be mistakingly predicted as high-risk are higher than in any other racial group.

In [None]:
data_1=data[data.age_cat=="Less than 25"]
data_2=data[data.age_cat=="25 - 45"]
data_3=data[data.age_cat=="Greater than 45"]

In [None]:
# Compute type I/II error rates within age groups:
cutoff=5
type_1_1=len(data_1[(data_1.two_year_recid==0)&(data_1.decile_score>cutoff)])/len(data_1[(data_1.two_year_recid==0)])
type_2_1=len(data_1[(data_1.two_year_recid==1)&(data_1.decile_score<=cutoff)])/len(data_1[(data_1.two_year_recid==1)])
type_1_2=len(data_2[(data_2.two_year_recid==0)&(data_2.decile_score>cutoff)])/len(data_2[(data_2.two_year_recid==0)])
type_2_2=len(data_2[(data_2.two_year_recid==1)&(data_2.decile_score<=cutoff)])/len(data_2[(data_2.two_year_recid==1)])
type_1_3=len(data_3[(data_3.two_year_recid==0)&(data_3.decile_score>cutoff)])/len(data_3[(data_3.two_year_recid==0)])
type_2_3=len(data_3[(data_3.two_year_recid==1)&(data_3.decile_score<=cutoff)])/len(data_3[(data_3.two_year_recid==1)])
ages=pd.DataFrame([[type_1_1,type_2_1],[type_1_2,type_2_2],[type_1_3,type_2_3]],index=["Younger than 25","25-45","Older than 45"],columns=["type I","type II"])
display(ages.T)

Note that type I error rate is the largest for defendants younger than 25; i.e., compared to other age groups, younger defendants are more likely to be mistakingly predicted to be high-risk compared to defendants falling in a differnt age bracket. Similarly, type II error rate is the largest for the older defendants, suggesting that potential two-year recidivists from this age group are more likely to be treated as low-risk compared to defendants from other age groups.

In [None]:
data_f=data[data.sex=="Female"]
data_m=data[data.sex=="Male"]

In [None]:
# Compute type I/II error rates within sex groups:
cutoff=5
type_1_f=len(data_f[(data_f.two_year_recid==0)&(data_f.decile_score>cutoff)])/len(data_f[(data_f.two_year_recid==0)])
type_2_f=len(data_f[(data_f.two_year_recid==1)&(data_f.decile_score<=cutoff)])/len(data_f[(data_f.two_year_recid==1)])
type_1_m=len(data_m[(data_m.two_year_recid==0)&(data_m.decile_score>cutoff)])/len(data_m[(data_m.two_year_recid==0)])
type_2_m=len(data_m[(data_m.two_year_recid==1)&(data_m.decile_score<=cutoff)])/len(data_m[(data_m.two_year_recid==1)])
sex=pd.DataFrame([[type_1_f,type_2_f],[type_1_m,type_2_m]],index=["Female","Male"],columns=["type I","type II"])
display(sex.T)

Now, iterate through different cutoff (threshold) values to observe if racial groups with largest type I/II errors change depending on this value.

In [None]:
types_1_aa,types_2_aa,types_1_w,types_2_w,types_1_h,types_2_h=[],[],[],[],[],[]
for cutoff in range(2,10):
    types_1_aa.append(len(data_aa[(data_aa.two_year_recid==0)&(data_aa.decile_score>cutoff)])/len(data_aa[(data_aa.two_year_recid==0)]))
    types_2_aa.append(len(data_aa[(data_aa.two_year_recid==1)&(data_aa.decile_score<=cutoff)])/len(data_aa[(data_aa.two_year_recid==1)]))
    types_1_w.append(len(data_w[(data_w.two_year_recid==0)&(data_w.decile_score>cutoff)])/len(data_w[(data_w.two_year_recid==0)]))
    types_2_w.append(len(data_w[(data_w.two_year_recid==1)&(data_w.decile_score<=cutoff)])/len(data_w[(data_w.two_year_recid==1)]))
    types_1_h.append(len(data_h[(data_h.two_year_recid==0)&(data_h.decile_score>cutoff)])/len(data_h[(data_h.two_year_recid==0)]))
    types_2_h.append(len(data_h[(data_h.two_year_recid==1)&(data_h.decile_score<=cutoff)])/len(data_h[(data_h.two_year_recid==1)]))

In [None]:
fig,axes=plt.subplots(nrows=1,ncols=2,figsize=(11,5))
axes[0].plot(range(2,10),types_1_aa,label='african-american',color='blue')
axes[0].plot(range(2,10),types_1_w,label='caucasian',color='pink')
axes[0].plot(range(2,10),types_1_h,label='hispanic',color='paleturquoise')
axes[0].set_xlabel("cutoff value")
axes[0].set_ylabel("type I error rate")
axes[0].legend()
axes[0].grid()
axes[1].plot(range(2,10),types_2_aa,label='african-american',color='blue')
axes[1].plot(range(2,10),types_2_w,label='caucasian',color='pink')
axes[1].plot(range(2,10),types_2_h,label='hispanic',color='paleturquoise')
axes[1].set_xlabel("cutoff value")
axes[1].set_ylabel("type II error rate")
axes[1].legend()
axes[1].grid()

Interpret these graphs. Why does type I (II) error rate decrease (increase) with cutoff value?