# Preprocess the COMPAS Dataset
In this file, we preprocess the COMPAS dataset for binary classification via an NN. The COMPAS data can be found here: https://github.com/propublica/compas-analysis. We use much of the same methodology as ProPublica in their preprocessing, found here: https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

In [0]:
import pandas as pd
import csv

In [0]:
# Run this cell to mount your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Open and Explore COMPAS Dataset
We open the COMPAS dataset under compas-scores, found in the following GitHub Repo, made public by ProPublica: https://github.com/propublica/compas-analysis

In [0]:
df = pd.read_csv('/content/drive/My Drive/colab/fairness_checking/raw_data/compas-scores-two-years.csv')

In [0]:
df.head(5)

Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,juv_fel_count,decile_score,juv_misd_count,juv_other_count,priors_count,days_b_screening_arrest,c_jail_in,c_jail_out,c_case_number,c_offense_date,c_arrest_date,c_days_from_compas,c_charge_degree,c_charge_desc,is_recid,r_case_number,r_charge_degree,r_days_from_arrest,r_offense_date,r_charge_desc,r_jail_in,r_jail_out,violent_recid,is_violent_recid,vr_case_number,vr_charge_degree,vr_offense_date,vr_charge_desc,type_of_assessment,decile_score.1,score_text,screening_date,v_type_of_assessment,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,0,1,0,0,0,-1.0,2013-08-13 06:03:42,2013-08-14 05:41:20,13011352CF10A,2013-08-13,,1.0,F,Aggravated Assault w/Firearm,0,,,,,,,,,0,,,,,Risk of Recidivism,1,Low,2013-08-14,Risk of Violence,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0,327,0,0
1,3,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,0,3,0,0,0,-1.0,2013-01-26 03:45:27,2013-02-05 05:36:53,13001275CF10A,2013-01-26,,1.0,F,Felony Battery w/Prior Convict,1,13009779CF10A,(F3),,2013-07-05,Felony Battery (Dom Strang),,,,1,13009779CF10A,(F3),2013-07-05,Felony Battery (Dom Strang),Risk of Recidivism,3,Low,2013-01-27,Risk of Violence,1,Low,2013-01-27,2013-01-26,2013-02-05,0,9,159,1,1
2,4,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,0,4,0,1,4,-1.0,2013-04-13 04:58:34,2013-04-14 07:02:04,13005330CF10A,2013-04-13,,1.0,F,Possession of Cocaine,1,13011511MM10A,(M1),0.0,2013-06-16,Driving Under The Influence,2013-06-16,2013-06-16,,0,,,,,Risk of Recidivism,4,Low,2013-04-14,Risk of Violence,3,Low,2013-04-14,2013-06-16,2013-06-16,4,0,63,0,1
3,5,marcu brown,marcu,brown,2013-01-13,Male,1993-01-21,23,Less than 25,African-American,0,8,1,0,1,,,,13000570CF10A,2013-01-12,,1.0,F,Possession of Cannabis,0,,,,,,,,,0,,,,,Risk of Recidivism,8,High,2013-01-13,Risk of Violence,6,Medium,2013-01-13,,,1,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,2013-03-26,Male,1973-01-22,43,25 - 45,Other,0,1,0,0,2,,,,12014130CF10A,,2013-01-09,76.0,F,arrest case no charge,0,,,,,,,,,0,,,,,Risk of Recidivism,1,Low,2013-03-26,Risk of Violence,1,Low,2013-03-26,,,2,0,1102,0,0


In [0]:
len(df)

7214

In [0]:
df.columns.values

array(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex',
       'dob', 'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out',
       'c_case_number', 'c_offense_date', 'c_arrest_date',
       'c_days_from_compas', 'c_charge_degree', 'c_charge_desc',
       'is_recid', 'r_case_number', 'r_charge_degree',
       'r_days_from_arrest', 'r_offense_date', 'r_charge_desc',
       'r_jail_in', 'r_jail_out', 'violent_recid', 'is_violent_recid',
       'vr_case_number', 'vr_charge_degree', 'vr_offense_date',
       'vr_charge_desc', 'type_of_assessment', 'decile_score.1',
       'score_text', 'screening_date', 'v_type_of_assessment',
       'v_decile_score', 'v_score_text', 'v_screening_date', 'in_custody',
       'out_custody', 'priors_count.1', 'start', 'end', 'event',
       'two_year_recid'], dtype=object)

Notice that we have 53 variables for each individual. We will be using the label for two_year_recid for our binary classifier, so we will predict whether an individual recidivates (commits another crime) in two years.

## Preprocess Features
We try to follow ProPublica's procedure in preprocessing and classifying the data as close as possible, under this article: https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

First, following ProPublica's procedure, we filter out all cases with arrest or charge dates *within* 30 days of the COMPAS assessment being conducted. 

In ProPublica's words: "It was not always clear, however, which criminal case was associated with an individual’s COMPAS score. To match COMPAS scores with accompanying cases, we considered cases with arrest dates or charge dates within 30 days of a COMPAS assessment being conducted."

In [0]:
df = df[df['days_b_screening_arrest'] < 30]
df = df[df['days_b_screening_arrest'] > -30]

In [0]:
len(df)

6159

Instead of using the 'score_text' variable like ProPublica did, in this preprocessing method, we will instead label each training sample *directly* with whether that individual recidivated after two years. This can be found under the 'two_year_recid' feature. If 'two_year_recid' = 1, then that individual recidividated after two years. If 'two_year_recid' = 0, then that individual did not. 

In [0]:
df['two_year_recid'].head(5)

0    0
1    1
2    1
5    0
6    1
Name: two_year_recid, dtype: int64

Now, we try to choose columns that are similar to ProPublica's chosen columns. In their analysis, ProPublica used "race, age, criminal history, future recidivism, charge degree, gender and age." 

To match this, we choose 9 features:
- sex
- age
- race
- juv_fel_count
- juv_misd_count
- juv_other_count
- priors_count
- c_charge_degree
- score_text

We use the label 'two_year_recid.'

We drop all other columns.

In [0]:
df = df[['sex', 
         'age', 
         'race', 
         'juv_fel_count', 
         'juv_misd_count', 
         'juv_other_count', 
         'priors_count', 
         'c_charge_degree', 
         'score_text',
         'two_year_recid']]

In [0]:
df.head(5)

Unnamed: 0,sex,age,race,juv_fel_count,juv_misd_count,juv_other_count,priors_count,c_charge_degree,score_text,two_year_recid
0,Male,69,Other,0,0,0,0,F,Low,0
1,Male,34,African-American,0,0,0,0,F,Low,1
2,Male,24,African-American,0,0,1,4,F,Low,1
5,Male,44,Other,0,0,0,0,M,Low,0
6,Male,41,Caucasian,0,0,0,14,F,Medium,1


We do some final preprocessing to have binary values for sex, race, and c_charge_degree. For sex, we label 0 as female and 1 as male. For race (our sensitive variable), we label 1 as African-American and 0 as all other races NOT African-American. For c_charge_degree, we split into two columns: c_charge_degree_F and c_charge_degree_M. For score_text, we split into three columns: score_text_low, score_text_medium, and score_text_high.

In [0]:
df['sex'].replace(['Female','Male'], [0,1], inplace=True) # Replace sex

In [0]:
def label_race(row):
    if row['race'] == 'African-American':
        return 1
    else:
        return 0

In [0]:
df['race'] = df.apply(lambda row: label_race(row), axis=1) # Replace race

In [0]:
df = pd.get_dummies(df, prefix=['c_charge_degree'], columns=['c_charge_degree']) # Replace c_charge_degree

In [0]:
df = pd.get_dummies(df, prefix=['score_text'], columns=['score_text']) # Replace score_text

In [0]:
df.head(5)

Unnamed: 0,sex,age,race,juv_fel_count,juv_misd_count,juv_other_count,priors_count,two_year_recid,c_charge_degree_F,c_charge_degree_M,score_text_High,score_text_Low,score_text_Medium
0,1,69,0,0,0,0,0,0,1,0,0,1,0
1,1,34,1,0,0,0,0,1,1,0,0,1,0
2,1,24,1,0,0,1,4,1,1,0,0,1,0
5,1,44,0,0,0,0,0,0,0,1,0,1,0
6,1,41,0,0,0,0,14,1,1,0,0,0,1


In [0]:
df = df[[col for col in df if col not in ['two_year_recid']] + ['two_year_recid']] # Move the label (two_year_recid) to the last col

In [0]:
df.head(5)

Unnamed: 0,sex,age,race,juv_fel_count,juv_misd_count,juv_other_count,priors_count,c_charge_degree_F,c_charge_degree_M,score_text_High,score_text_Low,score_text_Medium,two_year_recid
0,1,69,0,0,0,0,0,1,0,0,1,0,0
1,1,34,1,0,0,0,0,1,0,0,1,0,1
2,1,24,1,0,0,1,4,1,0,0,1,0,1
5,1,44,0,0,0,0,0,0,1,0,1,0,0
6,1,41,0,0,0,0,14,1,0,0,0,1,1


In [0]:
len(df)

6159

## Save the Preprocessed Dataset to CSV
Our final preprocessed dataset has 6,159 entries, 10 features, and a binary label (risk_recid). We save to CSV and use it for binary classification in another notebook.

In [0]:
df.to_csv('/content/drive/My Drive/colab/fairness_checking/preprocessed_compas_data_two_years.csv')