<a href="https://colab.research.google.com/github/clairecoffey/project/blob/master/claire_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fairness and the Bias/Variance Tradeoff 

## Claire Coffey

## May 2020

In this notebook we are studying bias and variance errors in the context of fairness, by exploring recidivism data. 

## Imports and Setup

Imports: first import the relevant libraries used throughout. 

In [0]:
# imports
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
import numpy as np
import pandas as pd

# Read in recidivism data 

In this notebook we are studying recidivism data. We utilise the COMPAS recidivism dataset, which uses recidivism data from Broward County jail and has been explored in the following studies:

"The accuracy, fairness, and limits of predicting recidivism", paper available at:
https://advances.sciencemag.org/content/4/1/eaao5580#corresp-1

"Machine Bias" ProPublica article, available at:
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

The dataset used can be found at:
https://github.com/propublica/compas-analysis


Here we import and read in the recidivism data. Currently, we are using a selection of 500 samples from this dataset for our predictions.

We use a selection 
of fields from this dataset to predict recidivism classification (0 = will not reoffend; 1 = will reoffend).

In [0]:

def load_file():
  full_data = False
  print("loading data")
  if full_data:
    # full dataset
    file_path = "https://raw.githubusercontent.com/clairecoffey/project/master/mphilproject/compas-scores-two-years%20-%20compas-scores-two-years.csv?token=ABPC6VNTFTGQBANNUJY2O4C6XGJGY"
  else:
    # small subset of first 500/1000 people
    # file_path = "https://raw.githubusercontent.com/clairecoffey/project/master/mphilproject/500-compas-scores-two-years%20-%20Sheet1%20(1).csv?token=ABPC6VOW7CBEIIGZVE6ZJYS6YKNHO"
    file_path = "https://raw.githubusercontent.com/clairecoffey/project/master/mphilproject/1000-compas-scores-two-years%20-%20Sheet1.csv"

  # load CSV contents
  all_data = pd.read_csv(file_path, delimiter=',', dtype={'sex': 'category', 
                                                          'age_cat': 'category',
                                                          'race': 'category',
                                                          'c_charge_degree': 'category',
                                                          'c_charge_desc': 'category',
                                                          'r_charge_degree': 'category',
                                                          'r_charge_desc': 'category',
                                                          'vr_charge_degree': 'category',
                                                          'vr_charge_desc': 'category'
                                                          })
  return all_data


In [857]:
all_data = load_file()
all_data

loading data


Unnamed: 0,id,name,first,last,sex,dob,age,age_cat,race,juv_fel_count,juv_misd_count,juv_other_count,priors_count,days_b_screening_arrest,c_jail_in,c_jail_out,c_case_number,c_offense_date,c_arrest_date,c_charge_degree,c_charge_desc,is_recid,r_case_number,r_charge_degree,r_days_from_arrest,r_offense_date,r_charge_desc,r_jail_in,r_jail_out,violent_recid,is_violent_recid,vr_case_number,vr_charge_degree,vr_offense_date,vr_charge_desc,in_custody,out_custody,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,Male,1947-04-18,69,Greater than 45,Other,0,0,0,0,-1.0,2013-08-13 06:03:42,2013-08-14 05:41:20,13011352CF10A,2013-08-13,,F,Aggravated Assault w/Firearm,0,,,,,,,,,0,,,,,2014-07-07,2014-07-14,0,327,0,0
1,3,kevon dixon,kevon,dixon,Male,1982-01-22,34,25 - 45,African-American,0,0,0,0,-1.0,2013-01-26 03:45:27,2013-02-05 05:36:53,13001275CF10A,2013-01-26,,F,Felony Battery w/Prior Convict,1,13009779CF10A,(F3),,2013-07-05,Felony Battery (Dom Strang),,,,1,13009779CF10A,(F3),2013-07-05,Felony Battery (Dom Strang),2013-01-26,2013-02-05,9,159,1,1
2,4,ed philo,ed,philo,Male,1991-05-14,24,Less than 25,African-American,0,0,1,4,-1.0,2013-04-13 04:58:34,2013-04-14 07:02:04,13005330CF10A,2013-04-13,,F,Possession of Cocaine,1,13011511MM10A,(M1),0.0,2013-06-16,Driving Under The Influence,2013-06-16,2013-06-16,,0,,,,,2013-06-16,2013-06-16,0,63,0,1
3,5,marcu brown,marcu,brown,Male,1993-01-21,23,Less than 25,African-American,0,1,0,1,,,,13000570CF10A,2013-01-12,,F,Possession of Cannabis,0,,,,,,,,,0,,,,,,,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,Male,1973-01-22,43,25 - 45,Other,0,0,0,2,,,,12014130CF10A,,2013-01-09,F,arrest case no charge,0,,,,,,,,,0,,,,,,,0,1102,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,1531,jeremy torres,jeremy,torres,Male,1995-03-29,21,Less than 25,Hispanic,0,0,1,1,-4.0,2013-11-14 09:10:59,2013-11-16 08:34:03,13013996CF10A,,2013-11-14,F,arrest case no charge,1,15026209TC40A,(M2),,2015-04-30,Unlaw LicTag/Sticker Attach,,,,0,,,,,2013-11-14,2013-11-16,0,528,1,1
996,1532,leonardo collazos,leonardo,collazos,Male,1956-07-21,59,Greater than 45,Hispanic,0,0,0,1,-1.0,2013-04-26 10:39:22,2013-04-27 01:07:17,92078761TC20A,1992-07-02,,M,License Suspended Revoked,0,,,,,,,,,0,,,,,2013-04-26,2013-04-27,0,1070,0,0
997,1533,isaac smith,isaac,smith,Male,1976-05-09,39,25 - 45,African-American,0,0,0,1,-9.0,2014-01-05 09:54:57,2014-01-10 07:58:25,12016467CF10A,,2014-01-05,F,arrest case no charge,0,,,,,,,,,0,,,,,2014-01-05,2014-01-10,0,808,0,0
998,1534,trevor parker,trevor,parker,Male,1986-04-23,29,25 - 45,African-American,0,0,0,2,-1.0,2013-05-05 10:38:59,2013-06-12 05:40:10,13006444CF10A,2013-05-05,,F,Burglary Conveyance Armed,0,,,,,,,,,0,,,,,2013-05-05,2013-06-12,37,1061,0,0


## Import and process data


We import the data into a pandas DataFrame. We begin by cleaning the data, so the crime descriptions are simplified, removing duplicate catefories. Then,  the categorical data is  split into different fields for each category, and encoded as 0 or 1. For example, an individual with characteristic "sex: male" would be encoded as "male: 1, female: 0". The sex category is then removed. 

We then consider which fields to use for prediction. This includes the removal of any fields/columns which contain many NaN values, since these cannot be handled by the classifiers. We choose to remove the columns with many NaNs rather than using an alternative approach such as replacing them with the average so as not to introduce other types of bias. We also then remove rows/individuals containing any further NaN values so there is no longer any NaN values present in the data. 

We then normalise all of the data in the dataframe, so that when fed into the classifier, the predicitons are not skewed (and potentially different forms of bias introduced).  We do this by using the StandardScaler in the sklearn preprocessing library, and we normalise the data to have a variance of 1.

Finally, we define the number of testing/training samples desired and split the data into these two sets appropriately.


In [0]:
from sklearn import preprocessing

def clean_descriptions(description):
  description = description.replace(' and ', ' ')
  description = description.replace(' / ', ' ')
  description = description.replace('possession', 'posess')
  description = description.replace('possessing', 'posess')
  description = description.replace('with', 'w/')
  description = description.replace('w/ ', 'w/')
  description = description.replace('attempted', 'att')
  description = description.replace('attempt', 'att')
  description = description.replace('aggravated', 'agg')
  description = description.replace('aggrav', 'agg') 
  return description

def import_data(all_data):

  split_by_sex = False
  num_testing_samples = 201

  encoded_sex = (pd.get_dummies(all_data['sex']))
  all_data = all_data.drop(columns=['sex'])
  all_data = all_data.join(encoded_sex)

  encoded_age_cat = (pd.get_dummies(all_data['age_cat']))
  all_data = all_data.drop(columns=['age_cat'])
  all_data = all_data.join(encoded_age_cat)

  encoded_race = (pd.get_dummies(all_data['race']))
  all_data = all_data.drop(columns=['race'])
  all_data = all_data.join(encoded_race)

  encoded_c_charge_degree = (pd.get_dummies(all_data['c_charge_degree']))
  all_data = all_data.drop(columns=['c_charge_degree'])
  all_data = all_data.join(encoded_c_charge_degree, rsuffix='_c')

  #these are joined with suffixes because otherwise columns overlap 
  all_data['c_charge_desc'] = all_data['c_charge_desc'].astype(str).str.lower()
  all_data['c_charge_desc'] = all_data['c_charge_desc'].apply(clean_descriptions)
  encoded_c_charge_desc = (pd.get_dummies(all_data['c_charge_desc']))
  all_data = all_data.drop(columns=['c_charge_desc'])
  all_data = all_data.join(encoded_c_charge_desc, rsuffix='_c')

  encoded_r_charge_degree = (pd.get_dummies(all_data['r_charge_degree']))
  all_data = all_data.drop(columns=['r_charge_degree'])
  all_data = all_data.join(encoded_r_charge_degree, rsuffix='_r')

  all_data['r_charge_desc'] = all_data['r_charge_desc'].astype(str).str.lower()
  all_data['r_charge_desc'] = all_data['r_charge_desc'].apply(clean_descriptions)
  encoded_r_charge_desc = (pd.get_dummies(all_data['r_charge_desc']))
  all_data = all_data.drop(columns=['r_charge_desc'])
  all_data = all_data.join(encoded_r_charge_desc, rsuffix='_r')

  encoded_vr_charge_degree = (pd.get_dummies(all_data['vr_charge_degree']))
  all_data = all_data.drop(columns=['vr_charge_degree'])
  all_data = all_data.join(encoded_vr_charge_degree, rsuffix='_vr')

  all_data['vr_charge_desc'] = all_data['vr_charge_desc'].astype(str).str.lower()
  all_data['vr_charge_desc'] = all_data['vr_charge_desc'].apply(clean_descriptions)
  encoded_vr_charge_desc = (pd.get_dummies(all_data['vr_charge_desc']))
  all_data = all_data.drop(columns=['vr_charge_desc'])
  all_data = all_data.join(encoded_vr_charge_desc, rsuffix='_vr')

  all_data = all_data.drop(columns=['nan'])
  all_data = all_data.drop(columns=['nan_vr'])
  all_data = all_data.drop(columns=['nan_r'])

  #drop columns not used for predictions, including info such as names, and coluns with many NaN values 
  all_data_simplified = all_data.drop(columns=['two_year_recid', 'r_days_from_arrest', 'id','name','first','last','dob','days_b_screening_arrest','c_jail_in','c_jail_out','c_case_number','c_offense_date','c_arrest_date','r_case_number','r_offense_date','r_jail_in','r_jail_out','vr_case_number','vr_offense_date','in_custody','out_custody','start','end','violent_recid', 'age'])

  #drop demographic info such as age, gender, race; keep only criminal records for predictions
  # all_data_simplified = all_data.drop(columns=['two_year_recid', 'r_days_from_arrest', 'id','name','first','last','dob','days_b_screening_arrest','c_jail_in','c_jail_out','c_case_number','c_offense_date','c_arrest_date','r_case_number','r_offense_date','r_jail_in','r_jail_out','vr_case_number','vr_offense_date','in_custody','out_custody','start','end','violent_recid', 'age','Female','Male',	'25 - 45',	'Greater than 45',	'Less than 25',	'African-American',	'Caucasian',	'Hispanic',	'Other',	'F',	'M'])

  #remove rows containing NaN values 
  all_data_simplified = all_data_simplified.dropna()

  #Renormalise the data so we have unit variance and mean 0 using built-in preprocessing method in sklearn
  scaler = preprocessing.StandardScaler()
  all_data_scaled = pd.DataFrame(scaler.fit_transform(all_data_simplified),columns=all_data_simplified.columns)

  all_data_and_labels = all_data_scaled.join(all_data[['two_year_recid']])
  all_data_and_labels.columns = map(str.lower, all_data_and_labels.columns)

  #split into training and testing with specific number of testing samples
  #for now just set testing set to be first num_testing_samples samples in table 
  testing_data_and_labels = all_data_and_labels[:num_testing_samples]
  #and training set to be the remainder
  #this is also then consistent which is good for seeing patterns etc 
  training_data_and_labels = all_data_and_labels[num_testing_samples:]

  if(split_by_sex):
    testing_data_and_labels = pd.DataFrame.reset_index(testing_data_and_labels[testing_data_and_labels['Female']>0], drop=True)

  # print("normalised")
  # print(training_data_and_labels)

  # print(testing_data_and_labels)
  return training_data_and_labels, testing_data_and_labels

 

In [859]:
training_data_and_labels, testing_data_and_labels = import_data(all_data)
training_data_and_labels

Unnamed: 0,juv_fel_count,juv_misd_count,juv_other_count,priors_count,is_recid,is_violent_recid,event,female,male,25 - 45,greater than 45,less than 25,african-american,asian,caucasian,hispanic,native american,other,f,m,agg assault,agg assault w/dead weap,agg assault w/firearm,agg battery,agg battery (firearm/actual posess),agg battery grt/bod/harm,agg battery pregnant,agg battery w/deadly weapon,agg fleeing eluding,agg fleeing/eluding high speed,agg stalking after injunctn,arrest case no charge,arson ii (vehicle),assault,att armed burglary dwell,att burg/struct/unocc,att burgl conv occp,att burgl unoccupied dwel,att robbery no weapon,att robbery firearm,...,(m1)_vr,(m2)_vr,(mo3)_vr,agg assault law enforc officer,agg assault w/dead weap_vr,agg assault w/firearm_vr,agg battery_vr,agg battery grt/bod/harm_vr,agg battery law enforc officer,agg battery pregnant_vr,agg battery w/deadly weapon_vr,agg flee/eluding (injury/prop damage),agg fleeing/eluding high speed_vr,assault_vr,att murder in the first degree_vr,battery_vr,battery on a person over 65_vr,battery on law enforc officer_vr,burglary dwelling assault/batt_vr,burglary w/assault/battery_vr,carjacking,child abuse_vr,doc/fighting/threatening words_vr,felony battery_vr,felony battery (dom strang)_vr,felony battery w/prior convict_vr,kidnapping (facilitate felony),manslaughter w/weapon,murder in the first degree_vr,robbery no weapon_vr,robbery sudd snatch no weapon_vr,robbery sudd snatch w/weapon_vr,robbery w/firearm_vr,robbery weapon_vr,stalking (agg)_vr,strong armed robbery_vr,threat public servant_vr,threaten throw destruct device_vr,throw deadly missile into veh_vr,two_year_recid
201,-0.133977,-0.170543,-0.250302,-0.700393,-0.949284,-0.386556,-0.811403,2.071474,-2.071474,0.904534,-0.529537,-0.548079,-1.010051,-0.054855,1.415275,-0.323994,-0.054855,-0.254878,-1.371803,1.371803,-0.031639,-0.114766,-0.095298,-0.031639,-0.031639,-0.063372,-0.100504,-0.08396,-0.054855,-0.031639,-0.044766,-0.442913,-0.031639,-0.070888,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,...,-0.231821,-0.063372,-0.031639,-0.044766,-0.077693,-0.044766,-0.031639,-0.070888,-0.031639,-0.063372,-0.070888,-0.044766,-0.031639,-0.063372,-0.044766,-0.226991,-0.054855,-0.070888,-0.031639,-0.044766,-0.031639,-0.031639,-0.031639,-0.031639,-0.100504,-0.054855,-0.031639,-0.031639,-0.031639,-0.054855,-0.031639,-0.031639,-0.063372,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,0
202,-0.133977,-0.170543,-0.250302,0.098458,-0.949284,-0.386556,-0.811403,-0.482748,0.482748,0.904534,-0.529537,-0.548079,-1.010051,-0.054855,1.415275,-0.323994,-0.054855,-0.254878,-1.371803,1.371803,-0.031639,-0.114766,-0.095298,-0.031639,-0.031639,-0.063372,-0.100504,-0.08396,-0.054855,-0.031639,-0.044766,-0.442913,-0.031639,-0.070888,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,...,-0.231821,-0.063372,-0.031639,-0.044766,-0.077693,-0.044766,-0.031639,-0.070888,-0.031639,-0.063372,-0.070888,-0.044766,-0.031639,-0.063372,-0.044766,-0.226991,-0.054855,-0.070888,-0.031639,-0.044766,-0.031639,-0.031639,-0.031639,-0.031639,-0.100504,-0.054855,-0.031639,-0.031639,-0.031639,-0.054855,-0.031639,-0.031639,-0.063372,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,0
203,-0.133977,1.703553,-0.250302,3.893002,1.053425,-0.386556,1.232433,2.071474,-2.071474,0.904534,-0.529537,-0.548079,0.990050,-0.054855,-0.706577,-0.323994,-0.054855,-0.254878,0.728967,-0.728967,-0.031639,-0.114766,-0.095298,-0.031639,-0.031639,-0.063372,-0.100504,-0.08396,-0.054855,-0.031639,-0.044766,-0.442913,-0.031639,-0.070888,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,...,-0.231821,-0.063372,-0.031639,-0.044766,-0.077693,-0.044766,-0.031639,-0.070888,-0.031639,-0.063372,-0.070888,-0.044766,-0.031639,-0.063372,-0.044766,-0.226991,-0.054855,-0.070888,-0.031639,-0.044766,-0.031639,-0.031639,-0.031639,-0.031639,-0.100504,-0.054855,-0.031639,-0.031639,-0.031639,-0.054855,-0.031639,-0.031639,-0.063372,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,1
204,-0.133977,-0.170543,-0.250302,0.298171,1.053425,-0.386556,1.232433,-0.482748,0.482748,-1.105542,-0.529537,1.824556,-1.010051,-0.054855,-0.706577,-0.323994,-0.054855,3.923448,0.728967,-0.728967,-0.031639,-0.114766,-0.095298,-0.031639,-0.031639,-0.063372,-0.100504,-0.08396,-0.054855,-0.031639,-0.044766,2.257778,-0.031639,-0.070888,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,...,-0.231821,-0.063372,-0.031639,-0.044766,-0.077693,-0.044766,-0.031639,-0.070888,-0.031639,-0.063372,-0.070888,-0.044766,-0.031639,-0.063372,-0.044766,-0.226991,-0.054855,-0.070888,-0.031639,-0.044766,-0.031639,-0.031639,-0.031639,-0.031639,-0.100504,-0.054855,-0.031639,-0.031639,-0.031639,-0.054855,-0.031639,-0.031639,-0.063372,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,1
205,-0.133977,-0.170543,-0.250302,-0.300967,1.053425,-0.386556,1.232433,-0.482748,0.482748,0.904534,-0.529537,-0.548079,0.990050,-0.054855,-0.706577,-0.323994,-0.054855,-0.254878,0.728967,-0.728967,-0.031639,-0.114766,-0.095298,-0.031639,-0.031639,-0.063372,-0.100504,-0.08396,-0.054855,-0.031639,-0.044766,-0.442913,-0.031639,-0.070888,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,...,-0.231821,-0.063372,-0.031639,-0.044766,-0.077693,-0.044766,-0.031639,-0.070888,-0.031639,-0.063372,-0.070888,-0.044766,-0.031639,-0.063372,-0.044766,-0.226991,-0.054855,-0.070888,-0.031639,-0.044766,-0.031639,-0.031639,-0.031639,-0.031639,-0.100504,-0.054855,-0.031639,-0.031639,-0.031639,-0.054855,-0.031639,-0.031639,-0.063372,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,-0.133977,-0.170543,2.133524,-0.500680,1.053425,-0.386556,1.232433,-0.482748,0.482748,-1.105542,-0.529537,1.824556,-1.010051,-0.054855,-0.706577,3.086473,-0.054855,-0.254878,0.728967,-0.728967,-0.031639,-0.114766,-0.095298,-0.031639,-0.031639,-0.063372,-0.100504,-0.08396,-0.054855,-0.031639,-0.044766,2.257778,-0.031639,-0.070888,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,...,-0.231821,-0.063372,-0.031639,-0.044766,-0.077693,-0.044766,-0.031639,-0.070888,-0.031639,-0.063372,-0.070888,-0.044766,-0.031639,-0.063372,-0.044766,-0.226991,-0.054855,-0.070888,-0.031639,-0.044766,-0.031639,-0.031639,-0.031639,-0.031639,-0.100504,-0.054855,-0.031639,-0.031639,-0.031639,-0.054855,-0.031639,-0.031639,-0.063372,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,1
996,-0.133977,-0.170543,-0.250302,-0.500680,-0.949284,-0.386556,-0.811403,-0.482748,0.482748,-1.105542,1.888441,-0.548079,-1.010051,-0.054855,-0.706577,3.086473,-0.054855,-0.254878,-1.371803,1.371803,-0.031639,-0.114766,-0.095298,-0.031639,-0.031639,-0.063372,-0.100504,-0.08396,-0.054855,-0.031639,-0.044766,-0.442913,-0.031639,-0.070888,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,...,-0.231821,-0.063372,-0.031639,-0.044766,-0.077693,-0.044766,-0.031639,-0.070888,-0.031639,-0.063372,-0.070888,-0.044766,-0.031639,-0.063372,-0.044766,-0.226991,-0.054855,-0.070888,-0.031639,-0.044766,-0.031639,-0.031639,-0.031639,-0.031639,-0.100504,-0.054855,-0.031639,-0.031639,-0.031639,-0.054855,-0.031639,-0.031639,-0.063372,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,0
997,-0.133977,-0.170543,-0.250302,-0.500680,-0.949284,-0.386556,-0.811403,-0.482748,0.482748,0.904534,-0.529537,-0.548079,0.990050,-0.054855,-0.706577,-0.323994,-0.054855,-0.254878,0.728967,-0.728967,-0.031639,-0.114766,-0.095298,-0.031639,-0.031639,-0.063372,-0.100504,-0.08396,-0.054855,-0.031639,-0.044766,2.257778,-0.031639,-0.070888,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,...,-0.231821,-0.063372,-0.031639,-0.044766,-0.077693,-0.044766,-0.031639,-0.070888,-0.031639,-0.063372,-0.070888,-0.044766,-0.031639,-0.063372,-0.044766,-0.226991,-0.054855,-0.070888,-0.031639,-0.044766,-0.031639,-0.031639,-0.031639,-0.031639,-0.100504,-0.054855,-0.031639,-0.031639,-0.031639,-0.054855,-0.031639,-0.031639,-0.063372,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,0
998,-0.133977,-0.170543,-0.250302,-0.300967,-0.949284,-0.386556,-0.811403,-0.482748,0.482748,0.904534,-0.529537,-0.548079,0.990050,-0.054855,-0.706577,-0.323994,-0.054855,-0.254878,0.728967,-0.728967,-0.031639,-0.114766,-0.095298,-0.031639,-0.031639,-0.063372,-0.100504,-0.08396,-0.054855,-0.031639,-0.044766,-0.442913,-0.031639,-0.070888,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,...,-0.231821,-0.063372,-0.031639,-0.044766,-0.077693,-0.044766,-0.031639,-0.070888,-0.031639,-0.063372,-0.070888,-0.044766,-0.031639,-0.063372,-0.044766,-0.226991,-0.054855,-0.070888,-0.031639,-0.044766,-0.031639,-0.031639,-0.031639,-0.031639,-0.100504,-0.054855,-0.031639,-0.031639,-0.031639,-0.054855,-0.031639,-0.031639,-0.063372,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,-0.031639,0


# Classification

##Selecting Classifiers

Here we select the classification model to use. We are using a selection of built-in classifiers in scikit-learn. 

Currently, we are using RBF SVM models (https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html). 

We define the boolean values ```vary_gamma ``` and  ```vary_c``` to define whether we are varying the gamma or C value in the classifiers. 




In [0]:
from sklearn import model_selection, neighbors, svm, gaussian_process, tree, ensemble, neural_network, metrics

def define_classifiers():

  vary_gamma = True
  vary_c = False 
  gammas = []
  cs = []
  classifiers = []

  if vary_gamma:
    # gammas = [0.0001, 0.001, 0.01, 0.1, 1.0, 10, 100, 1000, 10000]
    gammas = [0.001]
    # gammas = [1]
    c_val = 1000
    #fix size of C if varying gamma
    for gamma_val in gammas:
      classifiers.append(svm.SVC(gamma=gamma_val,C=c_val))

  if vary_c:
    cs = [10, 100, 1000, 10000, 100000, 1000000, 10000000, 10000000]
    gamma_val = 1
    #fix size of gamma if varying C
    for c_val in cs:
      classifiers.append(svm.SVC(gamma=gamma_val,C=c_val))

  return classifiers, gammas, cs


In [0]:
# define_classifiers()

## Bootstrapping 

The classification process then uses a bootstrapping procedure with the chosen model, to generate predictions of recidivism classifications (1 = will reoffend; 0 = will not reoffend).

Bootstrapping (https://link.springer.com/chapter/10.1007/978-1-4612-4380-9_41) is a sampling with replacement procedure. Here, the sample size is the same as the size of the (training) dataset. The bootstrapping procedure is run many times to generate different training datasets, which will then be used for classification. In turn, the classification results will be used to calculate and study the bias and variance errors. 

In [0]:
def do_bootstrap(training_data_and_labels):
  # this is one bootstrap sample 
  indices = np.random.randint(0,training_data_and_labels.shape[0] , training_data_and_labels.shape[0])
  indices.sort()
  data_points = []

  for i in indices:
    data_points.append(training_data_and_labels.iloc[i])

  b_sample = pd.DataFrame(data_points)

  return b_sample

In [0]:
# b_sample = do_bootstrap(training_data_and_labels)

### Calculate average prediction for each individual over all bootstrap samples 

In [0]:
def calculate_avg_prediction(predictions):
  #each row is bootstrap sample, each column an individual
  return majority_predictions

## Perform classification


Fit the model on the training data (which is one bootstrap data sample as defined above)


In [0]:
def fit_model(clf, b_sample, testing_data_and_labels):

    #training data is everything apart from two year recid 0/1 label from the bootstrap sample
    X_train = b_sample.drop(columns=['two_year_recid'])
    y_train = b_sample['two_year_recid']
    test_aa = testing_data_and_labels.loc[testing_data_and_labels['african-american'] > 0]
    X_test = test_aa.drop(columns=['two_year_recid'])
    y_test = test_aa['two_year_recid']

    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    y_true = y_test

    return y_pred, y_true

Perform classification for each bootstrap sample separately, and store these in a DataFrame, to be passed into the bias/variance calculations.

In [0]:
def classify(training_data_and_labels, testing_data_and_labels, clf):
    count = 0

    num_bootstraps = len(training_data_and_labels);
    while count <= num_bootstraps:
      b_sample = do_bootstrap(training_data_and_labels)
      y_pred, y_true = fit_model(clf, b_sample, testing_data_and_labels)
      if(count == 0):
        predictions = pd.DataFrame(pd.Series(y_pred)).transpose()
        #true labels are the same for every sample so we only need 1 row in df
        true_labels = pd.DataFrame(pd.Series(y_true)).transpose()
      else:
        predictions = predictions.append(pd.DataFrame(pd.Series(y_pred)).transpose())
      count += 1
      
    return predictions, true_labels

In [0]:
# classify(training_data_and_labels, testing_data_and_labels, clf)


## Correcting for Fairness


#Fairness

The definition of fairness is disputed, and there is not a single correct approach to ensuring fairness in machine learning. In general, as stated in https://arxiv.org/pdf/1711.08513.pdf, fairness in machine learning can be approached in two ways: fairness of the dataset itself; fairness of the model.

Since we cannot control the process by which the data is collected, and the recidivism dataset already exists (likely with human and societal biases built-in), we will not be focusing on the former category. Although, there have been recent trends within the fairness and machine learning communities to argue the importance of the fairness in data collection, for example, in http://papers.nips.cc/paper/7613-why-is-my-classifier-discriminatory.pdf, the authors discuss the necessity of correcting for bias in the dataset, an approach which may actually increase the accuracy of the predictions, in contrast to approaches that exclusively focus on correcting for fairness in the models, at the expense of accuracy. Another area in which recent trends in fairness research have addressed is the importance of developing context-aware fairness measurements (https://arxiv.org/pdf/1805.05859.pdf). However, in our project we will focus on model-based fairness correction - ensuring the machine learning models are not perpetuating existing biases, or introducing new biases. We do this by using a widely used and accepted fairness measurement which is context-independent, known as **Equalised Odds** (http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf). This approach is not without criticism, however it provides a clear and well-motivated approach to achieving fair predictions across subgroups with different protected characteristics. We attempt to correct for fairness in relation to the protected characteristics found in the recidivism dataset (sex, race, age). Once our models are 'fair' in relation to this description, we can explore the relationship between bias and variance errors and the potential discovery of discrimination against new categories. 



##Equalised Odds

As stated above, we are considering fairness in relation to the equalised odds metric (http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf). The definition as stated in this paper is as follows: \\
We say that a predictor $\hat{Y}$ satisfies equalized odds with respect to
protected attribute $A$ and outcome $Y$, if $\hat{Y}$ and $A$ are independent conditional on $Y$. Therefore, if the classification labels are $Y$ and $\hat{Y}$, for an outcome $ y=1 $, $\hat{Y}$ has equal true positive rates across all demographic groups, for example, the categories not female and female will have equal true positive rates. For an outcome  $ y=0 $, $\hat{Y}$ has equal false positive rates across all demographic groups. This enforces equal bias and accuracy in all demographics. This can formally be stated as:
$$ Pr \left\{ \hat{Y}=1 | A = 0, Y = y \right\} = Pr \left\{ \hat{Y}=1 | A = 0, Y = y \right\} , y \in \left\{ 0,1 \right\}$$

This approach punishes models that only perform well on the majority demographics.

In [0]:
from sklearn.metrics import confusion_matrix

#this method doesn't actually make anything fair, it just calculates what we need in order to configure the models to be fair 
def make_fair(y_true, y_pred):
  # get confusion matrix and compute tn,fp,fn,tp
  tn, fp, fn, tp = confusion_matrix(y_true.iloc[0].to_numpy(), y_pred.iloc[0].to_numpy()).ravel()
  print("true negatives:", tn, "false positives:", fp,"false negatives:", fn, "true positives:",tp)
  #reconfigure model based on these results

# Compute bias/variance errors

Using all these bootstrap predictions, we calculate the average misclassification error. This is done by first calculating the overall misclassification loss across bootstrap sample predictions, finding the average misclassification error for each datapoint (individual). As described in:
http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/uqc832/tr-bias.pdf

We can then decompose the error into the errors due to bias, and the errors due to variance, in order to study the behaviour of the model and the bias/variance tradeoff. This decomposition for classification is described in: 
https://homes.cs.washington.edu/~pedrod/bvd.pdf
https://pdfs.semanticscholar.org/9253/f3e13bca7e845e60394d85ddaec0d4cfc6d6.pdf https://www.stat.berkeley.edu/users/breiman/arcall96.pdf. 

The error is also comprised of an error due to noise (in addition to bias and variance). However, as stated in http://papers.nips.cc/paper/7613-why-is-my-classifier-discriminatory.pdf, the noise is dependent on the data, not the model, so comparing the discrimination level in the form of bias and variance errors, the noise terms cancel since they are independent of the model. Therefore, differences in bias can be explored even without knowing the underlying noise of the data. 

We calculate the bias and variance errors for each individual, following zero-one loss rules under misclassification loss, as described in https://homes.cs.washington.edu/~pedrod/bvd.pdf. We can then calculate the overall average bias error and variance error for the prediction. 


In [0]:
def compute_bias_variance(predictions, true_labels):

  # print("predictions: ")
  # print(predictions)
  # print("true labels: ")
  # print(true_labels)

  biases = []
  variances = []
  avg_errors = []
  misclassified_individuals = []

  # calculate the bias and variance for each value of X,y
  # for misclassification loss
  
  #find whether each element is misclassified for each bootstrap sample 
  predictions_misclassified = predictions.apply(lambda x : x != true_labels.iloc[0], axis=1)

  #count number of times misclassified for each datapoint across all bootstrap samples 
  counts = predictions_misclassified.apply(np.sum)

  #average misclassification error for each individual/datapoint 
  avg_errors = counts.apply(lambda y : np.divide(y,len(predictions)))

  index = 0

  for avg_error in avg_errors:
    (bias, variance) = (0, avg_error) if (avg_error <= 0.5) else (1, (1-avg_error))
    # print(bias)
    # print(variance)
    biases.append(bias)
    variances.append(variance)
    if avg_error > 0.5:
      misclassified_individuals.append(index)
    index+= 1

  avg_bias = np.mean(biases)
  # avg_var = abs(np.mean(avg_errors) - avg_bias)
  avg_var = abs(np.mean(variances))
  avg_error = np.mean(avg_errors)

  print("average error:")
  print(avg_error)
  print("average bias:")
  print(avg_bias)
  print("average variance:")
  print(avg_var)

  return avg_bias, avg_var, avg_error, misclassified_individuals

## Identifying Categories of Discrimination 

We hope to address the question: Are models that exhibit high bias errors likely to introduce new categories of discrimination? 

We can therefore look at the bias and variance errors for different models.


We want to see that if the variance is low and bias high, is it consistently discriminating against a certain subgroup, potentially introducing a new type of discrimination? Unlike other work such as http://papers.nips.cc/paper/7613-why-is-my-classifier-discriminatory.pdf, it doesn't have to be a protected characteristic.

# Plots

Creating the appropriate plots to visualise our results. We plot: 

1.   Bias error vs Variance error
2.   Gamma value of RBF SVM vs Variance error
3.   Gamma value of RBF SVM vs Bias error


In [0]:
import matplotlib.pyplot as plt                                  
def plot_bias_variance(biases, variances, gammas, cs, errors):   
  print("plotting bias/var") 
  plt.scatter(biases, variances)                                              
  plt.title('bias vs variance errors')                                     
  plt.xlabel('bias')                                                       
  plt.ylabel('variance')                                                   
  plt.show()

  # plt.scatter(gammas, variances)
  # plt.xscale('log')                                              
  # plt.title('RBF SVM, C = 1 \n gamma size vs variance errors')                                     
  # plt.xlabel('gamma')                                                       
  # plt.ylabel('variance')                                                   
  # plt.show()                                                            

  # plt.scatter(gammas, biases)                               
  # plt.xscale('log')                                                             
  # plt.title('RBF SVM, C = 1 \n gamma size vs bias errors')                                     
  # plt.xlabel('gamma')                                                       
  # plt.ylabel('bias')                                                   
  # plt.show()            

  # plt.scatter(gammas, errors)                               
  # plt.xscale('log')                                                             
  # plt.title('RBF SVM, C = 1 \n gamma size vs total error')                                     
  # plt.xlabel('gamma')                                                       
  # plt.ylabel('error')                                                   
  # plt.show()   

  plt.scatter(cs, biases)                               
  plt.xscale('log')                                                             
  plt.title('RBF SVM, gamma=0.001, C value vs bias errors')                                     
  plt.xlabel('C value')                                                       
  plt.ylabel('bias')                                                   
  plt.show()       

  plt.scatter(cs, variances)                               
  plt.xscale('log')                                                             
  plt.title('RBF SVM, gamma=0.001, C value vs variance errors')                                     
  plt.xlabel('C value')                                                       
  plt.ylabel('variance')                                                   
  plt.show()       

#just an example of if we want to plot the misclassified individuals against a characteristic from the dataframe 
#might help to look for patterns 
def plot_misclassified(misclassified):
  misclassified.reset_index().plot(kind='scatter', x='index', y='age') 
  plt.show()

In [0]:
#download CSV file containing all the info for the individuals who are consistently misclassified (i.e. >50% of the time, resulting in bias errors)
def download_misclassified(misclassified):
  misclassified.reset_index().to_csv('misclassified.csv', index=False)
  from google.colab import files
  files.download('misclassified.csv')

# Main method (execute code)

Main method to run the system, executing methods in appropriate sequence. 

In [0]:
def main():
  all_data = load_file()
  training_data_and_labels, testing_data_and_labels = import_data(all_data)
  biases = []
  variances = []
  total_errors = []
  classifiers, gammas, cs = define_classifiers()
  misclassified = []
  equalised_odds = True

  for classifier in classifiers:
    print(classifier)
    # clf = classifiers[classifier_names.index(classifier)]
    clf = classifier
    predictions, true_labels = classify(training_data_and_labels, testing_data_and_labels, clf)  
    majority_predictions = predictions.mode().astype('int64')
    if(equalised_odds):
      make_fair(true_labels, majority_predictions)
    bias, variance, total_error, misclassified_individuals = compute_bias_variance(predictions, true_labels)
    biases.append(bias)
    variances.append(variance)
    total_errors.append(total_error)
    #get the individuals which are misclassified on average (hence contributing to bias errors)
    print(misclassified_individuals)
    misclassified = testing_data_and_labels.iloc[misclassified_individuals]
    # download_misclassified(misclassified)
    # plot_misclassified(misclassified)

  plot_bias_variance(biases, variances, gammas, cs, total_errors)

In [873]:
main()

loading data
SVC(C=1000, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
true negatives: 30 false positives: 4 false negatives: 2 true positives: 56


ValueError: ignored