<a href="https://colab.research.google.com/github/clairecoffey/project/blob/master/claire_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fairness and the Bias/Variance Tradeoff

## Claire Coffey

## 24th April 2020

In this notebook we are studying bias and variance errors in the context of recidivism data. 

## Imports and Setup

Imports: first import the relevant libraries used throughout. 

In [0]:
# imports
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
import numpy as np
import pandas as pd

Setup: define relevant global boolean for recidivism data; this is here in case we want to switch between different datasets.

In [0]:
# setup
recidivism_data = True

# Read in recidivism data 

In this notebook we are studying recidivism data. We utilise the COMPAS recidivism dataset, which uses recidivism data from Broward County jail and has been explored in the following studies:

"The accuracy, fairness, and limits of predicting recidivism", paper available at:
https://advances.sciencemag.org/content/4/1/eaao5580#corresp-1

"Machine Bias" ProPublica article, available at:
https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

The dataset used can be found at:
https://github.com/propublica/compas-analysis


Here we import and read in the recidivism data. 

Currently, we use a selection 
of fields from this dataset to predict recidivism classification (0 = will not reoffend; 1 = will reoffend). Currently, the fields used are:

But can be easily altered.

In [0]:

def load_file():
  full_data = False
  print("importing data")
  if(recidivism_data):
    if full_data:
      # full 2 year compas scores dataset
      file_path = "https://raw.githubusercontent.com/clairecoffey/project/master/mphilproject/compas-scores-two-years%20-%20compas-scores-two-years.csv?token=ABPC6VJE3BXQDQ25BHIL7DK6SWGT2"
    else:
      # small subset of first 500 people
      file_path = "https://raw.githubusercontent.com/clairecoffey/project/master/mphilproject/500-compas-scores-two-years%20-%20Sheet1%20(1).csv?token=ABPC6VLXB3JLKFHUXEV5Y226VQCNE"

  #file_path = "https://raw.githubusercontent.com/clairecoffey/project/master/mphilproject/500-compas-scores-two-years%20-%20Sheet1%20(1).csv?token=ABPC6VLXB3JLKFHUXEV5Y226VQCNE"
        
  # load CSV contents
  #how does pandas deal with NaNs? some of these columns i.e. charge degrees have many NaN / blank fields
  all_data = pd.read_csv(file_path, delimiter=',', dtype={'sex': 'category', 
                                                          'age_cat': 'category',
                                                          'race': 'category',
                                                          'c_charge_degree': 'category',
                                                          'c_charge_desc': 'category',
                                                          'r_charge_degree': 'category',
                                                          'r_charge_desc': 'category',
                                                          'vr_charge_degree': 'category',
                                                          'vr_charge_desc': 'category'
                                                          })
  return all_data


In [25]:
all_data = load_file()

importing data


Unnamed: 0,id,name,first,last,sex,dob,age,age_cat,race,juv_fel_count,juv_misd_count,juv_other_count,priors_count,days_b_screening_arrest,c_jail_in,c_jail_out,c_case_number,c_offense_date,c_arrest_date,c_charge_degree,c_charge_desc,is_recid,r_case_number,r_charge_degree,r_days_from_arrest,r_offense_date,r_charge_desc,r_jail_in,r_jail_out,violent_recid,is_violent_recid,vr_case_number,vr_charge_degree,vr_offense_date,vr_charge_desc,in_custody,out_custody,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,Male,1947-04-18,69,Greater than 45,Other,0,0,0,0,-1.0,2013-08-13 06:03:42,2013-08-14 05:41:20,13011352CF10A,2013-08-13,,F,Aggravated Assault w/Firearm,0,,,,,,,,,0,,,,,2014-07-07,2014-07-14,0,327,0,0
1,3,kevon dixon,kevon,dixon,Male,1982-01-22,34,25 - 45,African-American,0,0,0,0,-1.0,2013-01-26 03:45:27,2013-02-05 05:36:53,13001275CF10A,2013-01-26,,F,Felony Battery w/Prior Convict,1,13009779CF10A,(F3),,2013-07-05,Felony Battery (Dom Strang),,,,1,13009779CF10A,(F3),2013-07-05,Felony Battery (Dom Strang),2013-01-26,2013-02-05,9,159,1,1
2,4,ed philo,ed,philo,Male,1991-05-14,24,Less than 25,African-American,0,0,1,4,-1.0,2013-04-13 04:58:34,2013-04-14 07:02:04,13005330CF10A,2013-04-13,,F,Possession of Cocaine,1,13011511MM10A,(M1),0.0,2013-06-16,Driving Under The Influence,2013-06-16,2013-06-16,,0,,,,,2013-06-16,2013-06-16,0,63,0,1
3,5,marcu brown,marcu,brown,Male,1993-01-21,23,Less than 25,African-American,0,1,0,1,,,,13000570CF10A,2013-01-12,,F,Possession of Cannabis,0,,,,,,,,,0,,,,,,,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,Male,1973-01-22,43,25 - 45,Other,0,0,0,2,,,,12014130CF10A,,2013-01-09,F,arrest case no charge,0,,,,,,,,,0,,,,,,,0,1102,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332,495,kia rodriquez,kia,rodriquez,Female,1979-07-02,36,25 - 45,African-American,0,0,0,0,-1.0,2013-05-05 08:31:15,2013-05-17 08:35:57,13006433CF10A,2013-05-05,,F,Aggrav Battery w/Deadly Weapon,0,,,,,,,,,0,,,,,2013-05-05,2013-05-17,11,1061,0,0
333,496,jodie sena,jodie,sena,Female,1991-01-26,25,25 - 45,Caucasian,0,0,0,2,-2.0,2013-01-09 08:33:15,2013-01-10 09:40:18,13000518MM10A,2013-01-09,,M,Petit Theft,1,13007257CF10A,(F3),0.0,2013-05-21,Possession of Cocaine,2013-05-21,2013-07-13,,0,,,,,2013-03-06,2013-03-15,0,54,0,1
334,497,helen carrillo,helen,carrillo,Female,1992-06-09,23,Less than 25,Hispanic,0,0,0,1,0.0,2013-04-04 01:13:56,2013-04-04 11:28:44,13006418MM10A,2013-04-03,,M,Operating W/O Valid License,1,15022411TC10A,(M2),0.0,2015-07-07,Operating W/O Valid License,2015-07-07,2015-07-17,,0,,,,,2015-07-07,2015-07-17,0,824,1,0
335,499,justin knoll,justin,knoll,Male,1988-02-24,28,25 - 45,Caucasian,0,0,0,1,-4.0,2014-02-27 04:28:02,2014-02-28 08:49:55,14002778CF10A,2014-02-27,,F,Pos Cannabis W/Intent Sel/Del,0,,,,,,,,,0,,,,,2014-07-16,2014-07-17,0,135,0,0


In [0]:
#['small', 'medium', 'large'] == [0, 1, 2]
# ['cocaine', 'marijuana', 'assault'] == [[0, 0, 1], [0, 1, 0], [1, 0, 0]]

# print(all_data['c_charge_desc'].unique())

#is get_dummies the best way to do this?
encoded_sex = (pd.get_dummies(all_data['sex']))
all_data = all_data.drop(columns=['sex'])
all_data = all_data.join(encoded_sex)

encoded_age_cat = (pd.get_dummies(all_data['age_cat']))
all_data = all_data.drop(columns=['age_cat'])
all_data = all_data.join(encoded_age_cat)

encoded_race = (pd.get_dummies(all_data['race']))
all_data = all_data.drop(columns=['race'])
all_data = all_data.join(encoded_race)

encoded_c_charge_desc = (pd.get_dummies(all_data['c_charge_degree']))
all_data = all_data.drop(columns=['c_charge_degree'])
all_data = all_data.join(encoded_c_charge_desc, rsuffix='_c')

#these are joined with suffixes because otherwise columns overlap 
encoded_c_charge_desc = (pd.get_dummies(all_data['c_charge_desc']))
all_data = all_data.drop(columns=['c_charge_desc'])
all_data = all_data.join(encoded_c_charge_desc, rsuffix='_c')

encoded_c_charge_desc = (pd.get_dummies(all_data['r_charge_degree']))
all_data = all_data.drop(columns=['r_charge_degree'])
all_data = all_data.join(encoded_c_charge_desc, rsuffix='_r')

encoded_r_charge_desc = (pd.get_dummies(all_data['r_charge_desc']))
all_data = all_data.drop(columns=['r_charge_desc'])
all_data = all_data.join(encoded_r_charge_desc, rsuffix='_r')

encoded_c_charge_desc = (pd.get_dummies(all_data['vr_charge_degree']))
all_data = all_data.drop(columns=['vr_charge_degree'])
all_data = all_data.join(encoded_c_charge_desc, rsuffix='_vr')

encoded_vr_charge_desc = (pd.get_dummies(all_data['vr_charge_desc']))
all_data = all_data.drop(columns=['vr_charge_desc'])
all_data = all_data.join(encoded_vr_charge_desc, rsuffix='_vr')

training_data = all_data.drop(columns=['id','name','first','last','dob','days_b_screening_arrest','c_jail_in','c_jail_out','c_case_number','c_offense_date','c_arrest_date','r_case_number','r_offense_date','r_jail_in','r_jail_out','vr_case_number','vr_offense_date','in_custody','out_custody','start','end'])


In [27]:

# all_data
training_data

Unnamed: 0,age,juv_fel_count,juv_misd_count,juv_other_count,priors_count,is_recid,r_days_from_arrest,violent_recid,is_violent_recid,event,two_year_recid,Female,Male,25 - 45,Greater than 45,Less than 25,African-American,Caucasian,Hispanic,Other,F,M,Agg Battery Grt/Bod/Harm,Agg Fleeing and Eluding,Agg Fleeing/Eluding High Speed,Aggrav Battery w/Deadly Weapon,Aggrav Stalking After Injunctn,Aggravated Assault W/Dead Weap,Aggravated Assault W/dead Weap,Aggravated Assault w/Firearm,Aggravated Battery (Firearm/Actual Possession),Aggravated Battery / Pregnant,Assault,Att Tamper w/Physical Evidence,Attempt Armed Burglary Dwell,Attempted Burg/struct/unocc,Attempted Robbery No Weapon,Battery,Battery On Parking Enfor Speci,Battery on Law Enforc Officer,...,Prostitution/Lewd Act Assignation,Prowling/Loitering_r,Resist/Obstruct W/O Violence_r,Robbery / No Weapon_r,Robbery W/Firearm,Susp Drivers Lic 1st Offense_r,Tamper With Witness/Victim/CI,Theft/To Deprive,Trespass After Warning,Trespass Other Struct/Conve,Trespass Struct/Conveyance_r,Unlaw LicTag/Sticker Attach_r,Unlaw Use False Name/Identity_r,Uttering a Forged Instrument_r,Viol Injunct Domestic Violence_r,Viol Injunction Protect Dom Violence,Viol Pretrial Release Dom Viol,Wear Mask w/Commit Offense,(F1),(F2)_vr,(F3)_vr,(M1)_vr,(M2)_vr,(MO3)_vr,Agg Battery Grt/Bod/Harm_vr,Aggrav Battery w/Deadly Weapon_vr,Aggravated Assault W/Dead Weap_vr,Aggravated Battery / Pregnant_vr,Assault_vr,Battery_vr,Battery on Law Enforc Officer_vr,Battery on a Person Over 65,Child Abuse_vr,DOC/Fighting/Threatening Words_vr,Felony Battery,Felony Battery (Dom Strang)_vr,Felony Battery w/Prior Convict_vr,Kidnapping (Facilitate Felony),Manslaughter with Weapon,Robbery W/Firearm_vr
0,69,0,0,0,0,0,,,0,0,0,0,1,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,34,0,0,0,0,1,,,1,1,1,0,1,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,24,0,0,1,4,1,0.0,,0,0,1,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,23,0,1,0,1,0,,,0,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,43,0,0,0,2,0,,,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332,36,0,0,0,0,0,,,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
333,25,0,0,0,2,1,0.0,,0,0,1,1,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
334,23,0,0,0,1,1,0.0,,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
335,28,0,0,0,1,0,,,0,0,0,0,1,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
# import data from git repo
def import_data():
    full_data = False
    print("importing data")
    if(recidivism_data):
        if full_data:
            # full 2 year compas scores dataset
            file_path = "https://raw.githubusercontent.com/clairecoffey/project/master/mphilproject/compas-scores-two-years%20-%20compas-scores-two-years.csv?token=ABPC6VJE3BXQDQ25BHIL7DK6SWGT2"
        else:
            # small subset of first 500 people
            file_path = "https://raw.githubusercontent.com/clairecoffey/project/master/mphilproject/500-compas-scores-two-years%20-%20Sheet1%20(1).csv?token=ABPC6VLXB3JLKFHUXEV5Y226VQCNE"
        
        # load CSV contents
        all_data = (pd.read_csv(file_path, delimiter=',').values)

        # We also preprocess relevant data (i.e. convert strings to ints)
        # For all of these, 0 means missing or not a valid category

        # convert to integer categories where 0 is female, 1 is male, -1 is other
        sexes = all_data[:, 4]
        for i, sex in enumerate(sexes):
            if sexes[i] == 'Female':
                sexes[i] = 0
            elif sexes[i] == 'Male':
                sexes[i] = 1
            else:
                sexes[i] = -1

        dobs = all_data[:, 5]
        ages = all_data[:, 6]

        # convert to integer categories where < 25 = 0; 25-45 = 1; >45 = 2
        age_cats = all_data[:, 7]
        for i, age_cat in enumerate(age_cats):
            if age_cats[i] == 'Less than 25':
                age_cats[i] = 0
            elif age_cats[i] == '25 - 45':
                age_cats[i] = 1
            elif age_cats[i] == 'Greater than 45':
                age_cats[i] = 2
            else:
                age_cats[i] = -1

        races = all_data[:, 8]
        for i, race in enumerate(races):
            if races[i] == 'African-American':
                races[i] = 0
            elif races[i] == 'Asian':
                races[i] = 1
            elif races[i] == 'Caucasian':
                races[i] = 2
            elif races[i] == 'Hispanic':
                races[i] = 3
            elif races[i] == 'Native American':
                races[i] = 4
            elif races[i] == 'Other':
                races[i] = 5
            else:
                races[i] = -1

        priors_counts = all_data[:, 12]
        two_year_recids = all_data[:, 40]

        # normal recidivism - this is what we are predicting
        labels = two_year_recids.astype(int)
        # therefore make this the classification label
        labels_list = [0, 1]

        training_data = all_data[:, 6:12].astype(int)
        training_data_and_labels = []

        print("training data")
        print(training_data)
        print("labels")
        print(labels)

        for i, individual in enumerate(training_data):
            data_label_tuple = individual, labels[i]
            training_data_and_labels.append(data_label_tuple)

        num_datapoints = len(training_data_and_labels)

    return training_data, labels, labels_list, training_data_and_labels, num_datapoints



In [12]:
import_data()

importing data
training data
[[69  2  5  0  0  0]
 [34  1  0  0  0  0]
 [24  0  0  0  0  1]
 ...
 [23  0  3  0  0  0]
 [28  1  2  0  0  0]
 [21  0  0  0  0  0]]
labels
[0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 0 1 0 1
 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 0 1 0 0 1
 1 0 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1
 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1
 1 1 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1 1
 1 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 1
 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 1 1 0
 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 0
 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 1 1 1 0 0 0
 1 0 0 1]


(array([[69,  2,  5,  0,  0,  0],
        [34,  1,  0,  0,  0,  0],
        [24,  0,  0,  0,  0,  1],
        ...,
        [23,  0,  3,  0,  0,  0],
        [28,  1,  2,  0,  0,  0],
        [21,  0,  0,  0,  0,  0]]),
 array([0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1,
        1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
        1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
        0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
        0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1,
        1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
        1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,
        0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
        1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,
        0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
        0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 

# Classification - choose model and perform predictions using bootstrapping

Here we select the classification model to use. We are currently using a selection of built-in classifiers in scikit-learn. (Currently, the parameters of the models are not optimal and are for testing purposes only. In order to select optimal parameters, a procedure such as grid search should be used.)

The classification process then uses a bootstrapping procedure with the chosen model, to generate predictions of recidivism classifications Bootstrapping is a sampling with replacement procedure. We use this to generate many classification predictions for the given dataset, by running bootstrapping many times to generate different training and testing datasets. The training and testing datasets are called the "boot" and "out of bag" examples respectively. 

In [0]:
from sklearn import model_selection, neighbors, svm, gaussian_process, tree, ensemble, neural_network, metrics

def define_classifiers():
    print("defining classifiers")
    # random classifiers to test
    classifier_names = ["Nearest Neighbours", "Linear SVM", "RBF SVM", "Gaussian Process",
                        "Decision Tree", "Random Forest", "Neural Network"]
    classifiers = [
        neighbors.KNeighborsClassifier(5),
        svm.SVC(kernel="linear", C=0.025),
        svm.SVC(gamma=2, C=2),
        gaussian_process.GaussianProcessClassifier(1.0 * gaussian_process.kernels.RBF(1.0), multi_class='one_vs_one'),
        tree.DecisionTreeClassifier(max_depth=10),
        ensemble.RandomForestClassifier(max_depth=10, n_estimators=100, max_features=2),
        neural_network.MLPClassifier(alpha=0.01, max_iter=1000)]

    # choose classifier
    classifier = "Random Forest"
    print("Classifer used: ", classifier)
    return classifier, classifiers, classifier_names


In [21]:
define_classifiers()

defining classifiers
Classifer used:  Random Forest


('Random Forest',
 [KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                       weights='uniform'),
  SVC(C=0.025, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False),
  SVC(C=2, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma=2, kernel='rbf', max_iter=-1,
      probability=False, random_state=None, shrinking=True, tol=0.001,
      verbose=False),
  GaussianProcessClassifier(copy_X_train=True, kernel=1**2 * RBF(length_scale=1),
                            max_iter_predict=100, multi_class='one_vs_one',
                            n_jobs=None, n_restarts_optimizer=0,
                            optimizer='fmin_l_bfgs

In [0]:
#this is one bootstrap sample 
indices = np.random.randint(0,all_data.shape[0] , all_data.shape[0])
indices.sort()
data_points = []
for i in indices:
  data_points.append(all_data.iloc[i])

b_data = pd.DataFrame(data_points)


In [0]:
b_data

Unnamed: 0,id,name,first,last,sex,dob,age,age_cat,race,juv_fel_count,juv_misd_count,juv_other_count,priors_count,days_b_screening_arrest,c_jail_in,c_jail_out,c_case_number,c_offense_date,c_arrest_date,c_charge_degree,c_charge_desc,is_recid,r_case_number,r_charge_degree,r_days_from_arrest,r_offense_date,r_charge_desc,r_jail_in,r_jail_out,violent_recid,is_violent_recid,vr_case_number,vr_charge_degree,vr_offense_date,vr_charge_desc,in_custody,out_custody,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,Male,1947-04-18,69,Greater than 45,Other,0,0,0,0,-1.0,2013-08-13 06:03:42,2013-08-14 05:41:20,13011352CF10A,2013-08-13,,F,Aggravated Assault w/Firearm,0,,,,,,,,,0,,,,,2014-07-07,2014-07-14,0,327,0,0
1,3,kevon dixon,kevon,dixon,Male,1982-01-22,34,25 - 45,African-American,0,0,0,0,-1.0,2013-01-26 03:45:27,2013-02-05 05:36:53,13001275CF10A,2013-01-26,,F,Felony Battery w/Prior Convict,1,13009779CF10A,(F3),,2013-07-05,Felony Battery (Dom Strang),,,,1,13009779CF10A,(F3),2013-07-05,Felony Battery (Dom Strang),2013-01-26,2013-02-05,9,159,1,1
1,3,kevon dixon,kevon,dixon,Male,1982-01-22,34,25 - 45,African-American,0,0,0,0,-1.0,2013-01-26 03:45:27,2013-02-05 05:36:53,13001275CF10A,2013-01-26,,F,Felony Battery w/Prior Convict,1,13009779CF10A,(F3),,2013-07-05,Felony Battery (Dom Strang),,,,1,13009779CF10A,(F3),2013-07-05,Felony Battery (Dom Strang),2013-01-26,2013-02-05,9,159,1,1
2,4,ed philo,ed,philo,Male,1991-05-14,24,Less than 25,African-American,0,0,1,4,-1.0,2013-04-13 04:58:34,2013-04-14 07:02:04,13005330CF10A,2013-04-13,,F,Possession of Cocaine,1,13011511MM10A,(M1),0.0,2013-06-16,Driving Under The Influence,2013-06-16,2013-06-16,,0,,,,,2013-06-16,2013-06-16,0,63,0,1
2,4,ed philo,ed,philo,Male,1991-05-14,24,Less than 25,African-American,0,0,1,4,-1.0,2013-04-13 04:58:34,2013-04-14 07:02:04,13005330CF10A,2013-04-13,,F,Possession of Cocaine,1,13011511MM10A,(M1),0.0,2013-06-16,Driving Under The Influence,2013-06-16,2013-06-16,,0,,,,,2013-06-16,2013-06-16,0,63,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
330,493,michael peter,michael,peter,Male,1971-01-06,45,Greater than 45,African-American,0,0,0,1,,,,13003680CF10A,2013-03-12,,F,Crim Use of Personal ID Info,0,,,,,,,,,0,,,,,,,0,1115,0,0
331,494,carlos vasquez,carlos,vasquez,Male,1995-01-17,21,Less than 25,Hispanic,0,1,0,3,-22.0,2013-07-11 04:30:33,2013-08-02 06:07:24,13009721CF10A,2013-07-11,,F,Grand Theft in the 3rd Degree,1,15011166MM10A,(M1),,2015-10-24,Possess Cannabis/20 Grams Or Less,,,,0,,,,,2014-12-16,2015-01-09,0,501,0,0
333,496,jodie sena,jodie,sena,Female,1991-01-26,25,25 - 45,Caucasian,0,0,0,2,-2.0,2013-01-09 08:33:15,2013-01-10 09:40:18,13000518MM10A,2013-01-09,,M,Petit Theft,1,13007257CF10A,(F3),0.0,2013-05-21,Possession of Cocaine,2013-05-21,2013-07-13,,0,,,,,2013-03-06,2013-03-15,0,54,0,1
334,497,helen carrillo,helen,carrillo,Female,1992-06-09,23,Less than 25,Hispanic,0,0,0,1,0.0,2013-04-04 01:13:56,2013-04-04 11:28:44,13006418MM10A,2013-04-03,,M,Operating W/O Valid License,1,15022411TC10A,(M2),0.0,2015-07-07,Operating W/O Valid License,2015-07-07,2015-07-17,,0,,,,,2015-07-07,2015-07-17,0,824,1,0


In [0]:
from sklearn.utils import resample

def bootstrap(clf, training_data_and_labels, num_datapoints):

    # use full dataset not just training data. use boot to fit model then use the out of bag samples for testing
    boot = resample(training_data_and_labels, replace=True, n_samples=round(num_datapoints*0.5), random_state=1)

    # want to get just the training data out of the list, not the labels, to compare to those in the boot list
    # first element of each array in the list

    found = False
    oob = []

    # need to compare to each element in boot
    # then if none of them are equal, append to oob
    for data_and_label in training_data_and_labels:
        for element in boot:
            current_elem = element
            # print(data_and_label[0])
            # print(type(data_and_label[0]))
            if np.array_equal(data_and_label[0], element[0]):
                found = True
                break
            else:
                found = False
        if not found:
            oob.append(data_and_label)
            found = False

    # model is fit on the drawn sample and evaluated on the out-of-bag sample
    training_data = []
    training_labels = []
    testing_data = []
    testing_labels = []

    for data_and_label in boot:
        training_data.append(data_and_label[0])
        training_labels.append(data_and_label[1])

    for data_and_label in oob:
        testing_data.append(data_and_label[0])
        testing_labels.append(data_and_label[1])

    X_train = training_data
    y_train = training_labels
    X_test = testing_data
    y_test = testing_labels

    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    y_true = y_test

    return clf, y_pred, y_true


NameError: ignored

In [0]:
def classify(training_data_and_labels, num_datapoints):

    bootstrapping = True
    classifier, classifiers, classifier_names = define_classifiers()
    clf = classifiers[classifier_names.index(classifier)]
    count = 0
    predictions = []
    true_labels = []

    if (bootstrapping):
        num_bootstraps = 50;
        while count <= num_bootstraps:
            clf, y_pred, y_true = bootstrap(clf, training_data_and_labels, num_datapoints)
            predictions.append(y_pred)
            true_labels.append(y_true)
            count += 1
    return predictions, true_labels

In [26]:
classify()

TypeError: ignored

# Compute bias/variance errors

Using all these bootstrap predictions, we calculate the average misclassification error, as described in this paper: 
http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/uqc832/tr-bias.pdf

We can then decompose the error into the errors due to bias, and the errors due to variance, in order to study the behaviour of the model and the bias/variance tradeoff. This decomposition for classification is described in the following paper:
https://homes.cs.washington.edu/~pedrod/bvd.pdf

We define bias and variance in this context as in:
 http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/uqc832/tr-bias.pdf

Currently, bias error is always ```0.0```, because the average misclassification error for each datapoint is ```<0.5```, so, on average, each datapoint is classified correctly, and the errors are therefore all due to variance. This is not the result wanted/expected. 

Perhaps we can use the mean squared error (as in https://pdfs.semanticscholar.org/9253/f3e13bca7e845e60394d85ddaec0d4cfc6d6.pdf), instead of misclassification error and results would be different. 

In [31]:
def compute_bias_variance(predictions, true_labels):
    # calculate bias and variance for each datapoint using bootstrap samples
    # then we can use these to get the overall/avg across predictions
    biases = []
    variances = []
    count = 0
    prob_misclassified = 0
    total_misclassified = 0
    avg_errors = []

    # calculate the bias and variance for each value of X,y
    # for misclassification loss
    for pred_labels in predictions:
        index = 0
        labels = true_labels[count]
        count += 1
        total_misclassified = 0
        # using all the bootstrap sample predictions
        # loop through all of the predictions for a particular index (X value)
        # and calculate the average misclassification error for this X
        for pred_label in pred_labels:
            true_label = labels[index]
            if true_label == pred_label:
                prob_misclassified = 0
            else:
                prob_misclassified = 1
            index += 1
            total_misclassified += prob_misclassified
#.sum()
        avg_error = total_misclassified / len(pred_labels)
        avg_errors.append(avg_error)

    # print(avg_errors)

    for avg_error in avg_errors:
        if avg_error <= 0.5:
            biases.append(0)
            variances.append(avg_error)
        else:
            biases.append(1)
            variances.append(avg_error - 1)

    avg_bias = (1/len(biases)) * sum(biases)
    avg_var = abs((sum(avg_errors)/(len(avg_errors))) - avg_bias)

    print("Average bias:")
    print(avg_bias)
    print("Average variance:")
    print(avg_var)

    # print(biases)
    # print(variances)

    return avg_bias, avg_var

SyntaxError: ignored

# Plots

Plotting bias/variance errors for diff classifiers - just 1 at a time for now!

In [0]:
import matplotlib.pyplot as plt                                  
def plot_bias_variance(bias, variance):                                                                             
    plt.scatter(bias, variance)                                              
    plt.title('bias vs variance errors')                                     
    plt.xlabel('bias')                                                       
    plt.ylabel('variance')                                                   
    plt.show()                                                               
                                                                             

# Main method (execute code)

Main method to run the system


In [30]:
def main():
    training_data, labels, labels_list, training_data_and_labels, num_datapoints = import_data()
    predictions, true_labels = classify(training_data_and_labels, num_datapoints)
    bias, variance = compute_bias_variance(predictions, true_labels)
    plot_bias_variance(bias, variance)

main()

importing data
training data
[[69  2  5  0  0  0]
 [34  1  0  0  0  0]
 [24  0  0  0  0  1]
 ...
 [23  0  3  0  0  0]
 [28  1  2  0  0  0]
 [21  0  0  0  0  0]]
labels
[0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 0 1 0 1
 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 0 1 0 0 1
 1 0 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1
 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1
 1 1 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1 1
 1 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 1
 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 1 1 0
 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 0
 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 1 1 1 0 0 0
 1 0 0 1]
defining classifiers
Classifer used:  Random Forest


NameError: ignored