# Bounty Hunting Project Notebook

< Insert group member names here and change the name of the file so it has your group ID in it.>

## Set Up

First, we load in some helper files.

In [1]:
import sys
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
sys.path.append('dontlook')
from dontlook import bountyHuntData
from dontlook import bountyHuntWrapper
import pandas as pd
import numpy as np

Next, we load in the data. You should use `train_x` and `train_y` to train your models. The second set of data (`validation_x` and `validation_y`) is for testing your models, to ensure that you aren't overfitting. It is also what will be passed to the updater in order to determine if a proposed update should be accepted and if repairs are needed. Since you have access to this data, you could overfit to it and get a bunch of updates accepted. However, a) we'll be able to tell you did this and b) your updates will fail on the holdout set that only we have access to, so doing this is not in your best interest.

In [2]:
[train_x, train_y, validation_x, validation_y] = bountyHuntData.get_data()

# Preprocessing
Some features contain nominal data and the underlying distribution is highly skewed. Preprocess those features before move forward.

In [3]:
'''Execute this cell if data preprocessing is included. Binary data with value set {1, 2} will be mapped to {0, 1}.'''
# AGEP(age) - Numerical data, no preprocessing is needed
# SCHL(Educational attainment) - Categorical data, but it's okay to preserve the natural order here

# MAR(Marital status) - Categorical data, 85% of data is 1 or 5.
# TODO: Need one hot encoding

# SEX(Sex) - binary data. Return 1 if Male, otherwise return 0.
def preprocess_SEX(x):
    if(x == 1):
        return 1
    return 0
train_x['SEX'] = train_x['SEX'].apply(preprocess_SEX)

# DIS(Disability recode) - binary data. Return 1 if with a disability, otherwise return 0.
def preprocess_DIS(x):
    if(x == 1):
        return 1
    return 0
train_x['DIS'] = train_x['DIS'].apply(preprocess_DIS)

# ESP(Employment status of parents) - 82% is 0, 8.3% is 1, all the other data ranges from 2 - 8
# TODO: Need one hot encoding
def preprocess_ESP(x):
    if(x != 0 and x != 1):
        return 2
    return x
train_x['ESP'] = train_x['ESP'].apply(preprocess_ESP)

# MIG(Mobility status) - 88.5% is 1, 9.9% is 3. Since the other data only makes up to 1% of whole data set,
# merge it into category 3. Then this feature becomes a binary feature.
# Return 1 if original data is 1, otherwise return 0
def preprocess_MIG(x):
    if(x == 1):
        return 1
    return 0
train_x['MIG'] = train_x['MIG'].apply(preprocess_MIG)

# CIT(Citizenship status) - 78% data is 1, 12.2% data is 4. Other data will be merged.
# return 1 -> 0, 4 -> 1, others -> 2
# TODO: Need one hot encoding
def preprocess_CIT(x):
    if(x == 1):
        return 0
    elif(x == 4):
        return 1
    return 2
train_x['CIT'] = train_x['CIT'].apply(preprocess_CIT)

# MIL(Military service) - 76.7% data is 4, 17.8% data is 0. Other data will be merged.
# return 0 -> 0, 4 -> 1, others -> 2
# TODO: Need one hot encoding
def preprocess_MIL(x):
    if(x == 0):
        return 0
    elif(x == 4):
        return 1
    return 2
train_x['MIL'] = train_x['MIL'].apply(preprocess_MIL)

# ANC(Ancestry recode) - 1.6% data is 3, will be merged into 4
# TODO: Need one hot encoding
def preprocess_ANC(x):
    if (x == 3):
        return 4
    return x
train_x['ANC'] = train_x['ANC'].apply(preprocess_ANC)

# NATIVITY(Nativity) - binary data. Return 1 if Native
def preprocess_NATIVITY(x):
    if(x == 1):
        return 0
    return 1
train_x['NATIVITY'] = train_x['NATIVITY'].apply(preprocess_NATIVITY)

# RELP(Relationship) - 38.3% is 0, 24.5% is 2, 18.5% is 1, all the other values are below 3.5%. Merge them into
# category 3
# TODO: Need one hot encoding
def preprocess_RELP(x):
    if(x not in [0, 1, 2]):
        return 3
    return x
train_x['RELP'] = train_x['RELP'].apply(preprocess_RELP)

# DEAR, DEYE - binary data. Return 1 if Yes(1), otherwise 0.
def preprocess_DEAR(x):
    if (x == 1):
        return 1
    return 0

def preprocess_DEYE(x):
    return preprocess_DEAR(x)
train_x['DEAR'] = train_x['DEAR'].apply(preprocess_DEAR)
train_x['DEYE'] = train_x['DEYE'].apply(preprocess_DEYE)


# DREM(Cognitive difficulty) - No preprocess needed
# TODO: Need one hot encoding

# RAC1P(Recoded detailed race code) - category 1, 2, 6, 8 includes 95% of the data. So all the other data will be
# categorized into one category
# TODO: Need one hot encoding
def preprocess_RAC1P(x):
    if(x in [9, 3, 5, 7, 4]):
        return 0
    return x
train_x['RAC1P'] = train_x['RAC1P'].apply(preprocess_RAC1P)
train_x = pd.DataFrame(train_x, dtype=int)


In [4]:
from sklearn.preprocessing import OneHotEncoder

features_to_encode = ['MAR', 'ESP', 'CIT', 'MIL', 'ANC', 'RELP', 'DREM', 'RAC1P']
encoder = OneHotEncoder(handle_unknown='ignore')
train_x_encoded = pd.DataFrame(encoder.fit_transform(train_x[features_to_encode]).toarray(),\
                               columns = encoder.get_feature_names_out(features_to_encode))
train_x = train_x.drop(columns = features_to_encode, axis = 1)
train_x = train_x.join(train_x_encoded)
train_x = pd.DataFrame(train_x, dtype = int)
train_x

Unnamed: 0,AGEP,SCHL,DIS,MIG,NATIVITY,DEAR,DEYE,SEX,MAR_1,MAR_2,...,RELP_2,RELP_3,DREM_0,DREM_1,DREM_2,RAC1P_0,RAC1P_1,RAC1P_2,RAC1P_6,RAC1P_8
0,37,21,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
1,31,21,0,1,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0
2,5,3,0,1,1,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0
3,74,20,0,1,1,0,0,0,1,0,...,0,0,0,0,1,0,0,0,1,0
4,12,10,1,1,0,0,0,1,0,0,...,1,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137871,80,15,1,1,0,0,0,1,0,1,...,0,1,0,0,1,0,1,0,0,0
137872,81,16,1,1,0,0,0,0,0,1,...,0,0,0,0,1,0,1,0,0,0
137873,47,22,0,1,0,0,0,0,1,0,...,0,0,0,0,1,0,1,0,0,0
137874,27,9,0,1,1,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1


In [9]:
train_x.corr().style.background_gradient(cmap='coolwarm')

Unnamed: 0,AGEP,SCHL,DIS,MIG,NATIVITY,DEAR,DEYE,SEX,MAR_1,MAR_2,MAR_3,MAR_4,MAR_5,ESP_0,ESP_1,ESP_2,CIT_0,CIT_1,CIT_2,MIL_0,MIL_1,MIL_2,ANC_1,ANC_2,ANC_4,RELP_0,RELP_1,RELP_2,RELP_3,DREM_0,DREM_1,DREM_2,RAC1P_0,RAC1P_1,RAC1P_2,RAC1P_6,RAC1P_8
AGEP,1.0,0.492387,0.309317,0.170233,0.143274,0.224795,0.122407,-0.049627,0.420713,0.343093,0.195542,0.059027,-0.69011,0.656652,-0.423288,-0.457985,-0.14876,0.181582,0.006749,-0.663654,0.479216,0.230473,0.131009,-0.060275,-0.097008,0.432297,0.24145,-0.631811,-0.082436,-0.38131,0.105129,0.193866,-0.092893,0.121866,-0.030338,-0.041215,-0.076649
SCHL,0.492387,1.0,-0.006166,0.051445,0.059881,0.022178,-0.008433,-0.032774,0.345681,0.027335,0.102242,0.021811,-0.413979,0.727264,-0.468187,-0.507812,-0.051685,0.08279,-0.019536,-0.757229,0.637566,0.092689,0.050232,0.035948,-0.103563,0.343894,0.193049,-0.498691,-0.070865,-0.556219,-0.05487,0.436961,-0.076516,0.111547,-0.03878,-0.007847,-0.097699
DIS,0.309317,-0.006166,1.0,0.013266,-0.0281,0.471077,0.380455,-0.008487,-0.042972,0.213182,0.069689,0.036577,-0.100411,0.131516,-0.095408,-0.081774,0.011737,0.005758,-0.022838,-0.126357,0.056923,0.108252,0.021651,-0.033609,0.010542,0.061678,-0.035451,-0.131465,0.103319,-0.086688,0.593793,-0.377874,-0.007777,0.026243,0.026254,-0.060069,-0.010299
MIG,0.170233,0.051445,0.013266,1.0,0.000609,0.025548,0.008752,-0.008067,0.1221,0.035869,0.014332,-0.011937,-0.141427,0.020224,0.005538,-0.031495,0.003268,0.049782,-0.059867,-0.038155,0.018625,0.029994,0.026495,0.009671,-0.044184,0.066152,0.080565,0.017256,-0.181611,-0.128134,-0.015155,0.102523,-0.021552,0.040661,-0.014578,-0.027482,-0.009723
NATIVITY,0.143274,0.059881,-0.0281,0.000609,1.0,-0.024357,-0.000643,-0.021294,0.162878,0.0147,0.009904,0.048079,-0.185299,0.180302,-0.123845,-0.118619,-0.933828,0.75577,0.468943,-0.179191,0.206017,-0.0815,0.257157,-0.240325,-0.053307,0.044948,0.083026,-0.179602,0.059399,-0.100479,-0.03376,0.096593,0.014446,-0.381631,0.088015,0.383419,0.154415
DEAR,0.224795,0.022178,0.471077,0.025548,-0.024357,1.0,0.213517,0.027127,0.025986,0.151827,0.022001,0.007602,-0.107988,0.079268,-0.052835,-0.05366,0.01895,-0.004435,-0.021623,-0.078504,-0.005067,0.143097,0.010354,-0.009096,-0.002799,0.071266,-0.000644,-0.089512,0.010496,-0.039421,0.162808,-0.092454,-0.016212,0.055244,-0.028264,-0.029755,-0.020619
DEYE,0.122407,-0.008433,0.380455,0.008752,-0.000643,0.213517,1.0,-0.014189,-0.025057,0.094514,0.034962,0.016754,-0.040883,0.050953,-0.036284,-0.032318,-0.010243,0.008371,0.005053,-0.050353,0.031209,0.027147,0.013652,-0.0237,0.009467,0.026098,-0.019931,-0.052275,0.044926,-0.030286,0.197827,-0.124891,0.00362,-0.007755,0.018771,-0.020243,0.010571
SEX,-0.049627,-0.032774,-0.008487,-0.008067,-0.021294,0.027127,-0.014189,1.0,0.033317,-0.125404,-0.045902,-0.026529,0.055025,-0.029632,0.019804,0.02001,0.025159,-0.027205,-0.005034,0.029103,-0.137629,0.208629,0.003366,-0.008798,0.005654,0.012492,-0.083818,0.048304,0.014546,0.009138,0.00941,-0.013479,-0.005496,0.016629,-0.018609,-0.002772,0.00118
MAR_1,0.420713,0.345681,-0.042972,0.1221,0.162878,0.025986,-0.025057,0.033317,1.0,-0.196903,-0.237057,-0.108261,-0.745299,0.387695,-0.2497,-0.270601,-0.149654,0.150463,0.042562,-0.385883,0.293419,0.106291,0.081497,-0.022101,-0.077645,0.156469,0.573848,-0.447783,-0.272139,-0.186864,-0.086837,0.197444,-0.057491,0.098132,-0.116528,0.054645,-0.050922
MAR_2,0.343093,0.027335,0.213182,0.035869,0.0147,0.151827,0.094514,-0.125404,-0.196903,1.0,-0.067892,-0.031005,-0.213448,0.111033,-0.071512,-0.077498,-0.022048,0.034376,-0.007289,-0.110746,0.07472,0.048304,0.028094,-0.028625,-0.003161,0.172876,-0.112992,-0.132465,0.042948,-0.053517,0.096174,-0.033074,-0.023176,0.042393,-0.002537,-0.027224,-0.029508


The model that you'll be building off of is a decision stump, i.e. a very stupid decision list with only one node. **Warning: do not rerun the next code block unless you want to completely restart building your PDL, as it will re-initialize it to just the decision stump!**

In [None]:
initial_model = DecisionTreeClassifier(max_depth = 1, random_state=0)
initial_model.fit(train_x, train_y)
f = bountyHuntWrapper.build_initial_pdl(initial_model, train_x, train_y, validation_x, validation_y)

# Bounty Hunting

Here's where the bulk of the work you'll be doing will live. Your job is to generate groups g such that there is some h that does better than the current model f on that group. Here, we generate an example group function, which identifies African American individuals.

In [None]:
# def g1(x):
#     """
#     Given an x, g should return either 0 or 1.
#     :param x: input vector
#     :return: 0 or 1; 1 if x belongs to group, 0 otherwise
#     """

#     # here is how to make a function g that returns 1 for all African American individuals
#     if x['RAC1P'] == 2:
#         return 1
#     else:
#         return 0

You might also imaging making a group function that tries to learn what regions the current algorithm performs poorly on in an adaptive way, instead of just guessing ad-hoc that it will do poorly on a particular subgroup. In order to generate such a g, you will need to generate a constructor that takes as input a current model and the training data, and outputs a function g. A template for doing this is provided below. The example version returns a very silly function g which looks at the predictions the current model makes, and returns a group function where the group it has learned is all the points that the PDL labels as a 1. It completely ignores the true labels (train_y), so is probably not a very good group function.

In [None]:
# Don't execute this cell unless you are trying to build complicated groups

# import numpy as np


# def g_(f,train_x, train_y):
#     # f is the current PDL
#     preds = train_x.apply(f.predict, axis=1)
#     xs = train_x[preds == 1]
#     ys = train_y[preds == 1]
#     dt = DecisionTreeClassifier(max_depth = 1, random_state=0)
#     dt.fit(xs, ys)
#     def g(x):
#         # g should take as input a SINGLE x and return 0 or 1 for it.
#         # if we call dt.predict on x it will break because the dimensions of x are wrong, so we have to reshape it and reshape the output.
#         # this is not particularly efficient, so if you have better ways of doing this go for it. :)
#         y = dt.predict(np.array(x).reshape(1,-1))
#         return y[0]
#     return g

# # if you wanted to build a particular g using the above, you could use the following line.
# g = g_(f, train_x, train_y)

In the following cell(s), generate group functions that you think will make improvements, and then try to run the updates as explained in the subsequent section. In the final version of your code that you turn in, the groups that you generated and their corresponding models h, and the order in which you did updates, should be obvious and re-generating your final PDL should be completely reproducible just by running the code blocks in this notebook.

In [None]:
# # Add your code defining groups (and possibly hs) here.

Once you've found a promising group g, you can run the following updater code. Here, we define two different update functions. The first, `simple_updater`, only requires that you find some model g that you think f might do poorly on. Then, it automatically trains a decision list of depth 10 on the training data restricted to your g, and it passes that model and g along to the updater.

You might want to do something a bit fancier than a decision tree to make your model, in which case you can run the second updater, which takes as input a group g and model h, and then updates f accordingly.

Every time you run the update function, it will tell you if your (g,h) passed the validation checks, i.e. if a) your group existed in the validation data and b) it made an improvement compared to f. If it did pass the validation checks, then the model f is updated to include your g and h. **Note that this means that as you run updates, it will be increasingly difficult to find groups that make improvements.**

In [None]:
def simple_updater(g,group_name = "g"):
    # if you want to change how h is trained, you can edit the below line.
    h = bountyHuntWrapper.build_model(train_x, train_y, g, dt_depth=10)
    # do not change anything beyond this point.
    if bountyHuntWrapper.run_checks(f, validation_x, validation_y, g, h, train_x=train_x, train_y=train_y):
        print("Running Update")
        bountyHuntWrapper.run_updates(f, g, h, train_x, train_y, validation_x, validation_y, group_name=group_name)
        
def rf_updater(g, group_name = "g", dt_depth = 5):
    print("building h")
    # Get indices first
    indices = train_x.apply(g, axis=1) == 1
    # Find training set
    training_xs = train_x[indices]
    training_ys = train_y[indices]

    clf = RandomForestClassifier(max_depth=dt_depth, random_state=0)  # setting random state for replicability
    clf.fit(training_xs, training_ys)
    print("finished building h")
    h = clf.predict
    # do not change anything beyond this point.
    if bountyHuntWrapper.run_checks(f, validation_x, validation_y, g, h, train_x=train_x, train_y=train_y):
        print("Running Update")
        bountyHuntWrapper.run_updates(f, g, h, train_x, train_y, validation_x, validation_y, group_name=group_name)

def updater(g, h, group_name="g"):
    # do not alter this code
    if bountyHuntWrapper.run_checks(f, validation_x, validation_y, g, h, train_x=train_x, train_y=train_y):
        print("Running Update")
        bountyHuntWrapper.run_updates(f, g, h, train_x, train_y, validation_x, validation_y, group_name=group_name)

In the below block, provide a script that builds *the entire final PDL that you come up with*. We will run this on the initial version of f (the decision stump) in order to evaluate your code. (Note: it is fine for the group functions g and the hs to be defined in above code blocks, just make sure that everything runs as you expect if you run everything from a clean kernel)

In [None]:
def g1(x):
    """
    Given an x, g should return either 0 or 1.
    :param x: input vector
    :return: 0 or 1; 1 if x belongs to group, 0 otherwise
    """

    # here is how to make a function g that returns 1 for all African American individuals
    if x['RAC1P'] == 4 or x['RAC1P'] == 5 or x['RAC1P'] == 7:
        return 1
    else:
        return 0
simple_updater(g1,group_name="g1")
# rf_updater(g1,group_name="g1")

In [None]:
def g2(x):
    if x['AGEP'] < 16 or x['AGEP'] > 60:
        return 1
    else:
        return 0
simple_updater(g2,group_name="g2")
# rf_updater(g2,group_name="g2")

In [None]:
def g3(x):
    if x['MAR'] == 1:
        return 0
    else:
        return 1
simple_updater(g3,group_name="g3")

In [None]:
def g4(x):
    if x['SCHL'] < 16:
        return 1
    else:
        return 0
simple_updater(g4,group_name="g4")

In [None]:
def g5(x):
    if x['MIL'] != 1 and x['MIL'] != 5 and x['SEX'] == 2:
        return 1
    else:
        return 0
simple_updater(g5,group_name="g5")

In [None]:
def g6(x):
    if x['DIS'] == 1:
        return 1
    else:
        return 0
simple_updater(g6,group_name="g6")

In [None]:
def g7(x):
    if x['MIL'] == 4:
        return 0
    else:
        return 1
simple_updater(g7,group_name="g7")

In [None]:
# def g8(x):
#     if x['RAC1P'] == 2 and x['SEX'] == 2:
#         return 1
#     else:
#         return 0
    
def g8(x):
    if (x['AGEP'] < 22 or x['AGEP'] > 45) and x['ESP'] == 1:
        return 1
    else:
        return 0
# simple_updater(g8,group_name="g8")
rf_updater(g8,group_name="g8")

In [None]:
train_x

In [None]:
train_x['ESP'].value_counts()

# Saving Your Model

We'd like to output the PDL to some permanent location for grading purposes. The lines below do this.

In [None]:
import dill as pickle # you will probably need to install dill, which you do w pip install dill in your command line
with open('pdl.pkl', 'wb') as pickle_file:
    pickle.dump(f, pickle_file)

If you saved your PDL to pdl.pkl and you want to reload it, you can do so as follows (instead of re-building it from scratch every time you shut down you kernel). Just be sure that your final PDL is fully replicable in the final version of your code, so that we can re-build it just given your gs and hs.

In [None]:
with open('pdl.pkl', 'rb') as pickle_file:
    content = pickle.load(pickle_file)

# Analysis of Your Final Model

1. How does your final model perform? On both the validation set and the training data, calculate f's error rates on each of the groups you identified, calculate the error rates of the initial model on each of the groups you identified, and compare them by taking their difference. Hint: you can use the helper function `bountyHuntWrapper.measure_group_error(model, g, x, y)` to get the error of f on x and y restricted to just the datapoints in a group g, and you can use `metrics.zero_one_loss` for the initial model (which is just a DL so you can directly use the scikit.learn functions on it).
2. Say instead you used bootstrapped fairness to postprocess equal error rates on the initial model over the groups you discovered (assuming you had a way to identify those groups ahead of time). How much would you need to inflate each groups' error to get them equal?

# Data Exploration

In order to find promising groups, you may find it helpful to do some data exploration. Please include any code or visualizations that you did to do so here. To get you started, here are some things you may find useful:

1. How to grab the predictions of the current PDL on the training data: because f.predict takes a single value as input, you have to use an apply function for this.

In [None]:
preds = train_x.apply(f.predict, axis=1)

2. Getting the zero-one loss of a model restricted to a group you have defined.

In [None]:
g = lambda x: 1 #here we define a group that just is all the data, replace as you see fit.
bountyHuntWrapper.measure_group_error(f, g, train_x, train_y)

3. You can view the training data by calling `train_x`. If you want to only view the data for a single group defined by your group function, you can run the following:

In [None]:
# replace g with whatever your group is
indices = train_x.apply(g, axis=1) == 1
xs = train_x[indices]
ys = train_y[indices]

4. Inspecting the existing PDL: The PDL is stored as an object, and tracks its training errors, validation set errors, and the group functions that are used in lists where the ith element is the group errors of all groups discovered so far on the ith node in the PDL. If you are more curious about the implementation, you can look at the model.py file in the codebase, which doesn't contain anything you can use to adaptively modify your code. (But lives in the same folder as the rest of the codebase just to make importing things easier)

In [None]:
# f is the current model
# print(f.train_errors) # group errors on training set.
# print(f.train_errors[0]) # this is the group error of each group on the initial PDL. The ith element of f.train_errors is the group error of each group on the ith version of the PDL.
# print(f.test_errors) # group errors on validation set
# print(f.predicates) # all of the group functions that have been appended so far
# print(f.leaves) # all of the h functions appended so far
# print(f.pred_names) # the names you passed in for each of the group functions, to more easily understand which are which.
for element in f.test_errors:
    print (element)

5. Looking at the group error of the ith group over each round of updates: Say you found a group at round 5 and you want to know how its group error looked at previous or subsequent rounds. To do so, you can pull `f.train_errors` or `f.test_errors` and look at the ith element of each list as follows:

In [None]:
target_group = 0 # this sets the group whose error you want to look at at each round to the initial model. If I wanted to look at the 1st group introduced, would change to a 1, e.g.
group_errs = [f.train_errors[i][target_group] for i in range(len(f.train_errors))]
group_errs