# Starter Notebook

In order to help minimize start up difficulties, we have provided you with a basic ML workflow for this project, as well as a few possible avenues to explore. 

## Section 1: ML Workflow for Submitting *(g,h)* pairs

### 1.0 Pip Installs and Imports

We will be using a package *dill* which is a variant of *pickle*, but allows a bit more expressive byte code serialization. This package is essential to saving your *(g,h)* pairs!.

In [1]:
!pip install dill
!pip install xgboost



Here is a non-inclusive list of packages you may find helpful

In [2]:
# Imports
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn import *
import dill as pkl
import xgboost as xgb

### 1.1 Download/Load Data

Navigate to the project [webpage](https://declancharrison.github.io/CIS_5230_Bias_Bounty_2023/) and click "Download Training Data". Extract the .zip files in the folder where this notebook is located, then run the cell below.

In [3]:
x_train = pd.read_csv('training_data.csv') 
y_train = np.genfromtxt('training_labels.csv', delimiter=',', dtype = float)

In [4]:
x_train

Unnamed: 0,ST,AGEP,CIT,COW,DDRS,DEAR,DEYE,DOUT,DRAT,DREM,...,FER,JWTRNS,LANX,MAR,MIL,SCHL,SEX,WKHP,OCCP,RAC1P
0,22.0,35.0,1.0,1.0,2.0,2.0,2.0,2.0,0.0,2.0,...,2.0,0.0,2.0,5.0,4.0,16.0,2.0,40.0,4020.0,2.0
1,2.0,23.0,1.0,1.0,2.0,2.0,2.0,2.0,0.0,2.0,...,0.0,1.0,2.0,5.0,4.0,16.0,1.0,40.0,6230.0,1.0
2,24.0,71.0,1.0,1.0,2.0,2.0,2.0,2.0,0.0,2.0,...,0.0,1.0,2.0,2.0,4.0,19.0,2.0,36.0,3602.0,2.0
3,28.0,22.0,1.0,1.0,2.0,2.0,2.0,2.0,0.0,2.0,...,2.0,1.0,2.0,5.0,4.0,19.0,2.0,32.0,3255.0,1.0
4,40.0,37.0,1.0,1.0,2.0,2.0,2.0,2.0,0.0,2.0,...,2.0,11.0,2.0,3.0,4.0,19.0,2.0,40.0,3645.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
340129,1.0,71.0,1.0,1.0,2.0,1.0,2.0,2.0,0.0,2.0,...,0.0,1.0,2.0,2.0,4.0,19.0,1.0,30.0,3930.0,1.0
340130,12.0,45.0,4.0,1.0,2.0,2.0,2.0,2.0,0.0,2.0,...,2.0,1.0,1.0,1.0,4.0,19.0,2.0,40.0,4700.0,8.0
340131,47.0,63.0,1.0,1.0,2.0,2.0,2.0,2.0,0.0,2.0,...,0.0,1.0,2.0,3.0,2.0,16.0,2.0,40.0,350.0,1.0
340132,48.0,25.0,1.0,1.0,2.0,2.0,2.0,2.0,0.0,2.0,...,0.0,0.0,2.0,5.0,4.0,21.0,1.0,40.0,7330.0,1.0


In [5]:
x_train.columns

Index(['ST', 'AGEP', 'CIT', 'COW', 'DDRS', 'DEAR', 'DEYE', 'DOUT', 'DRAT',
       'DREM', 'ENG', 'FER', 'JWTRNS', 'LANX', 'MAR', 'MIL', 'SCHL', 'SEX',
       'WKHP', 'OCCP', 'RAC1P'],
      dtype='object')

In [5]:
(x_train[(x_train['ST'] == 36) | (x_train['ENG'] == 1) | (x_train['DEYE'] == 2) | (x_train['SEX'] == 2)]).shape[0]/(x_train.shape[0]) * 100

99.14533683783449

### 1.2 Define a (g,h) pair

Below is an example of training a Decision Tree Regressor on individuals identified as white from the dataset.

In [62]:
# define group function
def g(X):
    return (X['ST'] == 36) | (X['ENG'] == 1) | (X['DEYE'] == 2) | (X['SEX'] == 2)

# initialize ML hypothesis class
clf = sk.ensemble.GradientBoostingRegressor(max_depth = 3, n_estimators = 1100, learning_rate= 0.5, random_state = 42)
# clf = xgb.XGBRegressor(max_depth = 5, n_estimators = 100, learning_rate= 0.5, random_state = 42)

# find group indices on data
indices = g(x_train)

# fit model specifically to group
clf.fit(x_train[indices], y_train[indices])

# define hypothesis function as bound clf.predict
h = clf.predict

### 1.3 Save Objects

The following cell will save your group model *g* with filename *g.pkl*, and your hypothesis function *h* with filename *h.pkl*.

In [63]:
# save group function to g.pkl
with open('g.pkl', 'wb') as file:
    pkl.dump(g, file)

# save hypothesis function to h.pkl
with open('h.pkl', 'wb') as file:
    pkl.dump(h, file)

### 1.4 Upload Models to Google Drive and Submit PR Request with Links

Follow instructions on GitHub Repo to submit a *(g,h)* pair update request!

## Section 2: Reducing Workflow Time Requirements by Creating a Local PDL

As you have probably noticed, submitting a *(g,h)* pair to the GitHub repository can take a long time depending on the current workload of the server. To approximate whether or not an update will be accepted, we have provided you the PDL architecture file and a workflow that will mimic your team's private PDL maintained by the server. 

**NOTE: One major caveat is the validation data this workflow uses is a cut from the training data, meaning you will want to refrain from training on it to prevent overfitting.**

The way we suggest getting around this without losing data efficacy is to train a *(g,h)* pair on the subset of training data that does not include the validation set, and attempt the *(g,h)* pair update on the local PDL. If the pair is rejected, you can continue tuning hyperparameters or searching for new groups. If the pair is accepted, you can retrain a new *(g,h)* pair over ALL the training data, and submit this pair to the server for an update. This will allow you to "squeeze all the juice" from your training data and test potential updates much quicker.  

In [24]:
## DONT CHANGE THIS CELL ###
# from pdl import PointerDecisionList

x_train_subset, x_val, y_train_subset, y_val = sk.model_selection.train_test_split(x_train, y_train, test_size = .15, random_state = 42)
# base_clf = sk.tree.DecisionTreeRegressor(max_depth = 1, random_state = 42)
# base_clf.fit(x_train_subset, y_train_subset)
# PDL = PointerDecisionList(base_clf, x_train_subset, y_train_subset, x_val, y_val, 1, 1)

In [11]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.5],
    'max_depth': [3, 5, 7]
}

In [12]:
# open PDL structure
with open('PDL/model.pkl', 'rb') as file:
    PDL = pkl.load(file)

# reload group/hypothesis functions to PDL
PDL.reload_functions()

In [13]:
(x_train[(x_train['COW'] == 7) | (x_train['COW'] == 5) | (x_train['AGEP'] >= 60) | (x_train['DEYE'] == 1) |  (x_train['RAC1P'] == 6) | (x_train['SEX'] == 1) | (x_train['SEX'] == 2)]).shape[0]/(x_train.shape[0]) * 100

100.0

Train your *(g,h)* pair on the subset of training data below:

You can put these two together to train a classifier using the whole training dataset after if it has been accepted:

In [61]:
# define group function
def g(X):
    return (X['RAC1P'] == 1) | (X['JWTRNS'] == 1) | (X['JWTRNS'] == 11) | (X['COW'] == 8) | (X['AGEP'] >= 75) | (X['SEX'] == 1)

In [62]:
# find group indices on training subset
indices = g(x_train_subset)

indices

251689     True
157025     True
265797     True
158173     True
61427      True
          ...  
119879     True
259178     True
131932     True
146867    False
121958     True
Length: 289113, dtype: bool

In [40]:
# initialize ML hypothesis class
# clf = sk.ensemble.RandomForestRegressor(max_depth = 4, n_estimators = 1000, random_state = 42)
# clf = sk.ensemble.GradientBoostingRegressor(max_depth = 7, n_estimators = 800, learning_rate= 0.1, random_state = 42)
# clf = xgb.XGBRegressor(max_depth = 5, n_estimators = 900, learning_rate= 0.1, random_state = 42)
clf = sk.ensemble.HistGradientBoostingRegressor(max_depth = 6, max_iter = 1200, learning_rate= 0.1, random_state = 42)

# fit model specifically to group subset
clf.fit(x_train_subset[indices], y_train_subset[indices])

# define hypothesis function as bound clf.predict
h = clf.predict

In [41]:
# compute PDL update
update_flag = PDL.update(g, h, x_train_subset, y_train_subset, x_val, y_val)

Update Rejected!


In [192]:
if update_flag:

    # recompute indices over whole training dataset
    indices = g(x_train)

    # refit classifier to full group
    clf.fit(x_train[indices], y_train[indices])

    # define hypothesis function as bound clf.predict
    h = clf.predict    

In [75]:
# load global predictions
global_predictions = np.genfromtxt('training_predictions.csv', delimiter=',', dtype = float)

# set value hyperparameter
value = 45000

# set epsilon hyperparameter
epsilon = 5000

# define 0,1 labels where current predictions OVERESTIMATE by at least epsilon
binary_labels = abs(global_predictions - value) < epsilon

# define group classifier class
g_clf = xgb.XGBRegressor(max_depth = 7, n_estimators = 500, random_state = 42)

# fit classifier to binary labels
g_clf.fit(x_train, binary_labels)

# define g
g = g_clf.predict

g(x_train)

array([0.99999917, 0.99999917, 0.99999917, ..., 0.99999917, 0.99999917,
       0.99999917], dtype=float32)

In [77]:
indices = g(x_train).astype(bool)

indices

array([ True,  True,  True, ...,  True,  True,  True])

In [74]:
# initialize ML hypothesis class
# clf = sk.ensemble.RandomForestRegressor(max_depth = 4, n_estimators = 1000, random_state = 42)
# clf = sk.ensemble.GradientBoostingRegressor(max_depth = 7, n_estimators = 800, learning_rate= 0.1, random_state = 42)
h_clf = xgb.XGBRegressor(max_depth = 7, n_estimators = 900, learning_rate= 0.1, random_state = 42)
# clf = sk.ensemble.HistGradientBoostingRegressor(max_depth = 6, max_iter = 1200, learning_rate= 0.1, random_state = 42)

h_clf.fit(x_train[indices], y_train[indices])

h = h_clf.predict

KeyError: "None of [Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n            ...\n            0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n           dtype='int64', length=289113)] are in the [columns]"

In [45]:
update_flag = PDL.update(g, h, x_train, y_train, x_val, y_val)

Update Rejected!


Submit *(g,h)* pair to GitHub!

**NOTE: You can save your PDL but it will require that your validation set does not change! Thus, you should not change the random state used to split your training data once you create your PDL**

In [193]:
# save group function to g.pkl
with open('g.pkl', 'wb') as file:
    pkl.dump(g, file)

# save hypothesis function to h.pkl
with open('h.pkl', 'wb') as file:
    pkl.dump(h, file)

# save PDL
PDL.save_model()