# GerryFair Tutorial

## Required Data Format

To test on a custom dataset, two files are needed: a file for the dataset itself and a file listing the types of attributes in the dataset. The dataset itself only needs the label column to have
values in 0,1. Our cleaning will automatically one-hot encode the categorical variables and, if desired, center the data.

For the attributes, each column should have a corresponding label, 0 (unprotected attribute), 1 (protected attribute), or 2 (label). See *communities_protected_formatted.csv* for an example.

#### Cleaning Data
If your data is not in that format, it needs to be cleaned. We provide a method, `clean_data` in *clean.py*, you can use to clean your data into the accepted format. 

The variable `dataset` should hold the file path to the file containing the dataset. The variable `attributes` should hold the file path to the file containing protected attributes. You should set `centered` to be `True` if you want the data to be centered.

In [None]:
dataset = "./dataset/communities.csv"
attributes = "./dataset/communities_protected.csv"
centered = True
X, X_prime, y = clean_dataset(dataset, attributes, centered)

#### Loading Clean Data

If you have already cleaned the data or have the data already in the required form, we provide a tool to load in the relevant data. The two files should be in XXX form...

In [None]:
dataset = "communities"
X, X_prime, y = clean.get_data(dataset)

## Using tools to train a model

Once the data is cleaned, we can use the ficticious play algorithm. This can be done using the `fictitious_play` function we provide. It requires to be passed in `X`, `X_prime`, and `y` and has a variety of optionsfor output and specifications for running, including an option to print heatmaps and change the number of iterations of the algorithm. Here, we will train a model on the communities data, printing heatmaps and setting the number of iterations to be 15.

In [None]:
C = 15
printflag = True
heatmapflag = True
heatmap_iter = 2
max_iter = 15
gamma = .01
errors_t, fp_diff_t = fictitious_play(X, X_prime, y, C, printflag, heatmapflag, heatmap_iter, max_iters, gamma)

## Using tools on evaluate a generic model

Once we have a model, whether it is fictitious play model or any generic model, we can use our tools to evaluate the fairness in several ways.

#### Auditing Predictions

You can audit for subgroup fairness of your predictions using the `audit` functionality. These predictions can come from any arbitrary model. Auditing the predictions returns the gamma unfairness of the predictions.

In [None]:
predictions = ...
gamma_unfairness = audit(predictions, X, X_prime, y,)

#### Plotting errors
You can also plot the errors of the model during training using the `plot_single` function in *fairness_plots.py*. Please note that these errors are returned by our fictitious play algorithm, so this is specifically for analyzing the effectiveness of our model.

In [None]:
fairness_plots.plot_single(errors_t, fp_diff_t, max_iters, gamma, C)