# Explore an Feldman Preprocessing

run the cell below twice, the warning about missing Orange is just a warning and if you run again, it goes away. 

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from fairness import results
from fairness.data.objects.list import DATASETS, get_dataset_names
from fairness.data.objects.ProcessedData import ProcessedData
from fairness.algorithms.list import ALGORITHMS
from fairness.metrics.list import get_metrics

%matplotlib inline

First, we'll look at the list of datasets

In [None]:
dataset_names = get_dataset_names()
dataset_names

We are going to choose just one of them to explore

In [None]:
dataset_name = 'ricci'
dataset_idx = dataset_names.index(dataset_name)

> You should now, rename this notebook to be `explore-<dataset name>`. Ideally, make a copy and rename that, then you can reuse the original to re-run on a different dataset easily

Now, we can select the dataset object that we want.

In [None]:
procesed_dataset = ProcessedData(DATASETS[dataset_idx])
single_sensitive = procesed_dataset.data.sensitive_attrs[0]
class_attr = procesed_dataset.data.class_attr

In [None]:
feldman = ALGORITHMS[11]
feldman

In [None]:
processed_df = procesed_dataset.dfs['original']
repaired_df = feldman.repair(processed_df, single_sensitive, class_attr, 1)

We have to add back the original protected attribute to be able to compare

In [None]:
repaired_df['orig_' + single_sensitive] = processed_df[single_sensitive]

Next, we can explore what has changed

Now we can plot the differences, to see how the values are distibributed.  We will se the `hue` to the protected class, so that we get different bars for each group, `col` to the target value of the prediction and plot a histogram of one of the features.

In [None]:
g = sns.FacetGrid(processed_df, col="Class", hue="Race", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Combine',  ec="k")
g.axes[-1].legend(); #semicolon prevents extra output

In [None]:
g = sns.FacetGrid(repaired_df, col="Class", hue="orig_Race", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Combine',  ec="k")
g.axes[-1].legend(); #semicolon prevents extra output

It can be useful to see how balanced (or imbalanced) the data is, so we will group by protected class and target variable and count the number of items for (any) one variable.  `unstack` changes the orientation fo the tabulated data

In [None]:
processed_df.groupby(['Race','Class'])['Position'].count().unstack()

What other ways can you analyze and try to understand what this fairness intervention is doing?