Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement individual fairness metrics #48

Open
7 tasks
saleiro opened this issue Sep 26, 2018 · 4 comments
Open
7 tasks

Implement individual fairness metrics #48

saleiro opened this issue Sep 26, 2018 · 4 comments
Assignees

Comments

@saleiro
Copy link
Member

saleiro commented Sep 26, 2018

This issue it's about creating a new class maybe named "Individual" that implements individual notions of fairness based on label differences (impurities) for similar individuals. Each method of the class just needs a list of dataframes as input (let's consider that in the future we might want to compare multiple train/test sets labels) and finds similar data points and then look to the label distribution of the pair/cluster.

  1. Cynthia Dwork's notion of individual fairness (Lipschitz condition).
    sub methods:
  • create pairwise distance metric in feature space
  • create pairwise distance metric in output space
  • some sort of aggregator
    e.g. count number of times the lipshitz condition is not met for each point, normalize and average?
  1. Matching methods to find similar data points and then calculate label purity.
    sub methods:
  • Create clusters (start with k-means)
  • Calculate purity metric of labels within a cluster (output k metrics)
  • Visualize clusters (if not 2-d use principal components?)
  • Visualize the purity metric per cluster
@saleiro
Copy link
Member Author

saleiro commented Sep 26, 2018

The sample datasets can be used for implementing this.

anisfeld added a commit that referenced this issue Oct 1, 2018
@anisfeld
Copy link
Collaborator

anisfeld commented Oct 1, 2018

I'm unclear on what the output should be.

For example, Dwork et al. focus on using the Lipshitz condition as a constraint on the optimization function. We want to take the predictions from an unconstrained algorithm and determine how far from the LIpshitz condition it is.

As a first step, would we want two matrices one with distance measured in the feature space and a second with distance measured in the outcome space (from which we could determine a score such as what fraction of the pairs of points fail the conditions)?

Similarly, would the input be the whole feature space for a prediction? This seems necessary, but undesirable, since we would require all the data.

@anisfeld
Copy link
Collaborator

@saleiro I've outlined how I think this should work in the issue ticket. Does that outline of functions seem reasonable? I'll start with the clustering?

@saleiro
Copy link
Member Author

saleiro commented Oct 12, 2018

@anisfeld I suggest that you create a README file within the individual module and outline exactly what you are going to implement. Let's abstract if the features are passed in the df or not. It's up to the final user to decide what she wants to use as representation. They can even pass different dfs based on different representations, train/test splits over time etc...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants