-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add ranking metrics #974
base: main
Are you sure you want to change the base?
Changes from all commits
b7320c6
c83bb33
58684c0
250113b
1c0b23e
3f5efaf
31c6ae6
4a00ebf
60dc08e
75e9efe
9a8ed1f
f185034
16bc945
ba9db5c
ebb8be2
fc057b7
971128e
6889aba
c1a5458
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -194,6 +194,24 @@ group loss primarily seeks to mitigate quality-of-service harms. Equalized | |
odds and equal opportunity can be used as a diagnostic for both allocation | ||
harms as well as quality-of-service harms. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We might want to move this paragraph further down and add the ranking metrics as representing allocation / quality-of-service harm. |
||
|
||
*Ranking*: | ||
|
||
Fairlearn includes two constraints for rankings, based on exposure: a measure for the amount of | ||
attention an instance is expected to receive, based on their position in the ranking. Exposure is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this meant to be agnostic to the use case for ranking? How would the expectations for how much attention an instance might receive change in situations where rankings are bounded at particular intervals (e.g., a limited number of search results returned per page)? |
||
computed as a logarithmic discount :math:`\frac{1}{log(1+i)}` for each position :math:`i`, as used | ||
in discounted cumulative gain (DCG). | ||
|
||
* *Exposure*: We try to allocate the exposure that each item gets fairly across the groups. | ||
A ranking :math:`\tau` has a fair exposure allocation under a distribution over :math:`(X,A,Y)`, | ||
if its ranking for :math:`\tau(X)` is statistically independent over sensitive feature:math:`A`. | ||
[#6]_ | ||
|
||
* *Proportional exposure*: We try to keep the exposure that each item gets proportional to its | ||
"ground-truth" relevance. Otherwise small differences in relevance can lead to huge differences | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you able to say anything more about how this ground truth relevance is typically determined? i.e., is this something that data scientists would have access to a priori, or if not, is there guidance in the paper this is adapted from on how to determine this? |
||
in exposure. A ranking :math:`\tau` satisfies parity in quality-of-service under | ||
a distribution over :math:`(X,A,Y)`, if its ranking for :math:`\tau(X)` is statistically | ||
proportional to :math:`Y`, independent over sensitive feature :math:`A`. [#6]_ | ||
|
||
Disparity metrics, group metrics | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
|
@@ -420,4 +438,7 @@ the algorithm may impact the intended outcomes of a given model. | |
<https://arxiv.org/pdf/1912.05511.pdf>`_, FAccT, 2021. | ||
|
||
.. [#5] Obermeyer, Powers, Vogeli, Mullainathan `"Dissecting racial bias in an algorithm used to manage the health of populations" | ||
<https://science.sciencemag.org/content/366/6464/447>`_, Science, 2019. | ||
<https://science.sciencemag.org/content/366/6464/447>`_, Science, 2019. | ||
|
||
.. [#6] Singh, Joachims `"Fairness of Exposure in Rankings" | ||
<https://dl.acm.org/doi/10.1145/3219819.3220088>`_, KDD, 2018. |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,145 @@ | ||||||
# Copyright (c) Fairlearn contributors. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a bit surprised we're adding an example, but not a user guide section. Sometimes the latter can borrow from the former through There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This example looks quite okay to me, we can just make sure we have links to it from the right places in the user guide and the API guides. |
||||||
# Licensed under the MIT License. | ||||||
|
||||||
""" | ||||||
========================================= | ||||||
Ranking | ||||||
========================================= | ||||||
""" | ||||||
|
||||||
from fairlearn.metrics import exposure, utility, proportional_exposure | ||||||
from fairlearn.metrics import MetricFrame | ||||||
|
||||||
# %% | ||||||
# This notebook shows how to apply functionalities of :mod:`fairlearn.metrics` to ranking problems. | ||||||
# We showcase the example "Fairly Allocating Economic Opportunity" from the paper | ||||||
# `"Fairness of Exposure in Ranking" <https://dl.acm.org/doi/10.1145/3219819.3220088>`_ | ||||||
# by Singh and Joachims (2018). | ||||||
# The example demonstrates how small differences in item relevance can lead to large differences | ||||||
# in exposure. Differences in exposure can be harmful when we rank individuals, such as in hiring. | ||||||
# Groups of people can also be indirectly affected by rankings of items such as books, music, or | ||||||
# products in online retail. For example, it could be that authors of color's products are | ||||||
# consistently lower ranked in search results. | ||||||
# | ||||||
# We reproduce the example of the paper, which for simplicity reasons uses a binary gender category | ||||||
# as sensitive attribute. However, the metric can also be applied to multi-categorical sensitive | ||||||
# attributes. | ||||||
# | ||||||
# Consider a web-service that connects employers ("users") to potential employees ("items"). | ||||||
# The web-service uses a ranking-based system to present a set of 6 applicants of which 3 are | ||||||
# labelled man and 3 are labelled woman. Men have relevance of 0.80, 0.79, 0.78 respectively for | ||||||
# the employer, while women have relevance of 0.77, 0.76, 0.75. | ||||||
# In this setting a relevance of 0.75 is defined as, 75% of all employers issuing the query | ||||||
# considered the applicant relevant. | ||||||
# | ||||||
# A simple way to rank the participants is in decreasing order of relevance. What does this mean | ||||||
# for the exposure between the two groups? | ||||||
# | ||||||
# NOTE: The used data should raise questions about | ||||||
# :ref:`construct validity <fairness_in_machine_learning.construct-validity>` , since we | ||||||
# should consider whether the "relevance" that is in the data is a good measurement of the actual | ||||||
# relevance. Hindsight bias is also a concern, since how can you know upfront if an applicant | ||||||
# will be a successful employee in the future? These concerns are important, but out of scope for | ||||||
# this use case example. | ||||||
|
||||||
|
||||||
ranking_pred = [1, 2, 3, 4, 5, 6] # ranking | ||||||
gender = ['Man', 'Man', 'Man', 'Woman', 'Woman', 'Woman'] | ||||||
y_true = [0.82, 0.81, 0.80, 0.79, 0.78, 0.77] | ||||||
|
||||||
# %% | ||||||
# Here we define what metrics we want to analyze. | ||||||
# | ||||||
# - The :func:`fairlearn.metrics.exposure` metric measures the average exposure of a group of | ||||||
# items, based on their position in the ranking. Exposure is the value that we assign to every | ||||||
# place in the ranking, calculated by a | ||||||
# standard exposure drop-off of :math:`1/log_2(1+j)` as used in Discounted Cumulative Gain | ||||||
# (`DCG <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.dcg_score.html>`_) | ||||||
# , to account for position bias. If there are big differences in exposure this could be an | ||||||
# indication of allocation harm, i.e. men are on average ranked way higher than women by the | ||||||
# web-service. | ||||||
# | ||||||
# - The :func:`fairlearn.metrics.utility` metric indicates the average "ground-truth" relevance of | ||||||
# a group. | ||||||
# | ||||||
# - The :func:`fairlearn.metrics.proportional_exposure` metric computes the average exposure of a | ||||||
# group, divided by its utility (i.e., average relevance). Differences between groups indicate | ||||||
# that the exposure of some groups is not proportional to their ground-truth utility, which can | ||||||
# be seen as a measure of quality-of-service harm. | ||||||
# | ||||||
# We can compare ranking metrics across groups using :class:`fairlearn.metrics.MetricFrame`. | ||||||
|
||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
metrics = { | ||||||
'exposure (allocation harm)': exposure, | ||||||
'average utility': utility, | ||||||
'proportional exposure (quality-of-service)': proportional_exposure | ||||||
} | ||||||
|
||||||
mf = MetricFrame(metrics=metrics, | ||||||
y_true=y_true, | ||||||
y_pred=ranking_pred, | ||||||
sensitive_features={'gender': gender}) | ||||||
|
||||||
# Customize the plot | ||||||
mf.by_group.plot( | ||||||
kind="bar", | ||||||
subplots=True, | ||||||
layout=[1, 3], | ||||||
legend=False, | ||||||
figsize=(12, 4) | ||||||
) | ||||||
|
||||||
# %% | ||||||
# We can compute the minimum ratio between groups using | ||||||
# :func:`fairlearn.metrics.MetricFrame.ratio`. The closer the ratio is to 1 the more fair | ||||||
# the ranking is. | ||||||
mf.ratio() | ||||||
|
||||||
# %% | ||||||
# The first plot shows that the web-service that men get significantly more exposure than women. | ||||||
# Although the second plot shows that the average utility of women is comparable to men. | ||||||
# Therefor we can say that the ranking contains quality-of-service harm against women, since the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
# proportional exposure is not equal (plot 3) | ||||||
|
||||||
# %% | ||||||
# How can we fix this? A simple solution is to rerank the items, in such a way that women get | ||||||
# more exposure and men get less exposure. For example we can swap the top man with the top | ||||||
# woman and remeasure the quality-of-service harm. | ||||||
|
||||||
ranking_pred = [1, 2, 3, 4, 5, 6] # ranking | ||||||
gender = ['Woman', 'Man', 'Man', 'Man', 'Woman', 'Woman'] | ||||||
y_true = [0.79, 0.81, 0.80, 0.82, 0.78, 0.77] # Continuous relevance score | ||||||
|
||||||
# Analyze metrics using MetricFrame | ||||||
# Careful that in contrast to the classification problem, y_pred now requires a ranking | ||||||
metrics = { | ||||||
'exposure (allocation harm)': exposure, | ||||||
'average utility': utility, | ||||||
'proportional exposure (quality-of-service)': proportional_exposure | ||||||
} | ||||||
|
||||||
mf = MetricFrame(metrics=metrics, | ||||||
y_true=y_true, | ||||||
y_pred=ranking_pred, | ||||||
sensitive_features={'gender': gender}) | ||||||
|
||||||
# Customize the plot | ||||||
mf.by_group.plot( | ||||||
kind="bar", | ||||||
subplots=True, | ||||||
layout=[1, 3], | ||||||
legend=False, | ||||||
figsize=(12, 4) | ||||||
) | ||||||
|
||||||
mf.ratio() | ||||||
|
||||||
# %% | ||||||
# The new plots show that the exposure and proportional exposure are now much more equal. The | ||||||
# difference in exposure allocation is much smaller and the quality-of-service is better | ||||||
# in proportion to the group's average utility. | ||||||
# | ||||||
# This was a simple example using fabricated data, just to show what the exposure metrics are | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
# capable of measuring. Manually switching of people in a ranking is however not recommendable with | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
# real world larger data sets. There we would recommend mitigation techniques, that are | ||||||
# unfortunately not yet implemented in Fairlearn. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utility
seems a bit too generic. (I have a feeling @MiroDudik feels the same way...)Perhaps
ranking_utility
?To some extent I'm wondering if these should be grouped with the other metrics at all, or whether this deserves its own section with ranking metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since exposure and utility only work for exposure. I also think it is better to call them
ranking_exposure
andranking_utility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second the proposed names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with
ranking_exposure
. Another option would bedcg_exposure
. This would allow us to introduce, for example,rbp_exposure
in future, see Eq. (2) here:My impression is that
utility
or evenranking_utility
is not the best naming choice for the objective we are calculating and it is also not very standard, because in most contexts,utility
is just a synonym forscore
so I would expect that it refers to things likedcg_score
orndcg_score
. That is actually how they use the word utility even in the "Fairness of Exposure in Rankings" paper. So, I'd be in favor of using some variation ofrelevance
. Maybe,average_relevance
ormean_relevance
?