-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add ranking metrics #974
base: main
Are you sure you want to change the base?
Changes from 9 commits
b7320c6
c83bb33
58684c0
250113b
1c0b23e
3f5efaf
31c6ae6
4a00ebf
60dc08e
75e9efe
9a8ed1f
f185034
16bc945
ba9db5c
ebb8be2
fc057b7
971128e
6889aba
c1a5458
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -194,6 +194,24 @@ group loss primarily seeks to mitigate quality-of-service harms. Equalized | |
odds and equal opportunity can be used as a diagnostic for both allocation | ||
harms as well as quality-of-service harms. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We might want to move this paragraph further down and add the ranking metrics as representing allocation / quality-of-service harm. |
||
|
||
*Ranking*: | ||
|
||
Fairlearn includes two constraints for rankings, based on exposure: a measure for the amount of | ||
attention an instance is expected to receive, based on their position in the ranking. Exposure is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this meant to be agnostic to the use case for ranking? How would the expectations for how much attention an instance might receive change in situations where rankings are bounded at particular intervals (e.g., a limited number of search results returned per page)? |
||
computed as a logarithmic discount :math:`\frac{1}{log(1+i)}` for each position :math:`i`, as used | ||
in discounted cumulative gain (DCG). | ||
|
||
* *Allocation harm*: We try to allocate the exposure that each item gets fairly across the groups. | ||
A ranking :math:`\tau` has a fair exposure allocation under a distribution over :math:`(X,A,Y)`, | ||
if its ranking for :math:`\tau(X)` is statistically independent over sensitive feature:math:`A`. | ||
[#6]_ | ||
|
||
* *Quality-of-service harm*: We try to keep the exposure that each item gets proportional to its | ||
"ground-truth" relevance. Otherwise small differences in relevance can lead to huge differences | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you able to say anything more about how this ground truth relevance is typically determined? i.e., is this something that data scientists would have access to a priori, or if not, is there guidance in the paper this is adapted from on how to determine this? |
||
in exposure. A ranking :math:`\tau` satisfies parity in quality-of-service under | ||
a distribution over :math:`(X,A,Y)`, if its ranking for :math:`\tau(X)` is statistically | ||
proportional to :math:`Y`, independent over sensitive feature :math:`A`. [#6]_ | ||
|
||
Disparity metrics, group metrics | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
|
@@ -420,4 +438,7 @@ the algorithm may impact the intended outcomes of a given model. | |
<https://arxiv.org/pdf/1912.05511.pdf>`_, FAccT, 2021. | ||
|
||
.. [#5] Obermeyer, Powers, Vogeli, Mullainathan `"Dissecting racial bias in an algorithm used to manage the health of populations" | ||
<https://science.sciencemag.org/content/366/6464/447>`_, Science, 2019. | ||
<https://science.sciencemag.org/content/366/6464/447>`_, Science, 2019. | ||
|
||
.. [#6] Singh, Joachims `"Fairness of Exposure in Rankings" | ||
<https://dl.acm.org/doi/10.1145/3219819.3220088>`_, KDD, 2018. |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,117 @@ | ||||||
# Copyright (c) Fairlearn contributors. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a bit surprised we're adding an example, but not a user guide section. Sometimes the latter can borrow from the former through There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This example looks quite okay to me, we can just make sure we have links to it from the right places in the user guide and the API guides. |
||||||
# Licensed under the MIT License. | ||||||
|
||||||
""" | ||||||
========================================= | ||||||
Ranking | ||||||
========================================= | ||||||
""" | ||||||
|
||||||
from fairlearn.metrics import exposure, utility, exposure_utility_ratio | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
from fairlearn.metrics import MetricFrame | ||||||
|
||||||
# %% | ||||||
# This notebook shows how to use Fairlearn with rankings. We showcase the example "Fairly | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# Allocating Economic Opportunity" from the paper "Fairness of Exposure in Ranking" by Singh and | ||||||
# Joachims (2018). | ||||||
# The example demonstrates how small differences in item relevance can lead to large differences | ||||||
# in exposure. | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# | ||||||
# Consider a web-service that connects employers (users) to potential employees (items). | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# The web-service uses a ranking-kased system to present a set of 6 applicants of which 3 are male | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# and 3 are female. Male applicants have relevance of 0.80, 0.79, 0.78 respectively for the | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# employer, while female applicants have relevance of 0.77, 0.76, 0.75. | ||||||
# In this setting a relevance of 0.75 is defined as, 75% of all employers issuing the query found | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# the applicant relevant. | ||||||
# | ||||||
# The Probability Ranking Principle suggests to rank the applicants in decreasing order of | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# relevance. What does this mean for the exposure between the two groups? | ||||||
|
||||||
ranking_pred = [1, 2, 3, 4, 5, 6] # ranking | ||||||
sex = ['Male', 'Male', 'Male', 'Female', 'Female', 'Female'] | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
y_true = [0.82, 0.81, 0.80, 0.79, 0.78, 0.77] | ||||||
|
||||||
# %% | ||||||
# Here we define what metrics we want to analyze. | ||||||
# | ||||||
# - The `exposure` metric shows the average exposure that each group gets, based on their position | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# biases. Exposure is the value that we assign to every place in the ranking, calculated by a | ||||||
# standard exposure drop-off of :math:`1/log_2(1+j)` as used in Discounted Cumulative Gain (DCG), | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# to account for position bias. If there are big differences in exposure | ||||||
# we could say that there is allocation harm in the data, i.e. males are on average ranked way | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# higher than females by the web-service. | ||||||
# | ||||||
# - The `utility` metric shows the average relevance that each group has. | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# | ||||||
# - The `exposure_utility_ratio` metric shows quality-of-service harms in the data. Since it shows | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# what the average exposure of each group is compared to its relevance. If there a big | ||||||
# differences in this metric we could say that the exposure of some sensitive groups is not | ||||||
# proportional to its utility. | ||||||
|
||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
metrics = { | ||||||
'exposure (allocation harm)': exposure, | ||||||
'average utility': utility, | ||||||
'exposure/utility (quality-of-service)': exposure_utility_ratio | ||||||
} | ||||||
|
||||||
mf = MetricFrame(metrics=metrics, | ||||||
y_true=y_true, | ||||||
y_pred=ranking_pred, | ||||||
sensitive_features={'sex': sex}) | ||||||
|
||||||
# Customize the plot | ||||||
mf.by_group.plot( | ||||||
kind="bar", | ||||||
subplots=True, | ||||||
layout=[1, 3], | ||||||
legend=False, | ||||||
figsize=(12, 4) | ||||||
) | ||||||
|
||||||
# Show the ratio of the metrics, 0 equals unfair and 1 equals fair. | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
mf.ratio() | ||||||
|
||||||
# %% | ||||||
# The first plot shows that the web-service that men get significantly more exposure than women. | ||||||
# Although the second plot shows that the utility of females is comparable to the males group. | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# Therefor we can say that the ranking contains quality-of-service harm against women, since the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
# exposure/utility ratio is not equal (plot 3) | ||||||
|
||||||
# %% | ||||||
# How can we fix this? A simple solution is to rerank the items, in such a way that females get | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
# more exposure and males get less exposure. For example we can switch the top male with the top | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# female applicant and remeasure the quality-of-service harm. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
ranking_pred = [1, 2, 3, 4, 5, 6] # ranking | ||||||
sex = ['Female', 'Male', 'Male', 'Male', 'Female', 'Female'] | ||||||
y_true = [0.79, 0.81, 0.80, 0.82, 0.78, 0.77] # Continuous relevance score | ||||||
|
||||||
print(len(ranking_pred), len(sex), len(y_true)) | ||||||
|
||||||
# Analyze metrics using MetricFrame | ||||||
# Careful that in contrast to the classification problem, y_pred now requires a ranking | ||||||
metrics = { | ||||||
'exposure (allocation harm)': exposure, | ||||||
'average utility': utility, | ||||||
'exposure/utility (quality_of_service)': exposure_utility_ratio | ||||||
} | ||||||
|
||||||
mf = MetricFrame(metrics=metrics, | ||||||
y_true=y_true, | ||||||
y_pred=ranking_pred, | ||||||
sensitive_features={'sex': sex}) | ||||||
|
||||||
# Customize the plot | ||||||
mf.by_group.plot( | ||||||
kind="bar", | ||||||
subplots=True, | ||||||
layout=[1, 3], | ||||||
legend=False, | ||||||
figsize=(12, 4) | ||||||
) | ||||||
|
||||||
# Show the ratio of the metrics, 0 equals unfair and 1 equals fair. | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
mf.ratio() | ||||||
|
||||||
# %% | ||||||
# The new plots show that the exposure and exposure/utility ratio are now much more equal. | ||||||
bram49 marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,6 +48,16 @@ | |
_mean_underprediction, | ||
count) | ||
|
||
from ._exposure import ( # noqa: F401 | ||
exposure, | ||
utility, | ||
exposure_utility_ratio, | ||
allocation_harm_in_ranking_difference, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps we can come up with a shorter name for these! How about simply There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also don't like the names yet. exposure_utility_ratio_ratio? 😂 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fairlearn/fairlearn-maintainers do you have any ideas for naming these? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the doc says |
||
allocation_harm_in_ranking_ratio, | ||
quality_of_service_harm_in_ranking_difference, | ||
quality_of_service_harm_in_ranking_ratio | ||
) | ||
|
||
|
||
# Add the generated metrics of the form and | ||
# `<metric>_{difference,ratio,group_min,group_max` | ||
|
@@ -67,7 +77,11 @@ | |
"demographic_parity_difference", | ||
"demographic_parity_ratio", | ||
"equalized_odds_difference", | ||
"equalized_odds_ratio" | ||
"equalized_odds_ratio", | ||
"allocation_harm_in_ranking_difference", | ||
"allocation_harm_in_ranking_ratio", | ||
"quality_of_service_harm_in_ranking_difference", | ||
"quality_of_service_harm_in_ranking_ratio" | ||
] | ||
|
||
_extra_metrics = [ | ||
|
@@ -80,4 +94,11 @@ | |
"count" | ||
] | ||
|
||
__all__ = _core + _disparities + _extra_metrics + list(sorted(_generated_metric_dict.keys())) | ||
_ranking_metrics = [ | ||
"exposure", | ||
"utility", | ||
"exposure_utility_ratio" | ||
] | ||
|
||
__all__ = _core + _disparities + _extra_metrics + _ranking_metrics \ | ||
+ list(sorted(_generated_metric_dict.keys())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utility
seems a bit too generic. (I have a feeling @MiroDudik feels the same way...)Perhaps
ranking_utility
?To some extent I'm wondering if these should be grouped with the other metrics at all, or whether this deserves its own section with ranking metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since exposure and utility only work for exposure. I also think it is better to call them
ranking_exposure
andranking_utility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second the proposed names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with
ranking_exposure
. Another option would bedcg_exposure
. This would allow us to introduce, for example,rbp_exposure
in future, see Eq. (2) here:My impression is that
utility
or evenranking_utility
is not the best naming choice for the objective we are calculating and it is also not very standard, because in most contexts,utility
is just a synonym forscore
so I would expect that it refers to things likedcg_score
orndcg_score
. That is actually how they use the word utility even in the "Fairness of Exposure in Rankings" paper. So, I'd be in favor of using some variation ofrelevance
. Maybe,average_relevance
ormean_relevance
?