Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add ranking metrics #974

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/user_guide/assessment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,9 @@ Base metric :code:`group_min` :code:`group_m
:func:`.selection_rate` . . Y Y
:func:`.true_negative_rate` . . Y Y
:func:`.true_positive_rate` . . Y Y
:func:`.exposure` . . Y Y
:func:`.utility` . . Y Y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utility seems a bit too generic. (I have a feeling @MiroDudik feels the same way...)
Perhaps ranking_utility?
To some extent I'm wondering if these should be grouped with the other metrics at all, or whether this deserves its own section with ranking metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since exposure and utility only work for exposure. I also think it is better to call them ranking_exposure and ranking_utility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second the proposed names.

Copy link
Member

@MiroDudik MiroDudik May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with ranking_exposure. Another option would be dcg_exposure. This would allow us to introduce, for example, rbp_exposure in future, see Eq. (2) here:

My impression is that utility or even ranking_utility is not the best naming choice for the objective we are calculating and it is also not very standard, because in most contexts, utility is just a synonym for score so I would expect that it refers to things like dcg_score or ndcg_score. That is actually how they use the word utility even in the "Fairness of Exposure in Rankings" paper. So, I'd be in favor of using some variation of relevance. Maybe, average_relevance or mean_relevance?

:func:`.exposure_utility_ratio` . . Y Y
:func:`sklearn.metrics.accuracy_score` Y . Y Y
:func:`sklearn.metrics.balanced_accuracy_score` Y . . .
:func:`sklearn.metrics.f1_score` Y . . .
Expand Down
23 changes: 22 additions & 1 deletion docs/user_guide/fairness_in_machine_learning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,24 @@ group loss primarily seeks to mitigate quality-of-service harms. Equalized
odds and equal opportunity can be used as a diagnostic for both allocation
harms as well as quality-of-service harms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to move this paragraph further down and add the ranking metrics as representing allocation / quality-of-service harm.


*Ranking*:

Fairlearn includes two constraints for rankings, based on exposure: a measure for the amount of
attention an instance is expected to receive, based on their position in the ranking. Exposure is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be agnostic to the use case for ranking? How would the expectations for how much attention an instance might receive change in situations where rankings are bounded at particular intervals (e.g., a limited number of search results returned per page)?

computed as a logarithmic discount :math:`\frac{1}{log(1+i)}` for each position :math:`i`, as used
in discounted cumulative gain (DCG).

* *Allocation harm*: We try to allocate the exposure that each item gets fairly across the groups.
A ranking :math:`\tau` has a fair exposure allocation under a distribution over :math:`(X,A,Y)`,
if its ranking for :math:`\tau(X)` is statistically independent over sensitive feature:math:`A`.
[#6]_

* *Quality-of-service harm*: We try to keep the exposure that each item gets proportional to its
"ground-truth" relevance. Otherwise small differences in relevance can lead to huge differences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to say anything more about how this ground truth relevance is typically determined? i.e., is this something that data scientists would have access to a priori, or if not, is there guidance in the paper this is adapted from on how to determine this?

in exposure. A ranking :math:`\tau` satisfies parity in quality-of-service under
a distribution over :math:`(X,A,Y)`, if its ranking for :math:`\tau(X)` is statistically
proportional to :math:`Y`, independent over sensitive feature :math:`A`. [#6]_

Disparity metrics, group metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -420,4 +438,7 @@ the algorithm may impact the intended outcomes of a given model.
<https://arxiv.org/pdf/1912.05511.pdf>`_, FAccT, 2021.

.. [#5] Obermeyer, Powers, Vogeli, Mullainathan `"Dissecting racial bias in an algorithm used to manage the health of populations"
<https://science.sciencemag.org/content/366/6464/447>`_, Science, 2019.
<https://science.sciencemag.org/content/366/6464/447>`_, Science, 2019.

.. [#6] Singh, Joachims `"Fairness of Exposure in Rankings"
<https://dl.acm.org/doi/10.1145/3219819.3220088>`_, KDD, 2018.
117 changes: 117 additions & 0 deletions examples/plot_ranking.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Copyright (c) Fairlearn contributors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised we're adding an example, but not a user guide section. Sometimes the latter can borrow from the former through literalinclude, but in my mind the first step is always the user guide. That said, I don't think we're making that distinction particularly clear at the moment, which is probably something to discuss on the community call again (topic: "structure of documentation" with a particular focus on user guide vs. examples)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example looks quite okay to me, we can just make sure we have links to it from the right places in the user guide and the API guides.

# Licensed under the MIT License.

"""
=========================================
Ranking
=========================================
"""

from fairlearn.metrics import exposure, utility, exposure_utility_ratio
bram49 marked this conversation as resolved.
Show resolved Hide resolved
from fairlearn.metrics import MetricFrame

# %%
# This notebook shows how to use Fairlearn with rankings. We showcase the example "Fairly
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# Allocating Economic Opportunity" from the paper "Fairness of Exposure in Ranking" by Singh and
# Joachims (2018).
# The example demonstrates how small differences in item relevance can lead to large differences
# in exposure.
bram49 marked this conversation as resolved.
Show resolved Hide resolved
#
# Consider a web-service that connects employers (users) to potential employees (items).
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# The web-service uses a ranking-kased system to present a set of 6 applicants of which 3 are male
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# and 3 are female. Male applicants have relevance of 0.80, 0.79, 0.78 respectively for the
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# employer, while female applicants have relevance of 0.77, 0.76, 0.75.
# In this setting a relevance of 0.75 is defined as, 75% of all employers issuing the query found
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# the applicant relevant.
#
# The Probability Ranking Principle suggests to rank the applicants in decreasing order of
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# relevance. What does this mean for the exposure between the two groups?

ranking_pred = [1, 2, 3, 4, 5, 6] # ranking
sex = ['Male', 'Male', 'Male', 'Female', 'Female', 'Female']
bram49 marked this conversation as resolved.
Show resolved Hide resolved
y_true = [0.82, 0.81, 0.80, 0.79, 0.78, 0.77]

# %%
# Here we define what metrics we want to analyze.
#
# - The `exposure` metric shows the average exposure that each group gets, based on their position
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# biases. Exposure is the value that we assign to every place in the ranking, calculated by a
# standard exposure drop-off of :math:`1/log_2(1+j)` as used in Discounted Cumulative Gain (DCG),
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# to account for position bias. If there are big differences in exposure
# we could say that there is allocation harm in the data, i.e. males are on average ranked way
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# higher than females by the web-service.
#
# - The `utility` metric shows the average relevance that each group has.
bram49 marked this conversation as resolved.
Show resolved Hide resolved
#
# - The `exposure_utility_ratio` metric shows quality-of-service harms in the data. Since it shows
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# what the average exposure of each group is compared to its relevance. If there a big
# differences in this metric we could say that the exposure of some sensitive groups is not
# proportional to its utility.

bram49 marked this conversation as resolved.
Show resolved Hide resolved
metrics = {
'exposure (allocation harm)': exposure,
'average utility': utility,
'exposure/utility (quality-of-service)': exposure_utility_ratio
}

mf = MetricFrame(metrics=metrics,
y_true=y_true,
y_pred=ranking_pred,
sensitive_features={'sex': sex})

# Customize the plot
mf.by_group.plot(
kind="bar",
subplots=True,
layout=[1, 3],
legend=False,
figsize=(12, 4)
)

# Show the ratio of the metrics, 0 equals unfair and 1 equals fair.
bram49 marked this conversation as resolved.
Show resolved Hide resolved
mf.ratio()

# %%
# The first plot shows that the web-service that men get significantly more exposure than women.
# Although the second plot shows that the utility of females is comparable to the males group.
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# Therefor we can say that the ranking contains quality-of-service harm against women, since the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Therefor we can say that the ranking contains quality-of-service harm against women, since the
# Therefore, we can say that the ranking shows evidence of quality-of-service harm against women, since the

# exposure/utility ratio is not equal (plot 3)

# %%
# How can we fix this? A simple solution is to rerank the items, in such a way that females get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# How can we fix this? A simple solution is to rerank the items, in such a way that females get
# How can we fix this? A simple solution is to rerank the items, in such a way that women get

# more exposure and males get less exposure. For example we can switch the top male with the top
bram49 marked this conversation as resolved.
Show resolved Hide resolved
# female applicant and remeasure the quality-of-service harm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# female applicant and remeasure the quality-of-service harm.
# woman and remeasure the quality-of-service harm.


ranking_pred = [1, 2, 3, 4, 5, 6] # ranking
sex = ['Female', 'Male', 'Male', 'Male', 'Female', 'Female']
y_true = [0.79, 0.81, 0.80, 0.82, 0.78, 0.77] # Continuous relevance score

print(len(ranking_pred), len(sex), len(y_true))

# Analyze metrics using MetricFrame
# Careful that in contrast to the classification problem, y_pred now requires a ranking
metrics = {
'exposure (allocation harm)': exposure,
'average utility': utility,
'exposure/utility (quality_of_service)': exposure_utility_ratio
}

mf = MetricFrame(metrics=metrics,
y_true=y_true,
y_pred=ranking_pred,
sensitive_features={'sex': sex})

# Customize the plot
mf.by_group.plot(
kind="bar",
subplots=True,
layout=[1, 3],
legend=False,
figsize=(12, 4)
)

# Show the ratio of the metrics, 0 equals unfair and 1 equals fair.
bram49 marked this conversation as resolved.
Show resolved Hide resolved
mf.ratio()

# %%
# The new plots show that the exposure and exposure/utility ratio are now much more equal.
bram49 marked this conversation as resolved.
Show resolved Hide resolved
25 changes: 23 additions & 2 deletions fairlearn/metrics/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,16 @@
_mean_underprediction,
count)

from ._exposure import ( # noqa: F401
exposure,
utility,
exposure_utility_ratio,
allocation_harm_in_ranking_difference,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can come up with a shorter name for these! How about simply exposure_difference/exposure_ratio for exposure? I'm not sure yet what to call exposure_utility though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't like the names yet. exposure_utility_ratio_ratio? 😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fairlearn/fairlearn-maintainers do you have any ideas for naming these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the doc says Calculate the difference in exposure allocation, from that, the name for me is then exposure_allocation_difference

allocation_harm_in_ranking_ratio,
quality_of_service_harm_in_ranking_difference,
quality_of_service_harm_in_ranking_ratio
)


# Add the generated metrics of the form and
# `<metric>_{difference,ratio,group_min,group_max`
Expand All @@ -67,7 +77,11 @@
"demographic_parity_difference",
"demographic_parity_ratio",
"equalized_odds_difference",
"equalized_odds_ratio"
"equalized_odds_ratio",
"allocation_harm_in_ranking_difference",
"allocation_harm_in_ranking_ratio",
"quality_of_service_harm_in_ranking_difference",
"quality_of_service_harm_in_ranking_ratio"
]

_extra_metrics = [
Expand All @@ -80,4 +94,11 @@
"count"
]

__all__ = _core + _disparities + _extra_metrics + list(sorted(_generated_metric_dict.keys()))
_ranking_metrics = [
"exposure",
"utility",
"exposure_utility_ratio"
]

__all__ = _core + _disparities + _extra_metrics + _ranking_metrics \
+ list(sorted(_generated_metric_dict.keys()))
Loading