Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add ranking metrics #974

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/user_guide/assessment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,9 @@ Base metric :code:`group_min` :code:`group_m
:func:`.selection_rate` . . Y Y
:func:`.true_negative_rate` . . Y Y
:func:`.true_positive_rate` . . Y Y
:func:`.exposure` . . Y Y
:func:`.utility` . . Y Y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utility seems a bit too generic. (I have a feeling @MiroDudik feels the same way...)
Perhaps ranking_utility?
To some extent I'm wondering if these should be grouped with the other metrics at all, or whether this deserves its own section with ranking metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since exposure and utility only work for exposure. I also think it is better to call them ranking_exposure and ranking_utility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second the proposed names.

Copy link
Member

@MiroDudik MiroDudik May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with ranking_exposure. Another option would be dcg_exposure. This would allow us to introduce, for example, rbp_exposure in future, see Eq. (2) here:

My impression is that utility or even ranking_utility is not the best naming choice for the objective we are calculating and it is also not very standard, because in most contexts, utility is just a synonym for score so I would expect that it refers to things like dcg_score or ndcg_score. That is actually how they use the word utility even in the "Fairness of Exposure in Rankings" paper. So, I'd be in favor of using some variation of relevance. Maybe, average_relevance or mean_relevance?

:func:`.proportional_exposure` . . Y Y
:func:`sklearn.metrics.accuracy_score` Y . Y Y
:func:`sklearn.metrics.balanced_accuracy_score` Y . . .
:func:`sklearn.metrics.f1_score` Y . . .
Expand Down
23 changes: 22 additions & 1 deletion docs/user_guide/fairness_in_machine_learning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,24 @@ group loss primarily seeks to mitigate quality-of-service harms. Equalized
odds and equal opportunity can be used as a diagnostic for both allocation
harms as well as quality-of-service harms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to move this paragraph further down and add the ranking metrics as representing allocation / quality-of-service harm.


*Ranking*:

Fairlearn includes two constraints for rankings, based on exposure: a measure for the amount of
attention an instance is expected to receive, based on their position in the ranking. Exposure is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be agnostic to the use case for ranking? How would the expectations for how much attention an instance might receive change in situations where rankings are bounded at particular intervals (e.g., a limited number of search results returned per page)?

computed as a logarithmic discount :math:`\frac{1}{log(1+i)}` for each position :math:`i`, as used
in discounted cumulative gain (DCG).

* *Exposure*: We try to allocate the exposure that each item gets fairly across the groups.
A ranking :math:`\tau` has a fair exposure allocation under a distribution over :math:`(X,A,Y)`,
if its ranking for :math:`\tau(X)` is statistically independent over sensitive feature:math:`A`.
[#6]_

* *Proportional exposure*: We try to keep the exposure that each item gets proportional to its
"ground-truth" relevance. Otherwise small differences in relevance can lead to huge differences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to say anything more about how this ground truth relevance is typically determined? i.e., is this something that data scientists would have access to a priori, or if not, is there guidance in the paper this is adapted from on how to determine this?

in exposure. A ranking :math:`\tau` satisfies parity in quality-of-service under
a distribution over :math:`(X,A,Y)`, if its ranking for :math:`\tau(X)` is statistically
proportional to :math:`Y`, independent over sensitive feature :math:`A`. [#6]_

Disparity metrics, group metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -420,4 +438,7 @@ the algorithm may impact the intended outcomes of a given model.
<https://arxiv.org/pdf/1912.05511.pdf>`_, FAccT, 2021.

.. [#5] Obermeyer, Powers, Vogeli, Mullainathan `"Dissecting racial bias in an algorithm used to manage the health of populations"
<https://science.sciencemag.org/content/366/6464/447>`_, Science, 2019.
<https://science.sciencemag.org/content/366/6464/447>`_, Science, 2019.

.. [#6] Singh, Joachims `"Fairness of Exposure in Rankings"
<https://dl.acm.org/doi/10.1145/3219819.3220088>`_, KDD, 2018.
145 changes: 145 additions & 0 deletions examples/plot_ranking.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Copyright (c) Fairlearn contributors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised we're adding an example, but not a user guide section. Sometimes the latter can borrow from the former through literalinclude, but in my mind the first step is always the user guide. That said, I don't think we're making that distinction particularly clear at the moment, which is probably something to discuss on the community call again (topic: "structure of documentation" with a particular focus on user guide vs. examples)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example looks quite okay to me, we can just make sure we have links to it from the right places in the user guide and the API guides.

# Licensed under the MIT License.

"""
=========================================
Ranking
=========================================
"""

from fairlearn.metrics import exposure, utility, proportional_exposure
from fairlearn.metrics import MetricFrame

# %%
# This notebook shows how to apply functionalities of :mod:`fairlearn.metrics` to ranking problems.
# We showcase the example "Fairly Allocating Economic Opportunity" from the paper
# `"Fairness of Exposure in Ranking" <https://dl.acm.org/doi/10.1145/3219819.3220088>`_
# by Singh and Joachims (2018).
# The example demonstrates how small differences in item relevance can lead to large differences
# in exposure. Differences in exposure can be harmful when we rank individuals, such as in hiring.
# Groups of people can also be indirectly affected by rankings of items such as books, music, or
# products in online retail. For example, it could be that authors of color's products are
# consistently lower ranked in search results.
#
# We reproduce the example of the paper, which for simplicity reasons uses a binary gender category
# as sensitive attribute. However, the metric can also be applied to multi-categorical sensitive
# attributes.
#
# Consider a web-service that connects employers ("users") to potential employees ("items").
# The web-service uses a ranking-based system to present a set of 6 applicants of which 3 are
# labelled man and 3 are labelled woman. Men have relevance of 0.80, 0.79, 0.78 respectively for
# the employer, while women have relevance of 0.77, 0.76, 0.75.
# In this setting a relevance of 0.75 is defined as, 75% of all employers issuing the query
# considered the applicant relevant.
#
# A simple way to rank the participants is in decreasing order of relevance. What does this mean
# for the exposure between the two groups?
#
# NOTE: The used data should raise questions about
# :ref:`construct validity <fairness_in_machine_learning.construct-validity>` , since we
# should consider whether the "relevance" that is in the data is a good measurement of the actual
# relevance. Hindsight bias is also a concern, since how can you know upfront if an applicant
# will be a successful employee in the future? These concerns are important, but out of scope for
# this use case example.


ranking_pred = [1, 2, 3, 4, 5, 6] # ranking
gender = ['Man', 'Man', 'Man', 'Woman', 'Woman', 'Woman']
y_true = [0.82, 0.81, 0.80, 0.79, 0.78, 0.77]

# %%
# Here we define what metrics we want to analyze.
#
# - The :func:`fairlearn.metrics.exposure` metric measures the average exposure of a group of
# items, based on their position in the ranking. Exposure is the value that we assign to every
# place in the ranking, calculated by a
# standard exposure drop-off of :math:`1/log_2(1+j)` as used in Discounted Cumulative Gain
# (`DCG <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.dcg_score.html>`_)
# , to account for position bias. If there are big differences in exposure this could be an
# indication of allocation harm, i.e. men are on average ranked way higher than women by the
# web-service.
#
# - The :func:`fairlearn.metrics.utility` metric indicates the average "ground-truth" relevance of
# a group.
#
# - The :func:`fairlearn.metrics.proportional_exposure` metric computes the average exposure of a
# group, divided by its utility (i.e., average relevance). Differences between groups indicate
# that the exposure of some groups is not proportional to their ground-truth utility, which can
# be seen as a measure of quality-of-service harm.
#
# We can compare ranking metrics across groups using :class:`fairlearn.metrics.MetricFrame`.

bram49 marked this conversation as resolved.
Show resolved Hide resolved
metrics = {
'exposure (allocation harm)': exposure,
'average utility': utility,
'proportional exposure (quality-of-service)': proportional_exposure
}

mf = MetricFrame(metrics=metrics,
y_true=y_true,
y_pred=ranking_pred,
sensitive_features={'gender': gender})

# Customize the plot
mf.by_group.plot(
kind="bar",
subplots=True,
layout=[1, 3],
legend=False,
figsize=(12, 4)
)

# %%
# We can compute the minimum ratio between groups using
# :func:`fairlearn.metrics.MetricFrame.ratio`. The closer the ratio is to 1 the more fair
# the ranking is.
mf.ratio()

# %%
# The first plot shows that the web-service that men get significantly more exposure than women.
# Although the second plot shows that the average utility of women is comparable to men.
# Therefor we can say that the ranking contains quality-of-service harm against women, since the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Therefor we can say that the ranking contains quality-of-service harm against women, since the
# Therefore, we can say that the ranking shows evidence of quality-of-service harm against women, since the

# proportional exposure is not equal (plot 3)

# %%
# How can we fix this? A simple solution is to rerank the items, in such a way that women get
# more exposure and men get less exposure. For example we can swap the top man with the top
# woman and remeasure the quality-of-service harm.

ranking_pred = [1, 2, 3, 4, 5, 6] # ranking
gender = ['Woman', 'Man', 'Man', 'Man', 'Woman', 'Woman']
y_true = [0.79, 0.81, 0.80, 0.82, 0.78, 0.77] # Continuous relevance score

# Analyze metrics using MetricFrame
# Careful that in contrast to the classification problem, y_pred now requires a ranking
metrics = {
'exposure (allocation harm)': exposure,
'average utility': utility,
'proportional exposure (quality-of-service)': proportional_exposure
}

mf = MetricFrame(metrics=metrics,
y_true=y_true,
y_pred=ranking_pred,
sensitive_features={'gender': gender})

# Customize the plot
mf.by_group.plot(
kind="bar",
subplots=True,
layout=[1, 3],
legend=False,
figsize=(12, 4)
)

mf.ratio()

# %%
# The new plots show that the exposure and proportional exposure are now much more equal. The
# difference in exposure allocation is much smaller and the quality-of-service is better
# in proportion to the group's average utility.
#
# This was a simple example using fabricated data, just to show what the exposure metrics are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This was a simple example using fabricated data, just to show what the exposure metrics are
# This was a highly stylized example using synthetic data, which served to show what exposure metrics are

# capable of measuring. Manually switching of people in a ranking is however not recommendable with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# capable of measuring. Manually switching of people in a ranking is however not recommendable with
# capable of measuring. Manually swapping people in a ranking is however not recommendable with

# real world larger data sets. There we would recommend mitigation techniques, that are
# unfortunately not yet implemented in Fairlearn.
29 changes: 27 additions & 2 deletions fairlearn/metrics/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,16 @@
_mean_underprediction,
count)

from ._exposure import ( # noqa: F401
exposure,
utility,
proportional_exposure,
exposure_difference,
exposure_ratio,
proportional_exposure_difference,
proportional_exposure_ratio
)


# Add the generated metrics of the form and
# `<metric>_{difference,ratio,group_min,group_max`
Expand All @@ -67,7 +77,15 @@
"demographic_parity_difference",
"demographic_parity_ratio",
"equalized_odds_difference",
"equalized_odds_ratio"
"equalized_odds_ratio",
"allocation_harm_in_ranking_difference",
"allocation_harm_in_ranking_ratio",
"quality_of_service_harm_in_ranking_difference",
"quality_of_service_harm_in_ranking_ratio",
"exposure_difference",
"exposure_ratio",
"proportional_exposure_difference",
"proportional_exposure_ratio"
]

_extra_metrics = [
Expand All @@ -80,4 +98,11 @@
"count"
]

__all__ = _core + _disparities + _extra_metrics + list(sorted(_generated_metric_dict.keys()))
_ranking_metrics = [
"exposure",
"utility",
"proportional_exposure"
]

__all__ = _core + _disparities + _extra_metrics + _ranking_metrics \
+ list(sorted(_generated_metric_dict.keys()))
Loading