From 794a07d00c1b6ca0ddfce37abd41e00ce9153051 Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Fri, 28 Feb 2020 09:47:04 -0500 Subject: [PATCH 1/5] add metrics API proposal Signed-off-by: Miro Dudik --- api/METRICS.md | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 91 insertions(+) create mode 100644 api/METRICS.md diff --git a/api/METRICS.md b/api/METRICS.md new file mode 100644 index 0000000..f5c77b0 --- /dev/null +++ b/api/METRICS.md @@ -0,0 +1,91 @@ +# API proposal for metrics + +## Example + +```python +# For most sklearn metrics, we would have their group version that returns a Bunch with fields +# * overall: overall metric value +# * by_group: a dictionary that maps sensitive feature values to metric values + +summary = accuracy_score_by_group(y_true, y_pred, sensitive_features=sf, **other_kwargs) + +# Exporting into pd.Series or pd.DataFrame in not too complicated + +series = pd.Series({**summary.by_group, 'overall': summary.overall}) +df = pd.DataFrame({"model accuracy": {**summary.by_group, 'overall': summary.overall}}) + +# Several types of scalar metrics for group fairness can be obtained from `summary` via transformation functions + +acc_difference = difference_from_summary(summary) +acc_ratio = ratio_from_summary(summary) +acc_group_min = group_min_from_summary(summary) + +# Most common disparity metrics should be predefined + +demo_parity_difference = demographic_parity_difference(y_true, y_pred, sensitive_features=sf, **other_kwargs) +demo_parity_ratio = demographic_parity_ratio(y_true, y_pred, sensitive_features=sf, **other_kwargs) +eq_odds_difference = equalized_odds_difference(y_true, y_pred, sensitive_features=sf, **other_kwargs) + +# For predefined disparities based on sklearn metrics, we adopt a consistent naming conventions + +acc_difference = accuracy_score_difference(y_true, y_pred, sensitive_features=sf, **other_kwargs) +acc_ratio = accuracy_score_ratio(y_true, y_pred, sensitive_features=sf, **other_kwargs) +acc_group_min = accuracy_score_group_min(y_true, y_pred, sensitive_features=sf, **other_kwargs) +``` + +## Functions + +*Function signatures* + +```python +metric_by_group(metric, y_true, y_pred, *, sensitive_features, **other_kwargs) +# return the summary for the provided metrics + +make_metric_by_group(metric) +# return a callable object _by_group: +# _by_group(...) = metric_by_group(, ...) + +# Transformation functions returning scalars +difference_from_summary(summary) +ratio_from_summary(summary) +group_min_from_summary(summary) +group_max_from_summary(summary) + +# Metric-specific functions returing summary and scalars +_by_group(y_true, y_pred, *, sensitive_features, **other_kwargs) +_difference(y_true, y_pred, *, sensitive_features, **other_kwargs) +_ratio(y_true, y_pred, *, sensitive_features, **other_kwargs) +_group_min(y_true, y_pred, *, sensitive_features, **other_kwargs) +_group_max(y_true, y_pred, *, sensitive_features, **other_kwargs) +``` + +*Summary of transformations* + +|transformation function|output|metric-specific function|code|aif360| +|-----------------------|------|------------------------|----|------| +|`difference_from_summary`|max - min|`_difference`|D|unprivileged - privileged| +|`ratio_from_summary`|min / max|`_ratio`|R| unprivileged / privileged| +|`group_min_from_summary`|min|`_group_min`|Min| N/A | +|`group_max_from_summary`|max|`_group_max`|Max| N/A | + +*Supported metric-specific functions* + +|metric|variants|task|notes|aif360| +|------|--------|-----|----|------| +|`selection_rate`| G,D,R,Min | class | | ✓ | +|`demographic_parity`| D,R | class | `selection_rate_difference`, `selection_rate_ratio` | `statistical_parity_difference`, `disparate_impact`| +|`accuracy_score`| G,D,R,Min | class | sklearn | `accuracy` | +|`balanced_accuracy_score` | G | class | sklearn | - | +|`mean_absolute_error` | G,D,R,Max | class,reg | sklearn | class only: `error_rate` +|`false_positive_rate` | G,D,R | class | | ✓ | +|`false_negative_rate` | G | class | | ✓ | +|`true_positive_rate` | G,D,R | class | | ✓ | +|`true_negative_rate` | G | class | | ✓ | +|`equalized_odds` | D,R | class | max of difference or ratio under `true_positive_rate`, `false_positive_rate` | - | +|`precision_score`| G | class | sklearn | ✓ | +|`recall_score`| G | class | sklearn | ✓ | +|`f1_score`| G | class | sklearn | - | +|`roc_auc_score`| G | prob | sklearn | - | +|`log_loss`| G | prob | sklearn | - | +|`mean_squared_error`| G | prob,reg | sklearn | - | +|`r2_score`| G | reg | sklearn | - | From 3b93629f7088fe0f2f7be254411a908a1c8b72ad Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Tue, 3 Mar 2020 18:14:58 -0500 Subject: [PATCH 2/5] add clarifications and confusion_matrix Signed-off-by: Miro Dudik --- api/METRICS.md | 35 ++++++++++++++++++++++++++++------- 1 file changed, 28 insertions(+), 7 deletions(-) diff --git a/api/METRICS.md b/api/METRICS.md index f5c77b0..391a660 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -33,13 +33,14 @@ acc_ratio = accuracy_score_ratio(y_true, y_pred, sensitive_features=sf, **other_ acc_group_min = accuracy_score_group_min(y_true, y_pred, sensitive_features=sf, **other_kwargs) ``` -## Functions +## Proposal *Function signatures* ```python metric_by_group(metric, y_true, y_pred, *, sensitive_features, **other_kwargs) -# return the summary for the provided metrics +# return the summary for the provided `metric`, where `metric` has the signature +# metric(y_true, y_pred, **other_kwargs) make_metric_by_group(metric) # return a callable object _by_group: @@ -51,7 +52,7 @@ ratio_from_summary(summary) group_min_from_summary(summary) group_max_from_summary(summary) -# Metric-specific functions returing summary and scalars +# Metric-specific functions returning summary and scalars _by_group(y_true, y_pred, *, sensitive_features, **other_kwargs) _difference(y_true, y_pred, *, sensitive_features, **other_kwargs) _ratio(y_true, y_pred, *, sensitive_features, **other_kwargs) @@ -59,7 +60,7 @@ group_max_from_summary(summary) _group_max(y_true, y_pred, *, sensitive_features, **other_kwargs) ``` -*Summary of transformations* +*Summary of transformations and transformation codes* |transformation function|output|metric-specific function|code|aif360| |-----------------------|------|------------------------|----|------| @@ -68,7 +69,18 @@ group_max_from_summary(summary) |`group_min_from_summary`|min|`_group_min`|Min| N/A | |`group_max_from_summary`|max|`_group_max`|Max| N/A | -*Supported metric-specific functions* +*Summary of tasks and task codes* + +|task|definition|code| +|----|----------|----| +|binary classification|labels and predictions are in {0,1}|class| +|probabilistic binary classification|labels are in {0,1}, predictions are in [0,1] and correspond to estimates of P(y\|x)|prob| +|randomized binary classification|labels are in {0,1}, predictions are in [0,1] and represent the probability of drawing y=1 in a randomized decision|class-rand| +|regression|labels and predictions are real-valued|reg| + +*Predefined metric-specific functions* + +* variants: D, R, Min, Max refer to the transformations from the table above; G refers to `_by_group`. |metric|variants|task|notes|aif360| |------|--------|-----|----|------| @@ -76,7 +88,8 @@ group_max_from_summary(summary) |`demographic_parity`| D,R | class | `selection_rate_difference`, `selection_rate_ratio` | `statistical_parity_difference`, `disparate_impact`| |`accuracy_score`| G,D,R,Min | class | sklearn | `accuracy` | |`balanced_accuracy_score` | G | class | sklearn | - | -|`mean_absolute_error` | G,D,R,Max | class,reg | sklearn | class only: `error_rate` +|`mean_absolute_error` | G,D,R,Max | class, reg | sklearn | class only: `error_rate` | +|`confusion_matrix` | G | class | sklearn | `binary_confusion_matrix` | |`false_positive_rate` | G,D,R | class | | ✓ | |`false_negative_rate` | G | class | | ✓ | |`true_positive_rate` | G,D,R | class | | ✓ | @@ -87,5 +100,13 @@ group_max_from_summary(summary) |`f1_score`| G | class | sklearn | - | |`roc_auc_score`| G | prob | sklearn | - | |`log_loss`| G | prob | sklearn | - | -|`mean_squared_error`| G | prob,reg | sklearn | - | +|`mean_squared_error`| G | prob, reg | sklearn | - | |`r2_score`| G | reg | sklearn | - | + +## Dashboard questions + +1. Should we enable regression metrics for probabilistic classification? + * `mean_absolute_error`, `mean_squared_error`, `mean_squared_error(...,squared=False)` +1. Should we introduce balanced error metrics for probabilistic classification? + * `balanced_mean_{squared,absolute}_error`, `balanced_log_loss` +1. Do we keep `mean_prediction` and `mean_{over,under}prediction`? From 9359f135a372c90d093f36bc8a7c76ca144ddfe8 Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Tue, 3 Mar 2020 18:21:43 -0500 Subject: [PATCH 3/5] fix list markdown Signed-off-by: Miro Dudik --- api/METRICS.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/api/METRICS.md b/api/METRICS.md index 391a660..ec11657 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -106,7 +106,7 @@ group_max_from_summary(summary) ## Dashboard questions 1. Should we enable regression metrics for probabilistic classification? - * `mean_absolute_error`, `mean_squared_error`, `mean_squared_error(...,squared=False)` + * `mean_absolute_error`, `mean_squared_error`, `mean_squared_error(...,squared=False)` 1. Should we introduce balanced error metrics for probabilistic classification? - * `balanced_mean_{squared,absolute}_error`, `balanced_log_loss` + * `balanced_mean_{squared,absolute}_error`, `balanced_log_loss` 1. Do we keep `mean_prediction` and `mean_{over,under}prediction`? From ddde2ff751d2a9aee17ba42d0090a9e4c182283e Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Thu, 12 Mar 2020 11:48:43 -0400 Subject: [PATCH 4/5] rename _by_group to _group_summary for consistency Signed-off-by: Miro Dudik --- api/METRICS.md | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/api/METRICS.md b/api/METRICS.md index ec11657..e497e1d 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -3,18 +3,20 @@ ## Example ```python -# For most sklearn metrics, we would have their group version that returns a Bunch with fields +# For most sklearn metrics, we will have their group version that returns +# the summary of its performance across groups as well as the overall +# performance, represented as a Bunch object with fields # * overall: overall metric value # * by_group: a dictionary that maps sensitive feature values to metric values -summary = accuracy_score_by_group(y_true, y_pred, sensitive_features=sf, **other_kwargs) +summary = accuracy_score_group_summary(y_true, y_pred, sensitive_features=sf, **other_kwargs) # Exporting into pd.Series or pd.DataFrame in not too complicated series = pd.Series({**summary.by_group, 'overall': summary.overall}) df = pd.DataFrame({"model accuracy": {**summary.by_group, 'overall': summary.overall}}) -# Several types of scalar metrics for group fairness can be obtained from `summary` via transformation functions +# Several types of scalar metrics for group fairness can be obtained from the group summary via transformation functions acc_difference = difference_from_summary(summary) acc_ratio = ratio_from_summary(summary) @@ -38,13 +40,13 @@ acc_group_min = accuracy_score_group_min(y_true, y_pred, sensitive_features=sf, *Function signatures* ```python -metric_by_group(metric, y_true, y_pred, *, sensitive_features, **other_kwargs) -# return the summary for the provided `metric`, where `metric` has the signature +group_summary(metric, y_true, y_pred, *, sensitive_features, **other_kwargs) +# return the group summary for the provided `metric`, where `metric` has the signature # metric(y_true, y_pred, **other_kwargs) -make_metric_by_group(metric) -# return a callable object _by_group: -# _by_group(...) = metric_by_group(, ...) +make_metric_group_summary(metric) +# return a callable object _group_summary: +# _group_summary(...) = group_summary(, ...) # Transformation functions returning scalars difference_from_summary(summary) @@ -52,15 +54,15 @@ ratio_from_summary(summary) group_min_from_summary(summary) group_max_from_summary(summary) -# Metric-specific functions returning summary and scalars -_by_group(y_true, y_pred, *, sensitive_features, **other_kwargs) +# Metric-specific functions returning group summary and scalars +_group_summary(y_true, y_pred, *, sensitive_features, **other_kwargs) _difference(y_true, y_pred, *, sensitive_features, **other_kwargs) _ratio(y_true, y_pred, *, sensitive_features, **other_kwargs) _group_min(y_true, y_pred, *, sensitive_features, **other_kwargs) _group_max(y_true, y_pred, *, sensitive_features, **other_kwargs) ``` -*Summary of transformations and transformation codes* +*Transformations and transformation codes* |transformation function|output|metric-specific function|code|aif360| |-----------------------|------|------------------------|----|------| @@ -69,7 +71,7 @@ group_max_from_summary(summary) |`group_min_from_summary`|min|`_group_min`|Min| N/A | |`group_max_from_summary`|max|`_group_max`|Max| N/A | -*Summary of tasks and task codes* +*Tasks and task codes* |task|definition|code| |----|----------|----| @@ -80,7 +82,7 @@ group_max_from_summary(summary) *Predefined metric-specific functions* -* variants: D, R, Min, Max refer to the transformations from the table above; G refers to `_by_group`. +* variants: D, R, Min, Max refer to the transformations from the table above; G refers to `_group_summary`. |metric|variants|task|notes|aif360| |------|--------|-----|----|------| From 0b86e6d333bcc76d04c6cdce51c3256684e47944 Mon Sep 17 00:00:00 2001 From: Miro Dudik Date: Mon, 16 Mar 2020 11:36:06 -0400 Subject: [PATCH 5/5] remove dashboard questions Signed-off-by: Miro Dudik --- api/METRICS.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/api/METRICS.md b/api/METRICS.md index e497e1d..98a2be2 100644 --- a/api/METRICS.md +++ b/api/METRICS.md @@ -104,11 +104,3 @@ group_max_from_summary(summary) |`log_loss`| G | prob | sklearn | - | |`mean_squared_error`| G | prob, reg | sklearn | - | |`r2_score`| G | reg | sklearn | - | - -## Dashboard questions - -1. Should we enable regression metrics for probabilistic classification? - * `mean_absolute_error`, `mean_squared_error`, `mean_squared_error(...,squared=False)` -1. Should we introduce balanced error metrics for probabilistic classification? - * `balanced_mean_{squared,absolute}_error`, `balanced_log_loss` -1. Do we keep `mean_prediction` and `mean_{over,under}prediction`?