MetricFrame should support metrics that don't require y_true and y_pred #756

MiroDudik · 2021-04-22T17:52:25Z

Right now MetricFrame only works with metrics with the signature metric(y_true, y_pred), but its disaggregation functionality should be much more broadly applicable. Two use cases of interest:

Dataset-only metrics or prediction-only metrics. These are the metrics of the form metric(y_true) and metric(y_pred). While in principle these could be handled by the current API, it's a bit confusing.
Metrics for settings beyond classification / regression. For example, to evaluate a contextual bandit algorithm, the metrics have the signature metric(actions, rewards, propensities). This use case comes up in settings with partial observations, e.g., in lending: we only observe whether the person repays a loan (reward) for the specific loan type we provide (action), including "no loan".

I think regardless of how the API is tweaked, there's an obvious first step. And then hopefully a not-too-controversial second step.

Step 1: Make MetricFrame arguments keyword-only

The suggestion is to change the current API, which is this:

MetricFrame(metric, y_true, y_pred, *, sensitive_features, control_features=None, sample_params=None)

To something like this:

MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)

So all arguments would be keyword-only and metric would become metrics (for consistency with other arguments and with columns in pandas). The functionality wouldn't change.

Step 2: Allow flexible names of shared sample parameters

Change the signature to:

MetricFrame(*, metrics, sensitive_features, control_features=None, sample_params=None, **shared_sample_params)

The idea is that any of the shared_sample_params are passed on to all metrics, whereas sample_params is (as before) a dictionary. With the new signature, it would still be possible to write things like

mf = MetricFrame(metrics=skm.accuracy_score, y_true=y_true, y_pred=y_pred, sensitive_features=sf)

But it would be simple to use other kinds of metrics that do not work with y_true and y_pred.

The text was updated successfully, but these errors were encountered:

MiroDudik · 2021-04-22T17:55:30Z

Resolving this issue would also enable #676 .

riedgar-ms · 2021-04-30T23:36:35Z

We definitely should be figuring out how to offer metrics for other scenarios. However, I'm a little concerned about a few points. Doing (1) is going to break all existing code. Now, while we hardly have the usage of sklearn (say), that's still not a good thing to do. And I think that doing (2) sort of does the opposite - once you've accepted **kwargs, you can't add any more arguments of your own, since you might clash with something a user has.

For (2), we could add a shared_sample_params dictionary and explode it when we invoke the metric functions.

For the rest, would it be wiser to say "MetricFrame is for tabular data; we also offer X, Y, Z ....."

MiroDudik · 2021-05-03T13:54:10Z

If we want to implement (1), we would not break everything right away. We would start off by issuing deprecation warnings. Would that work, @riedgar-ms @romanlutz @hildeweerts @adrinjalali ?

I just think that our current API is really limiting us, so I'd rather fix this sooner rather than commit to the future of docs that describe all kinds of workarounds.

I want to hold off discussing (2), because I acknowledge that there are a variety of considerations... I think that we have several options there. It would be a non-breaking change, so there's less urgency than with (1), which, if we decide to go that route, would be nice to accomplish before SciPy tutorial (at least the version where we have a new format, and issue warnings if somebody is using the old format).

riedgar-ms · 2021-05-03T14:00:33Z

Is there a way of detecting whether an argument was passed by position rather than by keyword?

hildeweerts · 2021-05-03T14:38:12Z

I am generally not opposed to Step 1 (although I don't really see the need for changing metric into metrics).

When we get to Step 2 things get a bit tricky IMO. I am afraid that trying to fit all different kinds of learning tasks into one MetricFrame will become extremely confusing for novice users. An alternative might be to leave MetricFrame as is and create a new data structure for other types of tasks (e.g., differentiate between things like SupervisedMetricFrame, UnsupervisedMetricFrame, ReinforcementMetricFrame).

MiroDudik · 2021-05-03T15:34:57Z

Re. metric -> metrics--the rationale is that we have sensitive_features and control_features. I've also noticed that in the tutorials I keep writing things like:

# Define metrics
metrics = {
    'accuracy': skm.accuracy_score,
    'precision': skm.precision_score,
    'recall': skm.recall_score,
    'false_positive_rate': flm.false_positive_rate,
}

mf = MetricFrame(metrics, ...)

So it seems that metrics is the more appropriate name--it makes no sense to change the name in isolation, but if we decide to implement (1), then it would be an opportunity to change this as well.

@riedgar-ms : the transitional API (i.e., for the purposes of deprecating the old API) would be something like:
MetricFrame(*args, metrics, y_true, y_pred, sensitive_features, control_features, sample_params)

romanlutz · 2021-05-03T17:04:25Z

I'm not terribly opinionated on this issue, but if we move to something like MetricFrame(*, metrics, sensitive_features, control_features, sample_params, **shared_sample_params) then we better have extremely good documentation. Reading this in an API documentation I would have no clue what to do with that. In contrast, the current signature is very intuitive to me. Obviously that can be fixed with a couple of nice examples right in the API doc and updating the user guide, but I want to make sure this is on our radar (i.e., no code PR without updating the docs).

I would also suggest providing a few examples of what that would look like if we have different kinds of metrics (e.g. accuracy needs y_true and y_pred, selection rate doesn't need a second set of ys) in the same set of metrics passed in.

Finally, I'm wondering whether the right thing to do here would be to add a warning that MetricFrame will have keyword-only args starting with the next version so people should adjust their scripts. [If so, we should make the switch to metrics instead of metric right away, otherwise it breaks people again.]

MiroDudik · 2021-05-03T17:38:39Z

@romanlutz -- i agree than step (2) requires some care, but what do you think about carrying out step (1), with the intermediate API of the form:

MetricFrame(*args, metrics=None, y_true=None, y_pred=None,
            sensitive_features, control_features=None, sample_params=None)

This would support the current behavior when 1-3 positional arguments are provided, but in those cases it would give a warning that this calling format is being deprecated.

riedgar-ms · 2021-05-03T18:24:17Z

I think that there are a few different things going on here, and I'd like to separate the different threads.

Firstly, I think that the idea behind shared_sample_params is a good one. Indeed, I considered putting something like that in last autumn when I was writing MetricFrame. I didn't because it wasn't a 'must have' feature. However, as I said above, I think it should be a Dict[str,vector] and not **kwargs. Going with the latter makes it harder for us to add any other named arguments of our own (since whatever name we pick, we're bound to break someone) at a later date. We 'explode' the Dict when the metric functions themselves are invoked.

Secondly, I certainly don't think that MetricFrame is the last word on these sort of metrics. Indeed, I'm sure that areas like text or vision may well need similar but different functionality (I don't know enough about them to say). But like @hildeweerts , I think that it may well be better to take each by itself, and figure out what works best for that domain, rather than trying to shoehorn them into MetricFrame.

I can see the logic in renaming the metric argument to metrics (it is a potential break, but only if people are naming their arguments). I'm not sure I understand the motivation behind forcing named arguments. I think that would make the interface much less skearn-like.

MiroDudik · 2021-05-03T21:36:10Z

@riedgar-ms : one of the reason to have keyword-only parameters is so that you can skip them when they're not needed. For example, if we went with the signature:

MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)

Then it would be okay to drop y_true and y_pred and provide all the metric parameters via sample_params.

MiroDudik · 2021-05-03T21:49:48Z

Re. shared_sample_params, I think that **shared_sample_params would make the pattern more similar to what we do right now with y_true and y_pred. We could take advantage of inspect.signature() and throw a warning if there's a conflict. In those rare cases, the users could use sample_params as a workaround.

But I also don't mind if y_true=None and y_pred=None are grandfathered in and anything else needs to be provided via an actual dictionary shared_sample_params rather than **shared_sample_params.

riedgar-ms · 2021-05-04T00:23:23Z

@riedgar-ms : one of the reason to have keyword-only parameters is so that you can skip them when they're not needed. For example, if we went with the signature:
MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)
Then it would be okay to drop y_true and y_pred and provide all the metric parameters via sample_params.

If the use case is "we can drop the current parameters, and have a different set of parameters instead" then we should probably be creating a new class to handle the second set of parameters. Right now, I feel that MetricFrame does the fairly simple task of taking an sklearn-style (y_true, y_pred) metric, and turning it into one with grouping (via the sensitive features). I'd prefer to keep it that way, and work out the best API for a new use case from a clean sheet.

It may be that there is a more general API hiding within, and that we end up changing MetricFrame to wrap (or subclass) that. But I think it would still be better to keep the existing MetricFrame API around, doing its simple job.

romanlutz · 2021-05-04T03:43:48Z

I feel like this discussion could benefit from some concrete examples. Like @riedgar-ms (and @hildeweerts I think) I'm currently struggling to see the benefit of complicating the currently very intuitive API, but perhaps with 3-5 examples of what this would look like for actual metrics (say, one of the existing ones we support, and then whichever new ones we might be able to accommodate through this) we might see the benefits (?)

hildeweerts · 2021-05-04T07:55:49Z

It may be that there is a more general API hiding within, and that we end up changing MetricFrame to wrap (or subclass) that. But I think it would still be better to keep the existing MetricFrame API around, doing its simple job.

I was thinking the same!

I think the majority of users will be looking for classification/regression metrics and may not even be familiar with reinforcement learning. Having to look at examples to understand how to use an API even in the 'simplest' scenario, signals that it is not intuitive.

I like @romanlutz idea of writing out some examples!

MiroDudik · 2021-05-04T22:28:57Z

Is there resistance even to Step 1, which would just modify the current API to:

MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)

I don't think that it would make the API more confusing, would it?

I think that there are good use cases that have nothing to do with reinforcement learning and that we have already run into (dataset-only metrics like demographic parity, streaming metrics which are currently being implemented, and word-error rate-like settings which are requested as well). But let me create some short examples to demonstrate.

riedgar-ms · 2021-05-06T11:10:11Z

It would break existing code, and require us to bump the version to v0.7.0. If we do that, I think we should release v0.7.0 as soon as possible after making the change. The long gap between v0.4.6 and v0.5.0 (the latter introducing the MetricFrame itself) has given us quite a lot of trouble, since the things in the repo were incompatible with the package on PyPI.

I would not be overjoyed at defaulting y_true and y_pred to None, though. That is not how sklearn metrics work.

At the same time, you could also add the shared_sample_params dictionary, since I think that is a great thing to have.

riedgar-ms · 2021-05-07T14:51:52Z

Based on the discussion in the community call, I'm happy with the following being done:

Renaming metric to metrics in the argument list
Switching to keyword arguments
Adding shared_sample_params (as a dictionary, not **kwargs)

My only concern is orchestrating a swift release, since these are breaking changes.

When it comes to metrics for other problem types, I would like to know more before deciding to support those by extending MetricFrame (rather than creating a new class).

hildeweerts · 2021-05-07T15:06:41Z

The reasoning behind shared_sample_params is a bit unclear to me still. It seems like a pretty niche situation but I might be missing something?

It seems to me that generally speaking the sample_params will be similar for different metrics of the same scenario (e.g., either classification or reinforcement learning). And in the niche cases that it's not, you might as well just create two MetricFrame objects.

So for me at least some examples would still be appreciated, @MiroDudik :D

MiroDudik · 2021-05-07T21:49:59Z

I know... examples... soonish?

@riedgar-ms : re. swift release, my suggestion re. staging is

1. backward compatible implementation with deprecation warning:

MetricFrame(*args, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)

2. move to keyword-only arguments:

MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features, control_features=None, sample_params=None)

3. addition of any other arguments like `shared_sample_params` (but that could happen later or together with 2)

Examples are coming next week... promise!

riedgar-ms · 2021-05-07T21:56:02Z

I'm not keen on allowing None for y_true and y_pred; not until we're really sure we want to expand on MetricFrame in this way. Otherwise, I'm OK with those two steps. I think that 3 could come before 2 actually. Right now, I don't think that (2) should happen before v0.8.0.

MiroDudik · 2021-05-07T22:22:30Z

@riedgar-ms: I believe that (1) must have y_true=None and y_pred=None otherwise we cannot be backward compatible plus able to issue deprecation warning... so in a sense (1) might be forced. We can discuss whether (2) should have None or not after I show those promised examples...

riedgar-ms · 2021-05-08T12:05:51Z

Hmmm.... good point.

MiroDudik · 2021-05-16T22:53:58Z

Finally, some examples... in those examples I'm considering two alternative proposals for the API (they slightly differ from the original proposal in my issue at the top):

Alternative A: Optional `y_true` / `y_pred` / `sensitive_features`

MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features=None, control_features=None, sample_params=None)

Alternative B: Also allow any keyword arguments and internally use `inspect.signature` to decide who gets what

MetricFrame(*, metrics, y_true=None, y_pred=None, sensitive_features=None, control_features=None,
            sample_params=None, **direct_sample_params)

The idea is that **direct_sample_params would contain sample parameters and each of them would be passed to all the metrics whose signature includes that parameter name. We retain the nested dictionary sample_params as a way to deal with naming conflicts (but there's also a work-around using lambda expressions).

Now the scenarios. I've written them down purely technically, but Scenarios 1 & 2 came up in various notebooks, Scenario 3 is already under a PR, and Scenario 4 comes up in settings with censored data (e.g., many allocation settings).

Scenario 1: Classification metrics and scoring metrics in the same metric frame

For example, let's say you train a logistic model and would like to evaluate both its classification accuracy as well as the area under the ROC curve.

Say we have:

y_true = # vector of binary classification labels
sf = # vector of sensitive features
y_pred = estimator.predict(X) # binary predictions
y_score = estimator.predict_proba(X)[1] # scores

Current approach is something like this:

metrics = {
    `accuracy`: skm.accuracy_score,
    `auc`: lambda y_true, y_pred, y_score: skm.roc_auc_score(y_true, y_score)
}
mf = MetricFrame(metrics, y_true, y_pred, sensitive_features=sf, 
                 sample_params={
                     'auc': {'y_score': y_score}})

Alternative 1A: Optional `y_true` / `y_pred`

metrics = {
    `accuracy`: skm.accuracy_score,
    `auc`: skm.roc_auc_score
}
mf = MetricFrame(metrics=metrics, y_true=y_true, sensitive_features=sf, 
                 sample_params={
                     'accuracy': {'y_pred': y_pred},
                     'auc': {'y_score': y_score}})

Alternative 1B: Also allow any keyword arguments and internally use `inspect.signature` to decide who gets what

metrics = {
    `accuracy`: skm.accuracy_score,
    `auc`: skm.roc_auc_score
}
mf = MetricFrame(metrics=metrics, y_true=y_true, y_pred=y_pred, y_score=y_score, sensitive_features=sf)

Scenario 2: Multiple models in the same metric frame

Say we have three models giving predictions y_pred1, y_pred2, y_pred3 that we wish to compare. The current approach is something like:

metrics = {
    `model1_accuracy`: skm.accuracy_score,
    `model2_accuracy`: lambda y_true, y_pred, y_pred2: skm.accuracy_score(y_true, y_pred2),
    `model3_accuracy`: lambda y_true, y_pred, y_pred3: skm.accuracy_score(y_true, y_pred3),
}
mf = MetricFrame(metrics, y_true, y_pred, sensitive_features=sf, 
                 sample_params={
                     'model2_accuracy': {'y_pred2': y_pred2},
                     'model3_accuracy': {'y_pred3': y_pred3}})

Alternative 2A: Optional `y_true` / `y_pred`

metrics = {
    `model1_accuracy`: skm.accuracy_score,
    `model2_accuracy`: skm.accuracy_score,
    `model3_accuracy`: skm.accuracy_score
}
mf = MetricFrame(metrics=metrics, y_true=y_true, sensitive_features=sf, 
                 sample_params={
                     'model1_accuracy': {'y_pred': y_pred1}
                     'model2_accuracy': {'y_pred': y_pred2},
                     'model3_accuracy': {'y_pred': y_pred3}})

Alternative 2B: Also allow any keyword arguments and internally use `inspect.signature` to decide who gets what

metrics = {
    `model1_accuracy`: lambda y_true, y_pred1: skm.accuracy_score(y_true, y_pred1),
    `model2_accuracy`: lambda y_true, y_pred2: skm.accuracy_score(y_true, y_pred2),
    `model3_accuracy`: lambda y_true, y_pred3: skm.accuracy_score(y_true, y_pred3),
}
mf = MetricFrame(metrics=metrics, y_true=y_true,
                 y_pred1=y_pred1, y_pred2=y_pred2, y_pred3=y_pred3, sensitive_features=sf)

(Note that the solution from 2A would still be allowed. I think that 2A is cleaner than 2B.)

Scenario 3: Streaming metrics

This is the scenario from #670 (and #655). The idea is to support adding data and updating a metric frame in a streaming fashion.

The currently proposed initialization is this:

mf = MetricFrame(skm.recall_score, y_true=[], y_pred=[], sensitive_features=[], streaming=True)

Alternative 3A: Optional `y_true` / `y_pred` / `sensitive_features`

mf = MetricFrame(skm.recall_score, streaming=True)

Scenario 4: Metrics that don't use `y_true` and `y_pred`

In cost-sensitive learning, the metric parameters would be y_pred and costs (for each example, providing a vector of costs for predicting each of the classes). In contextual bandits, the metric parameters would be actions_taken, rewards, propensities, actions_target (or something like that).

Below, assume that there are metrics defined for the above two settings, say:

cost_sensitive_loss(*, costs, y_pred)  # for cost-sensitive learning
inverse_propensity_score(*, actions_taken, rewards, propensities, actions_target) # for contextual bandits

Current approach:

def cost_sensitive_loss_wrapper(y_true, y_pred, *, costs):
    return cost_sensitive_loss(costs=costs, y_pred=y_pred)

def ips_wrapper(y_true, y_pred, *, actions_taken, rewards, propensities, actions_target):
    return inverse_propensity_score(actions_taken=actions_taken, rewards=rewards,
                                    propensities=propensities, actions_target=actions_target)

cost_sensitive_loss_mf = MetricFrame(cons_sensitive_loss_wrapper,
    dummy_y_true, y_pred, sensitive_features=sf,
    sample_params = {'costs': costs})

ips_mf = MetricFrame(ips_wrapper,
    dummy_y_true, dummy_y_pred, sensitive_features=sf,
    sample_params = {'actions_taken': actions_taken, 'rewards': rewards,
                     'propensities': propensities, 'actions_target': actions_target})

Alternative 4A: Optional `y_true` / `y_pred`

cost_sensitive_loss_mf = MetricFrame(metrics=cost_sensitive_loss,   # wrapper not needed
    y_true=y_true,
    sample_params = {'costs': costs})

ips_mf = MetricFrame(metrics=inverse_propensity_score,
    sample_params = {'actions_taken': actions_taken, 'rewards': rewards,   # wrapper not needed
                     'propensities': propensities, 'actions_target': actions_target})

Alternative 4B: Also allow any keyword arguments and internally use `inspect.signature` to decide who gets what

cost_sensitive_loss_mf = MetricFrame(metrics=cost_sensitive_loss,
    y_true=y_true,
    costs=costs)

ips_mf = MetricFrame(metrics=inverse_propensity_score,
    actions_taken=actions_taken, rewards=rewards,
    propensities=propensities, actions_target=actions_target)

MiroDudik · 2021-05-16T22:57:12Z

@romanlutz , @riedgar-ms , @hildeweerts : please take a look at the examples above, and let me know what you think about Alternative A vs Alternative B (which just extends Alternative A). In particular, do any of you see issues with Alternative A?

riedgar-ms · 2021-05-17T13:02:24Z

I'm still not enamoured of accepting **kwargs in a function signature. It means that if we ever want to add new arguments to the signature, we're going to break someone. That is why I've been arguing for using a regular dictionary there. I can see that further using inspect.signature is something which can give a 'wow factor' in a demo. It is also a way of enabling users to elevate typos into really subtle bugs.

I still feel that we should leave MetricFrame largely as-is (with the exception of converting to keyword-only arguments.... note that this does not involve making y_true and y_pred optional in the above proposal, since there is not a transition API which accepts the positional arguments with a warning). It does a fairly simple thing well, so keep it that way.

I would be happy to start up a fresh discussion group about what a more general disaggregated metric should look like (and as I mentioned above, it's entirely possible that the existing MetricFrame would end up being refactored to use that feature).

hildeweerts · 2021-05-17T13:51:05Z

Thanks a lot for these examples - this makes things a lot clearer for me! Given these examples, I find Alternative A pretty intuitive. If I understand correctly, the most basic scenario with just y_true and y_pred doesn't really change apart from having to name the arguments, right?

At a first glance, I am a bit concerned that Alternative B contains too much "magic" behind the scenes which makes it more difficult to read (and debug) code. I do understand that writing all the dictionaries can be tedious. Adding subclasses for specific types of problems (e.g., RL) would already go a long way I think (and is much easier to properly document without overwhelming novice users).

MiroDudik · 2021-05-17T18:13:23Z

@riedgar-ms : what do you think about Alternative A? I think it would go a long way towards addressing the main issues and it doesn't introduce **kwargs.

@hildeweerts : correct, our current functionality is supported by both Alternative A and Alternative B by just requiring to name y_true, y_pred, and metrics.

riedgar-ms · 2021-06-16T13:48:33Z

Ping @hildeweerts @MiroDudik @michaelamoako

hildeweerts · 2021-06-17T09:42:26Z

Re. MetricFrameFromValues: from a user perspective I think I would prefer a from_frame (or maybe from_precomputed?) function over a subclass, because it seems more natural considering the familiar pandas API that has 'normal' initialization and from_csv() etc. I think the add_data() problem should be pretty easy to solve by adding an attribute precomputed_ (or whatever you'd like to call it) that determines which add_data() is used.

Re. function factory suggested by @riedgar-ms : this is for a different problem with varying columns y_pred/y_true, right? Or also for the precomputed metrics? In that case, could you expand a bit on what that would look like?

riedgar-ms · 2021-06-17T12:43:48Z

The factory function would be something like:

def expand_y_pred(y_true, y_pred, f):
    assert len(y_true) == len(y_pred)
    return f(y_pred)

def make_metric_from_y_pred_function(function):
    return functools.partial(expand_y_pred, f = function)

in which case we could define:

selection_rate = make_metric_from_y_pred_function(np.mean)

(note I have only written the code above, not actually tried it). There could be a similar make_metric_from_y_true_function() as well.

Does that make more sense @hildeweerts ?

hildeweerts · 2021-06-17T13:51:07Z

Ah, yes, I see. Thanks for the explanation!

I can see the added value of the factory function, but I wonder how easy it would be for (novice) Fairlearn users to identify the need for it - particularly for those who skip the user guide.

I still think there's added value in having specific data structures for different types of ML tasks, rather than having to wrap functions to fit into the supervised learning paradigm - if that makes sense? E.g., with named arguments it seems like we should be able to handle y_true/y_pred for supervised problems, and perhaps labels/labels_true/labels_pred for unsupervised problems, etc. Just to make things easier for people who just want to use more standard scikit-learn-like metrics.

But I am not sure what's the best way forward here... looking forward to hearing other people's thoughts :)

riedgar-ms · 2021-06-17T19:26:01Z

Can you expand on labels / labels_true / labels_pred ? What's the third category for - I've generally been thinking of that 'y_pred is anything coming from the model, and y_true is anything coming from the original data.'

If we did start allowing different names, how would me handle routing arguments? I'm concerned that if we try looking at our list of arguments, and using reflection on the supplied metric functions, we'll end up creating a source of subtle bugs. Think sample_weight vs sample_weights.

hildeweerts · 2021-06-18T07:13:30Z

Can you expand on labels / labels_true / labels_pred ? What's the third category for - I've generally been thinking of that 'y_pred is anything coming from the model, and y_true is anything coming from the original data.'

That is correct! labels is the equivalent y for clustering problems. For some reason there exist both supervised and unsupervised metrics to evaluate clustering results (https://scikit-learn.org/stable/modules/classes.html#clustering-metrics), hence the three categories.

hildeweerts · 2021-06-18T07:22:03Z

If we did start allowing different names, how would me handle routing arguments? I'm concerned that if we try looking at our list of arguments, and using reflection on the supplied metric functions, we'll end up creating a source of subtle bugs. Think sample_weight vs sample_weights.

I was thinking we could have a separate SupervisedMetricFrame that directly routes y_true and y_pred (if they're supplied) and UnsupervisedMetricFrame that directs labels, labels_pred, labels_true. In this way people who have specific use cases don't need to work with function factories. We could still have the generic MetricFrame + function factories to facilitate people who require more flexibility.

kstohr · 2021-06-18T13:00:04Z

My issue with MetricFrame was that it computed the metric on instantiation which meant that the other methods could not be accessed unless I passed a metric. So you'll see in my attempt at a roc_curves class that there is a psuedo metric (tks @MiroDudik, for that solution) to enable that class to take advantage of MetricFrame's by_group method.

What do you think of introducing static methods within the MetricFrame class for helper functions like splitting the data by sensitive feature?

riedgar-ms · 2021-06-18T13:33:22Z

@kstohr I'm a bit confused by what you mean by:

Note: MetricFrame requires y_pred (clf.predict). However, ROC curves and AUC scores are generated using y_score (clf.predict_proba). This method substitutes y_score (type:float) for y_pred (type:int) to conform to the MetricFrame required params.

The argument is obviously called y_true but its type is certainly not required to be int.

riedgar-ms · 2021-06-18T13:38:30Z

Can you expand on labels / labels_true / labels_pred ? What's the third category for - I've generally been thinking of that 'y_pred is anything coming from the model, and y_true is anything coming from the original data.'

That is correct! labels is the equivalent y for clustering problems. For some reason there exist both supervised and unsupervised metrics to evaluate clustering results (https://scikit-learn.org/stable/modules/classes.html#clustering-metrics), hence the three categories.

@hildeweerts that sort of leads to the question "Then what is y as opposed to y_true and y_pred?". Scanning through the link to sklearn it appears that it's used for cases where the ground truth isn't known? If so, then would you always have either y_true and y_pred (s/y/labels/ as appropriate) OR just y ?

hildeweerts · 2021-06-18T14:06:06Z

@hildeweerts that sort of leads to the question "Then what is y as opposed to y_true and y_pred?". Scanning through the link to sklearn it appears that it's used for cases where the ground truth isn't known? If so, then would you always have either y_true and y_pred (s/y/labels/ as appropriate) OR just y ?

I'm not sure if I understand your question correctly so sorry in advance if this is all obvious to you. But generally y is used for ground-truth during training, my guess is that the authors decided to not call it y_true here because there is no explicit counterpart y_pred. For evaluation we need to be more specific, hence y_true and y_pred (or y_score). Similarly, they use labels to refer to cluster labels and only append _true and _pred in cases where there is an explicit ground-truth.

riedgar-ms · 2021-06-18T14:18:51Z

@hildeweerts it might be that I'm just not fully used to the conventions - based on your description, it sounds like y and y_pred (similarly labels and labels_pred) are interchangeable - both refer to things coming from a model. Is that correct?

hildeweerts · 2021-06-18T14:25:51Z

@hildeweerts it might be that I'm just not fully used to the conventions - based on your description, it sounds like y and y_pred (similarly labels and labels_pred) are interchangeable - both refer to things coming from a model. Is that correct?

To make things more confusing for you generally speaking y/y_true are interchangeable and labels/labels_pred. The reason for this is that by default clustering problems don't really have a "ground truth" (otherwise we would probably be better off doing classification). The "Evaluation and assessment" section of this wikipedia page describes some of the issues with evaluating clusterings pretty well I think.

riedgar-ms · 2021-06-18T14:31:33Z

@hildeweerts coming back to the API, though, are there metrics which would accept three parameters - labels, labels_true and labels_pred ? It sounds like the answer is that there aren't, but I'd like to make sure I'm understanding things correctly.

hildeweerts · 2021-06-18T14:44:01Z

@hildeweerts coming back to the API, though, are there metrics which would accept three parameters - labels, labels_true and labels_pred ? It sounds like the answer is that there aren't, but I'd like to make sure I'm understanding things correctly.

You're right, there aren't. So that's a similar issue as metrics that don't use y_true (only y_pred) or that use y_score instead of y_pred.

kstohr · 2021-06-18T14:53:14Z

@riedgar-ms

@kstohr I'm a bit confused by what you mean by:

Note: MetricFrame requires y_pred (clf.predict). However, ROC curves and AUC scores are generated using y_score (clf.predict_proba). This method substitutes y_score (type:float) for y_pred (type:int) to conform to the MetricFrame required params.

The argument is obviously called y_true but its type is certainly not required to be int.

Yes. It's a nomenclature issue. When I see y_pred I think it's the predicted labels. In fact, MetricFrame also allows you to pass other things in its place. Yes, it works. It's just not clear for the user.

And, in fact, all I needed to do was split the data by sensitive feature. I strongly feel like that should be a class/module on its own. Then you can use it in combination or with custom classes to compute metrics.

I totally agree with @riedgar-ms but would frame it a little differently:

We've got two things going on here:

Where we have a precomputed metric where we just need to slice it up by the sensitive features

Where there are multiple 'columns' in y_pred and/or y_true

Another way to think about it is we have two tasks to perform:
a) splitting the data by sensitive feature (universally needed and MetricFrame currently has a great approach)
b) passing split data to various metrics.

I think Fairlearn will be easier to maintain if we don't have to worry about how the data should be split and how it should be passed to a given metric in a single module/class. I suggest we just offer a class/module for splitting the data. And then worry about applying metrics to the split data in separate modules, which may import/depend on the module that splits the data such that we always handle that consistently. I feel the current module is trying to do too much in one class.

riedgar-ms · 2021-06-21T13:11:35Z

Yes. It's a nomenclature issue. When I see y_pred I think it's the predicted labels. In fact, MetricFrame also allows you to pass other things in its place. Yes, it works. It's just not clear for the user.

I'm not really sure about how best to cope with that. One solution (which would of course break everything) would be to have our own names for the inputs - say from_data and from_model which should avoid the problem of existing conventions. There's then the question of how to route them to the underlying metric function. Currently we do that positionally - first argument is the 'true' column, the second is the 'predicted' column. Now since @MiroDudik 's recent PR moved MetricFrame itself in the direction of keyword-only arguments, one could criticise this as inconsistent too. But having to match arbitrary argument names to MetricFrame to arbitrary argument names in the underlying functions is... not going to be trivial or unconfusing.

hildeweerts · 2021-06-22T09:57:11Z

I think this is an important discussion and I'd love to hear thoughts from other @fairlearn/fairlearn-maintainers as well here!

adrinjalali · 2021-06-22T10:13:33Z

It's quite a long conversation here, could you maybe summarize what we've got so far here, for the rest of us to be able to understand the conversation?

riedgar-ms · 2021-06-22T13:39:38Z

Summary

Per @adrinjalali 's request :-)

I think that we've got a few things going on here:

It seems that when users see y_true and y_pred they may think that MetricFrame only works for classification metrics, rather than more generally (really, they are column_from_data and column_from_model and even that meaning is only really imposed by the underlying metric function)
Sometimes the two 'columns' might actually need to be multiple columns
Some metrics only take one of these (e.g. selection_rate()) and users might not want to write a wrapper for this

@kstohr has also suggested that the 'split up by sensitive and conditional features' part of MetricFrame may well work better as a separate piece of functionality.

riedgar-ms · 2021-06-28T12:49:12Z

Pinging..... @adrinjalali @hildeweerts @MiroDudik @michaelamoako @kstohr

kstohr · 2021-06-28T13:44:12Z

@riedgar-ms Yep. That sums it up.

My reason for spitting out the group_by sensitive feature function is so that it can be used for descriptive stats and exploration. If you are running a metric and you are getting an error, it may be because a class is underrepresented, not represented at all, or is 'all true.' In these cases, we may need to know which class is impacted in order to handle it. In addition, it is quite common to inspect the data you are working with before passing it to a metric.

Note: It is ok, even preferred, if it is still a method in the class. My issue is that when you instantiate the class, it currently forces you to run a metric. It would be great to be able to perform the group_by function independently of running the metric.

riedgar-ms · 2021-07-08T15:25:57Z

Summary

This is a second attempt to summarise the discussion to date, with the goal of reaching a conclusion. The previous summary was overly summarised.

Current Status

As proposed at the the beginning of the thread, we are in the process of moving to requiring named arguments for MetricFrame. We have just released v0.7.0 which issues a warning if positional arguments are used; with v0.8.0 we plan to make named arguments mandatory. At that point, the signature of MetricFrame should be:

metrics = {
  'recall': skm.recall_score,
  'accuracy': skm.accuracy_score
}

s_p = {
  'recall' : { 'sample_weights' : s_w }
  # No sample params for accuracy
}

mf = MetricFrame(metrics=metrics, y_true=y_t, y_pred=y_p,
                 sensitive_features=s_f,
                 control_features=c_f,
                 sample_params=s_p)

Note that the metric functions themselves are dispatched using positional arguments - the basic signature is assumed to be metric_fn(y_true, y_pred). Now that we are requiring named arguments, this is an inconsistency.

Functionality Gaps

The feedback above (and from other sources) has highlighted several gaps in the functionality of MetricFrame. Specifically:

Some metrics don't take y_pred but y_score instead (often y_pred implies a label from a classification)
Some metrics take more than two arguments, e.g. inverse_propensity_score(*, actions_taken, rewards, propensities, actions_target)
We may have a precomputed metric (such as word error rate) and we just need it aggregated
Some metrics only need one argument (such as selection_rate for which we have a custom function which ignores the y_true argument)
A user may have multiple models which they wish to compare in a single MetricFrame

There are actually work arounds for all of these using the existing MetricFrame. These generally involve some combination of custom metric functions, making y_true and/or y_pred into a list of dictionaries, and filling extra arguments out via sample_params. The question is whether we can make the process easier.

A note on precomputed metrics

In the above, one proposed solution for the precomputed metric problem was to have a static factory method MetricFrame.from_precomputed(). However, I don't think this is needed - the precomputed metrics are still per-sample, and need to be aggregated somehow. The aggregation function (such as mean() or min() is then the 'metric'. We should make sure this is documented, though.

Proposed solutions

There are two basic ways to provide other columns - as **kwargs or a separate dictionary. In the following, we assume that the existing y_true and y_pred arguments get folded in eventually , although there would have to be intermediate releases (similar to the switch to named arguments).

Providing `**kwargs`

In this case, the signature of MetricFrame would become

mf = MetricFrame(metrics, 
                 sensitive_features=s_f,
                 control_features=c_f,
                 sample_params=s_p,
                 **direct_sample_params
)

and it would be invoked like:

mf = MetricFrame(metrics=metrics,,
                 sensitive_features=s_f,
                 control_features=c_f,
                 sample_params=s_p,
                 y_true=y_t, y_score=y_s
)

However, **kwargs has the downside that, it is hard to add other required arguments in future, since we can be sure that whatever name we pick, we will break someone. Personally, I also think it makes documentation more difficult to read.

Providing a dictionary

Rather than provide individual arguments, we can have a direct_sample_params dictionary. The signature would then be:

mf = MetricFrame(metrics, 
                 sensitive_features=s_f,
                 control_features=c_f,
                 sample_params=s_p,
                 **direct_sample_params
)

and it would be invoked like:

mf = MetricFrame(metrics=metrics,,
                 sensitive_features=s_f,
                 control_features=c_f,
                 sample_params=s_p,
                 direct_sample_params= { 'y_true'=y_t, 'y_score'=y_s }
)

This is somewhat clunkier, but does avoid the problems associated with adding extra arguments in future.

Dispatching the Metric Functions

Regardless of how we accept the non-y_true/y_pred arguments, we then have to decide how to call the individual metric functions.

The main question is whether we are going to

Expect all of the metric function to accept all of the arguments passed in direct_sample_params (or **direct_sample_params); or
Use reflection (inspect.signature() I believe) to examine which functions want which arguments

Option (2) is obviously much more flexible; for example, one could do something like:

metrics = {
  'metric_a' : metric_using_y_pred,
  'metric_b' : metric_using_y_score
  'metric_c' : metric_using_y_true_y_pred
  'metric_d' : metric_using_y_true_y_score
}

mf = MetricFrame(
                 metrics=metrics,,
                 sensitive_features=s_f,
                 control_features=c_f,
                 sample_params=s_p,
                 y_true=y_t, y_pred=y_p, y_score=y_s
)

(or similarly for the non-**kwargs option) and expect things to Just Work.

There are two problems with using reflection in this way. First, consider the example of a precomputed metric such as word_error_rate. We might want to use numpy.mean() as the aggregator.... but the argument for that is called a and not word_error_rate. Users would still have to write a wrapper function to make the names match.

Secondly, using reflection like this opens up a huge trap - a typo in the argument name can cause really subtle problems. If the metric function considers an argument to be mandatory (likely for something like y_true or y_score), then an error will be thrown. However, for an optional argument (think sample_weight vs sample_weights if a user were trying to pass the weights to every metric, rather than using sample_params) then the typo will simply stop the argument being passed, which might not be noticed for a long time (ask me about the value of IMPLICIT NONE and why zero initialisation is a travesty).

Working with Streaming

PR #670 is working to add streaming to MetricFrame. I think that all of the above should be compatible with that, but we should keep it in mind.

Tagging @adrinjalali @hildeweerts @kstohr @michaelamoako @MiroDudik

adrinjalali · 2021-07-09T10:02:17Z

Note that the metric functions themselves are dispatched using positional arguments - the basic signature is assumed to be metric_fn(y_true, y_pred). Now that we are requiring named arguments, this is an inconsistency.

I personally don't mind this inconsistency.

As for the rest of the proposal, I find it way too complicated, and not very user friendly, and hard to explain to users.

I would rather settle for a middle ground where we leave MetricFrame unchanged, and have a DatasetMetricFrame which accepts only one y. An API which supports all possible cases is an overly complicated one which is designed to handle 100% of the cases, we could settle for 95% and have a much simpler API, and have the other 5% use-cases do the appropriate workarounds.

We should of course accept extra input required by the metric. For example, in the case of inverse_propensity_score, if I understand correctly, we can have something like:

inverse_propensity_score(
    y_pred=actions_taken,
    y_true=actions_target,
    rewards=rewards,
    propensities=propensities
)

And with SLEP006 sklearn could accept something like:

make_scorer(inverse_propensity_score, score_params=["rewards", "propensities"])

to get a scorer. Note that this has removed the need for us to do any introspection, but we have a scorer object which is a callable but also has a state.

An alternative here, since we don't actually do slicing to re-call a scorer, is to have an object which only accepts precomputed scores, like

PrecomputedMetricFrame(
    auc=auc_score(...),
    propensity=propensity_score(....),
    ...,
    sensitive_attributes=...
)

Or something like that.

riedgar-ms · 2021-07-22T13:38:27Z

First, my apologies @adrinjalali for not responding sooner.

For MetricFrame itself, I've very happy with leaving it with its current form (as I mentioned above).

I'm not quite sure what you're describing with make_scorer(). Are you saying that users would make a call along the lines of:

my_metric_fn = make_scorer(inverse_propensity_score, score_params=["rewards", "propensities"])

and then pass my_metric_fn to FlexibleMetricFrame (or whatever we call it... I'm not a huge fan of that name). This object would provide the get_metadata_request() method which is mentioned in SLEP006, and which FlexibleMetricFrame could then use to route the arguments. So the user's invocation would be something like:

func_dict = { 'recall' : recall_score,
            'inv_propensity': my_metric_fn }

flex_mf = FlexibleMetricFrame(metrics=func_dict,
                              y_true=y_t, y_pred=y_p, rewards=rewards, propensities=propensites,
                              sensitive_features=sf)

Since recall_score won't have get_metadata_request() it will only be passed y_true and y_pred (positionally). In contrast, my_metric_fn will respond that it wants rewards and propensities as well, and be passed them. We avoid the 'hole' of dropped arguments due to spelling errors because make_scorer() can check that its score_params all appear in the signature of the underlying function, and similarly FlexibleMetricFrame can check its **kwargs list against the results of get_metadata_request().

Presumably, FlexibleMetricFrame (or its improved cognomen) would not accept the sample_params= dictionaries of MetricFrame because that would be taken care of through get_metadata_request() ?

riedgar-ms · 2021-08-06T15:41:36Z

Pinging @adrinjalali .... did I get what you meant by make_scorer() ?

adrinjalali · 2021-08-09T11:16:46Z

So what I was trying to say, is that handling the metadata routing is really not that easy, and most APIs we've tried to design are messy. The API in sklearn applies to scorers i.e. scorer(estimator, X, y), rather than scoring functions i.e. score(y_true, y_pred). Adding get_metadata_request() is certainly a way forward, but it's not an easy one.

Alternatively, we could let AlternativeMetricFrame to accept a scorer instead of scoring functions, and then it will be exactly in the scope of SLEP006, which would make things much easier.

I guess for now, we could have something like:

metrics = AlternativeMetricFrame(
    metrics={'roc': roc_func, 'inverse_propensity': inverse_propensity_score}
    y_true=y_true,
    y_pred=y_pred,
    rewards=rewards,
    propensities=propensities,
    routing={'inverse_propensity': ['rewards', 'propensities']}
)

We could also accept a scorer instead of the scoring method, but the user would also need to pass the estimator. I kinda would prefer that if we're gonna do routing.

hildeweerts · 2021-09-01T09:11:57Z

I don't have a lot to add to this discussion, but I agree that leaving MetricFrame as is and adding an alternative more flexible version would be preferable.

pinging @fairlearn/fairlearn-maintainers to move the discussion forward

MiroDudik added the API Anything which touches on the API label Apr 22, 2021

This was referenced May 7, 2021

Unstructured Data metrics (i.e: WER for Text data) in MetricsFrame #765

Closed

Guidelines around user-defined disaggregated metrics #786

Open

bram49 mentioned this issue Oct 6, 2021

ENH Add ranking metrics #974

Open

2 tasks

riedgar-ms mentioned this issue Oct 7, 2021

[ENH] New MetricFrame #975

Closed

MetricFrame should support metrics that don't require y_true and y_pred #756

MetricFrame should support metrics that don't require y_true and y_pred #756

Comments

MiroDudik commented Apr 22, 2021 • edited

Step 1: Make MetricFrame arguments keyword-only

Step 2: Allow flexible names of shared sample parameters

MiroDudik commented Apr 22, 2021

riedgar-ms commented Apr 30, 2021 • edited

MiroDudik commented May 3, 2021 • edited

riedgar-ms commented May 3, 2021

hildeweerts commented May 3, 2021

MiroDudik commented May 3, 2021

romanlutz commented May 3, 2021

MiroDudik commented May 3, 2021

riedgar-ms commented May 3, 2021

MiroDudik commented May 3, 2021

MiroDudik commented May 3, 2021

riedgar-ms commented May 4, 2021

romanlutz commented May 4, 2021

hildeweerts commented May 4, 2021

MiroDudik commented May 4, 2021

riedgar-ms commented May 6, 2021

riedgar-ms commented May 7, 2021 • edited

hildeweerts commented May 7, 2021

MiroDudik commented May 7, 2021

1. backward compatible implementation with deprecation warning:

2. move to keyword-only arguments:

3. addition of any other arguments like shared_sample_params (but that could happen later or together with 2)

riedgar-ms commented May 7, 2021

MiroDudik commented May 7, 2021

riedgar-ms commented May 8, 2021

MiroDudik commented May 16, 2021

Alternative A: Optional y_true / y_pred / sensitive_features

Alternative B: Also allow any keyword arguments and internally use inspect.signature to decide who gets what

Scenario 1: Classification metrics and scoring metrics in the same metric frame

Alternative 1A: Optional y_true / y_pred

Alternative 1B: Also allow any keyword arguments and internally use inspect.signature to decide who gets what

Scenario 2: Multiple models in the same metric frame

Alternative 2A: Optional y_true / y_pred

Alternative 2B: Also allow any keyword arguments and internally use inspect.signature to decide who gets what

Scenario 3: Streaming metrics

Alternative 3A: Optional y_true / y_pred / sensitive_features

Scenario 4: Metrics that don't use y_true and y_pred

Alternative 4A: Optional y_true / y_pred

Alternative 4B: Also allow any keyword arguments and internally use inspect.signature to decide who gets what

MiroDudik commented May 16, 2021

riedgar-ms commented May 17, 2021

hildeweerts commented May 17, 2021

MiroDudik commented May 17, 2021 • edited

riedgar-ms commented Jun 16, 2021

hildeweerts commented Jun 17, 2021

riedgar-ms commented Jun 17, 2021 • edited

hildeweerts commented Jun 17, 2021

riedgar-ms commented Jun 17, 2021

hildeweerts commented Jun 18, 2021

hildeweerts commented Jun 18, 2021

kstohr commented Jun 18, 2021

riedgar-ms commented Jun 18, 2021 • edited

riedgar-ms commented Jun 18, 2021 • edited

hildeweerts commented Jun 18, 2021

riedgar-ms commented Jun 18, 2021

hildeweerts commented Jun 18, 2021

riedgar-ms commented Jun 18, 2021

hildeweerts commented Jun 18, 2021

kstohr commented Jun 18, 2021

riedgar-ms commented Jun 21, 2021

hildeweerts commented Jun 22, 2021

adrinjalali commented Jun 22, 2021

riedgar-ms commented Jun 22, 2021

Summary

riedgar-ms commented Jun 28, 2021

kstohr commented Jun 28, 2021

riedgar-ms commented Jul 8, 2021

Summary

Current Status

Functionality Gaps

A note on precomputed metrics

Proposed solutions

Providing **kwargs

Providing a dictionary

MiroDudik commented Apr 22, 2021 •

edited

riedgar-ms commented Apr 30, 2021 •

edited

MiroDudik commented May 3, 2021 •

edited

riedgar-ms commented May 7, 2021 •

edited

3. addition of any other arguments like `shared_sample_params` (but that could happen later or together with 2)

Alternative A: Optional `y_true` / `y_pred` / `sensitive_features`

Alternative B: Also allow any keyword arguments and internally use `inspect.signature` to decide who gets what

Alternative 1A: Optional `y_true` / `y_pred`

Alternative 1B: Also allow any keyword arguments and internally use `inspect.signature` to decide who gets what

Alternative 2A: Optional `y_true` / `y_pred`

Alternative 2B: Also allow any keyword arguments and internally use `inspect.signature` to decide who gets what

Alternative 3A: Optional `y_true` / `y_pred` / `sensitive_features`

Scenario 4: Metrics that don't use `y_true` and `y_pred`

Alternative 4A: Optional `y_true` / `y_pred`

Alternative 4B: Also allow any keyword arguments and internally use `inspect.signature` to decide who gets what

MiroDudik commented May 17, 2021 •

edited

riedgar-ms commented Jun 17, 2021 •

edited

riedgar-ms commented Jun 18, 2021 •

edited

riedgar-ms commented Jun 18, 2021 •

edited

Providing `**kwargs`