creating a Keras metric from metrics/performance.py #3

mikkokotila · 2018-05-11T13:53:26Z

Right now performance.py works on the level of the mainline Hyperio program, and outside of Keras. This means it's not available at epoch level and therefore, is also not included in Keras reporting (the history object) or is not available to be used for EarlyStop or other callbacks. Once the exact same scoring functionality works as a Keras metric, then this is is resolved :)

Keras Metrics

Examples of working Keras metrics codes

matthewcarbone · 2018-07-27T13:21:19Z

@mikkokotila I can try to work on this but that second link is broken! Would be good to get a better understanding of what is required before I do anything.

mikkokotila · 2018-07-29T20:52:31Z

@x94carbone I think that would be great. I've fixed the link above.

Right now the performance.py implements a modified version of f1 score that is "better" than regular f1 score in two ways:

it avoids high score in the corner cases that actually warrant poor scores
it works exactly the same for both binary and multi-class tasks (this is through dealing with both prediction outputs in the same manner)

The question would be if this can be reasonably converted into a Keras metric with the same logic. Or is it perhaps better to take as a base the keras fmeasure_acc here and make the modifications to it based on the performance.py.

Then the second step would be to identify a similar "best of class" objective measure for continuous prediction tasks, and have that in both python (as we have performance.py now) and keras versions (for callbacks).

What do you think?

matthewcarbone · 2018-07-30T18:11:30Z

Yeah that sounds good. I would also at some point like to give the user some freedom to choose a custom metric, although that will be a challenge. I'm not sure I know what you mean by best of class though. Do you mean implementing a multi-class version of precision/recall/f1?

I'll definitely look into this when I have the time!

mikkokotila · 2018-07-31T09:09:24Z

Ah my bad with the choice of words. Basically right now in the world of statistics it seems that f1 gets us close to best possible objective measure, but not quite there yet (because of the corner cases). The corner cases fixed, this seems to be be the best possible way to handle classification tasks. This would naturally then lead to the question of what is the "gold standard" model for continuous prediction tasks.

The current modified f1 works for all kinds of classification tasks, it's modified to solve that problem. I have to test it a lot more to validate, but as far as I can gather the way I build leads to a situation where binary class and multi-label are both objective speaking in the same way.

For the part about user choosing their own metric, that's possible already now by using any metric in model.compile and that will come into the log.

With perfomance.my my focus have been in creating a metric we can "trust" in terms of completely automated pipeline later (where for example evolutionary algos help handle some of the parts that are still manual).

matthewcarbone · 2018-08-07T23:43:47Z

@mikkokotila sorry for the late response!

Ok so I'm still confused about something. In performance.py it appears that what's being used are just the usual F1 and beta-F1 scores. What's modified about them? I could just be missing this.

As for a trustworthy metric, that is something I could possibly help with, since I understand that right off the bat. For multi-class classification, we will want to look into the macro F1 score. It is something of an intensive quantity in the sense that it averages each class's F1 scores and treats them equally as opposed to what's going on now. This ensures that classes with small numbers of entries have equal weight compared with those classes with many entries.

Isn't it a somewhat impossible task of creating a totally generalized trustworthy metric that works for any system? I'm sorry for all the questions here but I still don't totally understand the direction you want to go in with this 👍

matthewcarbone · 2018-08-09T17:00:03Z

Bumping this.

@mikkokotila let me know if you can clarify when you have the time! Thanks! 👌

mikkokotila · 2018-08-10T12:33:56Z

There are two sides to this. First is "why it's important to have as much of unified metric across all experiments as possible?". My goal would be to have one day a database of millions of experiments all measured against a single objective metric. Currently this lives in the master log, which then users could choose to make as a contribution to the open database. It's important to note that here we just talk about what is stored to the master log, and otherwise of course the users use any metric they like. That said, at this point it does seem reasonable to accept that we will have to metrics; one for category predictions (single, multi-class, multi-label) and one for continuous. performance.my is an attempt to be the first one. A little bit more about this in the FAQ.

The the question about how is performance.py different from standard F1. It handles some of the corner cases as "null labels" instead of giving a miss-leading numeric result. The way it's currently done is just focused 100% on the above purpose. If we want to have a keras f1 without modifications, we can use the one that is already available in keras_metrics which is more or less directly from Keras (they removed it from Keras2 strangely enough).

Of course, when we talk about an objective metric, ideally this is something that would be something we use also as the base metric for automatic optimization / callbacks, etc.

matthewcarbone · 2018-08-11T03:20:02Z

My goal would be to have one day a database of millions of experiments all measured against a single objective metric.

This is interesting, but why? Millions of experiments for the purpose of what? Are these all related experiments? Sorry for all the questions - I'm really interested. Although I think I'm beginning to understand a bit more what you mean by a unified metric. F1 is currently the gold standard I guess since it accounts for class imbalances...

If we want to have a keras f1 without modifications, we can use the one that is already available in keras_metrics which is more or less directly from Keras (they removed it from Keras2 strangely enough).

Yup, precision and recall were removed. Very strange considering they're such important metrics. Their reason:

Basically these are all global metrics that were approximated
batch-wise, which is more misleading than helpful. This was mentioned in
the docs but it's much cleaner to remove them altogether. It was a mistake
to merge them in the first place.

found here.

Although from the look of it you've already implemented all the relevant functions in terms of Keras backends. Isn't that all that is necessary to implement them at epoch level?

mikkokotila · 2018-08-22T13:33:35Z

@x94carbone sorry, missed this.

The reason for having results of many experiments (regardless of the prediction type) logged has to do with the potential value such data will have for better understanding the a) hyperparameter optimization problem b) optimization the of the hyperparameter optimization process

Yes I've already implemented the fscore etc from Keras backend (old version), but this does not quite do what Performance does i.e. treat single-class, multi-class, and multi-label prediction tasks all in the same way (i.e. objective metric across all those kinds of tasks) nor does it deal with fscore corner cases e.g. when there is many positives and few negatives and all are predicted as positives.

mikkokotila · 2018-10-12T22:20:50Z

Closing this to make way for soon-to-happen inclusion of sklearn.metrics and then gradually making the most important ones available on the Keras backend level for epoch-by-epoch evaluation.

mikkokotila added the enhancement label May 11, 2018

mikkokotila assigned konchokdolma May 11, 2018

mikkokotila assigned matthewcarbone and unassigned konchokdolma Jul 29, 2018

matthewcarbone mentioned this issue Aug 14, 2018

How to use f1 measure for "best" model? #56

Closed

mikkokotila closed this as completed Oct 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

creating a Keras metric from metrics/performance.py #3

creating a Keras metric from metrics/performance.py #3

mikkokotila commented May 11, 2018 •

edited

Loading

matthewcarbone commented Jul 27, 2018

mikkokotila commented Jul 29, 2018

matthewcarbone commented Jul 30, 2018

mikkokotila commented Jul 31, 2018

matthewcarbone commented Aug 7, 2018

matthewcarbone commented Aug 9, 2018

mikkokotila commented Aug 10, 2018

matthewcarbone commented Aug 11, 2018

mikkokotila commented Aug 22, 2018

mikkokotila commented Oct 12, 2018

creating a Keras metric from metrics/performance.py #3

creating a Keras metric from metrics/performance.py #3

Comments

mikkokotila commented May 11, 2018 • edited Loading

matthewcarbone commented Jul 27, 2018

mikkokotila commented Jul 29, 2018

matthewcarbone commented Jul 30, 2018

mikkokotila commented Jul 31, 2018

matthewcarbone commented Aug 7, 2018

matthewcarbone commented Aug 9, 2018

mikkokotila commented Aug 10, 2018

matthewcarbone commented Aug 11, 2018

mikkokotila commented Aug 22, 2018

mikkokotila commented Oct 12, 2018

mikkokotila commented May 11, 2018 •

edited

Loading