Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creating a Keras metric from metrics/performance.py #3

Closed
mikkokotila opened this issue May 11, 2018 · 10 comments
Closed

creating a Keras metric from metrics/performance.py #3

mikkokotila opened this issue May 11, 2018 · 10 comments
Assignees

Comments

@mikkokotila
Copy link
Contributor

mikkokotila commented May 11, 2018

Right now performance.py works on the level of the mainline Hyperio program, and outside of Keras. This means it's not available at epoch level and therefore, is also not included in Keras reporting (the history object) or is not available to be used for EarlyStop or other callbacks. Once the exact same scoring functionality works as a Keras metric, then this is is resolved :)

Keras Metrics

Examples of working Keras metrics codes

@matthewcarbone
Copy link
Collaborator

@mikkokotila I can try to work on this but that second link is broken! Would be good to get a better understanding of what is required before I do anything.

@mikkokotila
Copy link
Contributor Author

@x94carbone I think that would be great. I've fixed the link above.

Right now the performance.py implements a modified version of f1 score that is "better" than regular f1 score in two ways:

  • it avoids high score in the corner cases that actually warrant poor scores
  • it works exactly the same for both binary and multi-class tasks (this is through dealing with both prediction outputs in the same manner)

The question would be if this can be reasonably converted into a Keras metric with the same logic. Or is it perhaps better to take as a base the keras fmeasure_acc here and make the modifications to it based on the performance.py.

Then the second step would be to identify a similar "best of class" objective measure for continuous prediction tasks, and have that in both python (as we have performance.py now) and keras versions (for callbacks).

What do you think?

@matthewcarbone
Copy link
Collaborator

Yeah that sounds good. I would also at some point like to give the user some freedom to choose a custom metric, although that will be a challenge. I'm not sure I know what you mean by best of class though. Do you mean implementing a multi-class version of precision/recall/f1?

I'll definitely look into this when I have the time!

@mikkokotila
Copy link
Contributor Author

Ah my bad with the choice of words. Basically right now in the world of statistics it seems that f1 gets us close to best possible objective measure, but not quite there yet (because of the corner cases). The corner cases fixed, this seems to be be the best possible way to handle classification tasks. This would naturally then lead to the question of what is the "gold standard" model for continuous prediction tasks.

The current modified f1 works for all kinds of classification tasks, it's modified to solve that problem. I have to test it a lot more to validate, but as far as I can gather the way I build leads to a situation where binary class and multi-label are both objective speaking in the same way.

For the part about user choosing their own metric, that's possible already now by using any metric in model.compile and that will come into the log.

With perfomance.my my focus have been in creating a metric we can "trust" in terms of completely automated pipeline later (where for example evolutionary algos help handle some of the parts that are still manual).

@matthewcarbone
Copy link
Collaborator

@mikkokotila sorry for the late response!

Ok so I'm still confused about something. In performance.py it appears that what's being used are just the usual F1 and beta-F1 scores. What's modified about them? I could just be missing this.

As for a trustworthy metric, that is something I could possibly help with, since I understand that right off the bat. For multi-class classification, we will want to look into the macro F1 score. It is something of an intensive quantity in the sense that it averages each class's F1 scores and treats them equally as opposed to what's going on now. This ensures that classes with small numbers of entries have equal weight compared with those classes with many entries.

Isn't it a somewhat impossible task of creating a totally generalized trustworthy metric that works for any system? I'm sorry for all the questions here but I still don't totally understand the direction you want to go in with this 👍

@matthewcarbone
Copy link
Collaborator

Bumping this.

@mikkokotila let me know if you can clarify when you have the time! Thanks! 👌

@mikkokotila
Copy link
Contributor Author

There are two sides to this. First is "why it's important to have as much of unified metric across all experiments as possible?". My goal would be to have one day a database of millions of experiments all measured against a single objective metric. Currently this lives in the master log, which then users could choose to make as a contribution to the open database. It's important to note that here we just talk about what is stored to the master log, and otherwise of course the users use any metric they like. That said, at this point it does seem reasonable to accept that we will have to metrics; one for category predictions (single, multi-class, multi-label) and one for continuous. performance.my is an attempt to be the first one. A little bit more about this in the FAQ.

The the question about how is performance.py different from standard F1. It handles some of the corner cases as "null labels" instead of giving a miss-leading numeric result. The way it's currently done is just focused 100% on the above purpose. If we want to have a keras f1 without modifications, we can use the one that is already available in keras_metrics which is more or less directly from Keras (they removed it from Keras2 strangely enough).

Of course, when we talk about an objective metric, ideally this is something that would be something we use also as the base metric for automatic optimization / callbacks, etc.

@matthewcarbone
Copy link
Collaborator

My goal would be to have one day a database of millions of experiments all measured against a single objective metric.

This is interesting, but why? Millions of experiments for the purpose of what? Are these all related experiments? Sorry for all the questions - I'm really interested. Although I think I'm beginning to understand a bit more what you mean by a unified metric. F1 is currently the gold standard I guess since it accounts for class imbalances...

If we want to have a keras f1 without modifications, we can use the one that is already available in keras_metrics which is more or less directly from Keras (they removed it from Keras2 strangely enough).

Yup, precision and recall were removed. Very strange considering they're such important metrics. Their reason:

Basically these are all global metrics that were approximated
batch-wise, which is more misleading than helpful. This was mentioned in
the docs but it's much cleaner to remove them altogether. It was a mistake
to merge them in the first place.

found here.

Although from the look of it you've already implemented all the relevant functions in terms of Keras backends. Isn't that all that is necessary to implement them at epoch level?

@mikkokotila
Copy link
Contributor Author

@x94carbone sorry, missed this.

The reason for having results of many experiments (regardless of the prediction type) logged has to do with the potential value such data will have for better understanding the a) hyperparameter optimization problem b) optimization the of the hyperparameter optimization process

Yes I've already implemented the fscore etc from Keras backend (old version), but this does not quite do what Performance does i.e. treat single-class, multi-class, and multi-label prediction tasks all in the same way (i.e. objective metric across all those kinds of tasks) nor does it deal with fscore corner cases e.g. when there is many positives and few negatives and all are predicted as positives.

@mikkokotila
Copy link
Contributor Author

Closing this to make way for soon-to-happen inclusion of sklearn.metrics and then gradually making the most important ones available on the Keras backend level for epoch-by-epoch evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants