-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
creating a Keras metric from metrics/performance.py #3
Comments
@mikkokotila I can try to work on this but that second link is broken! Would be good to get a better understanding of what is required before I do anything. |
@x94carbone I think that would be great. I've fixed the link above. Right now the performance.py implements a modified version of f1 score that is "better" than regular f1 score in two ways:
The question would be if this can be reasonably converted into a Keras metric with the same logic. Or is it perhaps better to take as a base the keras fmeasure_acc here and make the modifications to it based on the performance.py. Then the second step would be to identify a similar "best of class" objective measure for continuous prediction tasks, and have that in both python (as we have performance.py now) and keras versions (for callbacks). What do you think? |
Yeah that sounds good. I would also at some point like to give the user some freedom to choose a custom metric, although that will be a challenge. I'm not sure I know what you mean by best of class though. Do you mean implementing a multi-class version of precision/recall/f1? I'll definitely look into this when I have the time! |
Ah my bad with the choice of words. Basically right now in the world of statistics it seems that f1 gets us close to best possible objective measure, but not quite there yet (because of the corner cases). The corner cases fixed, this seems to be be the best possible way to handle classification tasks. This would naturally then lead to the question of what is the "gold standard" model for continuous prediction tasks. The current modified f1 works for all kinds of classification tasks, it's modified to solve that problem. I have to test it a lot more to validate, but as far as I can gather the way I build leads to a situation where binary class and multi-label are both objective speaking in the same way. For the part about user choosing their own metric, that's possible already now by using any metric in model.compile and that will come into the log. With perfomance.my my focus have been in creating a metric we can "trust" in terms of completely automated pipeline later (where for example evolutionary algos help handle some of the parts that are still manual). |
@mikkokotila sorry for the late response! Ok so I'm still confused about something. In As for a trustworthy metric, that is something I could possibly help with, since I understand that right off the bat. For multi-class classification, we will want to look into the macro F1 score. It is something of an intensive quantity in the sense that it averages each class's F1 scores and treats them equally as opposed to what's going on now. This ensures that classes with small numbers of entries have equal weight compared with those classes with many entries. Isn't it a somewhat impossible task of creating a totally generalized trustworthy metric that works for any system? I'm sorry for all the questions here but I still don't totally understand the direction you want to go in with this 👍 |
Bumping this. @mikkokotila let me know if you can clarify when you have the time! Thanks! 👌 |
There are two sides to this. First is "why it's important to have as much of unified metric across all experiments as possible?". My goal would be to have one day a database of millions of experiments all measured against a single objective metric. Currently this lives in the master log, which then users could choose to make as a contribution to the open database. It's important to note that here we just talk about what is stored to the master log, and otherwise of course the users use any metric they like. That said, at this point it does seem reasonable to accept that we will have to metrics; one for category predictions (single, multi-class, multi-label) and one for continuous. performance.my is an attempt to be the first one. A little bit more about this in the FAQ. The the question about how is performance.py different from standard F1. It handles some of the corner cases as "null labels" instead of giving a miss-leading numeric result. The way it's currently done is just focused 100% on the above purpose. If we want to have a keras f1 without modifications, we can use the one that is already available in keras_metrics which is more or less directly from Keras (they removed it from Keras2 strangely enough). Of course, when we talk about an objective metric, ideally this is something that would be something we use also as the base metric for automatic optimization / callbacks, etc. |
This is interesting, but why? Millions of experiments for the purpose of what? Are these all related experiments? Sorry for all the questions - I'm really interested. Although I think I'm beginning to understand a bit more what you mean by a unified metric. F1 is currently the gold standard I guess since it accounts for class imbalances...
Yup, precision and recall were removed. Very strange considering they're such important metrics. Their reason:
found here. Although from the look of it you've already implemented all the relevant functions in terms of Keras backends. Isn't that all that is necessary to implement them at epoch level? |
@x94carbone sorry, missed this. The reason for having results of many experiments (regardless of the prediction type) logged has to do with the potential value such data will have for better understanding the a) hyperparameter optimization problem b) optimization the of the hyperparameter optimization process Yes I've already implemented the fscore etc from Keras backend (old version), but this does not quite do what Performance does i.e. treat single-class, multi-class, and multi-label prediction tasks all in the same way (i.e. objective metric across all those kinds of tasks) nor does it deal with fscore corner cases e.g. when there is many positives and few negatives and all are predicted as positives. |
Closing this to make way for soon-to-happen inclusion of sklearn.metrics and then gradually making the most important ones available on the Keras backend level for epoch-by-epoch evaluation. |
Right now performance.py works on the level of the mainline Hyperio program, and outside of Keras. This means it's not available at epoch level and therefore, is also not included in Keras reporting (the history object) or is not available to be used for EarlyStop or other callbacks. Once the exact same scoring functionality works as a Keras metric, then this is is resolved :)
Keras Metrics
Examples of working Keras metrics codes
The text was updated successfully, but these errors were encountered: