What Metrics do we want to have for training/testing here? #4

karalets · 2020-03-27T23:46:03Z

Initially, we can have things like log-likelihood in order to just be able to get some reasonable quantitative thing.

Over time, however, we may want to have more informative metrics for performance of the deep net on the task at hand, for instance downstream metrics for a chemistry application or so.

While this is not pressing to do at first, I am opening this issue so we can collect ideas for:

metrics that make sense to collect for training and testing
plots we may want to see down the line that would be reasonable to have

Both of those can and should also take into account the evaluation chosen in https://pubs.rsc.org/en/content/articlepdf/2019/sc/c9sc00616h as ultimately we will need to compare to it.

My first pitch is as stated:

compute training likelihood and test likelihood on a hold-out dataset
visualize this as a function of number of training datapoints
visualize this as a function of number of background (for semis-supervised) datapoints
compare different strategies for building background dataset with respect to likelihood, i.e. more diverse data leading to better representations etc.
... more ways?

The nice thing about this is we can rerun the same evaluation protocols with any metric, not just LLK.

maxentile · 2020-05-13T19:41:39Z

This sounds good to me. I guess one other basic question is whether we want to have metrics for how "well-calibrated" the predictive uncertainties are, and if so, what those should look like. If this is in-scope, perhaps @karalets can provide some references / pointers?

karalets · 2020-05-13T19:46:36Z

This sounds good to me. I guess one other basic question is whether we want to have metrics for how "well-calibrated" the predictive uncertainties are, and if so, what those should look like. If this is in-scope, perhaps @karalets can provide some references / pointers?

Great point, I am happy to take point on that with some references once we start having results to discuss how to evaluate such things.

The general idea is the following:
We need in distribution and out-of-distribution data sets for training in order to evaluate things.

So once we have some experiments lined up where molecules for both categories are chosen well and start plotting stuff we can discuss that subtlety.

karalets · 2020-05-13T19:48:03Z

But I would still like it if the chemists here could add some more informative and concrete metrics with relation to usefulness here. Examples are the things the Cambridge-group paper evaluates.

yuanqing-wang · 2020-05-14T02:06:21Z

Personally I'd imagine real chemists care about false positive rates with a cutoff and so on. But at this stage let's just suppose they're no different.

karalets · 2020-05-14T02:07:37Z

I thought more about downstream quantities of interest as metrics. No clue what people care about.

karalets added the discussion label Mar 27, 2020

karalets assigned jchodera and yuanqing-wang Mar 27, 2020

karalets mentioned this issue May 13, 2020

visualization of uncertainty #24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What Metrics do we want to have for training/testing here? #4

What Metrics do we want to have for training/testing here? #4

karalets commented Mar 27, 2020 •

edited by yuanqing-wang

Loading

maxentile commented May 13, 2020

karalets commented May 13, 2020

karalets commented May 13, 2020

yuanqing-wang commented May 14, 2020

karalets commented May 14, 2020

What Metrics do we want to have for training/testing here? #4

What Metrics do we want to have for training/testing here? #4

Comments

karalets commented Mar 27, 2020 • edited by yuanqing-wang Loading

maxentile commented May 13, 2020

karalets commented May 13, 2020

karalets commented May 13, 2020

yuanqing-wang commented May 14, 2020

karalets commented May 14, 2020

karalets commented Mar 27, 2020 •

edited by yuanqing-wang

Loading