When the number of real samples is smaller than 10K, does the metric still produce a reliable score? #3

HelenMao · 2020-08-11T00:50:14Z

Hi, thanks for sharing the great work!
I would like to use this metric to evaluate the results in the image-to-image translation task. However, in I2I datasets, the number of real samples are always less than 10K, and most of them are around 1k. In this case, does the metric still produce a reliable score? Can I directly use this metric?
Looking forward to your reply, thanks a lot!

coallaoh · 2020-08-11T01:52:46Z

Thank you for your interest in our work 👍
We did a bit of analysis around this for real==fake case, when the D&C metrics should give values close to 1.0.

Coverage is very stable against #samples. The problem is Density's sensitivity to #samples. However, even for Density, I would say 1k samples already give stable result (low variance around the mean 1.0).

So my answer to your question is a reserved yes. Please go ahead with 1k samples, but keep in mind that the metrics are based on samples and are therefore not completely free from the sample variance.

HelenMao · 2020-08-11T02:31:38Z

Thanks for your quick response and it helps me a lot!

HelenMao · 2020-09-25T19:56:49Z

Hi, I use this metric and P&R to compare several generative models.
I find:

Although FID_A < FID_B, however, the coverage and density of B are better than A.
Also, I find the tendency of P&R and C&D is not consistent. For example, recall is smaller. However, the coverage score is larger. (This is not the outlier case that the paper reported)
when I choose different K. The tendency of coverage and density among these models will change. For example, when I choose K=3, C_A is much better than other methods. However, K=5, C_A is the worst.
Moreover, in the conditional setting, not only P&R and C&D seem not very consistent with FID. Namely, there are always many cases that FID is good. However, C&D is worse.
Do you have some experience about that?

coallaoh · 2020-09-27T22:59:55Z

Yes, we also have experienced certain inconsistencies in the model rankings for different metrics, and I personally do understand your pain. Unfortunately, there is no quick solution to the problems you are having - I should say they are deeply rooted in the difficulty of evaluating generative models.

Maybe you have already thought about this, but let's probably think a bit about how to judge whether an evaluation metric is doing the right job. It's not so easy ;) I believe there are two ways.

The metric is by definition what we want. For example, the accuracy metric is defined as the proportion of the correct predictions among all predictions made. This metric, by definition, is the exact representation of what humans generally want from a model. However, it is not always easy to build such a fully "interpretable" evaluation metric - e.g. building an evaluation metric for generative models. How do you algorithmically quantify the fidelity and diversity? There is no easy way. Thus, we and previous researchers have come up with proxies like CNN embeddings and KDE-like density estimators. But among many proxy metrics, how do we judge if metric A is better than metric B? This question leads to the second way of "evaluating an evaluation metric".
Build a few test cases where you know how the metrics should behave & see if the metrics pass this test. This is the method we adopted in our paper - with a handful of test cases where FID and P&R fail while D&C thrive. We cannot say that we have covered all meaningful test cases for evaluating generative models, but we did our best to cover them all, seeking advice from researchers who have been working with generative models for years. And yet, it is very much likely that D&C still fails in certain cases - we hope future researchers will find them out and propose improved metrics.

In this context, we can only say D&C are remedying key shortcomings of FID and P&R, rather than saying that D&C are the evaluation metrics to be used.

For your problem cases 1 and 2, we can't say if D&C are failing because they may be rectifying the wrong evaluation results given by FID or P&R. They are not the "test cases" as in point 2 above where the desired metric values or rankings are present. They are intriguing inconsistencies but are inconclusive of which metrics are doing the right job.

For problem case 3, the ranking's dependence on K is definitely a shortcoming for D&C. It is unfortunate, but partly expected because there is no guarantee that D&C are perfect metrics.

Sorry that my answers do not really solve any of your issues. But I can tell you that we have the same kind of issues, and they are deeply rooted in the inherent difficulty of evaluating generative models.

HelenMao · 2020-09-28T02:32:56Z

Thanks for your detailed reply :)
Yes, you are right. It is really suffered when I try to use all the metrics to evaluate the models and find inconsistency results since I cannot draw any conclusions from the results.
But just as you say, evaluating the generative models is indeed a difficulty without any GT to evaluate the metric itself.
Thanks again for the discussions 👍

HelenMao closed this as completed Aug 11, 2020

HelenMao changed the title ~~When real rsamples is smaller than 10K, does the metric still produce a reliable score?~~ When real samples is smaller than 10K, does the metric still produce a reliable score? Aug 11, 2020

HelenMao changed the title ~~When real samples is smaller than 10K, does the metric still produce a reliable score?~~ When the number of real samples is smaller than 10K, does the metric still produce a reliable score? Aug 11, 2020

HelenMao reopened this Sep 25, 2020

HelenMao closed this as completed Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the number of real samples is smaller than 10K, does the metric still produce a reliable score? #3

When the number of real samples is smaller than 10K, does the metric still produce a reliable score? #3

HelenMao commented Aug 11, 2020

coallaoh commented Aug 11, 2020

HelenMao commented Aug 11, 2020

HelenMao commented Sep 25, 2020 •

edited

Loading

coallaoh commented Sep 27, 2020

HelenMao commented Sep 28, 2020

When the number of real samples is smaller than 10K, does the metric still produce a reliable score? #3

When the number of real samples is smaller than 10K, does the metric still produce a reliable score? #3

Comments

HelenMao commented Aug 11, 2020

coallaoh commented Aug 11, 2020

HelenMao commented Aug 11, 2020

HelenMao commented Sep 25, 2020 • edited Loading

coallaoh commented Sep 27, 2020

HelenMao commented Sep 28, 2020

HelenMao commented Sep 25, 2020 •

edited

Loading