Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the number of real samples is smaller than 10K, does the metric still produce a reliable score? #3

Closed
HelenMao opened this issue Aug 11, 2020 · 5 comments

Comments

@HelenMao
Copy link

Hi, thanks for sharing the great work!
I would like to use this metric to evaluate the results in the image-to-image translation task. However, in I2I datasets, the number of real samples are always less than 10K, and most of them are around 1k. In this case, does the metric still produce a reliable score? Can I directly use this metric?
Looking forward to your reply, thanks a lot!

@coallaoh
Copy link
Collaborator

Thank you for your interest in our work 👍
We did a bit of analysis around this for real==fake case, when the D&C metrics should give values close to 1.0.

Screen Shot 2020-08-11 at 10 46 28 AM

Screen Shot 2020-08-11 at 10 46 36 AM

Coverage is very stable against #samples. The problem is Density's sensitivity to #samples. However, even for Density, I would say 1k samples already give stable result (low variance around the mean 1.0).

So my answer to your question is a reserved yes. Please go ahead with 1k samples, but keep in mind that the metrics are based on samples and are therefore not completely free from the sample variance.

@HelenMao
Copy link
Author

Thanks for your quick response and it helps me a lot!

@HelenMao HelenMao changed the title When real rsamples is smaller than 10K, does the metric still produce a reliable score? When real samples is smaller than 10K, does the metric still produce a reliable score? Aug 11, 2020
@HelenMao HelenMao changed the title When real samples is smaller than 10K, does the metric still produce a reliable score? When the number of real samples is smaller than 10K, does the metric still produce a reliable score? Aug 11, 2020
@HelenMao
Copy link
Author

HelenMao commented Sep 25, 2020

Hi, I use this metric and P&R to compare several generative models.
I find:

  1. Although FID_A < FID_B, however, the coverage and density of B are better than A.
  2. Also, I find the tendency of P&R and C&D is not consistent. For example, recall is smaller. However, the coverage score is larger. (This is not the outlier case that the paper reported)
  3. when I choose different K. The tendency of coverage and density among these models will change. For example, when I choose K=3, C_A is much better than other methods. However, K=5, C_A is the worst.
    Moreover, in the conditional setting, not only P&R and C&D seem not very consistent with FID. Namely, there are always many cases that FID is good. However, C&D is worse.
    Do you have some experience about that?

@HelenMao HelenMao reopened this Sep 25, 2020
@coallaoh
Copy link
Collaborator

Yes, we also have experienced certain inconsistencies in the model rankings for different metrics, and I personally do understand your pain. Unfortunately, there is no quick solution to the problems you are having - I should say they are deeply rooted in the difficulty of evaluating generative models.

Maybe you have already thought about this, but let's probably think a bit about how to judge whether an evaluation metric is doing the right job. It's not so easy ;) I believe there are two ways.

  1. The metric is by definition what we want. For example, the accuracy metric is defined as the proportion of the correct predictions among all predictions made. This metric, by definition, is the exact representation of what humans generally want from a model. However, it is not always easy to build such a fully "interpretable" evaluation metric - e.g. building an evaluation metric for generative models. How do you algorithmically quantify the fidelity and diversity? There is no easy way. Thus, we and previous researchers have come up with proxies like CNN embeddings and KDE-like density estimators. But among many proxy metrics, how do we judge if metric A is better than metric B? This question leads to the second way of "evaluating an evaluation metric".

  2. Build a few test cases where you know how the metrics should behave & see if the metrics pass this test. This is the method we adopted in our paper - with a handful of test cases where FID and P&R fail while D&C thrive. We cannot say that we have covered all meaningful test cases for evaluating generative models, but we did our best to cover them all, seeking advice from researchers who have been working with generative models for years. And yet, it is very much likely that D&C still fails in certain cases - we hope future researchers will find them out and propose improved metrics.

In this context, we can only say D&C are remedying key shortcomings of FID and P&R, rather than saying that D&C are the evaluation metrics to be used.

For your problem cases 1 and 2, we can't say if D&C are failing because they may be rectifying the wrong evaluation results given by FID or P&R. They are not the "test cases" as in point 2 above where the desired metric values or rankings are present. They are intriguing inconsistencies but are inconclusive of which metrics are doing the right job.

For problem case 3, the ranking's dependence on K is definitely a shortcoming for D&C. It is unfortunate, but partly expected because there is no guarantee that D&C are perfect metrics.

Sorry that my answers do not really solve any of your issues. But I can tell you that we have the same kind of issues, and they are deeply rooted in the inherent difficulty of evaluating generative models.

@HelenMao
Copy link
Author

Thanks for your detailed reply :)
Yes, you are right. It is really suffered when I try to use all the metrics to evaluate the models and find inconsistency results since I cannot draw any conclusions from the results.
But just as you say, evaluating the generative models is indeed a difficulty without any GT to evaluate the metric itself.
Thanks again for the discussions 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants