-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When the number of real samples is smaller than 10K, does the metric still produce a reliable score? #3
Comments
Thank you for your interest in our work 👍 Coverage is very stable against #samples. The problem is Density's sensitivity to #samples. However, even for Density, I would say 1k samples already give stable result (low variance around the mean 1.0). So my answer to your question is a reserved yes. Please go ahead with 1k samples, but keep in mind that the metrics are based on samples and are therefore not completely free from the sample variance. |
Thanks for your quick response and it helps me a lot! |
Hi, I use this metric and P&R to compare several generative models.
|
Yes, we also have experienced certain inconsistencies in the model rankings for different metrics, and I personally do understand your pain. Unfortunately, there is no quick solution to the problems you are having - I should say they are deeply rooted in the difficulty of evaluating generative models. Maybe you have already thought about this, but let's probably think a bit about how to judge whether an evaluation metric is doing the right job. It's not so easy ;) I believe there are two ways.
In this context, we can only say D&C are remedying key shortcomings of FID and P&R, rather than saying that D&C are the evaluation metrics to be used. For your problem cases 1 and 2, we can't say if D&C are failing because they may be rectifying the wrong evaluation results given by FID or P&R. They are not the "test cases" as in point 2 above where the desired metric values or rankings are present. They are intriguing inconsistencies but are inconclusive of which metrics are doing the right job. For problem case 3, the ranking's dependence on K is definitely a shortcoming for D&C. It is unfortunate, but partly expected because there is no guarantee that D&C are perfect metrics. Sorry that my answers do not really solve any of your issues. But I can tell you that we have the same kind of issues, and they are deeply rooted in the inherent difficulty of evaluating generative models. |
Thanks for your detailed reply :) |
Hi, thanks for sharing the great work!
I would like to use this metric to evaluate the results in the image-to-image translation task. However, in I2I datasets, the number of real samples are always less than 10K, and most of them are around 1k. In this case, does the metric still produce a reliable score? Can I directly use this metric?
Looking forward to your reply, thanks a lot!
The text was updated successfully, but these errors were encountered: