Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison on PACS and VLCS #22

Closed
vihari opened this issue Apr 16, 2020 · 4 comments
Closed

Comparison on PACS and VLCS #22

vihari opened this issue Apr 16, 2020 · 4 comments

Comments

@vihari
Copy link

vihari commented Apr 16, 2020

In table 1 and table 2 of your paper, you show performance of DeepAll along with performance of the related work. DeepAll is the baseline number that should have been the same across different methods had the dataset, implementation are standardized, is that correct? My question is: how are you comfortable making comparisons with methods from different implementations when they have such diverging baseline numbers? I mean, how can you be sure that the improvements are from better implementation or better generalization? One could compare improvements over DeepAll as indicative for domain generalization but deltas over baseline need not be linear, that is it might be harder to push DeepAll when it is already doing well. I am at loss trying to make sense of PACS and VLCS evaluations. What am I missing?

Thanks

@silvia1993
Copy link
Collaborator

silvia1993 commented Apr 16, 2020

"DeepAll is the baseline number that should have been the same across different methods had the dataset, implementation are standardized, is that correct?"
Yes, it is correct.

About your question: there could be several differences among the way in which each method is implemented that lead to a different DeepAll, also if all the methods use the same backbone as starting point. It could be used a different learning rate, batch size, a different data augmentation. So, in an ideal world each method that use the same backbone should have the same DeepAll, but since it is not possible (also because not all the algorithms provide the code, so it is not always possible to see in details the implementation choices) we think that it is more fair to report the DeepAll for each method.

Furthermore, from Table 1 and 2 of our work you can see that our DeepAll, in almost all cases, is higher than the others: we tried to compare our method with the more powerful version of DeepAll in order to see the actual gain. These methods in literature (with which we compare) show you what happens in settings where the DeepAll is quite low, but you don't know if those methods will actually work once you have raised DeepAll.

I hope that I have answered to your questions!

@vihari
Copy link
Author

vihari commented Apr 16, 2020

Thanks much for the quick response.
I highly appreciate that you report DeepAll for all the methods. This brought much needed clarity since many other DG papers which use these datasets directly compare without any indication that some/much of the improvement is obtained from better implementation.
Just a quick follow up question:

These methods in literature (with which we compare) show you what happens in settings where the DeepAll is quite low, but you don't know if those methods will actually work once you have raised DeepAll.

Although, I see your point I feel we cannot be sure of it. MLDG and DeepC (referring to table 1 of your paper) improve over DeepAll by 65.27->69.26 (+3.99) and 67.24->70.01 (+2.77) respectively compared to 71.52->73.38 (+1.86) of JiGen. The improvements of MLDG and DeepC may suffer when using better implementation of JiGen but it is hard to answer if it would be better or worse than JiGen. What are your comments?

I find this problem quite unsettling, so in our paper we refused to compare beyond the method whose implementation we used: JiGen -- https://arxiv.org/abs/2003.12815.

@silvia1993
Copy link
Collaborator

"The improvements of MLDG and DeepC may suffer when using better implementation of JiGen but it is hard to answer if it would be better or worse than JiGen."

Yes, I get your point but it is infeasible to mount all the methods over our baseline implementation.
So, I think it's enough to report the DeepAll for all the methods for a fair comparison.

@vihari
Copy link
Author

vihari commented Apr 16, 2020

Thanks, that answers my questions.

@vihari vihari closed this as completed Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants