-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on Vision Backbone Architecture in SPRC (Paper Reported:ViT-L/14, Source Code:ViT-g/14) #4
Comments
Hi Sungyeon Kim, Thanks for your email. We will check it and get back to
you.
Thanks again for your attention. Any questions feel free to ask.
Best regards
Chunmei
Sungyeon Kim ***@***.***> 于2024年4月3日周三 15:43写道:
… I recently had the pleasure of reading your paper submitted to ICLR, which
was selected as a spotlight.
The insights and methodologies discussed were both enlightening and
inspiring.
However, upon examining the source code and associated checkpoint files, I
discovered a *significant discrepancy that could potentially impact the
integrity of the reported results and the fairness of comparisons made
within the paper.*
The paper states that the SPRC model employs the *ViT-L* as its vision
backbone architecture.
Yet, the default settings in the source code and the architecture details
inferred from the checkpoint files suggest the use of the *EVA-CLIP
ViT-g/14* model instead. This discrepancy was confirmed through the
examination of the vision model's weights, which correspond to a depth of
40 and a dimension of 6144, characteristics unique to the ViT-g/14 model.
The performance gap between the Eva-Clip-G/14 and Clip-L/14 models is
substantial, leading to potentially unfair comparisons with existing
composed image retrieval methods.
I believe this reporting error was not intentional. My hypothesis is that
the default declaration function for BLIP-2 in the Lavis library, load_model_and_preprocess(name=args.blip_model_name,
model_type="pretrain"), might have been utilized without recognizing that
the 'pretrain' argument specifies the use of the ViT-g/14 model. Given
the significant performance improvements and the influence your paper has
already had, this oversight could lead to misunderstandings and
inadvertently set a misleading benchmark for subsequent research.
In light of the above, I respectfully suggest that the experiments be
re-conducted using the ViT-L as initially reported, and the findings be
updated accordingly. Should the performance with ViT-L/14 exhibit a
significant drop or fail to outperform existing methods, it would be
crucial to address this in the spirit of scientific accuracy and fairness.
I want to emphasize that my intention is not to criticize but to ensure
the integrity and reliability of influential research within our community.
Correcting this discrepancy is not only in the best interest of maintaining
scientific accuracy but also serves as a constructive step towards
enhancing the credibility and utility of the findings for future
explorations.
Thank you for your attention to this matter. I look forward to your
response and any corrective actions you deem appropriate.
—
Reply to this email directly, view it on GitHub
<#4>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AR75XN7GU47EF43UBOBCNB3Y3OXIVAVCNFSM6AAAAABFUYJR3SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZDEMJVGU2TKMI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thank you for your quick reply. I would appreciate it if you could leave an answer that clarifies this issue. |
Hi Kim, Thanks for your follow-up!
We will update you when we get back to the office.
Sungyeon Kim ***@***.***> 于2024年4月14日周日 15:31写道:
… Thank you for your quick reply. I would appreciate it if you could leave
an answer that clarifies this issue.
—
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AR75XN7RP2KPU4R7FAIJNL3Y5IWGNAVCNFSM6AAAAABFUYJR3SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJTHE2DMMRRHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi Kim, Thank you for your insightful comments and for bringing the discrepancy in our manuscript to our attention. We sincerely apologize for the erroneous statement regarding the vision backbone architecture. We had assumed that utilizing In response to your concerns, we've conducted a comparison of the results between ViT-L and ViT-G architectures. Despite the performance difference observed, we believe it's essential to highlight that ViT-L remains competitively performant. Below is a summary of the comparison:
Different vision backbones for CIRR testset
Different vision backbones for F-IQ dataset We acknowledge the importance of ensuring the accuracy and integrity of our research findings, and we deeply appreciate your efforts in bringing this matter to our attention. Your feedback will undoubtedly contribute to the refinement of our work and enhance its credibility within the research community. Additionally, to rectify the oversight, we've uploaded the ViT-L pretrained model and corresponding code to ensure transparency and reproducibility. Thank you once again for your diligence and understanding. Please do not hesitate to reach out if you have any further questions or concerns. Best regards, |
I want to use the 'clip_L' model, and I modified the 'vit_model="clip_L"' in the blip2_qformer_cir_align_prompt.py file. However, when running the code, it always uses the 'eva_clip_g' model. Can you help me with this? |
to run clip_L, adding this |
Hello, Baiyang. Thank you for your prompt reply. It seems that using a ViT-L backbone results in a significant performance drop. Although the paper claims that the proposed method achieves state-of-the-art, a fair comparison suggests that it is competitive or does not achieve the best performance. It hasn't been updated on arXiv yet. Do you have any plans to revise the manuscript? Best regards, |
I recently had the pleasure of reading your paper submitted to ICLR, which was selected as a spotlight.
The insights and methodologies discussed were both enlightening and inspiring.
However, upon examining the source code and associated checkpoint files, I discovered a significant discrepancy that could potentially impact the integrity of the reported results and the fairness of comparisons made within the paper.
The paper states that the SPRC model employs the ViT-L/14 as its vision backbone architecture.
Yet, the default settings in the source code and the architecture details inferred from the checkpoint files suggest the use of the EVA-CLIP ViT-g/14 model instead. This discrepancy was confirmed through the examination of the vision model's weights, which correspond to a depth of 40 and a dimension of 6144, characteristics unique to the ViT-g/14 model. The performance gap between the Eva-Clip-G/14 and Clip-L/14 models is substantial, leading to potentially unfair comparisons with existing composed image retrieval methods.
I believe this reporting error was not intentional. I guess that the default declaration function for BLIP-2 in the Lavis library,
load_model_and_preprocess(name=args.blip_model_name, model_type="pretrain")
, might have been utilized without recognizing that the'pretrain'
argument specifies the use of the ViT-g/14 model. Given the significant performance improvements and the influence your paper has already had, this oversight could lead to misunderstandings and inadvertently set a misleading benchmark for subsequent research.In light of the above, I respectfully suggest that the experiments be re-conducted using the ViT-L as initially reported, and the findings be updated accordingly.
I want to emphasize that my intention is not to criticize but to ensure the integrity and reliability of influential research within our community. Correcting this discrepancy is not only in the best interest of maintaining scientific accuracy but also serves as a constructive step towards enhancing the credibility and utility of the findings for future explorations.
If there has been any misunderstanding on my part regarding the architecture used, I am open to correction and deeply apologize for any confusion caused.
Thank you for your attention to this matter. I look forward to your response and any corrective actions you deem appropriate.
The text was updated successfully, but these errors were encountered: