Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Vision Backbone Architecture in SPRC (Paper Reported:ViT-L/14, Source Code:ViT-g/14) #4

Open
tjddus9597 opened this issue Apr 3, 2024 · 8 comments

Comments

@tjddus9597
Copy link

tjddus9597 commented Apr 3, 2024

I recently had the pleasure of reading your paper submitted to ICLR, which was selected as a spotlight.
The insights and methodologies discussed were both enlightening and inspiring.

However, upon examining the source code and associated checkpoint files, I discovered a significant discrepancy that could potentially impact the integrity of the reported results and the fairness of comparisons made within the paper.

The paper states that the SPRC model employs the ViT-L/14 as its vision backbone architecture.
Yet, the default settings in the source code and the architecture details inferred from the checkpoint files suggest the use of the EVA-CLIP ViT-g/14 model instead. This discrepancy was confirmed through the examination of the vision model's weights, which correspond to a depth of 40 and a dimension of 6144, characteristics unique to the ViT-g/14 model. The performance gap between the Eva-Clip-G/14 and Clip-L/14 models is substantial, leading to potentially unfair comparisons with existing composed image retrieval methods.

I believe this reporting error was not intentional. I guess that the default declaration function for BLIP-2 in the Lavis library, load_model_and_preprocess(name=args.blip_model_name, model_type="pretrain"), might have been utilized without recognizing that the 'pretrain' argument specifies the use of the ViT-g/14 model. Given the significant performance improvements and the influence your paper has already had, this oversight could lead to misunderstandings and inadvertently set a misleading benchmark for subsequent research.

In light of the above, I respectfully suggest that the experiments be re-conducted using the ViT-L as initially reported, and the findings be updated accordingly.

I want to emphasize that my intention is not to criticize but to ensure the integrity and reliability of influential research within our community. Correcting this discrepancy is not only in the best interest of maintaining scientific accuracy but also serves as a constructive step towards enhancing the credibility and utility of the findings for future explorations.

If there has been any misunderstanding on my part regarding the architecture used, I am open to correction and deeply apologize for any confusion caused.

Thank you for your attention to this matter. I look forward to your response and any corrective actions you deem appropriate.

@chunmeifeng
Copy link
Owner

chunmeifeng commented Apr 3, 2024 via email

@tjddus9597
Copy link
Author

Thank you for your quick reply. I would appreciate it if you could leave an answer that clarifies this issue.

@chunmeifeng
Copy link
Owner

chunmeifeng commented Apr 14, 2024 via email

@baiyang4
Copy link
Collaborator

baiyang4 commented Apr 15, 2024

Hi Kim,

Thank you for your insightful comments and for bringing the discrepancy in our manuscript to our attention. We sincerely apologize for the erroneous statement regarding the vision backbone architecture. We had assumed that utilizing model_type="pretrain" in the LAVIS BLIP-2 framework for model loading would default to the ViT-L model, which led to the misrepresentation in our manuscript. We have since revised the statement in the manuscript to reflect the accurate architecture used.

In response to your concerns, we've conducted a comparison of the results between ViT-L and ViT-G architectures. Despite the performance difference observed, we believe it's essential to highlight that ViT-L remains competitively performant. Below is a summary of the comparison:

Recall@k k=1 Recall@k k=5 Recall@k k=10 Recall@k k=50 Recall sub@k k=1 Recall sub@k k=2 Recall sub@k k=3 Avg.
VIT-L 50.70 80.65 88.77 97.64 79.59 91.90 96.77 80.12
VIT-G 51.96 82.12 89.74 97.69 80.65 92.31 96.60 81.39

Different vision backbones for CIRR testset

Dress R@10 Dress R@50 Shirt R@10 Shirt R@50 Toptee R@10 Toptee R@50 Average R@10 Average R@50 Avg.
VIT-L 45.81 70.40 51.62 72.52 55.69 77.21 51.04 73.38 62.21
VIT-G 49.18 72.43 55.64 73.89 59.35 78.58 54.92 74.97 64.85

Different vision backbones for F-IQ dataset

We acknowledge the importance of ensuring the accuracy and integrity of our research findings, and we deeply appreciate your efforts in bringing this matter to our attention. Your feedback will undoubtedly contribute to the refinement of our work and enhance its credibility within the research community.

Additionally, to rectify the oversight, we've uploaded the ViT-L pretrained model and corresponding code to ensure transparency and reproducibility.

Thank you once again for your diligence and understanding. Please do not hesitate to reach out if you have any further questions or concerns.

Best regards,

@yytinykd
Copy link

I want to use the 'clip_L' model, and I modified the 'vit_model="clip_L"' in the blip2_qformer_cir_align_prompt.py file. However, when running the code, it always uses the 'eva_clip_g' model. Can you help me with this?

@baiyang4
Copy link
Collaborator

to run clip_L, adding this --backbone pretrain_vitL to your training script, refer to the blip_fine_tune_2.py
parser.add_argument("--backbone", type=str, default="pretrain", help="pretrain for vit-g, pretrain_vitL for vit-l")

@yytinykd
Copy link

yytinykd commented May 1, 2024

Hello, in your paper, you mentioned using "Ours+CLIP" and "Ours+BLIP". Could you please specify which versions of the pre-trained visual encoder models for CLIP and BLIP were used?
image

@tjddus9597
Copy link
Author

tjddus9597 commented May 8, 2024

Hello, Baiyang. Thank you for your prompt reply.

It seems that using a ViT-L backbone results in a significant performance drop.
In the FIQ, its performance (62.21) is competitive with TG-CIR (62.21), ViT-B/16 and Re-ranking (62.15), ViT-B/16 with 384 Image.
Additionally, I have confirmed that the last column of the average metric report for TG-CIR is incorrect—the calculations for the average R@10 and R@50 and the final average do not align.
In the CIRR, the performance (80.12) also falls short compared to CoVR-BLIP, ViT-L/16 (80.81) and Re-ranking (80.90), ViT-B/16 with 384.

Although the paper claims that the proposed method achieves state-of-the-art, a fair comparison suggests that it is competitive or does not achieve the best performance.

It hasn't been updated on arXiv yet. Do you have any plans to revise the manuscript?

Best regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants