Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the choices in LLaVA+S^2 implementation #10

Open
jungle-gym-ac opened this issue May 26, 2024 · 2 comments
Open

About the choices in LLaVA+S^2 implementation #10

jungle-gym-ac opened this issue May 26, 2024 · 2 comments

Comments

@jungle-gym-ac
Copy link

Great work! I've read the paper and it seems the LLaVA+S^2 is implemented with OpenCLIP VIsion Encoder, and the LLM is finetuned with LoRA. However, the LLaVA baseline you compared with is implemented with OpenAI-CLIP Vision Encoder, and the LLM is full-finetuned(without LoRA).

If I'm right, I just wonder if you have tried using the same Vision Encoder, or full-finetuning the LLM, and what are the results of this setting? Thank you.

@bfshi
Copy link
Owner

bfshi commented May 26, 2024

Hi @jungle-gym-ac, yeah good question. In the scaling experiment on llava (Fig 3 in the paper), all the models including the baselines use openclip. The experiment of comparing llava-s2 to official llava (Table 11 in Appendix) uses OpenAI clip.

And you are right, all the models we trained on llava use lora while the official llava checkpoint we compare to uses full fine tuning. According to the official llava repo, the performance of llava with ft/lora doesn't differ much on average, but yeah comparing to the official checkpoint Lora would be fairer. We will include this in a later version of the paper. And we didn't try llava-s2 with ft.

@bfshi
Copy link
Owner

bfshi commented May 28, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants