Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BLIP and text-to-image serving #181

Merged
merged 4 commits into from
Mar 28, 2023
Merged

Add BLIP and text-to-image serving #181

merged 4 commits into from
Mar 28, 2023

Conversation

jonatanklosko
Copy link
Member

Closes #179.

image

Similarly to CLIP, BLIP is composed of two separate sub-models (vision and text). There are a couple architectures of BLIP, for now I only added :for_conditional_generation, because I'm not sure if the other ones are generalisable into servings, and for the :base model (which is like CLIP) I found no checkpoint to test against, so it's not necessarily useful.

@seanmor5 it hit me that speech-to-text is audio and not multimodal, so it feels weird that image-to-text is multimodal and not vision. Let me know if you have any opinion on this :)

@seanmor5
Copy link
Contributor

seanmor5 commented Mar 27, 2023

@jonatanklosko No issues with switching this out to not be multi-modal! Nice one!

Edit: I thought you had made the switch. But yeah, I think we should be consistent with Whisper. I guess it is strange though considering we'd keep the multimodal namespace for models and move the task to vision.

@seanmor5
Copy link
Contributor

Is it possible to just use the base of the conditional generation model checkpoint to test the base model?

@jonatanklosko
Copy link
Member Author

jonatanklosko commented Mar 27, 2023

I guess it is strange though considering we'd keep the multimodal namespace for models and move the task to vision.

Yeah I'm on the fence here, keeping BLIP under multimodal makes sense since CLIP is already there, and it's joint two models. But having the task there feels inconsistent (unless we move speech-to-text there too).

Is it possible to just use the base of the conditional generation model checkpoint to test the base model?

Not really, the base model is like CLIP, it has final projections for visual and text states, but those projections are not used in any other variant. Even the naming in hf/t is different across those variants (self.text_model, self.text_encoder, self.text_decoder), so loading directly is not possible, rather manipulating state dict manually. If we really wanted we could make a deterministic test, but I'm not sure it's worth it if there's not checkpoint.

@jonatanklosko
Copy link
Member Author

Well, "Zero-shot image classification" intuitively belongs under vision, right? And it will use CLIP which is under multimodal. At this point I think multimodal should be misc 😄

@jonatanklosko jonatanklosko merged commit 48c26db into main Mar 28, 2023
@jonatanklosko jonatanklosko deleted the jk-blip branch March 28, 2023 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add BLIP
3 participants