Add BLIP and text-to-image serving #181

jonatanklosko · 2023-03-27T18:13:01Z

Closes #179.

Similarly to CLIP, BLIP is composed of two separate sub-models (vision and text). There are a couple architectures of BLIP, for now I only added :for_conditional_generation, because I'm not sure if the other ones are generalisable into servings, and for the :base model (which is like CLIP) I found no checkpoint to test against, so it's not necessarily useful.

@seanmor5 it hit me that speech-to-text is audio and not multimodal, so it feels weird that image-to-text is multimodal and not vision. Let me know if you have any opinion on this :)

seanmor5 · 2023-03-27T21:37:55Z

@jonatanklosko No issues with switching this out to not be multi-modal! Nice one!

Edit: I thought you had made the switch. But yeah, I think we should be consistent with Whisper. I guess it is strange though considering we'd keep the multimodal namespace for models and move the task to vision.

seanmor5 · 2023-03-27T21:41:17Z

Is it possible to just use the base of the conditional generation model checkpoint to test the base model?

jonatanklosko · 2023-03-27T21:53:49Z

I guess it is strange though considering we'd keep the multimodal namespace for models and move the task to vision.

Yeah I'm on the fence here, keeping BLIP under multimodal makes sense since CLIP is already there, and it's joint two models. But having the task there feels inconsistent (unless we move speech-to-text there too).

Is it possible to just use the base of the conditional generation model checkpoint to test the base model?

Not really, the base model is like CLIP, it has final projections for visual and text states, but those projections are not used in any other variant. Even the naming in hf/t is different across those variants (self.text_model, self.text_encoder, self.text_decoder), so loading directly is not possible, rather manipulating state dict manually. If we really wanted we could make a deterministic test, but I'm not sure it's worth it if there's not checkpoint.

jonatanklosko · 2023-03-27T22:11:28Z

Well, "Zero-shot image classification" intuitively belongs under vision, right? And it will use CLIP which is under multimodal. At this point I think multimodal should be misc 😄

Add BLIP and text-to-image serving

ee01866

josevalim approved these changes Mar 27, 2023

View reviewed changes

jonatanklosko added 3 commits March 28, 2023 00:22

Move image-to-text to vision

24159e9

Fix shapes

6230926

Use tokenizer from the same checkpoint

df946be

jonatanklosko merged commit 48c26db into main Mar 28, 2023

jonatanklosko deleted the jk-blip branch March 28, 2023 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BLIP and text-to-image serving #181

Add BLIP and text-to-image serving #181

jonatanklosko commented Mar 27, 2023

seanmor5 commented Mar 27, 2023 •

edited

Loading

seanmor5 commented Mar 27, 2023

jonatanklosko commented Mar 27, 2023 •

edited

Loading

jonatanklosko commented Mar 27, 2023

Add BLIP and text-to-image serving #181

Add BLIP and text-to-image serving #181

Conversation

jonatanklosko commented Mar 27, 2023

seanmor5 commented Mar 27, 2023 • edited Loading

seanmor5 commented Mar 27, 2023

jonatanklosko commented Mar 27, 2023 • edited Loading

jonatanklosko commented Mar 27, 2023

seanmor5 commented Mar 27, 2023 •

edited

Loading

jonatanklosko commented Mar 27, 2023 •

edited

Loading