How to generate image from Image+Text? #4

bakachan19 · 2023-05-23T08:58:38Z

Hi.
Thanks for the great work you have provided.
In the readme I saw that there are several supported tasks:

Audio to Image
Audio+Text to Image
Audio+Image to Image
Image to Image
Text to Image
Thermal to Image
Depth to Image: Coming soon.

I am new to this type of applications, so I was wondering if it is possible to generate and image from image +text? For example, given an image of a dog and the text "pink flowers" I would like to generate an image that contains a dog and pink flowers.
If so, could you provide the code for an example? I was looking at the code in the api.py and I am a bit confused of the use of the prompt and text. Moreover, do I need to normalize the embeddings of the image and text before summing them together, or should I need to normalize the summed embedding?

I greatly appreciate your help.
Thanks.

The text was updated successfully, but these errors were encountered:

Zeqiang-Lai · 2023-05-23T11:10:57Z

I don't have time to implement it now, you could refer to

Anything2Image/anything2image/api.py

Line 76 in 681958d

elif audio is not None and text is not None:

to implement by yourself. The normalization has already handled. In a nutshell, the text and image should not be normalized. The audio should.

The stable-diffusion-unclip we used take two condition, (1) prompt (2) clip image embedding.

When we replace the clip image embedding with imagebind embedding, we could achieve anything2image.

The prompt in api.py refer to the prompt mentioned before. The text refer to the text imagebind embedding, which will replace the image embedding and feed into the diffusion model.

bakachan19 · 2023-05-23T12:44:28Z

Thanks!

bakachan19 · 2023-05-23T15:09:03Z

Sorry for bothering you again.
I was going through the original imagebind code and it looks like the image embeddings are normalized to l2:
https://github.com/facebookresearch/ImageBind/blob/38a9132636f6ca2acdd6bb3d3c10be5859488f59/models/imagebind_model.py#L421

modality_postprocessors[ModalityType.VISION] = Normalize(dim=-1)

but not temperature scaled.
Is there a reason why you skip normalization in your implementation?

        if image is not None:
            Image.fromarray(image).save('tmp.png')
            embeddings = model.forward({
                imagebind.ModalityType.VISION: imagebind.load_and_transform_vision_data(['tmp.png'], device),
            }, normalize=False)
            image_embeddings = embeddings[imagebind.ModalityType.VISION]
            os.remove('tmp.png')

Thank you for your time!

Zeqiang-Lai · 2023-05-23T15:21:59Z

It is obtained via test and trial. I didn't dive into the theory too much due to the limitation of time.

bakachan19 · 2023-05-23T15:25:27Z

Oh, I see.
Thanks.

bakachan19 closed this as completed May 23, 2023

Zeqiang-Lai mentioned this issue May 23, 2023

How to use ImageBind to generate image or audio? facebookresearch/ImageBind#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to generate image from Image+Text? #4

How to generate image from Image+Text? #4

bakachan19 commented May 23, 2023

Zeqiang-Lai commented May 23, 2023

bakachan19 commented May 23, 2023

bakachan19 commented May 23, 2023

Zeqiang-Lai commented May 23, 2023

bakachan19 commented May 23, 2023

How to generate image from Image+Text? #4

How to generate image from Image+Text? #4

Comments

bakachan19 commented May 23, 2023

Zeqiang-Lai commented May 23, 2023

bakachan19 commented May 23, 2023

bakachan19 commented May 23, 2023

Zeqiang-Lai commented May 23, 2023

bakachan19 commented May 23, 2023