Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to generate image from Image+Text? #4

Closed
bakachan19 opened this issue May 23, 2023 · 5 comments
Closed

How to generate image from Image+Text? #4

bakachan19 opened this issue May 23, 2023 · 5 comments

Comments

@bakachan19
Copy link

Hi.
Thanks for the great work you have provided.
In the readme I saw that there are several supported tasks:

Audio to Image
Audio+Text to Image
Audio+Image to Image
Image to Image
Text to Image
Thermal to Image
Depth to Image: Coming soon.

I am new to this type of applications, so I was wondering if it is possible to generate and image from image +text? For example, given an image of a dog and the text "pink flowers" I would like to generate an image that contains a dog and pink flowers.
If so, could you provide the code for an example? I was looking at the code in the api.py and I am a bit confused of the use of the prompt and text. Moreover, do I need to normalize the embeddings of the image and text before summing them together, or should I need to normalize the summed embedding?

I greatly appreciate your help.
Thanks.

@Zeqiang-Lai
Copy link
Owner

I don't have time to implement it now, you could refer to

elif audio is not None and text is not None:
to implement by yourself. The normalization has already handled. In a nutshell, the text and image should not be normalized. The audio should.

The stable-diffusion-unclip we used take two condition, (1) prompt (2) clip image embedding.

When we replace the clip image embedding with imagebind embedding, we could achieve anything2image.

The prompt in api.py refer to the prompt mentioned before. The text refer to the text imagebind embedding, which will replace the image embedding and feed into the diffusion model.

@bakachan19
Copy link
Author

Thanks!

@bakachan19
Copy link
Author

Sorry for bothering you again.
I was going through the original imagebind code and it looks like the image embeddings are normalized to l2:
https://github.com/facebookresearch/ImageBind/blob/38a9132636f6ca2acdd6bb3d3c10be5859488f59/models/imagebind_model.py#L421

modality_postprocessors[ModalityType.VISION] = Normalize(dim=-1)

but not temperature scaled.
Is there a reason why you skip normalization in your implementation?

        if image is not None:
            Image.fromarray(image).save('tmp.png')
            embeddings = model.forward({
                imagebind.ModalityType.VISION: imagebind.load_and_transform_vision_data(['tmp.png'], device),
            }, normalize=False)
            image_embeddings = embeddings[imagebind.ModalityType.VISION]
            os.remove('tmp.png')

Thank you for your time!

@Zeqiang-Lai
Copy link
Owner

It is obtained via test and trial. I didn't dive into the theory too much due to the limitation of time.

@bakachan19
Copy link
Author

Oh, I see.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants