Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in sd3_clip : encode_token_weights (depending how Stability trained it) #3772

Closed
CodeExplode opened this issue Jun 18, 2024 · 1 comment

Comments

@CodeExplode
Copy link

CodeExplode commented Jun 18, 2024

In the method encode_token_weights in sd3_clip.py, the empty conditioning is generated as:

out = torch.zeros((1, 77, 4096), device=comfy.model_management.intermediate_device())
pooled = torch.zeros((1, 768 + 1280), device=comfy.model_management.intermediate_device())

This seems correct for pooled, but 'out' would presumably need to be 77 * 2, as per the SD3 model diagram. Similarly, not including padded zeroes for CLIPs/T5 when missing seems wrong.

It would depend on how Stability originally trained the model. Either the unconditional generated this way (presuming blank prompts) wouldn't match the tensor size of a conditional with all 3 text encoders (154 long), or the unconditional would need to be created as zeroes based on the length of the conditional.

@CodeExplode
Copy link
Author

Seems perhaps you don't really need to pad the CLIP or T5 sections if not using them. They use different positional embeddings within their sub-range, so the model wouldn't be looking for them at specific indices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant