Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Separate merge ratio, downscaling and stride per module. #44

Open
vidiotgameboss opened this issue Jun 1, 2023 · 5 comments

Comments

@vidiotgameboss
Copy link

vidiotgameboss commented Jun 1, 2023

Currently as it is the token merging is done globally using a slider, from what I've noticed merging at different ratios would provide more optimal results as you can then have another small speed boost and have more control over your generations, perhaps even improve them as sometimes merging these modules can have a positive effect on details such as hands and faces or just produce more interesting results.

Example would be:
Cross-Attention merged at 0.9 (yes, it works perfectly at that ratio in my testing), downscale 2.
Attention merged at 0.3-0.6, downscale 1.
MLP merged at 0.1-0.3, downscale 1.

Downscaling can likely be done further if done individually per module, I haven't tested it much.

@dbolya
Copy link
Owner

dbolya commented Jun 2, 2023

Thank you for the suggestion! I never thought about applying different ratios to the different modules.

I wonder though, do you get any speed benefit from doing this? In my testing, the MLP and Cross-Attention didn't contribute almost anything to the overall time taken, so merging it wasn't useful. Though, I only tested this at 512x512, so I don't know about higher resolutions.

@vidiotgameboss
Copy link
Author

vidiotgameboss commented Jun 2, 2023

Thank you for the suggestion! I never thought about applying different ratios to the different modules.

I wonder though, do you get any speed benefit from doing this? In my testing, the MLP and Cross-Attention didn't contribute almost anything to the overall time taken, so merging it wasn't useful. Though, I only tested this at 512x512, so I don't know about higher resolutions.

In my testing it did have a small performance benefit, talking about 0.xx it/s to maybe 1 it/s difference here but I think that every small boost no matter how insignificant matters.

From what I noticed attention has the biggest impact, mlp has the second most and cross-attention has very little.

The main benefit would be that sometimes merging the mlp/cross-attention along with the attention can fix hands, faces or just produce interesting variations.

So yeah, small boost to performance I'd say but the additional control is good to have for experimenting with generations and possibly fixing them up a bit without needing inpainting/adetailer/tileresample.

The reason I thought of this is due to cross-attention having no downsides being merged at 0.9, might as well merge it, but attention starts getting weird at about 0.6-0.7 and mlp even sooner. That and because I've gotten very nice generations by just toying with the ratios, downscaling and stride. I thought it'd be nice to be able to compress as much as I can and have individual control over each module so that I can explore further.

@vidiotgameboss
Copy link
Author

vidiotgameboss commented Jun 2, 2023

To update on this:

I've done a bunch more testing and yeah it does provide very nice generations, fixing random details and improving compositions or simply improving faces/hands, as for the limits? for example not only is it possible to merge cross attention at 0.9 but also you could downscale it by x8 if desired and the image quality degradation is often unnoticeable, in fact it is good for variations and possibly fixing details.
MLP can be ratio 0.1, downscale x8 or ratio 0.2 downscale x4 and still be good while once again fixing details or providing very good variations.
Attention can be ratio 0.6, downscale x2 or ratio 0.7, downscale x1.

I often find success with these values but naturally it is seed, model, VAE, optimization and GPU dependent. If I find I do not like it then I tweak the ratio/downscale a bit or generate a couple variations as the process of merging makes generations on same seeds variably indeterministic, it is too bad that currently these modules cannot be merged separately.

From what I've noticed higher Attention compression degrading generations can be fixed/mitigated by merging MLP and/or Cross-Attention along with it, even with the limited control due to the merging ratio and downscaling being global I often notice what would be a highly degraded generation (if merging with only one module) fixed and even improved with some luck and by merging other modules, sometimes downscaling can help or produce an even better result.

I hope this can eventually be added, not necessarily for performance, or well could be as MLP/Cross-Attention (very small performance boosts) merging can allow for higher Attention merging (which is the main performance booster) due to higher ratios not degrading generations as much when modules are merged together at correct ratios and downscales.

And this is all without even mentioning Stride. The variation possibilities are very interesting.

@vidiotgameboss vidiotgameboss changed the title [Feature Request] Separate merge control. [Feature Request] Separate merge ratio, downscaling and stride per module. Jun 2, 2023
@dbolya
Copy link
Owner

dbolya commented Jun 5, 2023

I see, thanks for all the testing! I'm totally on board with adding finer control, but the issue is the interface. Currently, there are a lot of variables even without being able to tune parameters specific to the modules.

Actually, I think if I were to add something like this, I'd let you change the ratios at a per-block level. In the stable diffusion network, there are 4 layers at downsample 1, 4 layers at downsample 2, etc. and currently there's no granularity to say, "ok, the first 3 downsample 1 layers should have x% ratio, but the last should have y%", etc. If I'm going to add a "do anything" option, then might as well go all the way, right?

What do you think about an optional "config" string, which would let you set the parameters as granular as you wanted. For instance:

1-50%-2x2-a, 1-90%-4x4-x, 1-10%-2x2-m, ...

would set layer 1's attn (a) at r=50% with a 2x2 kernel, the cross attn (x) at r=90% with a 4x4 kernel, and the mlp (m) at r=10% with a 2x2 kernel. Then you could continue to specify layer 2, 3, 4, etc. the same way. It would be verbose, but it would be very granular.

There's an option for reducing verbosity by allowing you to specify multiple layers / modules with the same "operation":

0_12_13_14-50%-2x2-ax

could set the the merging ratio to r=50% for the attn and cross-attn of blocks 0, 12, 13, and 14 (which are the layers at "downsample 1").

Would that be useful? Idk how many people would use that lol.

@vidiotgameboss
Copy link
Author

I see, thanks for all the testing! I'm totally on board with adding finer control, but the issue is the interface. Currently, there are a lot of variables even without being able to tune parameters specific to the modules.

Actually, I think if I were to add something like this, I'd let you change the ratios at a per-block level. In the stable diffusion network, there are 4 layers at downsample 1, 4 layers at downsample 2, etc. and currently there's no granularity to say, "ok, the first 3 downsample 1 layers should have x% ratio, but the last should have y%", etc. If I'm going to add a "do anything" option, then might as well go all the way, right?

What do you think about an optional "config" string, which would let you set the parameters as granular as you wanted. For instance:

1-50%-2x2-a, 1-90%-4x4-x, 1-10%-2x2-m, ...

would set layer 1's attn (a) at r=50% with a 2x2 kernel, the cross attn (x) at r=90% with a 4x4 kernel, and the mlp (m) at r=10% with a 2x2 kernel. Then you could continue to specify layer 2, 3, 4, etc. the same way. It would be verbose, but it would be very granular.

There's an option for reducing verbosity by allowing you to specify multiple layers / modules with the same "operation":

0_12_13_14-50%-2x2-ax

could set the the merging ratio to r=50% for the attn and cross-attn of blocks 0, 12, 13, and 14 (which are the layers at "downsample 1").

Would that be useful? Idk how many people would use that lol.

Yeah that would be great if implemented and I think a lot of people who experiment with settings in general would use it, I personally tend to do that cause I like finding optimal configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants