[Feature Request] Separate merge ratio, downscaling and stride per module. #44

vidiotgameboss · 2023-06-01T19:22:07Z

Currently as it is the token merging is done globally using a slider, from what I've noticed merging at different ratios would provide more optimal results as you can then have another small speed boost and have more control over your generations, perhaps even improve them as sometimes merging these modules can have a positive effect on details such as hands and faces or just produce more interesting results.

Example would be:
Cross-Attention merged at 0.9 (yes, it works perfectly at that ratio in my testing), downscale 2.
Attention merged at 0.3-0.6, downscale 1.
MLP merged at 0.1-0.3, downscale 1.

Downscaling can likely be done further if done individually per module, I haven't tested it much.

dbolya · 2023-06-02T04:07:35Z

Thank you for the suggestion! I never thought about applying different ratios to the different modules.

I wonder though, do you get any speed benefit from doing this? In my testing, the MLP and Cross-Attention didn't contribute almost anything to the overall time taken, so merging it wasn't useful. Though, I only tested this at 512x512, so I don't know about higher resolutions.

vidiotgameboss · 2023-06-02T06:55:58Z

Thank you for the suggestion! I never thought about applying different ratios to the different modules.

I wonder though, do you get any speed benefit from doing this? In my testing, the MLP and Cross-Attention didn't contribute almost anything to the overall time taken, so merging it wasn't useful. Though, I only tested this at 512x512, so I don't know about higher resolutions.

In my testing it did have a small performance benefit, talking about 0.xx it/s to maybe 1 it/s difference here but I think that every small boost no matter how insignificant matters.

From what I noticed attention has the biggest impact, mlp has the second most and cross-attention has very little.

The main benefit would be that sometimes merging the mlp/cross-attention along with the attention can fix hands, faces or just produce interesting variations.

So yeah, small boost to performance I'd say but the additional control is good to have for experimenting with generations and possibly fixing them up a bit without needing inpainting/adetailer/tileresample.

The reason I thought of this is due to cross-attention having no downsides being merged at 0.9, might as well merge it, but attention starts getting weird at about 0.6-0.7 and mlp even sooner. That and because I've gotten very nice generations by just toying with the ratios, downscaling and stride. I thought it'd be nice to be able to compress as much as I can and have individual control over each module so that I can explore further.

vidiotgameboss · 2023-06-02T18:04:13Z

To update on this:

I've done a bunch more testing and yeah it does provide very nice generations, fixing random details and improving compositions or simply improving faces/hands, as for the limits? for example not only is it possible to merge cross attention at 0.9 but also you could downscale it by x8 if desired and the image quality degradation is often unnoticeable, in fact it is good for variations and possibly fixing details.
MLP can be ratio 0.1, downscale x8 or ratio 0.2 downscale x4 and still be good while once again fixing details or providing very good variations.
Attention can be ratio 0.6, downscale x2 or ratio 0.7, downscale x1.

I often find success with these values but naturally it is seed, model, VAE, optimization and GPU dependent. If I find I do not like it then I tweak the ratio/downscale a bit or generate a couple variations as the process of merging makes generations on same seeds variably indeterministic, it is too bad that currently these modules cannot be merged separately.

From what I've noticed higher Attention compression degrading generations can be fixed/mitigated by merging MLP and/or Cross-Attention along with it, even with the limited control due to the merging ratio and downscaling being global I often notice what would be a highly degraded generation (if merging with only one module) fixed and even improved with some luck and by merging other modules, sometimes downscaling can help or produce an even better result.

I hope this can eventually be added, not necessarily for performance, or well could be as MLP/Cross-Attention (very small performance boosts) merging can allow for higher Attention merging (which is the main performance booster) due to higher ratios not degrading generations as much when modules are merged together at correct ratios and downscales.

And this is all without even mentioning Stride. The variation possibilities are very interesting.

dbolya · 2023-06-05T06:32:03Z

I see, thanks for all the testing! I'm totally on board with adding finer control, but the issue is the interface. Currently, there are a lot of variables even without being able to tune parameters specific to the modules.

Actually, I think if I were to add something like this, I'd let you change the ratios at a per-block level. In the stable diffusion network, there are 4 layers at downsample 1, 4 layers at downsample 2, etc. and currently there's no granularity to say, "ok, the first 3 downsample 1 layers should have x% ratio, but the last should have y%", etc. If I'm going to add a "do anything" option, then might as well go all the way, right?

What do you think about an optional "config" string, which would let you set the parameters as granular as you wanted. For instance:

1-50%-2x2-a, 1-90%-4x4-x, 1-10%-2x2-m, ...

would set layer 1's attn (a) at r=50% with a 2x2 kernel, the cross attn (x) at r=90% with a 4x4 kernel, and the mlp (m) at r=10% with a 2x2 kernel. Then you could continue to specify layer 2, 3, 4, etc. the same way. It would be verbose, but it would be very granular.

There's an option for reducing verbosity by allowing you to specify multiple layers / modules with the same "operation":

0_12_13_14-50%-2x2-ax

could set the the merging ratio to r=50% for the attn and cross-attn of blocks 0, 12, 13, and 14 (which are the layers at "downsample 1").

Would that be useful? Idk how many people would use that lol.

vidiotgameboss · 2023-06-05T13:43:10Z

I see, thanks for all the testing! I'm totally on board with adding finer control, but the issue is the interface. Currently, there are a lot of variables even without being able to tune parameters specific to the modules.

Actually, I think if I were to add something like this, I'd let you change the ratios at a per-block level. In the stable diffusion network, there are 4 layers at downsample 1, 4 layers at downsample 2, etc. and currently there's no granularity to say, "ok, the first 3 downsample 1 layers should have x% ratio, but the last should have y%", etc. If I'm going to add a "do anything" option, then might as well go all the way, right?

What do you think about an optional "config" string, which would let you set the parameters as granular as you wanted. For instance:
1-50%-2x2-a, 1-90%-4x4-x, 1-10%-2x2-m, ...
would set layer 1's attn (a) at r=50% with a 2x2 kernel, the cross attn (x) at r=90% with a 4x4 kernel, and the mlp (m) at r=10% with a 2x2 kernel. Then you could continue to specify layer 2, 3, 4, etc. the same way. It would be verbose, but it would be very granular.

There's an option for reducing verbosity by allowing you to specify multiple layers / modules with the same "operation":
0_12_13_14-50%-2x2-ax
could set the the merging ratio to r=50% for the attn and cross-attn of blocks 0, 12, 13, and 14 (which are the layers at "downsample 1").

Would that be useful? Idk how many people would use that lol.

Yeah that would be great if implemented and I think a lot of people who experiment with settings in general would use it, I personally tend to do that cause I like finding optimal configurations.

vidiotgameboss changed the title ~~[Feature Request] Separate merge control.~~ [Feature Request] Separate merge ratio, downscaling and stride per module. Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Separate merge ratio, downscaling and stride per module. #44

[Feature Request] Separate merge ratio, downscaling and stride per module. #44

vidiotgameboss commented Jun 1, 2023 •

edited

Loading

dbolya commented Jun 2, 2023

vidiotgameboss commented Jun 2, 2023 •

edited

Loading

vidiotgameboss commented Jun 2, 2023 •

edited

Loading

dbolya commented Jun 5, 2023

vidiotgameboss commented Jun 5, 2023

[Feature Request] Separate merge ratio, downscaling and stride per module. #44

[Feature Request] Separate merge ratio, downscaling and stride per module. #44

Comments

vidiotgameboss commented Jun 1, 2023 • edited Loading

dbolya commented Jun 2, 2023

vidiotgameboss commented Jun 2, 2023 • edited Loading

vidiotgameboss commented Jun 2, 2023 • edited Loading

dbolya commented Jun 5, 2023

vidiotgameboss commented Jun 5, 2023

vidiotgameboss commented Jun 1, 2023 •

edited

Loading

vidiotgameboss commented Jun 2, 2023 •

edited

Loading

vidiotgameboss commented Jun 2, 2023 •

edited

Loading