Out of curiosity, what's the performance compared to torch.compile? #2

Chillee · 2023-10-17T17:01:54Z

I think torch.compile should work on HF diffusers: https://huggingface.co/docs/diffusers/optimization/torch2.0#a100-batch-size-1

isidentical · 2023-10-17T21:50:28Z

I have made an attempt on reproducing the results, but always take these with a grain of salt since there might be differences in cuda/library versions/hardware (especially memory bandwith between 40G and 80G being different) etc.

On one of our A100 40G, using model=SD1.5, batch_size=1, steps=100, I get:

-> SD1.5 out of the box with torch 2.1 SDPA is ~32it/s
-> SD1.5 + torch.compile is ~51it/s
-> SD1.5 + stable-fast is ~55it/s

chengzeyi · 2023-10-18T00:44:12Z

I have made an attempt on reproducing the results, but always take these with a grain of salt since there might be differences in cuda/library versions/hardware (especially memory bandwith between 40G and 80G being different) etc.

On one of our A100 40G, using model=SD1.5, batch_size=1, steps=100, I get:

-> SD1.5 out of the box with torch 2.1 SDPA is ~32it/s -> SD1.5 + torch.compile is ~51it/s -> SD1.5 + stable-fast is ~55it/s

Yes, in this doc, the performance of torch.compile on hf diffusers is discussed in detail.

https://huggingface.co/docs/diffusers/optimization/torch2.0

Chillee · 2023-10-18T17:25:33Z

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

isidentical · 2023-10-19T12:20:57Z

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.

chengzeyi · 2023-10-19T16:29:10Z

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.

Commercial GPUs are expensive and hard to get in my country.
We also have strict international Internet connection limitations here.

So testing cutting-edge ML models is not an easy task for me, please wait.

chengzeyi · 2023-10-20T06:25:58Z

So it's about on par with TensorRT but a bit slower than OneFlow and this repo?

Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast.

Accurate benchmark has been conducted for 4090 and 3090 by myself today.

Performance varies very greatly across different hardware/software/platform/driver configurations.
It is very hard to benchmark accurately. And preparing the environment for benchmarking is also a hard job.
I have tested on some platforms before but the results may still be inaccurate.

currently A100 is hard and expensive to rent from cloud server providers in my region.

Benchmark results will be available when I have the access to A100 again.

RTX 4090 (512x512, batch size 1, fp16, tcmalloc enabled)

Framework	SD 1.5	SD 2.1	SD 1.5 ControlNet
Vanilla PyTorch (2.1.0+cu118)	24.9 it/s	27.1 it/s	18.9 it/s
torch.compile (2.1.0+cu118, NHWC UNet)	33.5 it/s	38.2 it/s	22.7 it/s
AITemplate	65.7 it/s	71.6 it/s	untested
OneFlow	60.1 it/s	12.9 it/s (??)	untested
TensorRT	untested	untested	untested
Stable Fast (with xformers & triton)	61.8 it/s	61.6 it/s	42.3 it/s

RTX 3090 (512x512, batch size 1, fp16, tcmalloc enabled)

Framework	SD 1.5
Vanilla PyTorch (2.1.0+cu118)	22.5 it/s
torch.compile (2.1.0+cu118, NHWC UNet)	25.3 it/s
AITemplate	34.6 it/s
OneFlow	38.8 it/s
TensorRT	untested
Stable Fast (with xformers & triton)	31.5 it/s

Chillee · 2023-10-20T18:06:04Z

Thanks for the benchmarks!

I wonder if the tile sizes for torch.compile are just very untuned for consumer hardware 🤔

If you happen to have some more time, could you try:

making sure you turn on mode="reduce-overhead" for torch.compile.
Try running with TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1

Thanks a lot for the benchmarks! I don't happen to have any consumer cards immediately available, so it's good to see torch.compile performance on consumer hardware.

chengzeyi · 2023-10-29T15:08:53Z

Thanks for the benchmarks!

I wonder if the tile sizes for torch.compile are just very untuned for consumer hardware 🤔

If you happen to have some more time, could you try:

making sure you turn on mode="reduce-overhead" for torch.compile.

Try running with TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1

Thanks a lot for the benchmarks! I don't happen to have any consumer cards immediately available, so it's good to see torch.compile performance on consumer hardware.

In my own development environment, with 'reduce-overhead', the model just generates buggy outputs...

chengzeyi closed this as completed Dec 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of curiosity, what's the performance compared to torch.compile? #2

Out of curiosity, what's the performance compared to torch.compile? #2

Chillee commented Oct 17, 2023

isidentical commented Oct 17, 2023

chengzeyi commented Oct 18, 2023

Chillee commented Oct 18, 2023

isidentical commented Oct 19, 2023

chengzeyi commented Oct 19, 2023

chengzeyi commented Oct 20, 2023 •

edited

Chillee commented Oct 20, 2023

chengzeyi commented Oct 29, 2023 •

edited

Out of curiosity, what's the performance compared to torch.compile? #2

Out of curiosity, what's the performance compared to torch.compile? #2

Comments

Chillee commented Oct 17, 2023

isidentical commented Oct 17, 2023

chengzeyi commented Oct 18, 2023

Chillee commented Oct 18, 2023

isidentical commented Oct 19, 2023

chengzeyi commented Oct 19, 2023

chengzeyi commented Oct 20, 2023 • edited

RTX 4090 (512x512, batch size 1, fp16, tcmalloc enabled)

RTX 3090 (512x512, batch size 1, fp16, tcmalloc enabled)

Chillee commented Oct 20, 2023

chengzeyi commented Oct 29, 2023 • edited

chengzeyi commented Oct 20, 2023 •

edited

chengzeyi commented Oct 29, 2023 •

edited