-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of curiosity, what's the performance compared to torch.compile? #2
Comments
I have made an attempt on reproducing the results, but always take these with a grain of salt since there might be differences in cuda/library versions/hardware (especially memory bandwith between 40G and 80G being different) etc. On one of our A100 40G, using model=SD1.5, batch_size=1, steps=100, I get: -> SD1.5 out of the box with torch 2.1 SDPA is ~32it/s |
Yes, in this doc, the performance of torch.compile on hf diffusers is discussed in detail. |
So it's about on par with TensorRT but a bit slower than OneFlow and this repo? |
Since I have not tested either TensorRT or OneFlow on the same hardware, it's hard to make comparisons directly on the reported it/s numbers here vs mine. But as a relative perf point, torch.compile on A100 40G is about %8 slower than stable-fast which puts it into the same venue as OneFlow which is 9% slower compared to stable-fast. |
Commercial GPUs are expensive and hard to get in my country. So testing cutting-edge ML models is not an easy task for me, please wait. |
Accurate benchmark has been conducted for 4090 and 3090 by myself today. Performance varies very greatly across different hardware/software/platform/driver configurations. currently A100 is hard and expensive to rent from cloud server providers in my region. Benchmark results will be available when I have the access to A100 again. RTX 4090 (512x512, batch size 1, fp16, tcmalloc enabled)
RTX 3090 (512x512, batch size 1, fp16, tcmalloc enabled)
|
Thanks for the benchmarks! I wonder if the tile sizes for If you happen to have some more time, could you try:
Thanks a lot for the benchmarks! I don't happen to have any consumer cards immediately available, so it's good to see |
In my own development environment, with 'reduce-overhead', the model just generates buggy outputs... |
I think torch.compile should work on HF diffusers: https://huggingface.co/docs/diffusers/optimization/torch2.0#a100-batch-size-1
The text was updated successfully, but these errors were encountered: