-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional SYCL USM (device pointer explicit copy) and CUDA tuning for DOT #122
Conversation
Thanks for this @lfmeadow - it's really interesting that the buffers/accessors adds a small overhead. I'm a bit worried about that! Did you try the SYCL 2020 reduction API with USM for the dot kernel? I'm asking because I'd like to use that API instead of the 1.2.1-style manual implementation if we can. |
I just eyeballed the PTX. I should look more carefully to see where the extra instructions are coming from. There were definitely a lot of parameters, maybe there's some dead argument elimination to be done (which is not enabled for NVPTX as I recall). I'll look at the reduction API implementation today. |
I ran on AMD MI100 both "spock" at ORNL and one of the nodes at ANL JLSE. The SYCLUSM penalty vs. HIP is similar to CUDA, under 1% except for Dot which 3.08% worse. I'll try the SYCL reduction next. Here's the numbers for HIP and SYCLUSM on Spock. HIP uses TBSIZE of 256 for everything (doesn't affect anything but Dot). SYCLUSM uses 256 only for Dot, the others use the default which turns out to be 1024.
|
I ran SYCL2020 on A100, just the vanilla version with no USM changes. |
SYCL2020 with a redone dot kernel (but not USM) doesn't do quite as well as the original SYCL version on dot on A100:
|
Thank for all of this @lfmeadow - this is great stuff. It's good to see what values for work-group size are working well in general too. I'm worried the SYCL 2020 version was slower than the 1.2.1 version. There shouldn't be any major changes beyond syntactic sugar (the CTAD accessors, etc) so it should not affect performance... Glad to see you got the I'd hope that DPC++ would be able to incorporate these heuristics without requiring programmers to use Hopefully there will be a feature one day where we can suggest a work-group size for |
Yes, I need to revisit SYCL-2020 vs. the previous version without the USM and be a little more rigorous. On the reduction, apparently it uses this: On SYCL, SYCLUSM, and Cuda: If the kernels were fatter, the overhead wouldn't be so bad; picking a smaller number of grid blocks (like we did for Dot) and looping over them in the kernel would probably measurably decrease the overhead. |
If you know that the indices are 32-bit, you can try the |
The CUDA update here is somewhat fixed by 092ee67 by setting the number of thread blocks to 1024. If we think the approach here of queuing the device is better CUDA code, we can use that instead. @jeffhammond - what would you recommend. The HIP code has been updated by AMD in #139. They took a different approach, but as it cam directly from AMD I'm inclined to trust the heuristic for selecting a good size. The USM SYCL version is still interesting, and we need to add a version that does that. |
I'll run on a bunch of devices and see what the sensitivity is. |
I've got a SYCL2020 USM version implemented in 3f7bb63, will try to merge at some point. |
This aligns with the approach implemented in other models (SYCL 1.2.1 and HIP) Cherry-picks the CUDA updates from lmeadows in #122
The CUDA change has now been merged manually into |
This aligns with the approach implemented in other models (SYCL 1.2.1 and HIP) Cherry-picks the CUDA updates from lmeadows in UoB-HPC#122
Next stop is AMD :)