Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional SYCL USM (device pointer explicit copy) and CUDA tuning for DOT #122

Closed
wants to merge 7 commits into from

Conversation

lfmeadow
Copy link

@lfmeadow lfmeadow commented Feb 16, 2022

  1. SYCL performance on NVIDIA A100 is currently 2-3% worse than native CUDA. Inspection of the PTX generated by SYCL shows extra parameters and instructions due to accessor and buffers. I added syclusm which uses malloc_device and explicit queue->memcpy. This reduces the performance gap to around 1% with the worst case being Dot at 1.19% .
  2. The CUDA implementation of Dot is not optimal; the block size should be a multiple of the number of SMs (note that the SYCL versions already do this). I picked the same number as SYCL (4 * number of SMs) and CUDA Dot performance improves by 7.93% .
    Next stop is AMD :)

@tomdeakin
Copy link
Contributor

Thanks for this @lfmeadow - it's really interesting that the buffers/accessors adds a small overhead. I'm a bit worried about that!

Did you try the SYCL 2020 reduction API with USM for the dot kernel? I'm asking because I'd like to use that API instead of the 1.2.1-style manual implementation if we can.

@tomdeakin tomdeakin changed the base branch from main to develop February 16, 2022 14:40
@lfmeadow
Copy link
Author

I just eyeballed the PTX. I should look more carefully to see where the extra instructions are coming from. There were definitely a lot of parameters, maybe there's some dead argument elimination to be done (which is not enabled for NVPTX as I recall).

I'll look at the reduction API implementation today.

@lfmeadow
Copy link
Author

I ran on AMD MI100 both "spock" at ORNL and one of the nodes at ANL JLSE.
Aside, the JLSE nodes are a tiny bit faster on this and other benchmarks. No idea why.
Initally Dot was lagging the others by quite a lot, even after adjusting the number of blocks to be a multiple of the number of CUs. Specifically, 14.7 % slower with HIP and 17.2% slower with SYCLUSM.
Then I reread some older AMD training slides https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf. There's a good discussion of occupancy. The GCN GPUs can have up to 40 active wavefronts per CU (10 per SIMD unit). Thus one condition to maximize occupancy is that the number of (64-lane) waves in a workgroup must divide 40. Both HIP and SYCL were using 1024. I tried 512, not much help, and then 256, and voila! Dot BW is 2.34% less than the lowest non-reduction BW for HIP.

The SYCLUSM penalty vs. HIP is similar to CUDA, under 1% except for Dot which 3.08% worse.

I'll try the SYCL reduction next.

Here's the numbers for HIP and SYCLUSM on Spock. HIP uses TBSIZE of 256 for everything (doesn't affect anything but Dot). SYCLUSM uses 256 only for Dot, the others use the default which turns out to be 1024.
(pretty funny, I cut and pasted from O365 excel, it comes out as HTML)

hip              
function num_times n_elements sizeof max_mbytes_per_sec min_runtime max_runtime avg_runtime
Copy 1000 33554432 8 1.01E+06 0.000532927 0.000634115 0.000534976
Mul 1000 33554432 8 1.01E+06 0.000531393 0.00058859 0.000533475
Add 1000 33554432 8 986756 0.000816115 0.000865828 0.000819944
Triad 1000 33554432 8 986670 0.000816186 0.000867171 0.000819913
Dot 1000 33554432 8 963618 0.000557141 0.0005667 0.000561678
syclusm              
function num_times n_elements sizeof max_mbytes_per_sec min_runtime max_runtime avg_runtime
Copy 1000 33554432 8 1.00E+06 0.000536222 0.000595593 0.000537737
Mul 1000 33554432 8 1.00E+06 0.0005346 0.000582999 0.00053611
Add 1000 33554432 8 982390 0.000819742 0.000875025 0.000826572
Triad 1000 33554432 8 979027 0.000822558 0.000877611 0.00082778
Dot 1000 33554432 8 933974 0.000574824 0.00060448 0.000583949

@lfmeadow
Copy link
Author

I ran SYCL2020 on A100, just the vanilla version with no USM changes.
Dot is only 1198 GB/sec; I was geting 1290 GB/sec with SYCL version (not USM), and 1340 with my USM version.
I have not converted SYCL2020 to USM.
I can't see how to use the SYCL2020 reduction together with specifying number of workgroups and workgroup size. I suppose I could specify an ndrange and then have a loop like the older version, adding to the reduction variable.
But I'm not sure that the reduction machinery will honor this.
I'm asking around :)

@lfmeadow
Copy link
Author

lfmeadow commented Feb 21, 2022

SYCL2020 with a redone dot kernel (but not USM) doesn't do quite as well as the original SYCL version on dot on A100:
1235 GB/sec vs. 1292 GB/sec, and 1339 GB/sec with the tuned SYCLUSM version.
This will required some investigation.
Here's the rewritten SYCL2020 dot function:

template <class T>
T SYCLStream<T>::dot()
{
  
  queue->submit([&](sycl::handler &cgh)
  { 
    sycl::accessor ka {d_a, cgh, sycl::read_only};
    sycl::accessor kb {d_b, cgh, sycl::read_only};
    
    size_t N = array_size;
    cgh.parallel_for(
      sycl::nd_range<1>(dot_num_groups*dot_wgsize, dot_wgsize),
      // Reduction object, to perform summation - initialises the result to zero
      sycl::reduction(d_sum, cgh, std::plus<T>(), sycl::property::reduction::initialize_to_identity{}),
      [=](sycl::nd_item<1> item, auto& sum)
      { 
        size_t i = item.get_global_id(0);
        size_t global_size = item.get_global_range()[0];
        for (; i < N; i += global_size)
          sum += ka[i] * kb[i];
      });
  });
  
  // Get access on the host, and return a copy of the data (single number)
  // This will block until the result is available, so no need to wait on the queue.
  sycl::host_accessor result {d_sum, sycl::read_only};
  return result[0];
}

@tomdeakin
Copy link
Contributor

Thank for all of this @lfmeadow - this is great stuff.

It's good to see what values for work-group size are working well in general too.

I'm worried the SYCL 2020 version was slower than the 1.2.1 version. There shouldn't be any major changes beyond syntactic sugar (the CTAD accessors, etc) so it should not affect performance...

Glad to see you got the sycl::reduction working with nd_range.
Hopefully we'll be able to get the implementation of the sycl::reduction just as fast as a manual implementation of the reduction. The implementation could just use this algorithm - do you know what algorithm it's using?

I'd hope that DPC++ would be able to incorporate these heuristics without requiring programmers to use nd_range when they don't need it - @Pennycook mentioned this earlier: #83 (comment)

Hopefully there will be a feature one day where we can suggest a work-group size for range without switching to nd_range which has a different model of parallelism.

@lfmeadow
Copy link
Author

lfmeadow commented Feb 28, 2022

Yes, I need to revisit SYCL-2020 vs. the previous version without the USM and be a little more rigorous.

On the reduction, apparently it uses this:
https://github.com/intel/llvm/blob/8213321ebb90110bf4f3d04fa0dc8e131a464a19/libclc/ptx-nvidiacl/libspirv/group/collectives.cl#L263
I note that the subgroup reduction does not use the binary tree reduction; I suppose that isn't valid for all possible reduction operators (not sure).

On SYCL, SYCLUSM, and Cuda:
I compared PTX code for Add from SYCL, SYCLUSM, and Cuda.
SYCL kernel has 6 parameters vs. the 3 in SYCLUSM and Cuda.
Cuda has 14 SASS Instructions, SYCLUSM has 19, and SYCL has 31.
The extra SYCLUSM instructions are due to worse codegen and also from converting the global index to 64-bit (arguably semantically necessary in SYCL but not Cuda).
There are 3 extra SYCL instructions per accessor: ld.param, shift, and add.s64.

If the kernels were fatter, the overhead wouldn't be so bad; picking a smaller number of grid blocks (like we did for Dot) and looping over them in the kernel would probably measurably decrease the overhead.

@Pennycook
Copy link

The extra SYCLUSM instructions are due to worse codegen and also from converting the global index to 64-bit (arguably semantically necessary in SYCL but not Cuda).

If you know that the indices are 32-bit, you can try the -fsycl-id-queries-fit-in-int option. It basically wraps all of the functions in the sycl::id class which would return a 64-bit index with an assumption that the value would fit in an int.

@tomdeakin tomdeakin mentioned this pull request Aug 5, 2022
@tomdeakin
Copy link
Contributor

The CUDA update here is somewhat fixed by 092ee67 by setting the number of thread blocks to 1024. If we think the approach here of queuing the device is better CUDA code, we can use that instead. @jeffhammond - what would you recommend.

The HIP code has been updated by AMD in #139. They took a different approach, but as it cam directly from AMD I'm inclined to trust the heuristic for selecting a good size.

The USM SYCL version is still interesting, and we need to add a version that does that.

@jeffhammond
Copy link
Contributor

I'll run on a bunch of devices and see what the sensitivity is.

@tom91136
Copy link
Member

tom91136 commented Sep 5, 2023

I've got a SYCL2020 USM version implemented in 3f7bb63, will try to merge at some point.

tomdeakin added a commit that referenced this pull request Oct 6, 2023
This aligns with the approach implemented in other models (SYCL 1.2.1 and HIP)

Cherry-picks the CUDA updates from lmeadows in #122
@tomdeakin
Copy link
Contributor

The CUDA change has now been merged manually into develop as 9954b7d
AMD have updated the HIP version themselves with similar changes to this.
And we now have a SYCL version with USM in develop.
Therefore, I'll close this as we now have everything included, albeit in a piecemeal way. Thanks all for the contributions!

@tomdeakin tomdeakin closed this Oct 6, 2023
pranav-sivaraman pushed a commit to hpcgroup/BabelStream that referenced this pull request Dec 7, 2023
This aligns with the approach implemented in other models (SYCL 1.2.1 and HIP)

Cherry-picks the CUDA updates from lmeadows in UoB-HPC#122
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants