Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reduce intents and expressions on GPU #24787

Merged
merged 61 commits into from
Apr 25, 2024

Conversation

e-kayrakli
Copy link
Contributor

@e-kayrakli e-kayrakli commented Apr 5, 2024

This PR enables support for +, min and max reductions on GPU-eligible loops. Both intent and expression based forms are supported:

var sum: int;
forall i in 1..10 with (+ reduce sum) { }

and

+ reduce MyGpuArr

are now supported and executed as kernels on the GPU.

Background

In Chapel, reduce expressions lower to foralls with reduce intents early in compilation. As a result, during GPU transformations, there's only foralls with reduce intents. Therefore, the implementation here is all about supporting forall loops with reduce intents, where support for reduce expressions comes for free.

Implementation

Overview

We have been using CUB/hipCUB as the backend implementation for device-wide reductions for a couple of releases now. Implementation here relies on the same libraries support for block-wide reductions. For a reduction variable redVar, the generated kernel looks like:

void kernel(..., int redVar, int* redVarBuffer, ...) {
  // usual stuff

  chpl_gpu_dev_breduce(redVar, redVarBuffer);
}
  • redVarBuffer is allocated/maintained by the runtime.
    • At the end of the kernel execution, it'll store one element per block, which is essentially the reduction of all the values contributed by the threads within that block.
    • It will be reduced by the already-existing device-wide reductions to generate the final result.
  • The newly added device runtime function chpl_gpu_dev_breduce will use *CUB to perform the block-wide reduction at the end of the kernel.

Details

  • Changes the kernel structure to avoid "early returns". Now a kernel looks like:
void kernel() {
  // index computation
  if (in_bounds) {
    // this block is `userBody_`
  }
  {
    // this block is `postBody_` (note that "epilogue" 
    // has a more general meaning in the compiler already)
  }
  return;  // epilogue
}
  • This is because we need each thread in a block to call chpl_gpu_dev_breduce*. In an earlier implementation, I tried to duplicate calls at the end of the kernel body and in the early return block. But hipCUB couldn't handle the thread divergence. So, I had to steer away from early return. Now, chpl_gpu_dev_breduce calls are added to postBody_.
  • Adds a class KernelArg in gpuTransforms for easier representation of kernel actual/formal pairs.
  • Following from previous efforts, this PR leans more towards using the kernel_cfg type in the runtime. As such, it;
    • passes number of reduction variables into kernel configuration
    • removes 1D vs 3D-ness of the kernel from the compiler/runtime interface:
      • The runtime's kernel launch interface is just chpl_gpu_kernel_launch, now. IOW, there's no chpl_gpu_kernel_launch_flat anymore.
      • chpl_gpu_kernel_launch takes only a kernel_cfg argument. So, kernel_cfg should contain all the info necessary to launch a kernel.
      • Moves the compiler to use a single PRIM_GPU_KERNEL_LAUNCH primitive for all launches. IOW, there's no PRIM_GPU_KERNEL_LAUNCH_FLAT anymore. PRIM_GPU_KERNEL_LAUNCH only takes a single argument, that is the kernel_cfg
      • Instead, we know have PRIM_GPU_INIT_KERNEL_CFG_3D for 3D launches.
  • PRIM_GPU_BLOCK_REDUCE implements the chpl_gpu_dev_breduce in the snippet above.
  • This is how we pass a reduction variable to the kernel/runtime:
    • We use PRIM_GPU_ARG with a new argument kind.
    • For each reduction variable, the compiler generates a "final reduction wrapper". This is a compiler-generated function that wraps around the runtime call for the final reduction. We need to generate this function so that we can pass it to the runtime. As runtime is type-agnostic, it needs to know which function to call to finalize a reduction.
    • This function is passed to PRIM_GPU_ARG, and then is passed as a function pointer to the runtime.
    • We also add an argument for a buffer per reduction variable. This is redVarBuffer in the snippet above.
  • Device functions for block-wide reductions are in runtime/include/gpu/{nvidia,amd}/chpl-gpu-dev-reduce.h
  • kernel_cfg type has been extended to handle number of threads, grid dimensions, kernel names and of course reduction variables.
  • reduce_var is the type that represent reduction arguments to kernels
  • Among other things, chpl_gpu_arg_reduce is where a reduction variable is added to a kernel_cfg, and cfg_finalize_reductions is what's called after the kernel itself to perform the final reduction(s).
  • We know compile the application with --std=c++14. This was necessary for hipCUB.
  • test/gpu/native/basics/reductionNoGpuize.chpl is removed as it doesn't make much sense anymore.
  • __primitive based tests are adjusted for the new kernel launch interface.
  • More tests are added to gpu/native/reduction

Future work

Test

  • nvidia
  • amd
  • cpu

@e-kayrakli e-kayrakli requested a review from stonea April 9, 2024 23:38
compiler/optimizations/gpuTransforms.cpp Outdated Show resolved Hide resolved
compiler/optimizations/gpuTransforms.cpp Show resolved Hide resolved
test/gpu/native/reduction/basic.chpl Outdated Show resolved Hide resolved
test/gpu/native/reduction/combos.chpl Outdated Show resolved Hide resolved
test/gpu/native/reduction/reduceThroughput.chpl Outdated Show resolved Hide resolved
compiler/optimizations/gpuTransforms.cpp Outdated Show resolved Hide resolved
compiler/optimizations/gpuTransforms.cpp Outdated Show resolved Hide resolved
compiler/optimizations/gpuTransforms.cpp Show resolved Hide resolved
compiler/optimizations/gpuTransforms.cpp Outdated Show resolved Hide resolved
compiler/optimizations/gpuTransforms.cpp Outdated Show resolved Hide resolved
compiler/optimizations/gpuTransforms.cpp Show resolved Hide resolved
compiler/optimizations/gpuTransforms.cpp Outdated Show resolved Hide resolved
compiler/optimizations/gpuTransforms.cpp Show resolved Hide resolved
compiler/optimizations/gpuTransforms.cpp Show resolved Hide resolved
compiler/optimizations/gpuTransforms.cpp Outdated Show resolved Hide resolved
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
@e-kayrakli e-kayrakli merged commit f0797be into chapel-lang:main Apr 25, 2024
7 checks passed
@e-kayrakli e-kayrakli deleted the gpu-reduce-intent branch April 25, 2024 18:09
@bradcray
Copy link
Member

Very exciting! I had no idea this was this close to being accomplished/able!

e-kayrakli added a commit that referenced this pull request May 28, 2024
Arkouda+GPU compilation testing suffered as fallout from
#24787. The issues were
twofold:

- Compiler segfault: In Arkouda, some order-independent loops are
subject to what we call "late gpuization failure". In such cases, we
generate a kernel, but then remove it from the AST. Removing a function
from AST doesn't remove it from `gFnSymbols`. Scrubbing `gFnSymbols` et
al. happens in between passes and removes such functions. However,
`gpuTransforms` are technically part of LICM. After `gpuTransfroms`
there are some LICM operations (that also seem unrelated to LICM,
unfortunately) that iterate over `gFnSymbols`. One such operation would
get the removed function from the AST that would cause segfaults down
the road. As a solution, this PR uses `for_alive_in_Vec` in the
problematic section of the code to avoid the issue. Note that I couldn't
reproduce this issue outside of Arkouda.

- Linkage issues: fixing that exposed other linkage issues b/c Arkouda
uses reductions on bools. It felt safe to just support those. So, this
PR adds the ability to do reduction on bools.

[Reviewed by @DanilaFe]

#### Test:
- [x] arkouda compiles locally (well, there is another issue, but that
needs to be fixed on Arkouda side. See
Bears-R-Us/arkouda#3236)
- [x] nvidia
- [x] amd
- [x] standard linux64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants