Add support for reduce intents and expressions on GPU #24787

e-kayrakli · 2024-04-05T22:06:07Z

This PR enables support for +, min and max reductions on GPU-eligible loops. Both intent and expression based forms are supported:

var sum: int;
forall i in 1..10 with (+ reduce sum) { }

and

+ reduce MyGpuArr

are now supported and executed as kernels on the GPU.

Background

In Chapel, reduce expressions lower to foralls with reduce intents early in compilation. As a result, during GPU transformations, there's only foralls with reduce intents. Therefore, the implementation here is all about supporting forall loops with reduce intents, where support for reduce expressions comes for free.

Implementation

Overview

We have been using CUB/hipCUB as the backend implementation for device-wide reductions for a couple of releases now. Implementation here relies on the same libraries support for block-wide reductions. For a reduction variable redVar, the generated kernel looks like:

void kernel(..., int redVar, int* redVarBuffer, ...) {
  // usual stuff

  chpl_gpu_dev_breduce(redVar, redVarBuffer);
}

redVarBuffer is allocated/maintained by the runtime.
- At the end of the kernel execution, it'll store one element per block, which is essentially the reduction of all the values contributed by the threads within that block.
- It will be reduced by the already-existing device-wide reductions to generate the final result.
The newly added device runtime function chpl_gpu_dev_breduce will use *CUB to perform the block-wide reduction at the end of the kernel.

Details

Changes the kernel structure to avoid "early returns". Now a kernel looks like:

void kernel() {
  // index computation
  if (in_bounds) {
    // this block is `userBody_`
  }
  {
    // this block is `postBody_` (note that "epilogue" 
    // has a more general meaning in the compiler already)
  }
  return;  // epilogue
}

This is because we need each thread in a block to call chpl_gpu_dev_breduce*. In an earlier implementation, I tried to duplicate calls at the end of the kernel body and in the early return block. But hipCUB couldn't handle the thread divergence. So, I had to steer away from early return. Now, chpl_gpu_dev_breduce calls are added to postBody_.
Adds a class KernelArg in gpuTransforms for easier representation of kernel actual/formal pairs.
Following from previous efforts, this PR leans more towards using the kernel_cfg type in the runtime. As such, it;
- passes number of reduction variables into kernel configuration
- removes 1D vs 3D-ness of the kernel from the compiler/runtime interface:
  - The runtime's kernel launch interface is just chpl_gpu_kernel_launch, now. IOW, there's no chpl_gpu_kernel_launch_flat anymore.
  - chpl_gpu_kernel_launch takes only a kernel_cfg argument. So, kernel_cfg should contain all the info necessary to launch a kernel.
  - Moves the compiler to use a single PRIM_GPU_KERNEL_LAUNCH primitive for all launches. IOW, there's no PRIM_GPU_KERNEL_LAUNCH_FLAT anymore. PRIM_GPU_KERNEL_LAUNCH only takes a single argument, that is the kernel_cfg
  - Instead, we know have PRIM_GPU_INIT_KERNEL_CFG_3D for 3D launches.
PRIM_GPU_BLOCK_REDUCE implements the chpl_gpu_dev_breduce in the snippet above.
This is how we pass a reduction variable to the kernel/runtime:
- We use PRIM_GPU_ARG with a new argument kind.
- For each reduction variable, the compiler generates a "final reduction wrapper". This is a compiler-generated function that wraps around the runtime call for the final reduction. We need to generate this function so that we can pass it to the runtime. As runtime is type-agnostic, it needs to know which function to call to finalize a reduction.
- This function is passed to PRIM_GPU_ARG, and then is passed as a function pointer to the runtime.
- We also add an argument for a buffer per reduction variable. This is redVarBuffer in the snippet above.
Device functions for block-wide reductions are in runtime/include/gpu/{nvidia,amd}/chpl-gpu-dev-reduce.h
kernel_cfg type has been extended to handle number of threads, grid dimensions, kernel names and of course reduction variables.
reduce_var is the type that represent reduction arguments to kernels
Among other things, chpl_gpu_arg_reduce is where a reduction variable is added to a kernel_cfg, and cfg_finalize_reductions is what's called after the kernel itself to perform the final reduction(s).
We know compile the application with --std=c++14. This was necessary for hipCUB.
test/gpu/native/basics/reductionNoGpuize.chpl is removed as it doesn't make much sense anymore.
__primitive based tests are adjusted for the new kernel launch interface.
More tests are added to gpu/native/reduction

Future work

Support all/more kinds of reductions on GPU #24932

Test

nvidia
amd
cpu

compiler/optimizations/gpuTransforms.cpp

test/gpu/native/reduction/basic.chpl

test/gpu/native/reduction/combos.chpl

test/gpu/native/reduction/reduceThroughput.chpl

compiler/codegen/cg-expr.cpp

compiler/optimizations/gpuTransforms.cpp

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

bradcray · 2024-04-25T18:42:44Z

Very exciting! I had no idea this was this close to being accomplished/able!

@DanilaFe

Arkouda+GPU compilation testing suffered as fallout from #24787. The issues were twofold: - Compiler segfault: In Arkouda, some order-independent loops are subject to what we call "late gpuization failure". In such cases, we generate a kernel, but then remove it from the AST. Removing a function from AST doesn't remove it from `gFnSymbols`. Scrubbing `gFnSymbols` et al. happens in between passes and removes such functions. However, `gpuTransforms` are technically part of LICM. After `gpuTransfroms` there are some LICM operations (that also seem unrelated to LICM, unfortunately) that iterate over `gFnSymbols`. One such operation would get the removed function from the AST that would cause segfaults down the road. As a solution, this PR uses `for_alive_in_Vec` in the problematic section of the code to avoid the issue. Note that I couldn't reproduce this issue outside of Arkouda. - Linkage issues: fixing that exposed other linkage issues b/c Arkouda uses reductions on bools. It felt safe to just support those. So, this PR adds the ability to do reduction on bools. [Reviewed by @DanilaFe] #### Test: - [x] arkouda compiles locally (well, there is another issue, but that needs to be fixed on Arkouda side. See Bears-R-Us/arkouda#3236) - [x] nvidia - [x] amd - [x] standard linux64

e-kayrakli requested a review from stonea April 9, 2024 23:38

stonea reviewed Apr 12, 2024

View reviewed changes

e-kayrakli commented Apr 12, 2024

View reviewed changes

compiler/codegen/cg-expr.cpp Show resolved Hide resolved

e-kayrakli commented Apr 12, 2024

View reviewed changes

compiler/optimizations/gpuTransforms.cpp Outdated Show resolved Hide resolved

e-kayrakli commented Apr 12, 2024

View reviewed changes

compiler/optimizations/gpuTransforms.cpp Outdated Show resolved Hide resolved

stonea approved these changes Apr 18, 2024

View reviewed changes

e-kayrakli force-pushed the gpu-reduce-intent branch from 6c3274f to 64df191 Compare April 19, 2024 23:27

e-kayrakli added 23 commits April 25, 2024 10:12

Initial wiring for calling a device function for reduction variables

f9ed782

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Generate a temporary for interim result and pass it to the reduce hlper

dc7c922

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Allocate and thread interim buffer through to the kernel

70a34b2

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Initial attempt to actually use CUB

002fa25

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Fix a bug in device-side CUB usage

b1c92f2

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Wire into device-wide reduction support as well

a7acb9a

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Get the final result out into the Chapel variable, too

aad7d8c

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Wire function name, num threads and block size into runtime

c649502

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Wire up blk_dim_x into reduction buffer size

8a5a637

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Remove a debugging output

db0f02b

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Small fixes to get TeaLeaf working

4112d6f

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Remove commented-out code

f11c7a9

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Wire static block size into codegen

d61e3af

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Avoid adding a new field to GpuKernel

f4d3804

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Pass blockSize into codegen no matter what

44359da

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Wire type name into runtime

965d733

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Start stamping out some versions of device functions

a31901b

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Start working on adding reduction wrappers

f155504

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Not-working attempt to generate a wrapper for final reduction

a38e1a0

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Make the top level reduction interface generic to avoid casts

5296681

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Start working on reduce kind deduction

1fee0da

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Progress on reduction kinds

8dc2160

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add support for min and max reductions

1d64424

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

e-kayrakli added 22 commits April 25, 2024 10:12

Restructure the kernel to be able to call block-reduce just once

5ebd9c3

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Quick cleanup in the compiler

bfdfdf1

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Quick cleanup

61f8003

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add the missing file

20807f7

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Move some more files around, bump up to C++14 for GPU

9f63ee6

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Minor refactor in the runtime

6dd1956

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Clean up a comment

7ad6729

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Adjust some pound-defines

1050e52

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Adjustments for ROCm 4

1252842

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Adjust tests and skipifs

983b08e

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Close a memory leak coming from reduction buffers

ac9fbb1

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Address the first batch of Andy's feedback

5f32a7b

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Update the wrapper prefix to avoid name conflicts

e5ce625

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Add a test locking minloc/maxloc behavior

0cfe645

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Lock in GPU transform failure behavior

d3d360f

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Cleanup

e8584f0

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Tighten up pattern matching a bit

f9a4bc0

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Fix for cpu-as-device mode

fa79a4c

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Remove some trailing whitespaces

027c028

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Adjust a dyno test for a change in GPU primitives

8639cc6

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Make a helper data strucutre local static

80bffd1

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

Handle primitives block early to fix a bug

bc76764

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

e-kayrakli force-pushed the gpu-reduce-intent branch from 84fffbb to bc76764 Compare April 25, 2024 17:12

e-kayrakli mentioned this pull request Apr 25, 2024

Support all/more kinds of reductions on GPU #24932

Open

Adjust some good files for cpu-as-device

c4c7ffa

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>

e-kayrakli merged commit f0797be into chapel-lang:main Apr 25, 2024
7 checks passed

e-kayrakli deleted the gpu-reduce-intent branch April 25, 2024 18:09

e-kayrakli mentioned this pull request May 23, 2024

Fixes for Arkouda following GPU-based reduction #25108

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for reduce intents and expressions on GPU #24787

Add support for reduce intents and expressions on GPU #24787

e-kayrakli commented Apr 5, 2024 •

edited

bradcray commented Apr 25, 2024

Add support for reduce intents and expressions on GPU #24787

Add support for reduce intents and expressions on GPU #24787

Conversation

e-kayrakli commented Apr 5, 2024 • edited

Background

Implementation

Overview

Details

Future work

Test

bradcray commented Apr 25, 2024

e-kayrakli commented Apr 5, 2024 •

edited