Proposed language changes for GPU programming features #5323

daniellowell · 2017-02-08T20:17:12Z

AMD Research is proposing changes to the Chapel language to enable support for the GPU programming model.

This Issues topic has been created to open up discussion with Chapel Language developers and users on what these proposed changes should look like and what features might be need, or are superfluous.

The changes are described in the Chapel Improvement Proposal (CHIP) 17

This will give GPU programmers the ability to:
1. allocate and access GPU local scratch pad memory
2. allow access to GPU primitives such as get_local_id()
3. enforce proper execution by use of workgroup scope synchronization
4. specify the size of workgroups.
5. specify the number of workitems in a kernel launch
6. specify the dimensions of the global workitems and workgroups

The goal is to provide Chapel programmers the tools to create diverse and more efficient programs on a GPU. This CHIP however does not cover data movement between a GPU locale and other locales and assumes all required data and logic is available to the GPU at runtime.

Please feel free comment.

Current collaborators are:

Michael Ferguson, Cray
Daniel Lowell, AMD
Mike Chu, AMD
Ashwin Aji, AMD
Michael Ferguson, Cray

#5319

mppf · 2017-02-09T18:29:36Z

I'm particularly interested in the question of how a forall loop might possibly be combined with the sense of a workgroup. In particular, I have a hypothesis that the concept of a workgroup is useful even for parallel programs on CPUs. Certainly it is critical for GPU programming, but if it's also useful for CPU programming, we might want to consider ways to modify the Chapel language that make something like workgroup an optional component of all forall loops.

One of the features I've wished for with Chapel's forall loops is a way to provide a manually vectorized kernel (written in assembly, possibly) that should execute inside a forall loop. Including a sense of a workgroup that maps down to the vector width on a CPU is one way to do that.

Let's consider a simple STREAM-like example to understand the issue and the idea.

config const n: int;
var D = {1..n}; // possibly dmapped
var A:[D] int;
var B:[D] int;
var C:[D] int;

forall (a,b,c) in zip(A,B,C) {
  a = b + c;
}

Now suppose I have manually created an 8-wide vector addition routine, add8, in C or assembly that is available to Chapel as an extern proc taking in addresses for the first of 8 elements. Then I might write the loop as:

// for now, ignoring left-over elements
forall i in 1..n by 8 {
  add8( 8, A[i], B[i], C[i] );
}

But this approach has the drawback of only applying to forall loops over ranges with strided iteration. In particular, I'd like the forall loop to be part of high-level code - such as a promoted function call, and my optimized 8-way vectorized version to be usable as a function called within the high-level code. Even if I had a way to get the next 8 array elements (say, by address arithmetic), there's no way to know that the forall loop will process the next 8 elements on the same task. This problem is more severe if we imagine that A, B, C are distributed arrays; in that case, using the range as-written would no longer match the distribution; one would have to instead write a version with domain slicing to preserve the locality:

forall i in D[1..n by 8] {
  add8( 8, A[i], B[i], C[i] );  // taking in addresses of these elements
}

but now the risk that groups of 8 elements might not be stored in the same locale is even more severe. (E.g. if the arrays were Block distributed, communication would be required for some boundary elements, but the add8 method would not do that communication since it only accesses elements beyond the first from C/assembly code).

What if a workgroup size was part of the nature of a forall loop? e.g. as a strawman, the following might be equivalent to the previous examples:

forall (a,b,c) in zip(A,B,C) with (groupSize=8) {
  // run add8 only for the first of each 8 iterations handled in a workgroup
  if kernel.taskWithinGroup == 0 then
     add8(kernel.iterationsThisGroup,  a, b, c );
     // iterationsThisGroup allows boundaries, such as if D is block distributed,
     // to be handled correctly
}

Assuming that we had a reasonable default workgroup size, we could even write

proc add(a, b, c) {
  if kernel.taskWithinGroup == 0 then
     add8(kernel.iterationsThisGroup,  a, b, c );
}

add(A, B, C); // promoted operation on arrays

I think such a strategy would enable vectorization on CPUs, multi-resolution design (using manually vectorized kernels), and also serve as a starting point for the GPU functionality desired.

(Additionally, I think it would be very interesting for iterators and/or the Chapel compiler to perform communication for an entire work-group at a time).

Note: I'm continually confused about the difference between workgroup and wavefront and it might be that the GPU programming would need to be able to associate two numbers with a forall loop...

mppf · 2017-05-24T14:38:14Z

I think that the concept of a workgroup size within forall loops might be important to implementing vectorizeable reductions for the CPU.

ct-clmsn · 2017-05-24T14:55:10Z

@mppf would C++ style executors fit into this story?

mppf · 2017-05-24T15:01:24Z

@ct-clmsn it's interesting but I don't think it's solving the same problems.

dmk42 · 2017-05-25T14:42:17Z

It occurs to me that it might be worth approaching the workgroup size from two directions: the hardware and the programmer. The hardware has its own optimum, and the programmer may or may not know anything about it.

From the hardware side, and for the case that the programmer doesn't know what is optimal, it would be useful to have some sort of "machine description" in the compiler that would say "it's best to have forall loops be a multiple of this width."

However, maybe the program is currently being compiled for a machine where the optimal width is any multiple of 64, but the programmer knows he will eventually be running on hardware where the optimal width is any multiple of 128. For cases like that, it would also be useful to extend the syntax of forall loops.

Both enhancements would also be useful for CPUs, where it would help to make forall loops a multiple of the vector register width.

mppf · 2019-11-18T15:30:19Z

We havn't discussed separate GPU memories a whole lot, but one idea there is that e.g. Block distributed across the GPU memories could go somewhere.

I wonder though if there is a place for in, out, and inout intents on tasks, foralls, and even on statements to indicate that some particular data should be to/from GPU memory? In particular I think that one might want to have an inout task intent on an array slice copy that portion to a local array and back again. This idea would bring us more towards having the compiler/language develop an idea of "GPU tasks" and not just "GPU loops".

stonea · 2021-10-12T19:39:38Z

The Chapel team appreciates the efforts made by AMD Research on this effort. Their code contributions are archived in this repository: https://github.com/rocmarchive/chapel/tree/chpl-hsa-master

Since then GPU support has been added to Chapel as of the 1.25.0 release as documented here: https://chapel-lang.org/docs/technotes/gpu.html.

We plan on adding support for AMD GPUs in future releases.

For general discussion about the future of GPU support in Chapel, feel free to comment on this issue: #18554.

mppf added area: Language stat: Work in Progress labels Feb 9, 2017

dmk42 mentioned this issue Dec 12, 2017

APU locale model #7152

Merged

mppf mentioned this issue Nov 7, 2019

Enable SPMD programming within a forall loop #14405

Open

stonea closed this as completed Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed language changes for GPU programming features #5323

Proposed language changes for GPU programming features #5323

daniellowell commented Feb 8, 2017 •

edited by ben-albrecht

mppf commented Feb 9, 2017

mppf commented May 24, 2017

ct-clmsn commented May 24, 2017

mppf commented May 24, 2017

dmk42 commented May 25, 2017

mppf commented Nov 18, 2019

stonea commented Oct 12, 2021

Proposed language changes for GPU programming features #5323

Proposed language changes for GPU programming features #5323

Comments

daniellowell commented Feb 8, 2017 • edited by ben-albrecht

mppf commented Feb 9, 2017

mppf commented May 24, 2017

ct-clmsn commented May 24, 2017

mppf commented May 24, 2017

dmk42 commented May 25, 2017

mppf commented Nov 18, 2019

stonea commented Oct 12, 2021

daniellowell commented Feb 8, 2017 •

edited by ben-albrecht