New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed language changes for GPU programming features #5323
Comments
I'm particularly interested in the question of how a One of the features I've wished for with Chapel's Let's consider a simple STREAM-like example to understand the issue and the idea. config const n: int;
var D = {1..n}; // possibly dmapped
var A:[D] int;
var B:[D] int;
var C:[D] int;
forall (a,b,c) in zip(A,B,C) {
a = b + c;
} Now suppose I have manually created an 8-wide vector addition routine, // for now, ignoring left-over elements
forall i in 1..n by 8 {
add8( 8, A[i], B[i], C[i] );
} But this approach has the drawback of only applying to forall i in D[1..n by 8] {
add8( 8, A[i], B[i], C[i] ); // taking in addresses of these elements
} but now the risk that groups of 8 elements might not be stored in the same locale is even more severe. (E.g. if the arrays were Block distributed, communication would be required for some boundary elements, but the What if a forall (a,b,c) in zip(A,B,C) with (groupSize=8) {
// run add8 only for the first of each 8 iterations handled in a workgroup
if kernel.taskWithinGroup == 0 then
add8(kernel.iterationsThisGroup, a, b, c );
// iterationsThisGroup allows boundaries, such as if D is block distributed,
// to be handled correctly
} Assuming that we had a reasonable default workgroup size, we could even write proc add(a, b, c) {
if kernel.taskWithinGroup == 0 then
add8(kernel.iterationsThisGroup, a, b, c );
}
add(A, B, C); // promoted operation on arrays I think such a strategy would enable vectorization on CPUs, multi-resolution design (using manually vectorized kernels), and also serve as a starting point for the GPU functionality desired. (Additionally, I think it would be very interesting for iterators and/or the Chapel compiler to perform communication for an entire work-group at a time). Note: I'm continually confused about the difference between |
I think that the concept of a workgroup size within forall loops might be important to implementing vectorizeable reductions for the CPU. |
@mppf would C++ style executors fit into this story? |
@ct-clmsn it's interesting but I don't think it's solving the same problems. |
It occurs to me that it might be worth approaching the workgroup size from two directions: the hardware and the programmer. The hardware has its own optimum, and the programmer may or may not know anything about it. From the hardware side, and for the case that the programmer doesn't know what is optimal, it would be useful to have some sort of "machine description" in the compiler that would say "it's best to have forall loops be a multiple of this width." However, maybe the program is currently being compiled for a machine where the optimal width is any multiple of 64, but the programmer knows he will eventually be running on hardware where the optimal width is any multiple of 128. For cases like that, it would also be useful to extend the syntax of forall loops. Both enhancements would also be useful for CPUs, where it would help to make forall loops a multiple of the vector register width. |
We havn't discussed separate GPU memories a whole lot, but one idea there is that e.g. Block distributed across the GPU memories could go somewhere. I wonder though if there is a place for |
The Chapel team appreciates the efforts made by AMD Research on this effort. Their code contributions are archived in this repository: https://github.com/rocmarchive/chapel/tree/chpl-hsa-master Since then GPU support has been added to Chapel as of the 1.25.0 release as documented here: https://chapel-lang.org/docs/technotes/gpu.html. We plan on adding support for AMD GPUs in future releases. For general discussion about the future of GPU support in Chapel, feel free to comment on this issue: #18554. |
AMD Research is proposing changes to the Chapel language to enable support for the GPU programming model.
This Issues topic has been created to open up discussion with Chapel Language developers and users on what these proposed changes should look like and what features might be need, or are superfluous.
The changes are described in the Chapel Improvement Proposal (CHIP) 17
This will give GPU programmers the ability to:
1. allocate and access GPU local scratch pad memory
2. allow access to GPU primitives such as get_local_id()
3. enforce proper execution by use of workgroup scope synchronization
4. specify the size of workgroups.
5. specify the number of workitems in a kernel launch
6. specify the dimensions of the global workitems and workgroups
The goal is to provide Chapel programmers the tools to create diverse and more efficient programs on a GPU. This CHIP however does not cover data movement between a GPU locale and other locales and assumes all required data and logic is available to the GPU at runtime.
Please feel free comment.
Current collaborators are:
Michael Ferguson, Cray
Daniel Lowell, AMD
Mike Chu, AMD
Ashwin Aji, AMD
Michael Ferguson, Cray
#5319
The text was updated successfully, but these errors were encountered: