New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign of GPU locale model #18529
Comments
Arguably what we have today is the NUMA-aware-flat configuration in the degenerate case of just 1 numa domain |
Independent of how we number or otherwise identify sublocales, we should strive for them to accurately reflect the underlying node architectures. Current Milan-based EX compute blades have 2 sockets and 4 NUMA nodes per socket, meaning a total of 8 NUMA nodes and 3 different latencies to memory: the NUMA node with affinity to your core has the lowest latency, the other NUMA nodes with affinity to your core's socket have a slightly higher latency, and finally the NUMA nodes with affinity to the "other" socket have the longest latency. This can be seen in the relative latency matrix printed by |
Just focusing on CPU-only compute nodes for a moment, isn't the fact that we use the
Put another way, it seems like the best way to get performance on such deeply hierarchical node architectures may be to head down the 'process-per-NIC/NUMA domain' road with a simpler locale model rather than by creating locale models that more accurately reflect the node architecture? |
I'm not sure. I think it may be too strong to say that we "couldn't come up with a way". I think we only really tried the one way (multi-ddata) of approaching this, and convinced ourselves that couldn't be made to work. But I sometimes wonder what performance we would have seen had we looked at augmenting
I feel pretty certain that'll be the simplest way to do it. Of course, I'm used to coding at the system level so I'm comfy solving problems there, which is what that solution will require. I suspect absolute best performance would come from sticking with a process-per-node model because doing so avoids intra-node inter-locale "communication" (via the comm layer or whatever). But that will require work in areas where I'm less comfortable, so it looks harder to me. |
FWIW I have been expecting that a Block array distributed over NUMA sublocales for sockets (what @gbtitus called a "flat" version) will work just fine and give acceptable performance. Here I am imagining that the comm layer will use a faster implementation for comms within a node (like gasnet does with pshm) and I don't think that is too much to ask of the comm layer implementation. At the higher level though I think we need the locale model to reflect the architecture but only to a degree. The basic reason for that is that I expect diminishing returns for performance from a more detailed model while at the same time the additional detail adds complexity for programming and challenges to portability (across different locale models). Of course these downsides can be somewhat addressed by careful design of: And those points hopefully bring us back to the question of "How should users working with GPU systems interact with the locale model there?" |
I think in the near- and mid-term, we should answer questions from the user interface standpoint. While discussing the implementation details is definitely useful, I feel like if we didn't have As a side note, I also wanted to say that comparing the |
Hi all, We've been having a number of internal discussions recently (at least among the developers focusing on GPU support) about locale models. In terms of the models discussed on this Github issue (or at least the non-numa related ones) we're arriving at a consensus that we prefer the illustration as shown in the upper right (Locale for CPU; sublocales for GPU). In other words, I'm suggesting we eliminate the illustration as shown in the upper-left (in the original post for this issue) from consideration. If you do feel that upper-left locale model would be preferable to the upper-right one, please speak up. So why eliminate that top-left locale model? As of today, (at least in the GPU and NUMA locale models) Chapel allows a task to run on either a locale or sublocale. What this means is it's perfectly legal to have code run Given that I think we should eliminate the CPUs sublocale and just use the locale itself for expressing CPU-based computation and doing CPU-local memory management. I could imagine there being value in a having separate CPUs sublocale if running on the locale itself represented something different (for example: that we should span computation across all the sublocales -- the GPUs and CPUs in tandem). But this would be pretty different from what it means to run on a locale today (and would likely cause a lot of backwards incompatibility headaches for existing code that does things like In cases where we would want computation to span across CPUs and GPUs I think other mechanisms\abstractions would be a better approach (like the For now, I'll leave aside issue of how this all ought to glue with NUMA aware locale models (though I'm also considering this), but in the meantime please let me know if you do see value in having a separate CPUs sublocale as illustrated in the upper-left diagram. |
As of 1.27 we've modified the locale models for GPUs to be something akin to what we have in the top-right corner of the illustration I have on the issue summary. That is to say: top-level locales represent CPUs and sublocales are for GPUs. We no longer expect on here.gpus[0] {
foreach i in range do
// GPUized loop
} To query the number of GPUs on a given locale instead of doing We're also encouraging users to not think of sublocales as representing pieces of hardware but rather as indicating particular policies. For more details see slides 101 - 108 in the "Ongoing Efforts" slides for the 1.26 release (https://chapel-lang.org/releaseNotes/1.26/06-ongoing.pdf). Note these slides are somewhat out of date in that we hadn't settled on using a I'm going to close this issue as I think the original question of how to arrange sublocales (for GPUs) has been settled. |
When
CHPL_LOCALE_MODEL=gpu
Chapel will use a locale model where the GPU is accessed usinghere.getChild(n)
where n > 0;here.getChild(0)
will return the locale for the CPU. This means that there is no real difference betweenon Locale[i]
andon Locale[i].getChild[0]
, which is kind of bizarre.Maybe we should avoid having a sublocale 0 refer to the CPU? Or at least we should change the function to something like:
here.getGpu(0)
?This also raises interesting questions if we want to have a NUMA aware locale model as well as GPUs (sub)locales. One option would be to have the locale consist of a list of CPU sublocales followed by GPU sublocales. If there's a natural association of GPUs to CPUs we could also imagine having multiple levels to the hierarchy where GPU sublocales could be references form CPU sublocales.
Here's an illustration of these different options:
I'm not sure why we have the current design we do, but I imagine it follows from our implementation of wide pointers (see
wide_ptr_s
inruntime/include/chpltypes.h
andchpl_localeID_t
in `runtime/include/localeModels/gpu/chpl-locale-model.h') where we specify locales by a node ID and sublocale ID. Of course we can change this design if we want to or we could have the locale interface presented to the user abstract these details away.Related work to consider
We also may want to look at \ do something like this locale model for AMD's Accelerated Processing Unit (APU) (see #7152):
It might also be worth reviewing the KNL locale model (created in #5555, later removed in #15832).
Finally, we have a tech note that discusses the NUMA aware locale model (and compares it to the default flat model): https://chapel-lang.org/docs/technotes/localeModels.html
The text was updated successfully, but these errors were encountered: