Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign of GPU locale model #18529

Closed
stonea opened this issue Oct 6, 2021 · 8 comments
Closed

Redesign of GPU locale model #18529

stonea opened this issue Oct 6, 2021 · 8 comments

Comments

@stonea
Copy link
Contributor

stonea commented Oct 6, 2021

When CHPL_LOCALE_MODEL=gpu Chapel will use a locale model where the GPU is accessed using here.getChild(n) where n > 0; here.getChild(0) will return the locale for the CPU. This means that there is no real difference between on Locale[i] and on Locale[i].getChild[0], which is kind of bizarre.

Maybe we should avoid having a sublocale 0 refer to the CPU? Or at least we should change the function to something like: here.getGpu(0)?

This also raises interesting questions if we want to have a NUMA aware locale model as well as GPUs (sub)locales. One option would be to have the locale consist of a list of CPU sublocales followed by GPU sublocales. If there's a natural association of GPUs to CPUs we could also imagine having multiple levels to the hierarchy where GPU sublocales could be references form CPU sublocales.

Here's an illustration of these different options:

image

I'm not sure why we have the current design we do, but I imagine it follows from our implementation of wide pointers (see wide_ptr_s in runtime/include/chpltypes.h and chpl_localeID_t in `runtime/include/localeModels/gpu/chpl-locale-model.h') where we specify locales by a node ID and sublocale ID. Of course we can change this design if we want to or we could have the locale interface presented to the user abstract these details away.

Related work to consider

We also may want to look at \ do something like this locale model for AMD's Accelerated Processing Unit (APU) (see #7152):

on Locale[0].GPU { /* GPU code */ }

It might also be worth reviewing the KNL locale model (created in #5555, later removed in #15832).

Finally, we have a tech note that discusses the NUMA aware locale model (and compares it to the default flat model): https://chapel-lang.org/docs/technotes/localeModels.html

@mppf
Copy link
Member

mppf commented Oct 7, 2021

Arguably what we have today is the NUMA-aware-flat configuration in the degenerate case of just 1 numa domain

@gbtitus
Copy link
Member

gbtitus commented Oct 7, 2021

Independent of how we number or otherwise identify sublocales, we should strive for them to accurately reflect the underlying node architectures. Current Milan-based EX compute blades have 2 sockets and 4 NUMA nodes per socket, meaning a total of 8 NUMA nodes and 3 different latencies to memory: the NUMA node with affinity to your core has the lowest latency, the other NUMA nodes with affinity to your core's socket have a slightly higher latency, and finally the NUMA nodes with affinity to the "other" socket have the longest latency. This can be seen in the relative latency matrix printed by lstopo -v when it is run on such nodes. Similarly when multi-NIC and multi-GPU compute nodes for EX systems arrive, those NICs and GPUs will be physically arranged so as to balance their memory traffic across the NUMA nodes. Thus on a dual-NIC dual-GPU node, one NIC will be "closer" to one of the sockets and its top-level NUMA domain, and the other NIC will be closer to the other socket and top-level NUMA domain, and similarly for the two GPUs. Exactly how we want to represent that in the locale model I'm not sure, but achieving best performance will require that when Chapel distributes data and execution across the sublocales of a locale, that matches up properly with the underlying system architecture. The other thing to keep in mind is that these nodes are not that far off -- we already have multi-NIC nodes running in-house on testers, and I'm pretty sure some of the big deliverable systems for next year have dual-NIC dual-GPU nodes in them.

@bradcray
Copy link
Member

bradcray commented Oct 7, 2021

Independent of how we number or otherwise identify sublocales, we should strive for them to accurately reflect the underlying node architectures.

Just focusing on CPU-only compute nodes for a moment, isn't the fact that we use the flat locale model on them today and effectively couldn't come up with a way of running the numa locale model on them without hurting performance an argument that it may sometimes be better for the locale model to intentionally not reflect the node architecture? Or am I misunderstanding you? Your subsequent statement I agree with wholeheartedly:

but achieving best performance will require that when Chapel distributes data and execution across the sublocales of a locale, that matches up properly with the underlying system architecture.

Put another way, it seems like the best way to get performance on such deeply hierarchical node architectures may be to head down the 'process-per-NIC/NUMA domain' road with a simpler locale model rather than by creating locale models that more accurately reflect the node architecture?

@gbtitus
Copy link
Member

gbtitus commented Oct 7, 2021

Just focusing on CPU-only compute nodes for a moment, isn't the fact that we use the flat locale model on them today and effectively couldn't come up with a way of running the numa locale model on them without hurting performance an argument that it may sometimes be better for the locale model to intentionally not reflect the node architecture?

I'm not sure. I think it may be too strong to say that we "couldn't come up with a way". I think we only really tried the one way (multi-ddata) of approaching this, and convinced ourselves that couldn't be made to work. But I sometimes wonder what performance we would have seen had we looked at augmenting Block (for example) to handle two-level blocking when the locale model was numa, first across the locales and then across the NUMA sublocales. And then in the iterator(s) we'd have a forall-stmt expanding into a coforall-across-locales, containing a coforall-across-sublocales, containing a coforall-across-cores, containing a for-stmt. (Where the truly "flat" version would leave out the coforall-across-sublocales.) In other words, try solving the problem at the domain map level rather than at the DefaultRectangular level which is perhaps too close to the memory. Now, perhaps Block also would have the same divide op problem that multi-ddata did and we've simply gotten used to coding user applications such that we don't encounter that. But perhaps there are subtleties to the Block implementation that avoid the divide problem and I didn't pick up on that when I was wrestling with multi-ddata.

... it seems like the best way to get performance on such deeply hierarchical node architectures may be to head down the 'process-per-NIC/NUMA domain' road with a simpler locale model rather than by creating locale models that more accurately reflect the node architecture?

I feel pretty certain that'll be the simplest way to do it. Of course, I'm used to coding at the system level so I'm comfy solving problems there, which is what that solution will require. I suspect absolute best performance would come from sticking with a process-per-node model because doing so avoids intra-node inter-locale "communication" (via the comm layer or whatever). But that will require work in areas where I'm less comfortable, so it looks harder to me.

@mppf
Copy link
Member

mppf commented Oct 8, 2021

FWIW I have been expecting that a Block array distributed over NUMA sublocales for sockets (what @gbtitus called a "flat" version) will work just fine and give acceptable performance. Here I am imagining that the comm layer will use a faster implementation for comms within a node (like gasnet does with pshm) and I don't think that is too much to ask of the comm layer implementation.

At the higher level though I think we need the locale model to reflect the architecture but only to a degree. The basic reason for that is that I expect diminishing returns for performance from a more detailed model while at the same time the additional detail adds complexity for programming and challenges to portability (across different locale models). Of course these downsides can be somewhat addressed by careful design of:
a. a strategy that locale models always follow for breaking up resources - e.g. we could say that sibling locales never refer to the same resources - so that diving work among the sibling locales will never oversubscribe something
b. calls one can make on any locale model to generically get the appropriate resources - e.g. "get the data parallel sublocales"

And those points hopefully bring us back to the question of "How should users working with GPU systems interact with the locale model there?"

@e-kayrakli
Copy link
Contributor

I think in the near- and mid-term, we should answer questions from the user interface standpoint. While discussing the implementation details is definitely useful, I feel like if we didn't have here.getChild(1) as the de facto interface today, we wouldn't be too worried about these details.

As a side note, I also wanted to say that comparing the gpu locale model with existing (or once-existed) locale models can be a bit misleading. At least, it has been for me at times. What I think of a "sublocale" varies wildly between numa and gpu locale models. This mostly has to do with the limitations/quirks of the processor (GPU) on a gpu sublocale.

@stonea
Copy link
Contributor Author

stonea commented Jan 24, 2022

Hi all,

We've been having a number of internal discussions recently (at least among the developers focusing on GPU support) about locale models.

In terms of the models discussed on this Github issue (or at least the non-numa related ones) we're arriving at a consensus that we prefer the illustration as shown in the upper right (Locale for CPU; sublocales for GPU). In other words, I'm suggesting we eliminate the illustration as shown in the upper-left (in the original post for this issue) from consideration. If you do feel that upper-left locale model would be preferable to the upper-right one, please speak up.

So why eliminate that top-left locale model?

As of today, (at least in the GPU and NUMA locale models) Chapel allows a task to run on either a locale or sublocale. What this means is it's perfectly legal to have code run on Locale[i] or on Locale[i].getChild[0] and if you query here under each you would get a different result (one that returns the locale and the other that returns the sublocale). However, as described in the original problem statement (in today's GPU locale model) there's no real different in terms of execution behavior between the two. This (at least to me and I think to other as well) seems bizarre and confusing.

Given that I think we should eliminate the CPUs sublocale and just use the locale itself for expressing CPU-based computation and doing CPU-local memory management.

I could imagine there being value in a having separate CPUs sublocale if running on the locale itself represented something different (for example: that we should span computation across all the sublocales -- the GPUs and CPUs in tandem). But this would be pretty different from what it means to run on a locale today (and would likely cause a lot of backwards incompatibility headaches for existing code that does things like coforall loc in Locales on loc do ...).

In cases where we would want computation to span across CPUs and GPUs I think other mechanisms\abstractions would be a better approach (like the GPUIterator presented in this work: https://github.com/ahayashi/chapel-gpu).

For now, I'll leave aside issue of how this all ought to glue with NUMA aware locale models (though I'm also considering this), but in the meantime please let me know if you do see value in having a separate CPUs sublocale as illustrated in the upper-left diagram.

@stonea
Copy link
Contributor Author

stonea commented Aug 5, 2022

As of 1.27 we've modified the locale models for GPUs to be something akin to what we have in the top-right corner of the illustration I have on the issue summary.

That is to say: top-level locales represent CPUs and sublocales are for GPUs.

We no longer expect getChild() to be used as a user-level interface. Rather all the GPU sublocales are gathered together in a gpus array accessible off of a locale. This means to create a loop that does something on the first GPU of the current node you'd do something like this:

on here.gpus[0] {
  foreach i in range do
    // GPUized loop
}

To query the number of GPUs on a given locale instead of doing here.getChildCount() do here.gpus.size

We're also encouraging users to not think of sublocales as representing pieces of hardware but rather as indicating particular policies.

For more details see slides 101 - 108 in the "Ongoing Efforts" slides for the 1.26 release (https://chapel-lang.org/releaseNotes/1.26/06-ongoing.pdf). Note these slides are somewhat out of date in that we hadn't settled on using a gpus array to store gpu sublocales at the time but the broader points about getting rid of getChild and having a "conceptual shift" as described in slide 105 still hold.

I'm going to close this issue as I think the original question of how to arrange sublocales (for GPUs) has been settled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants