New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various benchmarks from OSU-Micro-Benchmarks/5.7.1-gompi-2021a-CUDA-11.3.1 segfault when using CUDA buffers #14801
Comments
How is this for e.g. |
We don't have that toolchain, but we did have
This does not change anything for me: results are the same both for my tests with
Not sure what else should be at the |
This OpenMPI FAQ seems to suggest |
This is actually becoming more worrying to me. I thought there was just UCX's cuda plugin, or OpenMPI's smcuda BTL. But if 2020a + |
yeah, this I also don't understand. Why would using UCX in I tried to build an OpenMPI with CUDA the 'traditional' way, so I now have a So, I really need someone to reproduce my initial problem. We might be chasing something down that is specific to our system/installation somehow... So if anyone with GPU Direct RDMA capable hardware can install |
Nothing stops openmpi from having an evil I have already reproduced the error. gather and alltoall crashes for me with 2021a + UCX-CUDA. The other 3 work. I also played around a bit with additional flags
but the only thing I see is that "gdr_copy" which I suspect is just outdated documentation? It's uct-cuda plugin nowdays(?). I see the same warning about nonexistent gdr_copy on fosscuda/2020b where we aren't doing funny things. |
For me, this just works:
You mean you get warnings there? That might be related to the fact that |
Hm, well, all of this CUDA RDMA stuff requires kernel level support no?
|
I did some testing and built some extra modules:
|
My conclusion from those is that we need OpenMPI to be built with CUDA as a dependency, to get us OpenMPI with cuda support and that the UCX CUDA is not sufficient here. |
Those results match up with whether it has smcuda or not, which leaves the door open for a OMPI_MCA_mca_component_path + mca_btl_smcuda.so solution. However, it could also be something more intrusive, like What I can't make sense of is;
which should have matched with OSU-Micro-Benchmarks/5.8-gompi-2021aa-CUDA-11.3.1 this error persist for @casparvl? Perhaps just some unrelated mistake was made? |
@Micket absolutely right. I just had another look, and I did something silly. We RPATH everything, and therefore, I should have recompiled the OSU-benchmarks. However, I forgot to do that, and simply loaded the CUDA-aware MPI. That, for obvious reasons, wasn't enough. I'll recompile OSU and run another test, but I have no doubt it will confirm what @branfosj has found... |
The thing that still puzzles me is that this FAQ really seems to suggest a CUDA-aware UCX should be enough... https://www.open-mpi.org/faq/?category=runcuda#mpi-apis-cuda-ucx Once I've reran the OSU tests, I'll post on the OpenMPI issue page to ask for confirmation that the |
If so, I think it would be interested to test this with
to see if we can expand the paths to add, cf. @bartoldeman comments in |
It would actually create the possibility of re-adopting the |
Ok, I can confirm @branfosj 's finding that
indeed works now that I recompiled my OSU benchmarks. Sorry for the confusion there... |
I rather think it would be the case that we just add a OpenMPI-CUDA along side the UCX-CUDA and not make it a fosscuda toolchain. |
That's also possible. My biggest concern is that for users it was pretty intuitive that if they wanted to use foss with cuda support, they'd have to load But, I have no strong feelings on this. We can always still define a our own |
Ok, I've put in a ticket on the OpenMPI issue page to ask the experts what level of CUDA support we should expect for our current solution (non-CUDA aware OpenMPI + CUDA aware UCX) open-mpi/ompi#9906 . That should help us decide going forward. |
For a user just using the top level softwares, like gromacs or so, they won't have to care any more than they do today. |
I'm with Åke on this; Even if packaging is arguably a bit messier I think it's a small price to pay for not having to duplicate all these dependencies, which is probably our number 1 concern in easybuild right now (processing of thousands of PRs). @akesandgren This package exists already; We also discussed whether we should just introduce an empty module that just depends on these CUDA things, but, it wasn't clear what to actually call it (it can't be named just "CUDA" since it would conflict), so we just opted on depending straight on UCX-CUDA. We can probably just let mental note: we can consider adding a new mpi-ext.h header file in CPATH with this module, redefining the cuda support But first we need to verify that the smcuda btl or similar is what we even need here. If it's not enough then this whole plan goes out the window anyway |
But that's my point: if Or am I missing something here? |
It's a matter of how you define (sub)toolchain(s), a toolchain that sits on top of another would normally require a change in the hierarchy for a HMNS and would still need hierarchial support in the framework. (which fosscuda of course have already) |
It is now clear from open-mpi/ompi#9906 that indeed OpenMPI without CUDA support + UCX with CUDA support should not be expected to work for collectives. Specifically, functions like these need to know how to handle GPU memory buffers, and thus require compiling with CUDA support on the OpenMPI side. Essentially, this means that support for CUDA-based MPI ops is (partially) broken in We have two potential approaches to solve this:
One can test the 2nd solution relatively easily by building a |
I am doubting if the OpenMPI build (foss/2021a) is really CUDA-aware. Based on Open MPI FAQs, when I run, In the dev version, it seems there is a fair amount of missing CUDA implementations in Open MPI source Also, shouldn't there be an mca_coll_cuda library produced? Currently,
Hypothetically speaking, there is one for cuda |
@mustafabar yes I don't think this has really ever been in doubt, since it's not really clear what "cuda support" entails (see e.g. open-mpi/ompi#7963 which checking UCX as an option ) @bartoldeman was grepping through the openmpi codebase, and found some conditionally compiled code in OPAL #if OPAL_CUDA_SUPPORT
MEMCPY_CUDA( iov[i].iov_base, base_pointer, iov[i].iov_len, pConv );
#else
MEMCPY( iov[i].iov_base, base_pointer, iov[i].iov_len );
#endif #if OPAL_CUDA_SUPPORT
#include "opal/datatype/opal_datatype_cuda.h"
#define MEMCPY_CUDA( DST, SRC, BLENGTH, CONVERTOR ) \
CONVERTOR->cbmemcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
#endif
...
#if OPAL_CUDA_SUPPORT
convertor->cbmemcpy = &opal_cuda_memcpy;
#endif Ouch, not good. But I still think there is hope; None of that code actually depends on CUDA itself. It's just whether we should check for cuda support at runtime or not, so, I think it would be fine to patch in/configure openmpi to always try to look for cuda by enabling the OPAL_CUDA_SUPPORT (and let it revert to memcpy at runtime), which it already does;
Theactual CUDA stuff is (e.g. populating Interestingly, this OPAL code has since actually been moved into mca/common/cuda; |
Ok. Perhaps, I miss some background here. I was alarmed when I saw that CUDA-aware flag is true as well as the mca_coll_cuda.so is available with foss/2020a and b. |
I don't understand, no they are not? foss/2021a, just like foss/2020b, foss/2020b, intel/2020b, etc. etc. are all non-CUDA-aware, just like intended. So far so good. Then, at a later stage, add, "CUDA support", at runtime, without relying on LIBRARY_PATH (since some might use RPATH), into this existing non-cuda-aware foss/2021a. |
Oops, I was actually referring to fosscuda2020a and b, and it makes a big difference. Sorry I think getting CUDA support out of OpenMPI would better be done via a known documented way (such as OpenMPI + CUDA awareness). In theory, the UCX-CUDA plugin should work but how much should Easybuilders be adventurous with it (doing Non-CUDA aware MPI coupled with UCX + CUDA)? It is interesting to see if it works with other MPI implementations like Intel or MVAPICH. If I understand correctly, from @branfosj test cases, ucx+cuda only worked when MPI had CUDA awareness. Disclaimer: I am new to this community and could say nonsense. |
DisclaimerThis first section is mostly me giving some background to my relatively new coworker @mustafabar about why we are in this situation. Skip to the end for technical goodies
A bit of background for those new to the discussion; This is pretty much strictly a packaging issue. Apart from the bug in this thread, I considered this approach a big improvement.
weeeell we now know there was also (at least) parts of OPAL that needs to have some additional code.
Specifically for OpenMPI, changing the packaging structure for MCA isn't very adventurous, rather it's intended purpose, quoting openmpi:
The issue here is that one of the components weren't well contained and has leaked into surrounding code giving conditional compilation.
Not sure what you mean here I'm afraid.
As far as I know, intel MPI doesn't support CUDA at all. I had a quick check with OSU-Micro-Benchmarks-5.6.3-iimpic-2020a.eb and it seems to segfault like expected. MVAPICH probably support CUDA, but I have no idea how they have structured their packaging. OPAL and CUDAI did some quick and dirty tests building OpenMPI, tweaking apart "OPAL_CUDA_SUPPORT" from the rest. So far I think it does work like i expected above; OPAL_CUDA_SUPPORT doesn't actually require CUDA (cuda.h/libcudart.so). |
fixed by #15528 |
Issue
Since
2021a
the support for CUDA aware MPI communication has changed: rather than using afosscuda
toolchain, we now use a non-CUDA aware MPI, with a CUDA aware UCX (UCX-CUDA/1.10.0-GCCcore-10.3.0-CUDA-11.3.1
). However, I'm seeing various (but not all) OSU tests fail with that setup with a segfault:Working/failing tests
I haven't run all OSU benchmarks, but a few that run without issues are:
A few that produce the segfault are:
Note that for the
osu_latency
andosu_bw
tests I get results that correspond with what can be expected from GPU direct RDMA, so they really seem to function properly.Summary of discussion on EasyBuild Slack
This thread on OpenMPI seems to suggest that UCX has limited support for GPU operations (see here) and might only work for pt2pt, but not for collectives (see here). The relevant parts:
The thread was from a while ago though, it is possible things have changed, but it would be unsure in which version (if anything changed).
If UCX indeed does not fully support operations on GPU buffers, then we might have to build
OpenMPI
with GPU support again (currently we buildOpenMPI
withUCX
support, and then buildUCX
with CUDA support), i.e. build thesmcuda
BTL again. It's a bit odd, since it seemed OpenMPI wanted to move away from that (and towards using UCX). Plus, it reintroduces the original issue that we prefer not to have afosscuda
toolchain, because that causes duplication of a lot of modules (see the original discussion here). As @Micket suggested on the chat: if it proves to be needed, we could try to build thesmcuda
BTL seperately, as an add-on, like we do forUCX-CUDA
.Open questions:
UCX
still has this limited support?The text was updated successfully, but these errors were encountered: