Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various benchmarks from OSU-Micro-Benchmarks/5.7.1-gompi-2021a-CUDA-11.3.1 segfault when using CUDA buffers #14801

Closed
casparvl opened this issue Jan 20, 2022 · 31 comments
Milestone

Comments

@casparvl
Copy link
Contributor

casparvl commented Jan 20, 2022

Issue

Since 2021a the support for CUDA aware MPI communication has changed: rather than using a fosscuda toolchain, we now use a non-CUDA aware MPI, with a CUDA aware UCX (UCX-CUDA/1.10.0-GCCcore-10.3.0-CUDA-11.3.1). However, I'm seeing various (but not all) OSU tests fail with that setup with a segfault:

==== backtrace (tid:2888500) ====
 0 0x000000000002a160 ucs_debug_print_backtrace()  /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucs/debug/debug.c:656
 1 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 2 0x000000000016065c __memmove_avx_unaligned_erms()  :0
 3 0x0000000000052aeb non_overlap_copy_content_same_ddt()  opal_datatype_copy.c:0
 4 0x00000000000874b9 ompi_datatype_sndrcv()  ???:0
 5 0x00000000000e2d84 ompi_coll_base_alltoall_intra_pairwise()  ???:0
 6 0x000000000000646c ompi_coll_tuned_alltoall_intra_dec_fixed()  ???:0
 7 0x0000000000089a4f MPI_Alltoall()  ???:0
 8 0x0000000000402d40 main()  ???:0
 9 0x0000000000023493 __libc_start_main()  ???:0
10 0x000000000040318e _start()  ???:0
=================================
[gcn30:2888500] *** Process received signal ***
[gcn30:2888500] Signal: Segmentation fault (11)
[gcn30:2888500] Signal code:  (-6)
[gcn30:2888500] Failing at address: 0xb155002c1334
[gcn30:2888500] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x1464e51a8b20]
[gcn30:2888500] [ 1] /lib64/libc.so.6(+0x16065c)[0x1464e4f3165c]
[gcn30:2888500] [ 2] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libopen-pal.so.40(+0x52aeb)[0x1464e485daeb]
[gcn30:2888500] [ 3] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(ompi_datatype_sndrcv+0x949)[0x1464e72454b9]
[gcn30:2888500] [ 4] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(ompi_coll_base_alltoall_intra_pairwise+0x174)[0x1464e72a0d84]
[gcn30:2888500] [ 5] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_dec_fixed+0x7c)[0x1464d494746c]
[gcn30:2888500] [ 6] /sw/arch/Centos8/EB_production/2021/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(MPI_Alltoall+0x15f)[0x1464e7247a4f]
[gcn30:2888500] [ 7] osu_alltoall[0x402d40]
[gcn30:2888500] [ 8] /lib64/libc.so.6(__libc_start_main+0xf3)[0x1464e4df4493]
[gcn30:2888500] [ 9] osu_alltoall[0x40318e]
[gcn30:2888500] *** End of error message ***

Working/failing tests

I haven't run all OSU benchmarks, but a few that run without issues are:

mpirun -np 2 osu_latency -d cuda D D
mpirun -np 2 osu_bw -d cuda D D
mpirun -np 2 osu_bcast -d cuda D D

A few that produce the segfault are:

mpirun -np 2 osu_gather -d cuda D D
mpirun -np 2 osu_alltoall -d cuda D D

Note that for the osu_latency and osu_bw tests I get results that correspond with what can be expected from GPU direct RDMA, so they really seem to function properly.

Summary of discussion on EasyBuild Slack

This thread on OpenMPI seems to suggest that UCX has limited support for GPU operations (see here) and might only work for pt2pt, but not for collectives (see here). The relevant parts:

OK I would like to drag in @jsquyres for confirming this, because my impression from Jeff was that for this case Open MPI could still be CUDA-aware. @bureddy Are you saying collectives won't work but point-to-point could? Interesting...

Yes. it might change in the future when UCX handle all datatypes pack/unpack and collectives

The thread was from a while ago though, it is possible things have changed, but it would be unsure in which version (if anything changed).

If UCX indeed does not fully support operations on GPU buffers, then we might have to build OpenMPI with GPU support again (currently we build OpenMPI with UCX support, and then build UCX with CUDA support), i.e. build the smcuda BTL again. It's a bit odd, since it seemed OpenMPI wanted to move away from that (and towards using UCX). Plus, it reintroduces the original issue that we prefer not to have a fosscuda toolchain, because that causes duplication of a lot of modules (see the original discussion here). As @Micket suggested on the chat: if it proves to be needed, we could try to build the smcuda BTL seperately, as an add-on, like we do for UCX-CUDA.

Open questions:

  • Can others confirm that the tests that fail for me also fail for them?
  • Can we get confirmation from experts (e.g. on OpenMPI / UCX issue tracker) that indeed UCX still has this limited support?
@Micket
Copy link
Contributor

Micket commented Jan 20, 2022

How is this for e.g. fosscuda/2020b? Crashing by default?
Same crash if one specifies mpirun --mca pml ucx ... ? Does it work with mpirun --mca btl smcuda,....?

@casparvl
Copy link
Contributor Author

How is this for e.g. fosscuda/2020b? Crashing by default?

We don't have that toolchain, but we did have 2020a. I just tested with that, and at least osu_latency and osu_alltoall work.

Same crash if one specifies mpirun --mca pml ucx ... ?

This does not change anything for me: results are the same both for my tests with 2021a and 2020a.

Does it work with mpirun --mca btl smcuda,....?

Not sure what else should be at the ... but: mpirun -np 2 --mca btl smcuda osu_alltoall -d cuda D D works for 2020a. It fails for 2021a because (as expected) it reports the requested component is not found.

@casparvl
Copy link
Contributor Author

casparvl commented Jan 20, 2022

This OpenMPI FAQ seems to suggest MPI_alltoall should be supported by UCX.

@Micket
Copy link
Contributor

Micket commented Jan 20, 2022

This is actually becoming more worrying to me. I thought there was just UCX's cuda plugin, or OpenMPI's smcuda BTL. But if 2020a + --mca pml ucx is different from 2021a, then I don't know what it is..

@casparvl
Copy link
Contributor Author

casparvl commented Jan 20, 2022

But if 2020a + --mca pml ucx is different from 2021a, then I don't know what it is..

yeah, this I also don't understand. Why would using UCX in 2020a make the CUDA stuff work, but it fails in 2021a?

I tried to build an OpenMPI with CUDA the 'traditional' way, so I now have a OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1. I figured, that should make the osu_alltoall test work again, just like it does in 2020a. But... same error!

So, I really need someone to reproduce my initial problem. We might be chasing something down that is specific to our system/installation somehow... So if anyone with GPU Direct RDMA capable hardware can install OSU-Micro-Benchmarks/5.7.1-gompi-2021a-CUDA-11.3.1 and try to run mpirun -np 2 --mca pml ucx osu_alltoall -d cuda D D, and let me know if they even see the same error, that would be great :)

@Micket
Copy link
Contributor

Micket commented Jan 20, 2022

Nothing stops openmpi from having an evil #ifdef CUDA inside it's core code outside of any plugins, but I really really hope that isn't the case. They would just totally screw things up for us. Also wouldn't explain why your "conventional" OpenMPI build still fails.

I have already reproduced the error. gather and alltoall crashes for me with 2021a + UCX-CUDA. The other 3 work.
I can't easily check the older toolchains.

I also played around a bit with additional flags

mpirun -np 2 --mca pml ucx -x UCX_TLS=rc,sm,cuda_copy,gdr_copy,cuda_ipc ./osu_latency D D

but the only thing I see is that "gdr_copy" which I suspect is just outdated documentation? It's uct-cuda plugin nowdays(?). I see the same warning about nonexistent gdr_copy on fosscuda/2020b where we aren't doing funny things.

@casparvl
Copy link
Contributor Author

For me, this just works:

mpirun -np 2 --mca pml ucx -x UCX_TLS=rc,sm,cuda_copy,gdr_copy,cuda_ipc osu_latency D D
# OSU MPI-CUDA Latency Test v5.7.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       0.36
1                       1.49
2                       2.33
4                       2.33
8                       2.32
16                      1.50
32                      2.33
64                      2.49
128                     2.47
256                     2.75
512                     3.00
1024                    4.57
2048                    5.63
4096                    9.19
8192                   15.48
16384                   9.30
32768                   9.33
65536                   9.86
131072                 10.82
262144                 12.38
524288                 14.97
1048576                20.56
2097152                31.49
4194304                53.87

You mean you get warnings there? That might be related to the fact that gdr_copy also requires a kernel extension, maybe you guys don't have that running on the cluster? Anyway, it should be unrelated to the errors with osu_alltoall and friends...

@Micket
Copy link
Contributor

Micket commented Jan 20, 2022

Hm, well, all of this CUDA RDMA stuff requires kernel level support no?
I run Mellanox OFED 5.4 and have nvidia_peermem (the replacement for the deprecated nv_peer_mem)

[1642700936.729058] [alvis5-01:223304:0]    ucp_context.c:770  UCX  WARN  transport 'gdr_copy' is not available, please use one or more of: cma, cuda, cuda_copy, cuda_ipc, dc, dc_mlx5, dc_x, ib, mm, posix, rc, rc_mlx5, rc_v, rc_verbs, rc_x, self, shm, sm, sysv, tcp, ud, ud_mlx5, ud_v, ud_verbs, ud_x

@branfosj
Copy link
Member

I did some testing and built some extra modules:

  • OSU-Micro-Benchmarks/5.7-gompic-2020b - works
  • OSU-Micro-Benchmarks/5.7.1-gompi-2021a-CUDA-11.3.1 - broken
  • OSU-Micro-Benchmarks/5.8-gompi-2021a-CUDA-11.3.1 - broken
  • OSU-Micro-Benchmarks/5.7.1-gompi-2021b-CUDA-11.4.1 - broken
  • OSU-Micro-Benchmarks/5.8-gompi-2021a-CUDA-11.3.1-v2 - replacing UCX and UCX-CUDA with version 1.11.2 - broken
  • OSU-Micro-Benchmarks/5.8-gompi-2021aa-CUDA-11.3.1 - replacing OpenMPI with one that has CUDA and UCX-CUDA 1.10.0 dependencies - works

@branfosj
Copy link
Member

My conclusion from those is that we need OpenMPI to be built with CUDA as a dependency, to get us OpenMPI with cuda support and that the UCX CUDA is not sufficient here.

@Micket
Copy link
Contributor

Micket commented Jan 20, 2022

Those results match up with whether it has smcuda or not, which leaves the door open for a OMPI_MCA_mca_component_path + mca_btl_smcuda.so solution. However, it could also be something more intrusive, like
https://github.com/open-mpi/ompi/search?q=OPAL_CUDA_SUPPORT
that seems to infect a lot of code.

What I can't make sense of is;

I tried to build an OpenMPI with CUDA the 'traditional' way, so I now have a OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1. I figured, that should make the osu_alltoall test work again, just like it does in 2020a. But... same error!

which should have matched with OSU-Micro-Benchmarks/5.8-gompi-2021aa-CUDA-11.3.1 this error persist for @casparvl? Perhaps just some unrelated mistake was made?

@casparvl
Copy link
Contributor Author

@Micket absolutely right. I just had another look, and I did something silly. We RPATH everything, and therefore, I should have recompiled the OSU-benchmarks. However, I forgot to do that, and simply loaded the CUDA-aware MPI. That, for obvious reasons, wasn't enough.

I'll recompile OSU and run another test, but I have no doubt it will confirm what @branfosj has found...

@casparvl
Copy link
Contributor Author

The thing that still puzzles me is that this FAQ really seems to suggest a CUDA-aware UCX should be enough... https://www.open-mpi.org/faq/?category=runcuda#mpi-apis-cuda-ucx Once I've reran the OSU tests, I'll post on the OpenMPI issue page to ask for confirmation that the smcuda btl is really still needed.

@Micket
Copy link
Contributor

Micket commented Jan 21, 2022

If so, I think it would be interested to test this with

export OMPI_MCA_mca_component_path=additional/path/to/smcuda

to see if we can expand the paths to add, cf. @bartoldeman comments in
#12484

@casparvl
Copy link
Contributor Author

casparvl commented Jan 21, 2022

It would actually create the possibility of re-adopting the fosscuda toolchain, but then it is foss + smcuda btl. That would be nice, because it would make foss a subtoolchain of fosscuda, and foss-based installations could be reused as dependencies for fosscuda, hence solving the duplication issue we had before...

@casparvl
Copy link
Contributor Author

casparvl commented Jan 21, 2022

Ok, I can confirm @branfosj 's finding that

replacing OpenMPI with one that has CUDA and UCX-CUDA 1.10.0 dependencies - works

indeed works now that I recompiled my OSU benchmarks. Sorry for the confusion there...

@akesandgren
Copy link
Contributor

I rather think it would be the case that we just add a OpenMPI-CUDA along side the UCX-CUDA and not make it a fosscuda toolchain.

@casparvl
Copy link
Contributor Author

That's also possible. My biggest concern is that for users it was pretty intuitive that if they wanted to use foss with cuda support, they'd have to load fosscuda. If in the future, they have to use foss plus seperate OpenMPI-CUDA and UCX-CUDA` modules, that's less intuitive and will definitely lead to more questions on our end.

But, I have no strong feelings on this. We can always still define a our own fosscuda toolchain as a site, for convenience of the users.

@casparvl
Copy link
Contributor Author

Ok, I've put in a ticket on the OpenMPI issue page to ask the experts what level of CUDA support we should expect for our current solution (non-CUDA aware OpenMPI + CUDA aware UCX) open-mpi/ompi#9906 . That should help us decide going forward.

@akesandgren
Copy link
Contributor

For a user just using the top level softwares, like gromacs or so, they won't have to care any more than they do today.
For users doing actual development a "bundle" module that loads the necessary pieces will help though.
That could be called cuda-dev for instance and sit on top of foss or intel or some other compiler...

@Micket
Copy link
Contributor

Micket commented Jan 21, 2022

I'm with Åke on this; Even if packaging is arguably a bit messier I think it's a small price to pay for not having to duplicate all these dependencies, which is probably our number 1 concern in easybuild right now (processing of thousands of PRs).

@akesandgren This package exists already; buildenv-default-foss-2021a-CUDA-11.3.1.eb

We also discussed whether we should just introduce an empty module that just depends on these CUDA things, but, it wasn't clear what to actually call it (it can't be named just "CUDA" since it would conflict), so we just opted on depending straight on UCX-CUDA. We can probably just let OpenMPI-CUDA depend UCX-CUDA as well for added convenience.

mental note: we can consider adding a new mpi-ext.h header file in CPATH with this module, redefining the cuda support

But first we need to verify that the smcuda btl or similar is what we even need here. If it's not enough then this whole plan goes out the window anyway

@casparvl
Copy link
Contributor Author

Even if packaging is arguably a bit messier I think it's a small price to pay for not having to duplicate all these dependencies, which is probably our number 1 concern in easybuild right now (processing of thousands of PRs).

But that's my point: if fosscuda is foss + OpenMPI-CUDA (the smcudaplugin), you don't need to duplicate any dependencies. Foss is then simply a subtoolchain of fosscuda, and you can use any foss-build modules as dependencies for fosscuda builds. No duplication of dependencies needed. The only software that you need to compile with fosscuda is the one that really requires an OpenMPI with GPU support.

Or am I missing something here?

@akesandgren
Copy link
Contributor

It's a matter of how you define (sub)toolchain(s), a toolchain that sits on top of another would normally require a change in the hierarchy for a HMNS and would still need hierarchial support in the framework. (which fosscuda of course have already)

@casparvl
Copy link
Contributor Author

casparvl commented Jan 26, 2022

It is now clear from open-mpi/ompi#9906 that indeed OpenMPI without CUDA support + UCX with CUDA support should not be expected to work for collectives. Specifically, functions like these need to know how to handle GPU memory buffers, and thus require compiling with CUDA support on the OpenMPI side. Essentially, this means that support for CUDA-based MPI ops is (partially) broken in 2021a and 2021b.

We have two potential approaches to solve this:

  1. Go back to a fosscuda toolchain like we had for 2020 and earlier. This means reintroducing a lot of the duplicated modules since, if you have a dependency that requires foss (but which itself does not use CUDA, and thus doesn't require a GPU-capable OpenMPI), you still have to rebuild that dependency with fosscuda.

  2. Build a non-CUDA aware OpenMPI module and a CUDA aware OpenMPI module (something like OpenMPI-CUDA, similar to UCX-CUDA), and selectively pull in the MCA components that require CUDA-awareness by setting OMPI_MCA_mca_component_path in the OpenMPI-CUDA module. Open questions regarding this approach:

  • Does the component provided through OMPI_MCA_mca_component_path take priority over the one already present in the base installation?
  • Which components are needed exactly? (coll for sure, but maybe others too?)

One can test the 2nd solution relatively easily by building a OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1 and manually setting OMPI_MCA_mca_component_path.

@mustafabar
Copy link
Contributor

mustafabar commented Jan 26, 2022

I am doubting if the OpenMPI build (foss/2021a) is really CUDA-aware. Based on Open MPI FAQs, when I run,
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
It returns false

In the dev version, it seems there is a fair amount of missing CUDA implementations in Open MPI source

Also, shouldn't there be an mca_coll_cuda library produced? Currently,

[openmpi]$ ls mca_coll_*
mca_coll_adapt.la       mca_coll_han.la         mca_coll_libnbc.la      mca_coll_self.la        mca_coll_sync.la        
mca_coll_adapt.so       mca_coll_han.so         mca_coll_libnbc.so      mca_coll_self.so        mca_coll_sync.so        
mca_coll_basic.la       mca_coll_inter.la       mca_coll_monitoring.la  mca_coll_sm.la          mca_coll_tuned.la       
mca_coll_basic.so       mca_coll_inter.so       mca_coll_monitoring.so  mca_coll_sm.so          mca_coll_tuned.so  

Hypothetically speaking, there is one for cuda

@Micket
Copy link
Contributor

Micket commented Jan 26, 2022

@mustafabar yes I don't think this has really ever been in doubt, since it's not really clear what "cuda support" entails (see e.g. open-mpi/ompi#7963 which checking UCX as an option )

@bartoldeman was grepping through the openmpi codebase, and found some conditionally compiled code in OPAL opal/datatype/opal_convertor.c

#if OPAL_CUDA_SUPPORT
                MEMCPY_CUDA( iov[i].iov_base, base_pointer, iov[i].iov_len, pConv );
#else
                MEMCPY( iov[i].iov_base, base_pointer, iov[i].iov_len );
#endif
#if OPAL_CUDA_SUPPORT
#include "opal/datatype/opal_datatype_cuda.h"
#define MEMCPY_CUDA( DST, SRC, BLENGTH, CONVERTOR ) \
    CONVERTOR->cbmemcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) )
#endif
...
#if OPAL_CUDA_SUPPORT
    convertor->cbmemcpy       = &opal_cuda_memcpy;
#endif

Ouch, not good. But I still think there is hope; None of that code actually depends on CUDA itself. It's just whether we should check for cuda support at runtime or not, so, I think it would be fine to patch in/configure openmpi to always try to look for cuda by enabling the OPAL_CUDA_SUPPORT (and let it revert to memcpy at runtime), which it already does;

void *opal_cuda_memcpy(void *dest, const void *src, size_t size, opal_convertor_t* convertor)
{
    int res;

    if (!(convertor->flags & CONVERTOR_CUDA)) {
        return memcpy(dest, src, size);
    }

    if (convertor->flags & CONVERTOR_CUDA_ASYNC) {
        res = ftable.gpu_cu_memcpy_async(dest, (void *)src, size, convertor);
    } else {
        res = ftable.gpu_cu_memcpy(dest, (void *)src, size);
    }

    if (res != 0) {
        opal_output(0, "CUDA: Error in cuMemcpy: res=%d, dest=%p, src=%p, size=%d",
                    res, dest, src, (int)size);
        abort();
    } else {
        return dest;
    }
}

Theactual CUDA stuff is (e.g. populating ftable.gpu_cu_memcpy_async) seems tobe done by opal/mca/common/cuda
So.. perhaps there is hope? A simple patch/change to how OpenMPI is built so that OPAL always builds it's CUDA supported code +
OMPI_MCA_mca_component_path to a separate build with libmca_common_cuda.so
Maybe there is hope? I'm not sure what counts as a "component" here though, so I might be a bit optimistic...

Interestingly, this OPAL code has since actually been moved into mca/common/cuda;
open-mpi/ompi@deb37ac

@mustafabar
Copy link
Contributor

Ok. Perhaps, I miss some background here. I was alarmed when I saw that CUDA-aware flag is true as well as the mca_coll_cuda.so is available with foss/2020a and b.
So I directly thought that the current OpenMPI under 2021a is not CUDA aware.
I am worried that even with a workaround that links to some libmca_common_cuda for CUDA versions, another kind of segfault will happen such as if the higher-level functions, for example, allocate the send/recv buffers on the CPU.

@Micket
Copy link
Contributor

Micket commented Jan 26, 2022

Ok. Perhaps, I miss some background here. I was alarmed when I saw that CUDA-aware flag is true as well as the mca_coll_cuda.so is available with foss/2020a and b.

I don't understand, no they are not?

foss/2021a, just like foss/2020b, foss/2020b, intel/2020b, etc. etc. are all non-CUDA-aware, just like intended. So far so good.

Then, at a later stage, add, "CUDA support", at runtime, without relying on LIBRARY_PATH (since some might use RPATH), into this existing non-cuda-aware foss/2021a.
We thought we could do it via UCX plugin system (https://github.com/easybuilders/easybuild-easyconfigs/blob/develop/easybuild/easyconfigs/u/UCX/UCX-1.10.0-dynamic_modules.patch ) but it turns out this is incomplete.
Question remains if it's possible to do so here

@mustafabar
Copy link
Contributor

mustafabar commented Jan 26, 2022

Oops, I was actually referring to fosscuda2020a and b, and it makes a big difference. Sorry

I think getting CUDA support out of OpenMPI would better be done via a known documented way (such as OpenMPI + CUDA awareness). In theory, the UCX-CUDA plugin should work but how much should Easybuilders be adventurous with it (doing Non-CUDA aware MPI coupled with UCX + CUDA)?
MPI is a UCX user, and the user could mess up some buffers if it doesn't know how to create them correctly.

It is interesting to see if it works with other MPI implementations like Intel or MVAPICH. If I understand correctly, from @branfosj test cases, ucx+cuda only worked when MPI had CUDA awareness.

Disclaimer: I am new to this community and could say nonsense.

@Micket
Copy link
Contributor

Micket commented Jan 27, 2022

Disclaimer

This first section is mostly me giving some background to my relatively new coworker @mustafabar about why we are in this situation. Skip to the end for technical goodies

I think getting CUDA support out of OpenMPI would better be done via a known documented way (such as OpenMPI + CUDA awareness).

A bit of background for those new to the discussion; This is pretty much strictly a packaging issue.
The reason we went away from this approach is due to the major downside it brings; having to duplicate the entire software trees into gompic/fosscuda variants. Twice the PRs to review, test, and install. It also means extra work for anyone primarily focused on fosscuda, since you are likely to find all the cool stuff people already contributed to foss, forcing you to do more tedious "porting", pointlessly building the same software twice.
The most reoccuring problem we discuss among us maintainers in what we can do to diminish the sheer number of easyconfigs we have to merge (setting new records for number of new contributions year after year), and removing this forking of toolchains is one such technique. Commonly, CUDA support lags behind compiler versions, causing major delays to construct toolchains.

Apart from the bug in this thread, I considered this approach a big improvement.
If this turns out to be unfixable, then we have a massive amount of work to do (and the fact that we have renamed CUDAcore -> CUDA is going to cause a huge headache trying to revert).

In theory, the UCX-CUDA plugin should work

weeeell we now know there was also (at least) parts of OPAL that needs to have some additional code.

but how much should Easybuilders be adventurous with it (doing Non-CUDA aware MPI coupled with UCX + CUDA)?

Specifically for OpenMPI, changing the packaging structure for MCA isn't very adventurous, rather it's intended purpose, quoting openmpi:

The Modular Component Architecture (MCA) is the backbone for much of Open MPI's functionality. It is a series of frameworks, components, and modules that are assembled at run-time to create an MPI implementation.

The issue here is that one of the components weren't well contained and has leaked into surrounding code giving conditional compilation.
Fortunately, this is a clear cut case; either the MPI_XXX call works, or it immediately segfaults, so I don't think we need to be unsure about whether or not there are issues left.

MPI is a UCX user, and the user could mess up some buffers if it doesn't know how to create them correctly.

Not sure what you mean here I'm afraid.

It is interesting to see if it works with other MPI implementations like Intel or MVAPICH. If I understand correctly, from @branfosj test cases, ucx+cuda only worked when MPI had CUDA awareness.

As far as I know, intel MPI doesn't support CUDA at all. I had a quick check with OSU-Micro-Benchmarks-5.6.3-iimpic-2020a.eb and it seems to segfault like expected. MVAPICH probably support CUDA, but I have no idea how they have structured their packaging.


OPAL and CUDA

I did some quick and dirty tests building OpenMPI, tweaking apart "OPAL_CUDA_SUPPORT" from the rest. So far I think it does work like i expected above; OPAL_CUDA_SUPPORT doesn't actually require CUDA (cuda.h/libcudart.so).
I haven't had time to do a full test yet; perhaps there is yet more missing, but otherwise, this is at it's core just a separation of the "monolithic" configuration flag --cuda into parts, and the rest is just adding the relevant MCA components whenever we want.
Not a single line of C-code was required, just poking at the build scripts a bit.
I also gave some feedback on the ompi issue because I.M.H.O. changes in OpenMPI5 is moving further in the wrong direction (ref. open-mpi/ompi#9906 (comment) )

@branfosj
Copy link
Member

branfosj commented Jun 4, 2022

fixed by #15528

@branfosj branfosj closed this as completed Jun 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants