Merging CUDA and non CUDA toolchains into one #12484

Micket · 2021-03-29T16:50:07Z

As has been discussed multiple times in zoom and in chat, merging fosscuda+foss, and just using versionsuffixes for CUDA-variants of the (relatively few) easyconfigs that does have CUDA bindings.

One thing holding us back has been the CUDA support in MPI, which prevents us from moving into foss since it's part of the toolchain definitions itself, but now with UCX, it might be possible to get the best of both worlds (I'm going to just dismiss the legacy CUDA stuff in openmpi and just focus on UCX).

We want something that

works regardless of RPATH is used or not
can be opt int after foss(without CUDA) is already in place
supports all the RDMA goodies we have today in fosscuda.

Can it be done? Maybe; UCX has an environment variable for all the plugins it uses (ucx_info -f lists all variables):
We could introduce a a UCX-package (UCX-CUDA maybe?) which shadows non-CUDA UCX, and we start setting the environment variable;

UCX_MODULE_DIR='%(installdir)s/lib/ucx'

and, well, that should be it?

Example UCX-CUDA how i envision it:

easyblock = 'ConfigureMake'

name = 'UCX-CUDA'
version = '1.9.0'
local_cudaversion = '11.1.1'
versionsuffix = '-CUDA-%s' % local_cudaversion

homepage = 'http://www.openucx.org/'
description = """Unified Communication X
An open-source production grade communication framework for data centric
and high-performance applications
"""

toolchain = {'name': 'GCCcore', 'version': '10.2.0'}
toolchainopts = {'pic': True}

source_urls = ['https://github.com/openucx/ucx/releases/download/v%(version)s']
sources = ['%(namelower)s-%(version)s.tar.gz']
checksums = ['a7a2c8841dc0d5444088a4373dc9b9cc68dbffcd917c1eba92ca8ed8e5e635fb']

builddependencies = [
    ('binutils', '2.35'),
    ('Autotools', '20200321'),
    ('pkg-config', '0.29.2'),
]

osdependencies = [OS_PKG_IBVERBS_DEV]

dependencies = [
    ('UCX', version),
    ('numactl', '2.0.13'),
    ('CUDAcore', local_cudaversion, '', True),
    ('GDRCopy', '2.1', versionsuffix),
]

configure_cmd = "contrib/configure-release"
configopts = '--enable-optimizations --enable-cma --enable-mt --with-verbs '
configopts += '--without-java --disable-doxygen-doc '
configopts += '--with-cuda=$EBROOTCUDACORE --with-gdrcopy=$EBROOTGDRCOPY '

prebuildopts = 'unset CUDA_CFLAGS && unset LIBS && '
buildopts = 'V=1'

# Not a PATH since we want to replace it, not append to it
modextravars = {
    'UCX_MODULE_DIR': '%(installdir)s/lib/ucx',
}

sanity_check_paths = {
    'files': ['bin/ucx_info', 'bin/ucx_perftest', 'bin/ucx_read_profile'],
    'dirs': ['include', 'lib', 'share']
}

sanity_check_commands = ["ucx_info -d"]

moduleclass = 'lib'

We'd then just let a TensorFlow-2.4.1-foss-2020b-CUDA-11.1.1.eb have a dependency on UCX-CUDA (at least indirectly) which is probably the ugliesst thing with this approach.

This would remove the need for gcccuda, gompic, fosscuda, and they would all just use suffixes instead (and optionally depend on UCX-CUDA if they have some MPI parts).

@bartoldeman Is this at all close to the approach you envisioned?

The text was updated successfully, but these errors were encountered:

bartoldeman · 2021-03-31T13:03:17Z

@Micket more-or-less. But I tested it out and ran into a couple of issues:

I found that UCX_MODULE_DIRdoes not override the default search path for modules, it merely adds a second alternative to $EBROOTUCX/lib/ucx (the first is derived at runtime from from the full path of $EBROOTUCX/lib/libucs.so.0. That could of course be patched in the UCX source code.
With RPATH, the single Open MPI is still linked to the main non-CUDA UCX libraries, so libucs.so.0 and co in UCX-CUDA would be there for nothing.
There is however an alternative which instead of UCX centric is Open MPI centric: via OMPI_MCA_mca_component_path you could add a directory with a mca_pml_ucx.so that links to a full UCX-CUDA. So you'd have this:

gompi is a subtoolchain of gompic
gompic depends on OpenMPI-CUDA and gompi components
OpenMPI-CUDA only has the modified plugins and extends OMPI_MCA_mca_component_path, no libmpi etc.
OpenMPI-CUDA depends on OpenMPI (which depends on plain UCX) and UCX-CUDA
a bit complex though...

bartoldeman · 2021-03-31T13:28:02Z

Perhaps switching the two lines here:
https://github.com/openucx/ucx/blob/0477cce66118f6c9a65b8954878c0ee3a33b5035/src/ucs/sys/module.c#L121
makes a difference, will check...

Micket · 2021-03-31T13:52:23Z

Is 1. and 2. basically the same issue right? Were we to patch the order, then I suspect it would solve the RPATH issues? (or are
I was a bit lazy here and just reused the entire UCX build, but I'm really only after the new directory with UCX-modules; so one could do the same as you suggest for OpenMPI-CUDA.

Regarding 3. redefining gompi as a subtoolchain sounds a bit scary; can we even do that without wreaking havoc on all previous toolchains versions?

bartoldeman · 2021-03-31T13:57:16Z

I'm actually not even sure if non-CUDA UCX can be convinced to see the CUDA plugins.. will need to test some more.
The Open MPI approach via OMPI_MCA_mca_component_path works for sure (tested).

akesandgren · 2021-03-31T13:58:44Z

Unfortunately that won't be enough in the long run, when we get other things that use UCX directly and would then need the UCX-CUDA version...

Micket · 2021-03-31T14:02:36Z

We would definitely have a "UCX-CUDA" even with the OMPI_MCA_component_path approach, and I don't think there would be an issue depending on that directly if necessary with either approach

bartoldeman · 2021-03-31T14:57:35Z

Upon further investigation the UCX plugin architecture is not flexible enough for our purpose. The reason is that the list of plugins to load is set at configure time: for us this is:

#define uct_MODULES ":ib:rdmacm:cma:knem"

in config.h for non-CUDA and

#define uct_MODULES ":cuda:ib:rdmacm:cma:knem"

for CUDA. UCX parses this list to figure out which plugins to load.

Open MPI's plugin architecture is more flexible, though messing about with OMPI_MCA_mca_component_path would be a novelty in modules as far as I know, so tread carefully...

bartoldeman · 2021-03-31T17:08:56Z

Note about Intel MPI: this one goes via libfabric which is flexible enough (via FI_PROVIDER_PATH).
Note that we (with rpath) patch it a little in a hook already that modifies postinstallcmds adding this:

patchelf --set-rpath $EBROOTUCX/lib --force-rpath %(installdir)s/intel64/libfabric/lib/prov/libmlx-fi.so

we'd need to have two libmlx-fi.so copies, one linking to the non-CUDA UCX and one to the CUDA UCX (if the latter works properly at all)

Micket · 2021-03-31T20:19:31Z

So

expand OpenMPI to set OMPI_MCA_mca_component_path
UCX-CUDA (under GCCcore) like above (minus the pointless MODULE_DIR)
OpenMPI-CUDA (under gompi) that just contains one MCA library + OMPI_MCA_mca_component_path.
impi-CUDA (under iimpi?) that just contains libmlx-fi.so + FI_PROVIDER_PATH

boegel · 2021-04-01T16:15:10Z

Why does OpenMPI need to set OMPI_MCA_mca_component_path? (step 0)
Maybe I'm missing something...

bartoldeman · 2021-04-01T17:57:17Z

@boegel no, step 0 isn't necessary; only if you want OpenMPI-CUDA to prepend that would be slightly cleaner.
step 2 could set

OMPI_MCA_mca_base_component_path="$EBROOTOPENMPIMINUSCUDA/lib/openmpi:$EBROOTOPENMPI/lib/openmpi:$HOME/.openmpi/components"

or reading the source code at https://github.com/open-mpi/ompi/blob/92389c364df669822bb6d72de616c8ccf95b891c/opal/mca/base/mca_base_component_repository.c#L215 this is also possible:

OMPI_MCA_mca_base_component_path="$EBROOTOPENMPIMINUSCUDA/lib/openmpi:SYSTEM_DEFAULT:USER_DEFAULT"

Micket · 2021-06-28T17:53:44Z

So we have a working UCX + CUDA split working now. We should just decide on how workflow of how to put the easyconfigs on top of these now. Some loose suggestions have been floating around but nothing concrete. I can only recall hearing objections to all approaches so I doubt we can serve.
The suggestions I have myself, or have seen others make are:

The simplest approach we could take is to just depend on CUDAcore or UCX-CUDA if the software has MPI support. The cuda variants of software gets a versionsuffix = "-CUDA-%(cudaver)s". E.g. GROMACS-2021.3-foss-2021a.eb and GROMACS-2021.3-foss-2021a-CUDA-11.3.1.eb. For intel mpi I don't think it supports UCX-CUDA regardless so there they could just depend on CUDAcore. The test suite would ensure we don't mix multiple different versions. Upside: Simplicity. Also, it mirrors what we do at GCCcore with UCX-CUDA and NCCL. Downside: UCX-CUDA might be a bit obscure and people might not know to depend on it? (perhaps a test suite check would suffice to remedy that)
Add a trivial bundle "CUDA-11.3.1-GCCcore-10.3.0.eb" that depends on UCX-CUDA, and, maybe binutils. Then just depend on that and use a versionsuffix just like before. Upside: Perhaps simpler to remember that dependency name for those we don't realize what the purpose of UCX-CUDA is? Downside: Forces UCX-CUDA on everything even if it doesn't need UCX/MPI, especially for the intel side which can't even use UCX-CUDA (yet?).
Add a "CUDA" package that expands modulepath somewhere so it looks like more HMNS. Of course, the whole plan was to be able to reuse non CUDA foss dependencies here, so, it would somehow need to be below foss, so, I don't know what sort of mess we'd have to create to expand modulepath depending on what other modules you have loaded. We'd have to mess about with toolchain definitions and such. Upside: Adds a hierarchy for those who like that. Downside: I think it would create a fair bit of complexity

For option 2 and 3 we should probably also watch out for how the name CUDA would interact with the naming schemes and install locations.

akesandgren · 2021-06-29T06:14:11Z

As of Intel MPI 2019.6 (5 actually but there it is just a tech preview) it requires UCX (https://software.intel.com/content/www/us/en/develop/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html), so from that point of view we should use UCX-CUDA just to make them identical.

I think option 1 is the cleanest one, and it also makes it clearer for users which module they want when they look for CUDA enabled stuff.

I don't really like the -CUDA-x.y versionsuffix myself but I'd still go for it.

For option 2, shouldn't that be one CUDA-x.y-GCCcore-x.y which depend on CUDAcore and one CUDA-x.y-gompi-z that depends on UCX-CUDA? I.e. depending on at what level the toolchain is it pulls in the required CUDA(core|UCX)?

akesandgren · 2021-06-29T12:29:25Z

Hmmm, for option 1 we probably need to make accomodations in tools/module_naming_scheme/hierarchical_mns.py that CUDA should not change modulepath in this case...

Micket · 2021-06-29T17:15:08Z

As of Intel MPI 2019.6 (5 actually but there it is just a tech preview) it requires UCX (https://software.intel.com/content/www/us/en/develop/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html), so from that point of view we should use UCX-CUDA just to make them identical.

Yes, we will/are depend on UCX via impi, but it's just whether or not do add ucx-cuda plugins and GPUDirect if you make a TensorFlow-3.4.5-intel-2021a-CUDA-11.3.1.eb, but if impi can't use those features, why add it?
Not that there is much harm to add it, just 2 extra modules to load.

For option 2, shouldn't that be one CUDA-x.y-GCCcore-x.y which depend on CUDAcore and one CUDA-x.y-gompi-z that depends on UCX-CUDA? I.e. depending on at what level the toolchain is it pulls in the required CUDA(core|UCX)?

I don't think that does anything useful unless you want to expand modulepaths like in option 3 and define a bunch of toolchain stuff on top of gcc/gompi/foss (so that we can still use dependencies from these levels). Otherwise, they would all just depend on the same UCX-CUDA and nothing else.

Hmmm, for option 1 we probably need to make accomodations in tools/module_naming_scheme/hierarchical_mns.py that CUDA should not change modulepath in this case...

When CUDA/CUDAcore is used as an ordinary dependency (not part of a toolchain), this doesn't happen. I managed to build #13282 without modifications and it ends up in modules/all/MPI/GCC/10.3.0/OpenMPI/4.1.1/OSU-Micro-Benchmarks/5.7.1-CUDA-11.3.1.lua like expected (matching option 1 presented here)

casparvl · 2021-07-07T11:38:25Z

Seeing #13282 solution 1 actually leads to pretty clean easyconfigs: adding one dep for GPU support, and another dep for GPU communication support. It's also flexible: if, for some reason, a different CUDA version is required for a specific EasyConfig, that can easily be done, because CUDA is not a dependency in some very low level toolchain. One would only have to install a new CUDAcore, and (if relevant) UCX-CUDA.

But, let me see if I get this right: there will then be essentially both a non-cuda UCX (from gompi) in the path, as well as a UCX with CUDA support (from UCX-CUDA). So we essentially rely on the path order to have it pick up the relevant one? Or does UCX-CUDA now only install the plugins, but still uses the UCX module for the base UCX?

I guess in the first case, the RPATH issue that @bartoldeman mentioned will still be an issue, but I guess in the 2nd case it will essentially be resolved, right?

All in all, I'm in favour of solution 1. Solution 2 is 'convenient' but indeed a bit dirty that UCX get's pulled in even for non-MPI software. I don't think it's hard to check PRs for sanity in this scenario 1, so if all maintainers know about this approach, it should be ok. Essentially, the only check that needs to be done is: if it's an MPI capable toolchain, and includes CUDA as dep, it has to also include UCX-CUDA. Maybe a check like that could even be automated...

Micket · 2021-07-14T10:08:44Z

@casparvl

Or does UCX-CUDA now only install the plugins, but still uses the UCX module for the base UCX?
It does this. UCX-CUDA depends on UCX, and this is all merged already actually. No RPATH problems as it relies on UCX_MODULE_DIR.

And just because a software depends on MPI and CUDA doesn't mean it necessarily needs UCX-CUDA, I think it still needs special directives to use GPUDirect/RDMA stuff. But there is no harm in always enabling the support in UCX.

casparvl · 2021-07-14T12:48:49Z

I guess you're right.

As far as I know the use of GPUDirect requires the MPI_* routines to be called with a GPU pointer as send/receive buffer, instead of calling cudaMempy to copy from GPU to CPU buffer, and then calling your MPI_* routine on the CPU buffer. But it's probably harder to find out if software has such GPU-direct support, than just making sure the support is enabled in UCX :)

It does this. UCX-CUDA depends on UCX, and this is all merged already actually. No RPATH problems as it relies on UCX_MODULE_DIR.

This is cool btw. I'd seen the PR, but wasn't sure I understood correctly what was going on. I suspected this was the case though, nice solution.

branfosj · 2021-08-04T09:36:36Z

I am happy with approach 1. I've done some testing of this but with CUDA 11.2.x (due to the NVIDIA driver version I have available).

boegel · 2021-08-18T13:42:43Z

@Micket I think we can close this, since from foss/2021a onwards we now have a better approach to support software that requires a GPU-aware OpenMPI via UCX-CUDA (cfr. #13260), so we effectively don't have a fosscuda anymore?

Micket · 2021-08-18T13:47:31Z

Sure. There will likely be some surprises going forward when we actually start adding stuff, but i think we can sort them out.

Micket added the 2021a label Mar 29, 2021

boegel added this to the 4.x milestone Mar 31, 2021

Micket added this to To do in 2021a common toolchains via automation Apr 10, 2021

Micket moved this from To do to In progress in 2021a common toolchains Apr 12, 2021

Micket moved this from In progress to To do in 2021a common toolchains Apr 12, 2021

Micket mentioned this issue Jun 29, 2021

{perf}[gompi/2021a] OSU-Micro-Benchmarks v5.7.1, NCCL v2.10.3, magma v2.6.1 with CUDA 11.3.1 #13282

Merged

akesandgren mentioned this issue Jun 29, 2021

{toolchain} nvompic v2021a #13107

Closed

1 task

Micket mentioned this issue Jul 14, 2021

{lib}[system/system] NCCL v2.10.3 #13405

Closed

1 task

Micket closed this as completed Aug 18, 2021

2021a common toolchains automation moved this from To do to Done Aug 18, 2021

boegel modified the milestones: 4.x, 4.4.1 Aug 18, 2021

Micket mentioned this issue Jan 20, 2022

Improving detection of CUDA enabled MPI in EasyBuild #14517

Closed

This was referenced Jan 20, 2022

Various benchmarks from OSU-Micro-Benchmarks/5.7.1-gompi-2021a-CUDA-11.3.1 segfault when using CUDA buffers #14801

Closed

Specific OSU benchmarks segfault when non-CUDA aware OpenMPI 4.1.1 compiled with CUDA-aware UCX open-mpi/ompi#9906

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging CUDA and non CUDA toolchains into one #12484

Merging CUDA and non CUDA toolchains into one #12484

Micket commented Mar 29, 2021

bartoldeman commented Mar 31, 2021

bartoldeman commented Mar 31, 2021

Micket commented Mar 31, 2021

bartoldeman commented Mar 31, 2021

akesandgren commented Mar 31, 2021

Micket commented Mar 31, 2021

bartoldeman commented Mar 31, 2021 •

edited

bartoldeman commented Mar 31, 2021

Micket commented Mar 31, 2021

boegel commented Apr 1, 2021

bartoldeman commented Apr 1, 2021

Micket commented Jun 28, 2021 •

edited

akesandgren commented Jun 29, 2021

akesandgren commented Jun 29, 2021

Micket commented Jun 29, 2021

casparvl commented Jul 7, 2021 •

edited

Micket commented Jul 14, 2021

casparvl commented Jul 14, 2021 •

edited

branfosj commented Aug 4, 2021

boegel commented Aug 18, 2021

Micket commented Aug 18, 2021

Merging CUDA and non CUDA toolchains into one #12484

Merging CUDA and non CUDA toolchains into one #12484

Comments

Micket commented Mar 29, 2021

bartoldeman commented Mar 31, 2021

bartoldeman commented Mar 31, 2021

Micket commented Mar 31, 2021

bartoldeman commented Mar 31, 2021

akesandgren commented Mar 31, 2021

Micket commented Mar 31, 2021

bartoldeman commented Mar 31, 2021 • edited

bartoldeman commented Mar 31, 2021

Micket commented Mar 31, 2021

boegel commented Apr 1, 2021

bartoldeman commented Apr 1, 2021

Micket commented Jun 28, 2021 • edited

akesandgren commented Jun 29, 2021

akesandgren commented Jun 29, 2021

Micket commented Jun 29, 2021

casparvl commented Jul 7, 2021 • edited

Micket commented Jul 14, 2021

casparvl commented Jul 14, 2021 • edited

branfosj commented Aug 4, 2021

boegel commented Aug 18, 2021

Micket commented Aug 18, 2021

bartoldeman commented Mar 31, 2021 •

edited

Micket commented Jun 28, 2021 •

edited

casparvl commented Jul 7, 2021 •

edited

casparvl commented Jul 14, 2021 •

edited