Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gcccorecuda toolchain. #3385

Closed
wants to merge 6 commits into from

Conversation

bartoldeman
Copy link
Contributor

This is a new toolchain combining GCCcore and CUDA.
gcccuda and iccifortcuda then needed to be modified to let them
know that gcccorecuda is an optional subtoolchain.

This is a new toolchain combining GCCcore and CUDA.
gcccuda and iccifortcuda then needed to be modified to let them
know that gcccorecuda is an optional subtoolchain.
@bartoldeman
Copy link
Contributor Author

We need this for UCX with CUDA support, unless UCX is compiled with gcccuda or iccifortcuda instead.

@mboisson
Copy link
Contributor

mboisson commented Jul 9, 2020

@Micket is probably interested in this

@Micket Micket added the new label Jul 10, 2020


class GccCUDA(GccToolchain, Cuda):
"""Compiler toolchain with GCC and CUDA."""
NAME = 'gcccuda'

COMPILER_MODULE_NAME = ['GCC', 'CUDA']
SUBTOOLCHAIN = GccToolchain.NAME
SUBTOOLCHAIN = [GccToolchain.NAME, GCCcoreCUDA.NAME]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be worried about all the older toolchains, i.e. will it work if gcccorecuda is missing for 2019b?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if recipes try to use it. Unless someone adds a recipe like https://github.com/easybuilders/easybuild-easyconfigs/blob/cf729049d57ef84428077841321f23c3d5cc4131/easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.0.3-gcccuda-2020a.eb

for 2019b, it won't matter that the subtoolchain is not defined for that version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcccorecuda.py has OPTIONAL = True, so it will be ignored for 2019b.

The same mechanism is used for e.g. golf, gmkl and iimkl already, and even for gcccore, as older toolchains (talking about 2015a and earlier) didn't have that yet.

@Micket
Copy link
Contributor

Micket commented Jul 13, 2020

I tried to use this. I made a a whole clean test tree for EB and sneakily just copied over the 3 files.

== found valid index for /apps/Test/software/Core/EasyBuild/4.2.2/easybuild/easyconfigs, so using it...
ERROR: Failed to process easyconfig /local/EB/build/eb-6aeetmhc/files_pr10935/g/gompic/gompic-2020a.eb: Failed to process easyconfig /local/EB/build/eb-6aeetmhc/files_pr10935/o/OpenMPI/OpenMPI-4.0.3-gcccuda-2020a.eb: Failed to process easyconfig /local/EB/build/eb-6aeetmhc/files_pr10935/u/UCX/UCX-1.8.0-gcccorecuda-2020a.eb: Unknown set of toolchain compilers, module naming scheme needs work: dict_keys(['GCCcore', 'CUDAcore'])

did I mess up or is there still something missing?

@mboisson
Copy link
Contributor

I tried to use this. I made a a whole clean test tree for EB and sneakily just copied over the 3 files.

== found valid index for /apps/Test/software/Core/EasyBuild/4.2.2/easybuild/easyconfigs, so using it...
ERROR: Failed to process easyconfig /local/EB/build/eb-6aeetmhc/files_pr10935/g/gompic/gompic-2020a.eb: Failed to process easyconfig /local/EB/build/eb-6aeetmhc/files_pr10935/o/OpenMPI/OpenMPI-4.0.3-gcccuda-2020a.eb: Failed to process easyconfig /local/EB/build/eb-6aeetmhc/files_pr10935/u/UCX/UCX-1.8.0-gcccorecuda-2020a.eb: Unknown set of toolchain compilers, module naming scheme needs work: dict_keys(['GCCcore', 'CUDAcore'])

did I mess up or is there still something missing?

which files did you copy ?

@mboisson
Copy link
Contributor

For gompic-2020a.eb, you will need :
CUDAcore-11.0.2.eb
CUDA-11.0.2-GCC-9.3.0.eb
OpenMPI-4.0.3-gcccuda-2020a.eb
UCX-1.8.0-gcccudacore-2020a.eb
gcccuda-2020a.eb
gcccorecuda-2020a.eb

I think

@Micket
Copy link
Contributor

Micket commented Jul 13, 2020

The 3 files from this PR. This PR is missing an important change to (at least?) hierarchical_mns.py:

https://github.com/easybuilders/easybuild-framework/blob/develop/easybuild/tools/module_naming_scheme/hierarchical_mns.py#L53-L69

(GCCcore + CUDA needs to be defined there)

edit: Presumably
'CUDA,GCCcore': ('GCCcore-CUDA', '%(GCCcore)s-%(CUDA)s'),

though, this raises the question; should GCCcoreCUDA be

COMPILER_MODULE_NAME = ['GCCcore', 'CUDAcore']

or

COMPILER_MODULE_NAME = ['GCCcore', 'CUDA']

@bartoldeman
Copy link
Contributor Author

ah good point about the HMNS. I'll work on a test case for that.

Uses:
'CUDAcore,GCCcore' : ('GCCcore-CUDAcore', '%(GCCcore)s-%(CUDAcore)s'),
so the compiler in HMNS is GCCcore-CUDAcore.
@bartoldeman
Copy link
Contributor Author

Added support for HMNS now (using Compiler/GCCcore-CUDAcore)

@easybuilders easybuilders deleted a comment from boegelbot Jul 23, 2020
@easybuilders easybuilders deleted a comment from boegelbot Jul 23, 2020
@easybuilders easybuilders deleted a comment from boegelbot Jul 23, 2020
@bartoldeman
Copy link
Contributor Author

Tests finally pass now that #3392 is merged.

@Micket
Copy link
Contributor

Micket commented Jul 27, 2020

I'm really trying to think this through really really carefully

  1. The reason why we must have a CUDA module at the compiler levels is because it is what contains the crucial "MODULEPATH" extension for HMNS.
  2. The good thing with that is that a user can do, module load GCC/9.3.0 and then module load CUDA and get to the GCC-CUDA modules for that compiler. They could also load module load gcccuda/2019b for convenience, but that module isn't actually needed at all.
  3. The bad thing which this approach is that it's really quite cryptic, and, we have to add these special rules to the naming scheme based on the dependencies of a toolchain. The alternative approach, i.e. the toolchain loads it path itself, would lead to a cleaner approach, though would not let the user just "add" CUDA or OpenMPI, but rather force them to load the "gcccuda" or "foss" modules.

Side-note: Personally, I would have strongly preferred is EB just always did alternative approach in 3. I think it would have been much much cleaner. We would for example not have had the issue with split icc + ifort modules, and we wouldn't have had to introduce this "ugly" "CUDAcore" modules, or any extra modules at all. There would just be 1 GCC, 1 CUDA, and 1 gcccuda (which contains the GCC-CUDA modulepath).

But, now, where does GCCcoreCUDA fall? In this PR, there is no "CUDA" at GCCcore level, it just directly depends on the CUDAcore (at system level), so, this can't work like 2. I can't quite tell where we define which module gets to expand the MODULEPATH, but, there is only 1 module it could be placed it, and that's gcccorecuda?

So, we have 2 alternatives

  1. gcccorecuda is the exception to 2 (and effectively works like 3., though, still using a custom HMNS rule, so, it's something inbetween)
  2. we have to introduce the CUDA module at GCCcore as well, and have gcccorecuda depend on that instead.

@mboisson
Copy link
Contributor

  1. The good thing with that is that a user can do, module load GCC/9.3.0 and then module load CUDA and get to the GCC-CUDA modules for that compiler. They could also load module load gcccuda/2019b for convenience, but that module isn't actually needed at all.

In my opinion, gcccuda/2019b module (and other toolchain modules) is actually harmful, not convenient. There is absolutely no indication of what "2019b" means, from the point of view of users. They know versions of GCC and versions of CUDA.

  1. The bad thing which this approach is that it's really quite cryptic, and, we have to add these special rules to the naming scheme based on the dependencies of a toolchain. The alternative approach, i.e. the toolchain loads it path itself, would lead to a cleaner approach, though would not let the user just "add" CUDA or OpenMPI, but rather force them to load the "gcccuda" or "foss" modules.

Side-note: Personally, I would have strongly preferred is EB just always did alternative approach in 3. I think it would have been much much cleaner. We would for example not have had the issue with split icc + ifort modules, and we wouldn't have had to introduce this "ugly" "CUDAcore" modules, or any extra modules at all. There would just be 1 GCC, 1 CUDA, and 1 gcccuda (which contains the GCC-CUDA modulepath).

But, now, where does GCCcoreCUDA fall? In this PR, there is no "CUDA" at GCCcore level, it just directly depends on the CUDAcore (at system level), so, this can't work like 2. I can't quite tell where we define which module gets to expand the MODULEPATH, but, there is only 1 module it could be placed it, and that's gcccorecuda?

But it does work like 2 (users load GCC and then load CUDA). We're running this in production.

So, we have 2 alternatives

  1. gcccorecuda is the exception to 2 (and effectively works like 3., though, still using a custom HMNS rule, so, it's something inbetween)
  2. we have to introduce the CUDA module at GCCcore as well, and have gcccorecuda depend on that instead.

I don't have a clear enough picture of the problem to comment, so I will let Bart do it, but know that it does work, as we have been using this for about a year within our stack.

@Micket
Copy link
Contributor

Micket commented Jul 27, 2020

In my opinion, gcccuda/2019b module (and other toolchain modules) is actually harmful, not convenient.

Sounds like a problem that can be solved with a tiny bit of documentation and a module show. And the version string isn't at all about what version of GCC or OpenMPI is used (something I suspect basically none of my users are the least bit interested in). It's a snapshot in time of what set of library and software versions we build, e.g. roughly what version of OpenFOAM, TensorFlow, Biopython, etc. one can expect to find.
And for the rest who care about some particular version, they do a module spider and just load whatever toolchain components Lmod tells them. None of them would have had any idea what version of GCC+OpenMPI+CUDA to load in order to access the latest TensorFlow (which is the only thing the user cares about).
And for those who just wants the compiler? Why.. it's still there, you can still just do a module load GCC/9.3.0.

But it does work like 2 (users load GCC and then load CUDA).

Yes. That's how it works for gcccuda. But not with PR, gcccorecuda has no CUDA module to call its own, it directly depends on CUDAcore, so there is nowhere to put the MODULEPATH extension, which is required needed for HMNS.

We're running this in production.

With HMNS?

The only reason why we have to introduce a CUDAcore at all is just because of the need to have a unique CUDA under GCC so that it can extend the MODULEPATH for HMNS when loaded. If one e.g. has a flat module naming scheme, then, there is no paths to modify so one can of course get away with anything.

@mboisson
Copy link
Contributor

mboisson commented Jul 27, 2020 via email

@mboisson
Copy link
Contributor

@Micket
Copy link
Contributor

Micket commented Jul 27, 2020

So, I had a sift through the ComputeCanade naming scheme,

  1. It's not HMNS, it's a subclass to HMNS that changes almost everything. Especially all the parts related to CUDA and GCCcore for which there are several special conditions of which none are present in this PR.
  2. One of the special hacks seems to be to not create the subdirectory when gcccorecuda is used; https://github.com/ComputeCanada/easybuild-computecanada-config/blob/master/SoftCCHierarchicalMNS.py#L146-L148
    so, the gcccorecuda doesn't have it's own subdirectory, it's tied to just the CUDA version, and then this directory is directly tied to CUDAcore, which is given the task to contain this (GCCcore independent) MODULEPATH:
    https://github.com/ComputeCanada/easybuild-computecanada-config/blob/b4c723d40eccb999c8bae8a33f72261f600c1d2a/SoftCCHierarchicalMNS.py#L240-L241
    And, of course, this would only work if we always have new CUDA versions for each gcccorecuda, so that there are no conflicts, which, so far, luckily has been the case: https://github.com/ComputeCanada/easybuild-easyconfigs/tree/computecanada-master/easybuild/easyconfigs/g/gcccorecuda

None of this would work here. There must be a module that contains the MODULEPATH extension.

  1. I don't think it can be at CUDAcore as has been done at ComputeCanada. It would minimum require the introduction of the CUDAcore "toolchain", and hope that we never ever reuse CUDA version across to 2 toolchain releases because that would break things.
  2. We can't introduce a new CUDA under GCCcore for use with gcccorecuda, because that would mean fosscuda would need to depend on different CUDA modules from both gcccuda and gcccorecuda and since those modules would conflict it means only one of their MODULEPATHs would be added (and fosscuda should have both).
  3. gcccorecuda itself could contain the MODULEPATH extension.

@mboisson
Copy link
Contributor

mboisson commented Jul 27, 2020

Sorry, I meant we use a HMNS. It is definitely more complicated (and complete) than upstream HMNS.

A fundamental design choice in our stack is to use sub toolchains to maximize the reusability. This means having more branches of the hierarchy, as we will reuse the same GCC/Intel for multiple versions of OpenMPI, or multiple versions of CUDA. We don't necessarily pick one "2020a" and stick with it until we change everything 6 months later. This is particularly true of CUDA, since it changes more frequently than we change toolchains (roughly every 2 years). In 4 years, we have extensively used only 3 versions of GCC (5.4.0, 7.3.0, 9.3.0) and Intel (2016.4, 2018.3, 2020.1), Open MPI (2.1.1, 3.0.2, 4.0.1), but many more versions of CUDA (7.5, 8.0, 9.0, 9.2, 10.0, 10.1, 10.2, 11.0).

I looked at the upstream HMNS, and how exactly does it consider CUDA in the hierarchy ? I don't see any special case for CUDA as there is for MPI ?

@bartoldeman
Copy link
Contributor Author

I'm taking time off so I can just add some comments here and there but here it goes:
@mboisson the upstream module path is like Compiler/GCC-CUDA/8.3.0-10.1.243 where CC would use CUDA/gcc8/cuda10.1, it looks very different but I believe there is a one-to-one mapping between them here. What is very different though is the CC mapping of Core to a single path, to this properly I believe than that CUDAcore would need to live at the GCCcore level upstream, so that module load GCC CUDA, which has CUDAcore as a dependency will get the CUDAcore module to extend MODULEPATH correctly, to include Compiler/GCCcore-CUDAcore/9.3.0-11.0.2 under which UCX can live.

About CUDAcore in general I know from past comments that @boegel is not a big fan of the *core concept including GCCcore.
A more radical solution is to have multiple modules for a single software installation: CUDA is installed at some unique location without a module, and all CUDA modules at different locations can refer to exactly that same CUDA installation, and similarly UCX can be compiled with gcccuda but usable anyway by Intel using 1 UCX installation but 2 modulefiles. But breaking the one-to-one mapping between module and software installation path is quite a radical change indeed!

@ocaisa
Copy link
Member

ocaisa commented Jul 28, 2020

So, there is an Lmod feature that may help us here: inherit(). That would allow us to get around the one name rule, you would have one actual installation, with CUDA at other levels inheriting from that and only extending the modulepath in other directions.

In that case, if you have CUDA at the core level loaded, loading a compiler like GCC would cause a reload of CUDA which would also do additional extensions to the module path (one for gcccorecuda, one for gcccuda). Thinking about it, the CUDA modules that inherit from the base CUDA module should probably be created when installing the toolchain. This is pretty weird behaviour, since a single easyconfig will create two modules, one for the toolchain and one for CUDA (but only in an HMNS)...I wonder if we can make that part of the naming scheme rather than try to bake it into EB directly.

@Micket
Copy link
Contributor

Micket commented Jul 28, 2020

I looked at the upstream HMNS, and how exactly does it consider CUDA in the hierarchy ? I don't see
any special case for CUDA as there is for MPI ?

So, part of it is here
https://github.com/easybuilders/easybuild-framework/blob/develop/easybuild/tools/module_naming_scheme/hierarchical_mns.py#L60-L66
as well as checking for the CUDA name here:
https://github.com/easybuilders/easybuild-framework/blob/develop/easybuild/tools/module_naming_scheme/hierarchical_mns.py#L180-L182
and the subsequent logic puts the MODULEPATH extension into the CUDA module here:
https://github.com/easybuilders/easybuild-framework/blob/develop/easybuild/tools/module_naming_scheme/hierarchical_mns.py#L224

@ocaisa I'm not super keen on yet more "hacks". I just feel that putting the MODULEPATH extension into gcccorecuda is the only way to go. This just solves all problems without any need for any special hacks.
We just need to have a unique module that can extend the MODULEPATH. gcccorecuda will be unique.
Of course, everything would even simpler and way cleaner if HMNS did this for gcccuda/iccifortcuda as well. Then we could just get rid of this CUDAcore stuff and just have CUDA at the core level and just depend on it.

@mboisson
Copy link
Contributor

mboisson commented Jul 28, 2020

Extending the MODULEPATH in gcccorecuda is not an option for sites who hide toolchains.

Am I mistaken in thinking that, even without modifications to EasyBuild like we do, GCC, iccifort and OpenMPI are the modules which extend MODULEPATH ?

If that's the case, then why should it be any different for CUDA ? (i.e. it should be CUDA which extends the MODULEPATH, not gcccorecuda).

@ocaisa
Copy link
Member

ocaisa commented Jul 28, 2020

The way I see it there is not just one module path extension for CUDA but many: for every toolchain that can appear in foss there is an equivalent combination that will include CUDA (gcccorecuda, gcccuda, gompic, fosscuda). That's the main reason I would suggest using the inherit() approach since it won't matter at which point you load CUDA, you will always get the right set of MODULEPATH extensions since it will walk the tree of CUDA modules extending the path for every relevant level.

I agree that it shouldn't be a hack, I would approach it in a similar way to how we approached modulerc, we would need CUDA ecs that leverage the system level installation for each toolchain (i.e., CUDA built with GCCcore which together define gcccorecuda, CUDA built with gompi which together define `gompic``). Such an easyconfig might look like:

easyblock = 'inherit'

name = 'CUDA'
version = '10.1.243'

homepage = 'https://developer.nvidia.com/cuda-toolkit'
description = """CUDA (formerly Compute Unified Device Architecture) is a parallel
 computing platform and programming model created by NVIDIA and implemented by the
 graphics processing units (GPUs) that they produce. CUDA gives developers access
 to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs."""

toolchain = {'name': 'GCC', 'version': '8.3.0'}

builddependencies = [
    ('%(name)s', '%(version)s'), '', SYSTEM,
]

moduleclass = 'lang'

Having the build dep ensures that the base installation exists and the toolchain tells us how to extend the path once we call out to the MNS.

The problem I see is that this would work fine in a hierarchical MNS where you use Lmod, but what about in the flat scheme? There the names are not identical so inherit() does nothing for you. The inherit.py would have to be clever enough to handle other scenarios.

@Micket
Copy link
Contributor

Micket commented Jul 28, 2020

If that's the case, then why should it be any different for CUDA ? (i.e. it should be CUDA which extends the MODULEPATH, not gcccorecuda).

Well just look at this PR; there is no CUDA under GCCcore. It just depends on the system level CUDAcore.
https://github.com/easybuilders/easybuild-framework/pull/3385/files#diff-37ff22840e53e6e8e5da7edf34687d6cR40-R41
You don't have a CUDA-XXX-GCCcore.eb config at ComputeCanada either.
https://github.com/ComputeCanada/easybuild-easyconfigs/tree/computecanada-master/easybuild/easyconfigs/c/CUDA
So the problem with CUDA under GCCcore is that it doesn't exist, which is not a coincidence:
Adding CUDA under GCCcore means that it would conflict with CUDA under GCC. Can't have them loaded simultatnously

CC's customized HMNS instead makes a hack that adds the MODULEPATH directly into the system level CUDAcore (which you are fortunately have chosen to not hide). Quoting Bart;

What is very different though is the CC mapping of Core to a single path,

Though, Barts comment gives me an idea.
We could perhaps have CUDA-xxx-iccifort and CUDA-xxx-GCC also extend the GCCcore-CUDA MODULEPATH. That *might* work, just maybe. It could lead to some funky side-effects, since it is really bending over backwards. Moving CUDAcoreintoGCCcore` sounds like it could also work, as it would avoid the single-path problem.

@mboisson
Copy link
Contributor

mboisson commented Jul 28, 2020

If that's the case, then why should it be any different for CUDA ? (i.e. it should be CUDA which extends the MODULEPATH, not gcccorecuda).

Well just look at this PR; there is no CUDA under GCCcore. It just depends on the system level CUDAcore.
https://github.com/easybuilders/easybuild-framework/pull/3385/files#diff-37ff22840e53e6e8e5da7edf34687d6cR40-R41
It depends on the compiler toolchain (i.e. there is one for GCC and one for iccifort).

You don't have a CUDA-XXX-GCCcore.eb config at ComputeCanada either.
https://github.com/ComputeCanada/easybuild-easyconfigs/tree/computecanada-master/easybuild/easyconfigs/c/CUDA
So the problem with CUDA under GCCcore is that it doesn't exist, which is not a coincidence:
Adding CUDA under GCCcore means that it would conflict with CUDA under GCC. Can't have them loaded simultatnously

CC's customized HMNS instead makes a hack that adds the MODULEPATH directly into the system level CUDAcore (which you are fortunately have chosen to not hide). Quoting Bart;

I think there is confusion here.
https://github.com/ComputeCanada/easybuild-computecanada-config/blob/b4c723d40eccb999c8bae8a33f72261f600c1d2a/SoftCCHierarchicalMNS.py#L240
this line is not instead of extending the MODULEPATH for the CUDA-XXX-GCC-9.3.0.eb recipe, it is in addition to the regular path, i.e.
https://github.com/ComputeCanada/easybuild-computecanada-config/blob/b4c723d40eccb999c8bae8a33f72261f600c1d2a/SoftCCHierarchicalMNS.py#L261

Which is, I think, exactly what you describe in the next paragraph.

Also, for us, the CUDAcore module is actually hidden, but it gets loaded when users load CUDA. It is just hidden by our hooks https://github.com/ComputeCanada/easybuild-computecanada-config/blob/master/cc_hooks_gentoo.py#L367
because it does not make sense to hide it upstream. We also hide GCCcore (which is not hidden by --hide-toolchains, as it is not a toolchain per say).

What is very different though is the CC mapping of Core to a single path,

Though, Barts comment gives me an idea.
We could perhaps have CUDA-xxx-iccifort and CUDA-xxx-GCC also extend the GCCcore-CUDA MODULEPATH. That *might* work, just maybe. It could lead to some funky side-effects, since it is really bending over backwards. Moving CUDAcoreintoGCCcore` sounds like it could also work, as it would avoid the single-path problem.

@Micket
Copy link
Contributor

Micket commented Jul 28, 2020

I think there is confusion here.
https://github.com/ComputeCanada/easybuild-computecanada-config/blob/b4c723d40eccb999c8bae8a33f72261f600c1d2a/SoftCCHierarchicalMNS.py#L240
this line is not instead of extending the MODULEPATH for the CUDA-XXX-GCC-9.3.0.eb recipe,

Well, then I admit I'm utterly fooled, because that just doesn't match with the code I see here. It literally checks ec['name'] == CUDACORE; if the name of the easyconfig is CUDAcore, and if it is it adds a path that does not contain any compiler or compiler versions at all (line 241)

So, from that I could expect your system level CUDAcore module the contain a MODULEPATH extension that doesn't contain compiler information:

prepend_path("MODULEPATH",".../avx/CUDA/cuda10.2")

I would still expect CUDA@GCC to still expand it's own modulepath like usual.

it is in addition to the regular path, i.e.
https://github.com/ComputeCanada/easybuild-computecanada-config/blob/b4c723d40eccb999c8bae8a33f72261f600c1d2a/SoftCCHierarchicalMNS.py#L261

How could this possible be in addition; that part is in an different ifelse that can't be true simultaneously?

I actually went ahead and set up ComputeCanadas CVMFS on my computer to see what your modules look like, and after digging through a bit I found the CUDA "core" path that I suspected:
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/CUDA/cuda10.2/
and looking for the ucx module, which should be under that "toolchain":

$ ml spider ucx/1.8.0
[snip]
    Additional variants of this module can also be loaded after loading the following modules:

      cudacore/.10.2.89

So, it's exactly as I described? The MODULEPATH extension is part of CUDAcore (that you also hide).

@mboisson
Copy link
Contributor

mboisson commented Jul 28, 2020

I think there is confusion here.
https://github.com/ComputeCanada/easybuild-computecanada-config/blob/b4c723d40eccb999c8bae8a33f72261f600c1d2a/SoftCCHierarchicalMNS.py#L240
this line is not instead of extending the MODULEPATH for the CUDA-XXX-GCC-9.3.0.eb recipe,

Well, then I admit I'm utterly fooled, because that just doesn't match with the code I see here. It literally checks ec['name'] == CUDACORE; if the name of the easyconfig is CUDAcore, and if it is it adds a path that does not contain any compiler or compiler versions at all (line 241)

So, from that I could expect your system level CUDAcore module the contain a MODULEPATH extension that doesn't contain compiler information:

prepend_path("MODULEPATH",".../avx/CUDA/cuda10.2")

That is correct. This is what Bart was referring to when he said

What is very different though is the CC mapping of Core to a single path

For upstream HMNS, it should be dependent on the version of GCCcore. Our HMNS remaps GCCcore,X.X.X to a single path (Core), and so does our CUDAcore.

I would still expect CUDA@GCC to still expand it's own modulepath like usual.

it is in addition to the regular path, i.e.
https://github.com/ComputeCanada/easybuild-computecanada-config/blob/b4c723d40eccb999c8bae8a33f72261f600c1d2a/SoftCCHierarchicalMNS.py#L261

How could this possible be in addition; that part is in an different ifelse that can't be true simultaneously?

The CUDAcore adds the path above (which would be specific to GCCcore version in upstream), The CUDA module adds the compiler-specific (i.e. gcc93 or intel2020) paths.

See files
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/Core/cudacore/.11.0.2.lua

/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/Compiler/intel2020/cuda/11.0.lua
and
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/Compiler/gcc9/cuda/11.0.lua

I actually went ahead and set up ComputeCanadas CVMFS on my computer to see what your modules look like, and after digging through a bit I found the CUDA "core" path that I suspected:
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/CUDA/cuda10.2/
and looking for the ucx module, which should be under that "toolchain":

$ ml spider ucx/1.8.0
[snip]
    Additional variants of this module can also be loaded after loading the following modules:

      cudacore/.10.2.89

So, it's exactly as I described? The MODULEPATH extension is part of CUDAcore (that you also hide).

@mboisson
Copy link
Contributor

mboisson commented Jul 28, 2020

Maybe this yields a clearer picture (I kept only the paths for the 2020a components) :

$ grep -r MODULEPATH /cvmfs/soft.computecanada.ca/easybuild/modules/2020/{Core,avx2} | grep -v "HOME\|.bak\|:--\|10.2" | grep "9.3.0\|4.0.3\|cuda11.0\|2020.1.217" | tr ':' '\n'
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/Core/gcc/9.3.0.lua
prepend_path("MODULEPATH", pathJoin("/cvmfs/soft.computecanada.ca/easybuild/modules/2020", os.getenv("RSNT_ARCH"), "Compiler/gcc9"))
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/Core/intel/2020.1.217.lua
prepend_path("MODULEPATH", pathJoin("/cvmfs/soft.computecanada.ca/easybuild/modules/2020", os.getenv("RSNT_ARCH"), "Compiler/intel2020"))
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/Core/cudacore/.11.0.2.lua
prepend_path("MODULEPATH", "/cvmfs/soft.computecanada.ca/easybuild/modules/2020/CUDA/cuda11.0")
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/Core/cudacore/.11.0.2.lua
prepend_path("MODULEPATH", pathJoin("/cvmfs/soft.computecanada.ca/easybuild/modules/2020", os.getenv("RSNT_ARCH"), "CUDA/cuda11.0"))
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/Compiler/gcc9/openmpi/4.0.3.lua
prepend_path("MODULEPATH", "/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/MPI/gcc9/openmpi4")
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/Compiler/gcc9/cuda/11.0.lua
prepend_path("MODULEPATH", "/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/CUDA/gcc9/cuda11.0")
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/Compiler/intel2020/openmpi/4.0.3.lua
prepend_path("MODULEPATH", "/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/MPI/intel2020/openmpi4")
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/Compiler/intel2020/cuda/11.0.lua
prepend_path("MODULEPATH", "/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/CUDA/intel2020/cuda11.0")
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/CUDA/gcc9/cuda11.0/openmpi/4.0.3.lua
prepend_path("MODULEPATH", "/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/MPI/gcc9/cuda11.0/openmpi4")
/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/CUDA/intel2020/cuda11.0/openmpi/4.0.3.lua
prepend_path("MODULEPATH", "/cvmfs/soft.computecanada.ca/easybuild/modules/2020/avx2/MPI/intel2020/cuda11.0/openmpi4")

@ocaisa
Copy link
Member

ocaisa commented Jul 29, 2020

In general, I think we need to agree an approach to how we are going to deal with accelerators. I'm not such a big fan of the current direction with fosscuda, since it leads to a lot of duplication of easyconfigs. As @mboisson has pointed out elsewhere, it makes things like --try-toolchain not work so well since we inspect the capabilities required by the toolchain used, which with fosscuda usually means CUDA even when this is not actually required by the software at all. We should have more fine-grained control over this.

Personally I would like to see accelerator packages be added as overlays on top of standard toolchains (potentially shadowing packages available there). For the current (standard) hierarchy, this means additional MODULEPATH extensions for GCCcore, Compiler and MPI levels. To act as overlays, these extensions have to occur in the appropriate order, and I currently can't see another way of doing that other than have different CUDA packages that are auto-swapped.

Dependency resolution becomes complicated in that case, but I think I could modify that such that if a CUDA capable toolchain is used, other CUDA-enabled toolchains in the hierarchy are always preferred when resolving deps.

Having said all that, the traffic that this PR has generated clearly implies that we need to have a more complete discussion on this topic.

@mboisson
Copy link
Contributor

mboisson commented Jul 29, 2020

@ocaisa, the same thing applies to MKL. I had created a PR for all of the subtoolchains we use here:
easybuilders/easybuild-easyconfigs#10954
but then, I split the CUDA parts out in this PR instead :
easybuilders/easybuild-easyconfigs#11035

only to realize that we also install MKL at the SYSTEM level, while upstream typically installs it at the MPI level. This means that the toolchains gmkl and iimkl can't be used upstream.

So I created this one
easybuilders/easybuild-easyconfigs#11036
which includes only gomkl, iomkl, iompi, and is much more limitative (in my opinion).

By installing MKL at the SYSTEM level, it actually means that we also have gcccoremkl, with which we compile R at the GCCcore level (so it is visible even with the Intel compiler loaded), but it still is compiled against MKL.

@boegel boegel added this to the 4.x milestone Aug 5, 2020
@boegel boegel modified the milestones: 4.x, next release (4.2.3?) Aug 19, 2020
@Micket
Copy link
Contributor

Micket commented Aug 19, 2020

So, there are probably lots of ways we can technically go foward with this. Some promising options (in no particular order):

  1. Not use gcccorecuda at all and just duplicate UCX configs (and whatever else we might want in gcccorecuda) for each compiler. This means a few unnecessary builds, but probably not such a big deal actually.

  2. Introduce a modulepath Compiler/GCCcore-CUDA/9.3.0-11.0.2/ and put that modulepath extension into the gcccorecuda module (the reason we haven't done this for other toolchains is that they can be hidden, but it's doubtful that any user is going to need to look for stuff directly under gcccorecuda).

  3. Introduce a modulepath e.g. Compiler/GCCcore-CUDA/9.3.0-11.0.2/ and put that modulepath extension into CUDA (meaning it would extend both GCC-CUDA and GCCcore-CUDA (or iccifort-CUDA + GCCcore-CUDA on the intel side) )

  4. Move CUDAcore from the system level into the GCCcore level, and use it to the CUDAcore module to extend the modulepath. This means we are still stuck with a duplicate CUDA install @ system level for other stuff.

@mboisson
Copy link
Contributor

I would favor solution 1.

@boegel
Copy link
Member

boegel commented Sep 1, 2020

I've pushed this to the next milestone (likely 4.3.1)...

I haven't had time to dive into this yet, and it deserves some careful thought rather than rushing it in (and I don't want this to block the 4.3.0 release).

@boegel
Copy link
Member

boegel commented Sep 2, 2020

close/re-open to refresh CI

@boegel boegel closed this Sep 2, 2020
@boegel boegel reopened this Sep 2, 2020
@boegel boegel added this to In progress in EasyBuild v4.3.1 via automation Sep 16, 2020
@bartoldeman
Copy link
Contributor Author

I'll close this for now, we can keep it for now but there are issue with this in the general HMNS, unless you install CUDAcore for GCCcore which is a little odd since it's a binary.

EasyBuild v4.3.1 automation moved this from In progress to Done Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants