New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gcccorecuda toolchain. #3385
Conversation
This is a new toolchain combining GCCcore and CUDA. gcccuda and iccifortcuda then needed to be modified to let them know that gcccorecuda is an optional subtoolchain.
We need this for UCX with CUDA support, unless UCX is compiled with gcccuda or iccifortcuda instead. |
@Micket is probably interested in this |
|
||
|
||
class GccCUDA(GccToolchain, Cuda): | ||
"""Compiler toolchain with GCC and CUDA.""" | ||
NAME = 'gcccuda' | ||
|
||
COMPILER_MODULE_NAME = ['GCC', 'CUDA'] | ||
SUBTOOLCHAIN = GccToolchain.NAME | ||
SUBTOOLCHAIN = [GccToolchain.NAME, GCCcoreCUDA.NAME] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to be worried about all the older toolchains, i.e. will it work if gcccorecuda is missing for 2019b?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if recipes try to use it. Unless someone adds a recipe like https://github.com/easybuilders/easybuild-easyconfigs/blob/cf729049d57ef84428077841321f23c3d5cc4131/easybuild/easyconfigs/o/OpenMPI/OpenMPI-4.0.3-gcccuda-2020a.eb
for 2019b, it won't matter that the subtoolchain is not defined for that version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcccorecuda.py has OPTIONAL = True
, so it will be ignored for 2019b.
The same mechanism is used for e.g. golf, gmkl and iimkl already, and even for gcccore, as older toolchains (talking about 2015a and earlier) didn't have that yet.
I tried to use this. I made a a whole clean test tree for EB and sneakily just copied over the 3 files.
did I mess up or is there still something missing? |
which files did you copy ? |
For I think |
The 3 files from this PR. This PR is missing an important change to (at least?) (GCCcore + CUDA needs to be defined there) edit: Presumably though, this raises the question; should GCCcoreCUDA be
or
|
ah good point about the HMNS. I'll work on a test case for that. |
Uses: 'CUDAcore,GCCcore' : ('GCCcore-CUDAcore', '%(GCCcore)s-%(CUDAcore)s'), so the compiler in HMNS is GCCcore-CUDAcore.
Added support for HMNS now (using |
Tests finally pass now that #3392 is merged. |
I'm really trying to think this through really really carefully
Side-note: Personally, I would have strongly preferred is EB just always did alternative approach in 3. I think it would have been much much cleaner. We would for example not have had the issue with split icc + ifort modules, and we wouldn't have had to introduce this "ugly" "CUDAcore" modules, or any extra modules at all. There would just be 1 GCC, 1 CUDA, and 1 gcccuda (which contains the GCC-CUDA modulepath). But, now, where does GCCcoreCUDA fall? In this PR, there is no "CUDA" at GCCcore level, it just directly depends on the CUDAcore (at system level), so, this can't work like 2. I can't quite tell where we define which module gets to expand the MODULEPATH, but, there is only 1 module it could be placed it, and that's So, we have 2 alternatives
|
In my opinion,
But it does work like 2 (users load GCC and then load CUDA). We're running this in production.
I don't have a clear enough picture of the problem to comment, so I will let Bart do it, but know that it does work, as we have been using this for about a year within our stack. |
Sounds like a problem that can be solved with a tiny bit of documentation and a
Yes. That's how it works for
With HMNS? The only reason why we have to introduce a |
Yes, with HMNS, we only run a HMNS.
…
On Jul 27, 2020 at 1:08 PM, <Mikael Öhman ***@***.***)> wrote:
>
>
> In my opinion, gcccuda/2019b module (and other toolchain modules) is actually harmful, not convenient.
>
>
Sounds like a problem that can be solved with a tiny bit of documentation and a module show. And the version string isn't at all about what version of GCC or OpenMPI is used (something I suspect basically none of my users are the least bit interested in). It's a snapshot in time of what set of library and software versions we build, e.g. roughly what version of OpenFOAM, TensorFlow, Biopython, etc. one can expect to find.
And for the rest who care about some particular version, they do a module spider and just load whatever toolchain components Lmod tells them. None of them would have had any idea what version of GCC+OpenMPI+CUDA to load in order to access the latest TensorFlow (which is the only thing the user cares about).
And for those who just wants the compiler? Why.. it's still there, you can still just do a module load GCC/9.3.0.
>
>
> But it does work like 2 (users load GCC and then load CUDA).
>
>
Yes. That's how it works for gcccuda. But not with PR, gcccorecuda has no CUDA module to call its own, it directly depends on CUDAcore, so there is nowhere to put the MODULEPATH extension, which is required needed for HMNS.
>
>
> We're running this in production.
>
>
With HMNS?
The only reason why we have to introduce a CUDAcore at all is just because of the need to have a unique CUDA under GCC so that it can extend the MODULEPATH for HMNS when loaded. If one e.g. has a flat module naming scheme, then, there is no paths to modify so one can of course get away with anything.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub (#3385 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ABZKY2VL4VICC2WFYMLYKHTR5WYB7ANCNFSM4OVTHYPA).
|
This his our version of HMNS. |
So, I had a sift through the ComputeCanade naming scheme,
None of this would work here. There must be a module that contains the MODULEPATH extension.
|
Sorry, I meant we use a HMNS. It is definitely more complicated (and complete) than upstream HMNS. A fundamental design choice in our stack is to use sub toolchains to maximize the reusability. This means having more branches of the hierarchy, as we will reuse the same GCC/Intel for multiple versions of OpenMPI, or multiple versions of CUDA. We don't necessarily pick one "2020a" and stick with it until we change everything 6 months later. This is particularly true of CUDA, since it changes more frequently than we change toolchains (roughly every 2 years). In 4 years, we have extensively used only 3 versions of GCC (5.4.0, 7.3.0, 9.3.0) and Intel (2016.4, 2018.3, 2020.1), Open MPI (2.1.1, 3.0.2, 4.0.1), but many more versions of CUDA (7.5, 8.0, 9.0, 9.2, 10.0, 10.1, 10.2, 11.0). I looked at the upstream HMNS, and how exactly does it consider CUDA in the hierarchy ? I don't see any special case for CUDA as there is for MPI ? |
I'm taking time off so I can just add some comments here and there but here it goes: About CUDAcore in general I know from past comments that @boegel is not a big fan of the *core concept including GCCcore. |
So, there is an Lmod feature that may help us here: In that case, if you have CUDA at the core level loaded, loading a compiler like |
So, part of it is here @ocaisa I'm not super keen on yet more "hacks". I just feel that putting the |
Extending the Am I mistaken in thinking that, even without modifications to EasyBuild like we do, GCC, iccifort and OpenMPI are the modules which extend MODULEPATH ? If that's the case, then why should it be any different for CUDA ? (i.e. it should be CUDA which extends the MODULEPATH, not |
The way I see it there is not just one module path extension for CUDA but many: for every toolchain that can appear in I agree that it shouldn't be a hack, I would approach it in a similar way to how we approached modulerc, we would need CUDA ecs that leverage the system level installation for each toolchain (i.e., CUDA built with
Having the build dep ensures that the base installation exists and the toolchain tells us how to extend the path once we call out to the MNS. The problem I see is that this would work fine in a hierarchical MNS where you use Lmod, but what about in the flat scheme? There the names are not identical so |
Well just look at this PR; there is no CUDA under GCCcore. It just depends on the system level CUDAcore. CC's customized HMNS instead makes a hack that adds the MODULEPATH directly into the system level CUDAcore (which you are fortunately have chosen to not hide). Quoting Bart;
Though, Barts comment gives me an idea. |
I think there is confusion here. Which is, I think, exactly what you describe in the next paragraph. Also, for us, the
|
Well, then I admit I'm utterly fooled, because that just doesn't match with the code I see here. It literally checks So, from that I could expect your system level
I would still expect CUDA@GCC to still expand it's own modulepath like usual.
How could this possible be in addition; that part is in an different I actually went ahead and set up ComputeCanadas CVMFS on my computer to see what your modules look like, and after digging through a bit I found the CUDA "core" path that I suspected:
So, it's exactly as I described? The MODULEPATH extension is part of CUDAcore (that you also hide). |
That is correct. This is what Bart was referring to when he said
For upstream HMNS, it should be dependent on the version of GCCcore. Our HMNS remaps GCCcore,X.X.X to a single path (Core), and so does our CUDAcore.
The CUDAcore adds the path above (which would be specific to GCCcore version in upstream), The CUDA module adds the compiler-specific (i.e. gcc93 or intel2020) paths. See files
|
Maybe this yields a clearer picture (I kept only the paths for the 2020a components) :
|
In general, I think we need to agree an approach to how we are going to deal with accelerators. I'm not such a big fan of the current direction with fosscuda, since it leads to a lot of duplication of easyconfigs. As @mboisson has pointed out elsewhere, it makes things like Personally I would like to see accelerator packages be added as overlays on top of standard toolchains (potentially shadowing packages available there). For the current (standard) hierarchy, this means additional MODULEPATH extensions for GCCcore, Compiler and MPI levels. To act as overlays, these extensions have to occur in the appropriate order, and I currently can't see another way of doing that other than have different CUDA packages that are auto-swapped. Dependency resolution becomes complicated in that case, but I think I could modify that such that if a CUDA capable toolchain is used, other CUDA-enabled toolchains in the hierarchy are always preferred when resolving deps. Having said all that, the traffic that this PR has generated clearly implies that we need to have a more complete discussion on this topic. |
@ocaisa, the same thing applies to MKL. I had created a PR for all of the subtoolchains we use here: only to realize that we also install MKL at the SYSTEM level, while upstream typically installs it at the MPI level. This means that the toolchains So I created this one By installing MKL at the SYSTEM level, it actually means that we also have |
So, there are probably lots of ways we can technically go foward with this. Some promising options (in no particular order):
|
I would favor solution 1. |
I've pushed this to the next milestone (likely 4.3.1)... I haven't had time to dive into this yet, and it deserves some careful thought rather than rushing it in (and I don't want this to block the 4.3.0 release). |
close/re-open to refresh CI |
I'll close this for now, we can keep it for now but there are issue with this in the general HMNS, unless you install CUDAcore for GCCcore which is a little odd since it's a binary. |
This is a new toolchain combining GCCcore and CUDA.
gcccuda and iccifortcuda then needed to be modified to let them
know that gcccorecuda is an optional subtoolchain.