How to deal with plugins not being built for some architectures? #28576

makortel · 2019-12-06T22:19:13Z

PR #28537 introduces first plugins depending on CUDA, and they are ignored for ppc64le. If/when we add some of these (or future) plugins on any configuration fragment used in runTheMatrix.py, the depending workflows will fail on ppc64le (similar to the tests in #28572) already because the fillDescriptions()-generated configuration files are not present (and even if that would succeed, loading of the plugin(s) would fail).

Some options on how to proceed

Forget fillDescriptions() for such modules and go back to hand-written cfi files, avoid also configuration validation
Provide a separate mechanism to generate+validate cfi fragments that is always present
- requires solving of Prune non-consumed EDProducers that are only in Tasks #26438 in a way that we can avoid constructing the actual CUDA EDModules
Add an additional layer of plugins, i.e. each "CUDA EDModule" itself would contain CPU-only code (and thus would be present in all architectures), and would load the helper plugin with the real CUDA dependence only if CUDAService (or some other mechanism) tells that CUDA is available.
- CUDAService itself would need such an indirection as well
- This option would make solving Prune non-consumed EDProducers that are only in Tasks #26438 straightforward as module construction would become acceptable operation, and configuration validation would not pull in non-CPU libraries
Build multiple versions of the plugin libraries (CPU-only, CPU+CUDA) and decide at load time which one to pick
- This could be an interesting option for building specific libraries for more modern vector instruction sets than SSE3 (AVX(2), AVX-512)

More ideas are welcome.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2019-12-06T22:20:47Z

A new Issue was created by @makortel Matti Kortelainen.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

VinInn · 2019-12-07T17:28:15Z

Do really cuda not exist for ppc64le?
option 4 is interesting, still (given the historical organization of CMSSW with many algos in "src") not fully satisfactory.
option 1 is simply nogo

slava77 · 2019-12-07T19:02:25Z

moving fillDescriptions to a separate file.cc is a possible solution; not as a standard but to avoid issues in these cases; but this likely requires having a plugin.h and a modules.cc file to declare and instantiate the relevant parts, while leaving some methods unresolved if the rest of the plugin.cc can not be compiled.

VinInn · 2019-12-08T08:10:33Z

I think we should simply ifdef (eventually at higher level)
in any case None of this is a solution: if we support a platform better all code be able to run on it and produce correct results

makortel · 2019-12-09T20:11:26Z

Do really cuda not exist for ppc64le?

CUDA is available for ppc64le. I was actually wondering that myself as well, maybe @smuzaffar @mrodozov could clarify if there is anything fundamental preventing adding CUDA external for ppc64le or it just has not been done yet?

makortel · 2019-12-09T20:53:16Z

if we support a platform better all code be able to run on it and produce correct results

I'm not sure we can count on the availability of all foreseen GPUs (NVIDIA, AMD, Intel) on all our current CPU architectures (x86, Arm, Power).

A possible way out could be to put the GPU to be part of the architecture (SCRAM_ARCH), in which case the differentiation in the configuration and in the code would be just "CPU vs. GPU" instead of "CPU vs. NVIDIA vs. AMD vs. Intel".

VinInn · 2019-12-10T09:06:28Z

SCRAM_ARCH is the ubiquitous mechanism we use.
It has its downsides (notably the inability to cross-compile that essentially makes impossible to have platform dependent plugins in the same release)
FatLibraires are not easy to deploy at the scale of cmssw applications. I even doubt that we can compile and link kernels for different types of GPUs.

makortel · 2019-12-10T15:57:30Z

I was specifically thinking about a separate library per GPU type, and then loading only the one corresponding to the GPU the machine has (if any).

VinInn · 2019-12-10T15:59:40Z

and do we have a mechanism for that?

makortel · 2019-12-10T16:01:44Z

Not today. But that's the question, would we need/benefit from such a mechanism.

fwyzard · 2019-12-10T16:22:12Z

Going back to the original issue, I understand there are two problems

generate the cfi files through fillDescriptions()
support a configuration that may want to run a CUDA module

The first issue (availability of fillDescriptions()) could be addressed (today) by splitting the (e.g.) EDProducer in multiple files, or #ifdeffing the CUDA part. This could be made to result in a dummy module that can be constructed (and maybe run) but not do anything ?
Is this something that we would like to have ? It doesn't seem like a good idea to me.

By the way, I wrote "today" because I have not given any thoughts to how to handle it if/when we use something like Kokkos or Alpaka.

The second issue (availability of the module itself) would be automatically address if we create a dummy module.
An alternative could be to "filter out" these modules at python level.
I guess we would need some central way to declare that a module requires some condition to be set at configuration time to even be loaded ?
And then, how would it interact with a SwitchProducer ?

VinInn · 2019-12-10T16:49:42Z

I am only afraid that we start with "few localized actions" and will end-up in few months/years to apply the pattern to most of the plugins. This is why, at the end, I always failed back to support solutions based on SCRAM_ARCH.

makortel · 2020-01-08T16:02:58Z

assign core

cmsbuild · 2020-01-08T16:03:19Z

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2020-03-29T18:46:48Z

Here is a proposal based on earlier discussion of @Dr15Jones and I. We concluded that all the modules specified in the configuration need to be loaded and constructed:

loaded in order to run configuration validation with fillDescriptions()
- the validation can insert new parameters into the PSet of the module and thus alter the process configuration hash
constructed in order to get data dependencies (consumes() and produces())
- produces() information is also used to enforce that the cases of the SwitchProducer produce exactly the same products

A possible way, that has minimal impact on the existing code, would be

Use #ifdef to hide code that can not be compiled for platform X such that constructor, destructor, and fillDescriptions() are always available with equal behavior
- in constructor and destructor with respect to consumes+produces is sufficient
- constructor and destructor must not have side effects
assume that all modules can be loaded within one process and one $SCRAM_ARCH
- essentially implying options 3 or 4 in the issue description in case we really hit a case that we must avoid loading some shared object
After constructing all modules, i.e. figuring out the data dependence DAG, immediately delete all unscheduled (=not in any Path or EndPath) modules whose output is not consumed
Architecture-dependent initialization should be done in beginJob() or beginStream()
- that is not called if the module is deleted because no-one consumed its output
Add globalBeginJob() to stream modules with GlobalCache (all other cases already have beginJob() or beginStream())
- run after constructor, but before beginStream()
- one call (global) per module (no stream replication)
- not called if module is deleted
- e.g. for ML inference (TensorFlow etc), initialize global cache here

Applying the same constraints to Services and ESProcuders implies

Must be able to construct all Services. Architecture-dependent code in constructor must be protected with a run-time check (or with #ifdef if unable to build)
Must be able to construct all ESProducers.
- In principle that would be sufficient, because the producing functions are called on-demand, i.e. it would seem to be safe to be able to call architecture-dependent code in produce(). An exception comes from hltGetConditions, that leads to calling the producing functions for all ESProducers.
  - After all EDModules in HLT menu use ES-consumes, and we get concurrent ESProducers, could hltGetConditions perhaps become unnecessary? Ideally framework would run all consumed ESProducers before and outside of the EDModule transition functions, although the exact time would be unspecified. On the other hand, already now the moment when hltGetConditions gets run wrt. other modules in the menu seems to be unspecified (unless I missed some data dependency on a quick look).
- In the mean time, the ESProducer producing functions must not call architecture-dependent code (or must protect it at run time), and the functions must always produce a valid ESProduct.
  - The currently recommended pattern CUDA ESProducts almost achieves that: the ESProduct is a wrapper that holds the data in CPU memory, and contains the code to initiate the transfers that gets called from the EDModule. The recipe would only need to be updated to protect the CUDA-specific code with a run-time check for CUDA availability.

makortel · 2020-03-29T18:47:11Z

loaded in order to run configuration validation with fillDescriptions()

the validation can insert new parameters into the PSet of the module and thus alter the process configuration hash

constructed in order to get data dependencies (consumes() and produces())

produces() information is also used to enforce that the cases of the SwitchProducer produce exactly the same products

In the long run we could think of moving more functionality to
python and impose certain restrictions:

mark fillDescriptions()-generated cfi fragments as "complete", i.e. new parameters can not be added or existing ones removed (see Prune non-consumed EDProducers that are only in Tasks #26438 (comment))
- or more precisely: all operations that could cause the configuration validation to change the module PSet would be forbidden
require that the module PSets contain all input data dependencies as InputTags, and only the ones that the module really consumes
- while this would have other benefits as well by making the process configuration more self contained, it would be a rather big change and would require going over the code of nearly all modules

Then we would not need to validate the module configurations, nor call the constructors for the data dependence DAG of modules. Neither of these would help to enforce the SwitchProducer cases producing exactly the same products.

Instead of the module dependence DAG in python we could think of SwitchTask+SwitchSequence (see #26438 (comment)), but I'm a bit afraid those won't scale by creating a bigger mess of having to nearly identical tasks for different "devices" and having to carefully keep track of them. The SwitchTask approach would also not help to enforce the SwitchProducer cases producing exactly the same products.

makortel · 2020-03-30T14:20:38Z

assume that all modules can be loaded within one process and one $SCRAM_ARCH

essentially implying options 3 or 4 in the issue description in case we really hit a case that we must avoid loading some shared object

I do see a risk here that we continue to bundle everything together, and if something not-generally-loadable appears, we may have to restructure a lot of code. On the other hand, going for complicated solutions now would have the risk of doing possibly unnecessary work now.

makortel · 2020-04-07T15:41:17Z

Replying here to @fwyzard's comment #27983 (comment)

the framework needs to be able to construct all EDModules and ESProducers for all the platforms the configuration supports. I.e. the CUDA EDProducers need to exist even when we can't build CUDA, implying that the pieces of code depending on CUDA needs to be protected with #ifdefs.

For what is worth, I really don't like this approach.

I don't like it too much either.

At this point, I think that the requirement to have unique hashes for the configuration is doing more harm to the process than (any ?) good.

There are more issues than just the "configuration hash". If we can not build an EDModule for some architecture

We can not generate the cfi fragments, and any configuration loading/importing such a configuration would fail (also when using cms.Modifier.toReplaceWith() etc)
- On the other hand the cfi fragments should be architecture-independent, so we could think of setting up a $SCRAM_ARCH-independent "repository" of the cfi fragments, which would contain the fragments for all $SCRAM_ARCHs. This would work as long as there is one $SCRAM_ARCH release that can generate the fragment.
- We could load such "possibly missing cfi" or "possibly non-functional cff" fragments with the cms.Modifier.makeProcessModifier(). I would not be surprised if we would run into some problems with it (e.g. ordering), because the mechanism was not designed to be widely used, but rather as "a last resort". This approach would require giving up on "portable configuration" goal.
We can not validate the process configuration if it contains any such modules. This point is important for the "equal configuration hash across architectures".
- We could move more validation functionality into python itself, see "complete cfi fragments in How to deal with plugins not being built for some architectures? #28576 (comment)
With SwitchProducer approach, the idea is to make only the modules that produce products "equal" to the CPU products "switchable", and leave their dependencies to be run on demand.
- The "on demand" part requires the knowledge of data dependencies. Currently those are known after the construction of all EDModules.
  - We could require that such modules must declare all their ED/ES data dependencies with InputTag and ESInputTag and do the dependence analysis on the python side (How to deal with plugins not being built for some architectures? #28576 (comment)).
  - We could introduce SwitchTask and SwitchSequence to declare explicitly when to run which module. I'm a bit afraid this approach would result a mess when applied to large scale.
- We could think some other way of declaring a "portable configuration"
We could not enforce in all jobs that all case-producers of a 'SwitchProducer` produce exactly the same products
- We could relax the enforcement by doing it only for those cases that are available for the machine the job is running. Given that the CPU module would be always available, i.e. all jobs would compare against that, this approach might be sufficient. Some error cases might be caught later than with the current enforment.

makortel · 2020-09-01T18:30:54Z

Summary of a discussion in the core software meeting today

In the short term (#31261) continue with #ifdef approach, explore ways to make it look nicer

When CUDA is not available at compile time, define a "dummy" producer that provides a constructor, fillDescriptions() that is equivalent to the fillDescriptions() of the real producer, and produces() function that is either empty or throws an exception
Accept the difference that the real producer declares consumes() and produces() and the "dummy" producer does not. This is simpler (can leave CUDA data formats not to be built), and forces us to discover if we really need the consumes+produces on SCRAM_ARCH that does not support CUDA. Currently the following consequences are known, but we believe they should be really problemaic
- callWhenNewProductsRegistered() behaves differently on different architectures
  - The non-built EDModule would not run properly anyway
  - callWhenNewProductsRegistered() should really not use products whose production is controlled by the SwitchProducer anyway
- EDModule dependency tools like DependencyGraph or Tracer will give different answers on different architectures
  - We believe this to be acceptable

For longer term some new thoughts

Some options for cfi fragments
- A common, architecture independent place in the release installation area to collect the cfi files
  - What if the IB/release of the SCRAM_ARCH populating that area has build errors?
  - What if no single SCRAM_ARCH can build all EDModules?
  - Makes the testing step of all SCRAM_ARCHs of a given CMSSW branch to depend on the build step of one or all SCRAM_ARCHs
- (variation of previous) Upon import Package.SubPackage.fragment_cfi, if scram knows that that file has "intentionally" not been generated, it could go and find the cfi fragment from the production SCRAM_ARCH's area
What if we decide some of the cfi options, and just accept that some EDModules have not been built?
- Framework would have to know that the reason for EDModule (or any plugin) not existing is because it can not be built for that architecture
- SwitchProducer would not be able to fully enforce that the producer of each case would produce the same products
  - Assuming the CPU (or default) case would be available for all architectures, comparison against that should be sufficient
  - SwitchProducer could not be extended to (optionally) automatically limit the SwitchProducer products to a common subset of the case producers (improve the SwitchProducer error message to suggest a solution #31230)
- Information on consumed and produced products for non-built EDModules would not be available
  - As discussed above for the #ifdef, likely they are not necessary and we could live without
What if data products can not be built?
- It is possible to make data products always build (std::any as an extreme example), but unclear if that would really be useful
- Easier to start with just ignoring just data products (no consumes/produces declarations)
Alternative: deal with the "accelerator" variability within EDModules instead of exposing all that into the configuration (with a separate EDModule class for each "accelerator")
- With a portability technology (like Alpaka, Kokkos, or SYCL) we anyway aim to have a single source for the EDModule and algorithms, so the flexibility on the work organization into EDModules could be less necessary
- In SYCL/oneAPI, the target "accelerator" is chosen at run time, so this approach would work naturally with SYCL
  - Alpaka, Kokkos, or any other solution where the target "accelerator" is chosen at compile time could probably be made to work with additional layer of indirection
  - What if SYCL/oneAPI is not available for or does not work with some SCRAM_ARCH?
- Some similarities with the option 3 in the description (additional layer of plugins)
- Would require "accelerator"-independent data formats (at the type-level at least, but probably we'd want those anyway Issues related to GPU data structures #29297)
- Could be worth of prototyping
- Need to support current way with CUDA at least until everything has been migrated to a new way

In the specific case of CUDA11 and gcc10, we could work around with fallback compiler options (use gcc9 with nvcc, and gcc10 elsewhere), but this was considered potentially fragile (e.g. towards C++ modules), and a more general solution was preferred.

makortel · 2024-08-15T18:20:15Z

The problems with CUDA modules have been addressed in the Alpaka module system. The approach there (ModuleTypeResolver, #40383 and links therein) has some similarities with

Alternative: deal with the "accelerator" variability within EDModules instead of exposing all that into the configuration (with a separate EDModule class for each "accelerator")

but there are also some differences.

I think we could close this issue, and if some other system (like ML inference, although there we are already dealing with this problem in a different way) would face similar challenge, open a new issue then.

makortel · 2024-08-15T18:20:22Z

+core

cmsbuild · 2024-08-15T18:20:37Z

cms-bot internal usage

cmsbuild · 2024-08-15T18:20:38Z

This issue is fully signed and ready to be closed.

cmsbuild added the pending-assignment label Dec 6, 2019

This was referenced Dec 6, 2019

Tools for CUDA modules #28537

Merged

Skip building onnx code on PowerPC. The external is not supported #28572

Closed

Prune non-consumed EDProducers that are only in Tasks #26438

Closed

makortel mentioned this issue Dec 11, 2019

Where to place loading of CUDAService? #28575

Open

cmsbuild added core-pending pending-signatures and removed pending-assignment labels Jan 8, 2020

makortel mentioned this issue Feb 13, 2020

recover TF inference for deepJet and deepDoubleX and MXNET inf for deepBoosted (migrated to ONNX in #28112) #28959

Closed

This was referenced Apr 6, 2020

Issues related to GPU plugins organisation #29298

Open

Squash all Patatrack developments on top of CMSSW_11_3_0_pre5 #27983

Closed

makortel mentioned this issue Aug 27, 2020

[RFC] Make BeamSpotToCUDA constructible, and avoid building CUDADataFormats/BeamSpot when CUDA is not available according to cuda-gcc-support #31261

Closed

makortel mentioned this issue Oct 12, 2020

Add a way to specify compute accelerators in the configuration #31760

Closed

makortel closed this as completed Aug 15, 2024

cmsbuild added core-approved fully-signed and removed core-pending pending-signatures labels Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with plugins not being built for some architectures? #28576

How to deal with plugins not being built for some architectures? #28576

makortel commented Dec 6, 2019

cmsbuild commented Dec 6, 2019

VinInn commented Dec 7, 2019

slava77 commented Dec 7, 2019

VinInn commented Dec 8, 2019

makortel commented Dec 9, 2019

makortel commented Dec 9, 2019

VinInn commented Dec 10, 2019

makortel commented Dec 10, 2019

VinInn commented Dec 10, 2019

makortel commented Dec 10, 2019

fwyzard commented Dec 10, 2019

VinInn commented Dec 10, 2019

makortel commented Jan 8, 2020

cmsbuild commented Jan 8, 2020

makortel commented Mar 29, 2020

makortel commented Mar 29, 2020

makortel commented Mar 30, 2020

makortel commented Apr 7, 2020

makortel commented Sep 1, 2020 •

edited

Loading

makortel commented Aug 15, 2024

makortel commented Aug 15, 2024

cmsbuild commented Aug 15, 2024

cmsbuild commented Aug 15, 2024

How to deal with plugins not being built for some architectures? #28576

How to deal with plugins not being built for some architectures? #28576

Comments

makortel commented Dec 6, 2019

cmsbuild commented Dec 6, 2019

VinInn commented Dec 7, 2019

slava77 commented Dec 7, 2019

VinInn commented Dec 8, 2019

makortel commented Dec 9, 2019

makortel commented Dec 9, 2019

VinInn commented Dec 10, 2019

makortel commented Dec 10, 2019

VinInn commented Dec 10, 2019

makortel commented Dec 10, 2019

fwyzard commented Dec 10, 2019

VinInn commented Dec 10, 2019

makortel commented Jan 8, 2020

cmsbuild commented Jan 8, 2020

makortel commented Mar 29, 2020

makortel commented Mar 29, 2020

makortel commented Mar 30, 2020

makortel commented Apr 7, 2020

makortel commented Sep 1, 2020 • edited Loading

makortel commented Aug 15, 2024

makortel commented Aug 15, 2024

cmsbuild commented Aug 15, 2024

cmsbuild commented Aug 15, 2024

makortel commented Sep 1, 2020 •

edited

Loading