Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with plugins not being built for some architectures? #28576

Closed
makortel opened this issue Dec 6, 2019 · 23 comments
Closed

How to deal with plugins not being built for some architectures? #28576

makortel opened this issue Dec 6, 2019 · 23 comments

Comments

@makortel
Copy link
Contributor

makortel commented Dec 6, 2019

PR #28537 introduces first plugins depending on CUDA, and they are ignored for ppc64le. If/when we add some of these (or future) plugins on any configuration fragment used in runTheMatrix.py, the depending workflows will fail on ppc64le (similar to the tests in #28572) already because the fillDescriptions()-generated configuration files are not present (and even if that would succeed, loading of the plugin(s) would fail).

Some options on how to proceed

  1. Forget fillDescriptions() for such modules and go back to hand-written cfi files, avoid also configuration validation
  2. Provide a separate mechanism to generate+validate cfi fragments that is always present
  3. Add an additional layer of plugins, i.e. each "CUDA EDModule" itself would contain CPU-only code (and thus would be present in all architectures), and would load the helper plugin with the real CUDA dependence only if CUDAService (or some other mechanism) tells that CUDA is available.
  4. Build multiple versions of the plugin libraries (CPU-only, CPU+CUDA) and decide at load time which one to pick
    • This could be an interesting option for building specific libraries for more modern vector instruction sets than SSE3 (AVX(2), AVX-512)

More ideas are welcome.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2019

A new Issue was created by @makortel Matti Kortelainen.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@VinInn
Copy link
Contributor

VinInn commented Dec 7, 2019

Do really cuda not exist for ppc64le?
option 4 is interesting, still (given the historical organization of CMSSW with many algos in "src") not fully satisfactory.
option 1 is simply nogo

@slava77
Copy link
Contributor

slava77 commented Dec 7, 2019

moving fillDescriptions to a separate file.cc is a possible solution; not as a standard but to avoid issues in these cases; but this likely requires having a plugin.h and a modules.cc file to declare and instantiate the relevant parts, while leaving some methods unresolved if the rest of the plugin.cc can not be compiled.

@VinInn
Copy link
Contributor

VinInn commented Dec 8, 2019

  1. I think we should simply ifdef (eventually at higher level)
  2. in any case None of this is a solution: if we support a platform better all code be able to run on it and produce correct results

@makortel
Copy link
Contributor Author

makortel commented Dec 9, 2019

Do really cuda not exist for ppc64le?

CUDA is available for ppc64le. I was actually wondering that myself as well, maybe @smuzaffar @mrodozov could clarify if there is anything fundamental preventing adding CUDA external for ppc64le or it just has not been done yet?

@makortel
Copy link
Contributor Author

makortel commented Dec 9, 2019

if we support a platform better all code be able to run on it and produce correct results

I'm not sure we can count on the availability of all foreseen GPUs (NVIDIA, AMD, Intel) on all our current CPU architectures (x86, Arm, Power).

A possible way out could be to put the GPU to be part of the architecture (SCRAM_ARCH), in which case the differentiation in the configuration and in the code would be just "CPU vs. GPU" instead of "CPU vs. NVIDIA vs. AMD vs. Intel".

@VinInn
Copy link
Contributor

VinInn commented Dec 10, 2019

SCRAM_ARCH is the ubiquitous mechanism we use.
It has its downsides (notably the inability to cross-compile that essentially makes impossible to have platform dependent plugins in the same release)
FatLibraires are not easy to deploy at the scale of cmssw applications. I even doubt that we can compile and link kernels for different types of GPUs.

@makortel
Copy link
Contributor Author

I was specifically thinking about a separate library per GPU type, and then loading only the one corresponding to the GPU the machine has (if any).

@VinInn
Copy link
Contributor

VinInn commented Dec 10, 2019

and do we have a mechanism for that?

@makortel
Copy link
Contributor Author

Not today. But that's the question, would we need/benefit from such a mechanism.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 10, 2019

Going back to the original issue, I understand there are two problems

  • generate the cfi files through fillDescriptions()
  • support a configuration that may want to run a CUDA module

The first issue (availability of fillDescriptions()) could be addressed (today) by splitting the (e.g.) EDProducer in multiple files, or #ifdeffing the CUDA part. This could be made to result in a dummy module that can be constructed (and maybe run) but not do anything ?
Is this something that we would like to have ? It doesn't seem like a good idea to me.

By the way, I wrote "today" because I have not given any thoughts to how to handle it if/when we use something like Kokkos or Alpaka.

The second issue (availability of the module itself) would be automatically address if we create a dummy module.
An alternative could be to "filter out" these modules at python level.
I guess we would need some central way to declare that a module requires some condition to be set at configuration time to even be loaded ?
And then, how would it interact with a SwitchProducer ?

@VinInn
Copy link
Contributor

VinInn commented Dec 10, 2019

I am only afraid that we start with "few localized actions" and will end-up in few months/years to apply the pattern to most of the plugins. This is why, at the end, I always failed back to support solutions based on SCRAM_ARCH.

@makortel
Copy link
Contributor Author

makortel commented Jan 8, 2020

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 8, 2020

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor Author

Here is a proposal based on earlier discussion of @Dr15Jones and I. We concluded that all the modules specified in the configuration need to be loaded and constructed:

  • loaded in order to run configuration validation with fillDescriptions()
    • the validation can insert new parameters into the PSet of the module and thus alter the process configuration hash
  • constructed in order to get data dependencies (consumes() and produces())
    • produces() information is also used to enforce that the cases of the SwitchProducer produce exactly the same products

A possible way, that has minimal impact on the existing code, would be

  • Use #ifdef to hide code that can not be compiled for platform X such that constructor, destructor, and fillDescriptions() are always available with equal behavior
    • in constructor and destructor with respect to consumes+produces is sufficient
    • constructor and destructor must not have side effects
  • assume that all modules can be loaded within one process and one $SCRAM_ARCH
    • essentially implying options 3 or 4 in the issue description in case we really hit a case that we must avoid loading some shared object
  • After constructing all modules, i.e. figuring out the data dependence DAG, immediately delete all unscheduled (=not in any Path or EndPath) modules whose output is not consumed
  • Architecture-dependent initialization should be done in beginJob() or beginStream()
    • that is not called if the module is deleted because no-one consumed its output
  • Add globalBeginJob() to stream modules with GlobalCache (all other cases already have beginJob() or beginStream())
    • run after constructor, but before beginStream()
    • one call (global) per module (no stream replication)
    • not called if module is deleted
    • e.g. for ML inference (TensorFlow etc), initialize global cache here

Applying the same constraints to Services and ESProcuders implies

  • Must be able to construct all Services. Architecture-dependent code in constructor must be protected with a run-time check (or with #ifdef if unable to build)
  • Must be able to construct all ESProducers.
    • In principle that would be sufficient, because the producing functions are called on-demand, i.e. it would seem to be safe to be able to call architecture-dependent code in produce(). An exception comes from hltGetConditions, that leads to calling the producing functions for all ESProducers.
      • After all EDModules in HLT menu use ES-consumes, and we get concurrent ESProducers, could hltGetConditions perhaps become unnecessary? Ideally framework would run all consumed ESProducers before and outside of the EDModule transition functions, although the exact time would be unspecified. On the other hand, already now the moment when hltGetConditions gets run wrt. other modules in the menu seems to be unspecified (unless I missed some data dependency on a quick look).
    • In the mean time, the ESProducer producing functions must not call architecture-dependent code (or must protect it at run time), and the functions must always produce a valid ESProduct.
      • The currently recommended pattern CUDA ESProducts almost achieves that: the ESProduct is a wrapper that holds the data in CPU memory, and contains the code to initiate the transfers that gets called from the EDModule. The recipe would only need to be updated to protect the CUDA-specific code with a run-time check for CUDA availability.

@makortel
Copy link
Contributor Author

  • loaded in order to run configuration validation with fillDescriptions()

    • the validation can insert new parameters into the PSet of the module and thus alter the process configuration hash
  • constructed in order to get data dependencies (consumes() and produces())

    • produces() information is also used to enforce that the cases of the SwitchProducer produce exactly the same products

In the long run we could think of moving more functionality to
python and impose certain restrictions:

  • mark fillDescriptions()-generated cfi fragments as "complete", i.e. new parameters can not be added or existing ones removed (see Prune non-consumed EDProducers that are only in Tasks #26438 (comment))
    • or more precisely: all operations that could cause the configuration validation to change the module PSet would be forbidden
  • require that the module PSets contain all input data dependencies as InputTags, and only the ones that the module really consumes
    • while this would have other benefits as well by making the process configuration more self contained, it would be a rather big change and would require going over the code of nearly all modules

Then we would not need to validate the module configurations, nor call the constructors for the data dependence DAG of modules. Neither of these would help to enforce the SwitchProducer cases producing exactly the same products.

Instead of the module dependence DAG in python we could think of SwitchTask+SwitchSequence (see #26438 (comment)), but I'm a bit afraid those won't scale by creating a bigger mess of having to nearly identical tasks for different "devices" and having to carefully keep track of them. The SwitchTask approach would also not help to enforce the SwitchProducer cases producing exactly the same products.

@makortel
Copy link
Contributor Author

  • assume that all modules can be loaded within one process and one $SCRAM_ARCH
    • essentially implying options 3 or 4 in the issue description in case we really hit a case that we must avoid loading some shared object

I do see a risk here that we continue to bundle everything together, and if something not-generally-loadable appears, we may have to restructure a lot of code. On the other hand, going for complicated solutions now would have the risk of doing possibly unnecessary work now.

@makortel
Copy link
Contributor Author

makortel commented Apr 7, 2020

Replying here to @fwyzard's comment #27983 (comment)

the framework needs to be able to construct all EDModules and ESProducers for all the platforms the configuration supports. I.e. the CUDA EDProducers need to exist even when we can't build CUDA, implying that the pieces of code depending on CUDA needs to be protected with #ifdefs.

For what is worth, I really don't like this approach.

I don't like it too much either.

At this point, I think that the requirement to have unique hashes for the configuration is doing more harm to the process than (any ?) good.

There are more issues than just the "configuration hash". If we can not build an EDModule for some architecture

  • We can not generate the cfi fragments, and any configuration loading/importing such a configuration would fail (also when using cms.Modifier.toReplaceWith() etc)
    • On the other hand the cfi fragments should be architecture-independent, so we could think of setting up a $SCRAM_ARCH-independent "repository" of the cfi fragments, which would contain the fragments for all $SCRAM_ARCHs. This would work as long as there is one $SCRAM_ARCH release that can generate the fragment.
    • We could load such "possibly missing cfi" or "possibly non-functional cff" fragments with the cms.Modifier.makeProcessModifier(). I would not be surprised if we would run into some problems with it (e.g. ordering), because the mechanism was not designed to be widely used, but rather as "a last resort". This approach would require giving up on "portable configuration" goal.
  • We can not validate the process configuration if it contains any such modules. This point is important for the "equal configuration hash across architectures".
  • With SwitchProducer approach, the idea is to make only the modules that produce products "equal" to the CPU products "switchable", and leave their dependencies to be run on demand.
    • The "on demand" part requires the knowledge of data dependencies. Currently those are known after the construction of all EDModules.
      • We could require that such modules must declare all their ED/ES data dependencies with InputTag and ESInputTag and do the dependence analysis on the python side (How to deal with plugins not being built for some architectures? #28576 (comment)).
      • We could introduce SwitchTask and SwitchSequence to declare explicitly when to run which module. I'm a bit afraid this approach would result a mess when applied to large scale.
    • We could think some other way of declaring a "portable configuration"
  • We could not enforce in all jobs that all case-producers of a 'SwitchProducer` produce exactly the same products
    • We could relax the enforcement by doing it only for those cases that are available for the machine the job is running. Given that the CPU module would be always available, i.e. all jobs would compare against that, this approach might be sufficient. Some error cases might be caught later than with the current enforment.

@makortel
Copy link
Contributor Author

makortel commented Sep 1, 2020

Summary of a discussion in the core software meeting today

In the short term (#31261) continue with #ifdef approach, explore ways to make it look nicer

  • When CUDA is not available at compile time, define a "dummy" producer that provides a constructor, fillDescriptions() that is equivalent to the fillDescriptions() of the real producer, and produces() function that is either empty or throws an exception
  • Accept the difference that the real producer declares consumes() and produces() and the "dummy" producer does not. This is simpler (can leave CUDA data formats not to be built), and forces us to discover if we really need the consumes+produces on SCRAM_ARCH that does not support CUDA. Currently the following consequences are known, but we believe they should be really problemaic
    • callWhenNewProductsRegistered() behaves differently on different architectures
      • The non-built EDModule would not run properly anyway
      • callWhenNewProductsRegistered() should really not use products whose production is controlled by the SwitchProducer anyway
    • EDModule dependency tools like DependencyGraph or Tracer will give different answers on different architectures
      • We believe this to be acceptable

For longer term some new thoughts

  • Some options for cfi fragments
    • A common, architecture independent place in the release installation area to collect the cfi files
      • What if the IB/release of the SCRAM_ARCH populating that area has build errors?
      • What if no single SCRAM_ARCH can build all EDModules?
      • Makes the testing step of all SCRAM_ARCHs of a given CMSSW branch to depend on the build step of one or all SCRAM_ARCHs
    • (variation of previous) Upon import Package.SubPackage.fragment_cfi, if scram knows that that file has "intentionally" not been generated, it could go and find the cfi fragment from the production SCRAM_ARCH's area
  • What if we decide some of the cfi options, and just accept that some EDModules have not been built?
    • Framework would have to know that the reason for EDModule (or any plugin) not existing is because it can not be built for that architecture
    • SwitchProducer would not be able to fully enforce that the producer of each case would produce the same products
    • Information on consumed and produced products for non-built EDModules would not be available
      • As discussed above for the #ifdef, likely they are not necessary and we could live without
  • What if data products can not be built?
    • It is possible to make data products always build (std::any as an extreme example), but unclear if that would really be useful
    • Easier to start with just ignoring just data products (no consumes/produces declarations)
  • Alternative: deal with the "accelerator" variability within EDModules instead of exposing all that into the configuration (with a separate EDModule class for each "accelerator")
    • With a portability technology (like Alpaka, Kokkos, or SYCL) we anyway aim to have a single source for the EDModule and algorithms, so the flexibility on the work organization into EDModules could be less necessary
    • In SYCL/oneAPI, the target "accelerator" is chosen at run time, so this approach would work naturally with SYCL
      • Alpaka, Kokkos, or any other solution where the target "accelerator" is chosen at compile time could probably be made to work with additional layer of indirection
      • What if SYCL/oneAPI is not available for or does not work with some SCRAM_ARCH?
    • Some similarities with the option 3 in the description (additional layer of plugins)
    • Would require "accelerator"-independent data formats (at the type-level at least, but probably we'd want those anyway Issues related to GPU data structures #29297)
    • Could be worth of prototyping
    • Need to support current way with CUDA at least until everything has been migrated to a new way

In the specific case of CUDA11 and gcc10, we could work around with fallback compiler options (use gcc9 with nvcc, and gcc10 elsewhere), but this was considered potentially fragile (e.g. towards C++ modules), and a more general solution was preferred.

@makortel
Copy link
Contributor Author

The problems with CUDA modules have been addressed in the Alpaka module system. The approach there (ModuleTypeResolver, #40383 and links therein) has some similarities with

  • Alternative: deal with the "accelerator" variability within EDModules instead of exposing all that into the configuration (with a separate EDModule class for each "accelerator")

but there are also some differences.

I think we could close this issue, and if some other system (like ML inference, although there we are already dealing with this problem in a different way) would face similar challenge, open a new issue then.

@makortel
Copy link
Contributor Author

+core

@cmsbuild
Copy link
Contributor

cms-bot internal usage

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants