Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44698

Merged

Conversation

makortel
Copy link
Contributor

PR description:

This PR is a temporary workaround for the issue discussed in #44643. In short, with an HLT menu that uses Alpaka modules, when some HLT processes use GPU and some do not, the fact that the different backends of Alpaka modules produce different transient data products cause the ProductIDs to be different between the CPU-only and CPU+GPU processes, and presently there is not enough metadata available in the final streamer file for the framework to keep properly track of the various indices, that leads to de-referencing of edm::Ref to fail in a subsequent job (for longer description see #44643 (comment)).

This PR works around the problem by making a subset(*) of the Alpaka EDProducers on the CPU serial backend, for each "device-side data product" they produce (that in reality are the "host-side data products" directly) they register the production of the corresponding CUDA backend data product. This hack makes the CPU-serial and CUDA backend EDProducers to register the production of exactly the same data products, that leads to equal ProductIDs for the same data products between CPU-only and CPU+GPU jobs, circumventing the problem.

(*) ECAL, PF, and phase1 Pixel EDProducers, that are either being used in the present HLT menu, or are planned to be used in the near future

The use of strings for the CUDA backend data products is ugly, but avoids the need to have a compile-time dependence on the CUDA backend code in the CPU serial backend code (I actually tried that first, but ran into compilation issues; probably our present build rules prevent the use of Alpaka CUDA backend types in the CPU serial backend code). At runtime the added explicit CUDA dependence will likely break on platforms that do not support CUDA (all our code directly depending on CUDA would be broken anyway, so the temporary loss of functionality seems acceptable).

This PR is intended to be reverted when the necessary metadata to deal with the different ProductIDs in different HLT processes (CPU-only vs. CPU+GPU) gets propagated to the framework), whose details are being discussed in #44643 .

PR validation:

I ran the example HLT job in #44643 (comment) for 1 event without and with this PR with and without a GPU. Then I ran edmProvDump --productIDEntry 0 -a <file> for both output files, and compared the lines that contained ProductID 2: text (this shows the set of data products that are either stored in the file, or are ancestors of the stored data products, produced by the re-HLT process).

Without this PR the ProductIDs between CPU-only and CPU+GPU jobs show differences (#44643 (comment)).

With this PR the ProductIDs between CPU-only and CPU+GPU jobs are the same.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

To be backported to 14_0_X.

…ackends in Alpaka modules possibly used at HLT

To be reverted in the near future after the necessary metadata to deal
with different ProductIDs in different HLT processes (CPU-only vs.
CPU+GPU) gets propagated to the framework.
@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 10, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44698/39907

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel for master.

It involves the following packages:

  • EventFilter/EcalRawToDigi (reconstruction)
  • HeterogeneousCore/AlpakaCore (heterogeneous)
  • RecoLocalCalo/EcalRecProducers (reconstruction)
  • RecoLocalTracker/SiPixelClusterizer (reconstruction)
  • RecoLocalTracker/SiPixelRecHits (reconstruction)
  • RecoParticleFlow/PFClusterProducer (reconstruction)
  • RecoParticleFlow/PFRecHitProducer (reconstruction)
  • RecoTracker/PixelSeeding (reconstruction)
  • RecoTracker/PixelVertexFinding (reconstruction)
  • RecoVertex/BeamSpotProducer (reconstruction, alca)

@mandrenguyen, @fwyzard, @perrotta, @consuegs, @saumyaphor4252, @cmsbuild, @jfernan2, @makortel can you please review it and eventually sign? Thanks.
@JanFSchulte, @mmarionncern, @GiacomoSguazzoni, @mroguljic, @tocheng, @VinInn, @threus, @rsreds, @tsusa, @thomreis, @hatakeyamak, @lgray, @rovere, @youyingli, @mmusich, @VourMa, @tvami, @ferencek, @dkotlins, @gpetruc, @argiro, @francescobrivio, @missirol, @wang0jin, @seemasharmafnal, @rchatter, @dgulhan, @sameasy, @Martin-Grunewald, @apsallid, @fabiocos, @ReyerBand, @yuanchao, @mtosi, @felicepantaleo this is something you requested to watch as well.
@rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

enable gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@makortel
Copy link
Contributor Author

urgent

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-47488d/38758/summary.html
COMMIT: 2d853ef
CMSSW: CMSSW_14_1_X_2024-04-10-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44698/38758/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 111 lines to the logs
  • Reco comparison results: 49 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3316263
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3316240
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 20
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 39720
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@jfernan2
Copy link
Contributor

+1

@perrotta
Copy link
Contributor

+alca

@fwyzard
Copy link
Contributor

fwyzard commented Apr 11, 2024

hi @makortel, a question about the workaround.

Do the branch id depend on the branch type or name, or only on their presence ?

I know it's not (even close to) a production environment, but I'm curious: what would happen if one mixes jobs running on the CPU and on AMD GPUs ?

@makortel
Copy link
Contributor Author

Do the branch id depend on the branch type or name, or only on their presence ?

Both the BranchID and the ProductID depend on the friendly class name, the module label, the product instance name, and the process name (BranchID being a hash of those, and ProductID being an index into a container ordered by those). I.e. these "mock-up products" for the CPU-only case must reflect the other case exactly in order to reproduce the same behavior.

I know it's not (even close to) a production environment, but I'm curious: what would happen if one mixes jobs running on the CPU and on AMD GPUs ?

The CPU+AMD GPU job might have different ProductIDs than a CPU+NVIDIA GPU job, or not. It really depends on the other data products in the job (main their friendly class names) if the NVIDIA vs. AMD difference in the type names would make a difference or not.

If one would require the same ProductIDs for CPU-only, CPU+NVIDIA, and CPU+AMD, then all three backends would have to declare the production of all three flavors of data products on each backend. (which is then why we don't do that)

The workaround in this PR is thus very specific to the present HLT farm hardware.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 11, 2024

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @antoniovilela, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@antoniovilela
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants