Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44698

makortel · 2024-04-10T20:22:32Z

PR description:

This PR is a temporary workaround for the issue discussed in #44643. In short, with an HLT menu that uses Alpaka modules, when some HLT processes use GPU and some do not, the fact that the different backends of Alpaka modules produce different transient data products cause the ProductIDs to be different between the CPU-only and CPU+GPU processes, and presently there is not enough metadata available in the final streamer file for the framework to keep properly track of the various indices, that leads to de-referencing of edm::Ref to fail in a subsequent job (for longer description see #44643 (comment)).

This PR works around the problem by making a subset(*) of the Alpaka EDProducers on the CPU serial backend, for each "device-side data product" they produce (that in reality are the "host-side data products" directly) they register the production of the corresponding CUDA backend data product. This hack makes the CPU-serial and CUDA backend EDProducers to register the production of exactly the same data products, that leads to equal ProductIDs for the same data products between CPU-only and CPU+GPU jobs, circumventing the problem.

(*) ECAL, PF, and phase1 Pixel EDProducers, that are either being used in the present HLT menu, or are planned to be used in the near future

The use of strings for the CUDA backend data products is ugly, but avoids the need to have a compile-time dependence on the CUDA backend code in the CPU serial backend code (I actually tried that first, but ran into compilation issues; probably our present build rules prevent the use of Alpaka CUDA backend types in the CPU serial backend code). At runtime the added explicit CUDA dependence will likely break on platforms that do not support CUDA (all our code directly depending on CUDA would be broken anyway, so the temporary loss of functionality seems acceptable).

This PR is intended to be reverted when the necessary metadata to deal with the different ProductIDs in different HLT processes (CPU-only vs. CPU+GPU) gets propagated to the framework), whose details are being discussed in #44643 .

PR validation:

I ran the example HLT job in #44643 (comment) for 1 event without and with this PR with and without a GPU. Then I ran edmProvDump --productIDEntry 0 -a <file> for both output files, and compared the lines that contained ProductID 2: text (this shows the set of data products that are either stored in the file, or are ancestors of the stored data products, produced by the re-HLT process).

Without this PR the ProductIDs between CPU-only and CPU+GPU jobs show differences (#44643 (comment)).

With this PR the ProductIDs between CPU-only and CPU+GPU jobs are the same.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

To be backported to 14_0_X.

…ackends in Alpaka modules possibly used at HLT To be reverted in the near future after the necessary metadata to deal with different ProductIDs in different HLT processes (CPU-only vs. CPU+GPU) gets propagated to the framework.

cmsbuild · 2024-04-10T20:22:56Z

cms-bot internal usage

cmsbuild · 2024-04-10T20:27:53Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44698/39907

This PR adds an extra 48KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File RecoParticleFlow/PFClusterProducer/plugins/alpaka/PFClusterSoAProducer.cc modified in PR(s): Move EventSetup record dependency to non-device source file #44656, Fix crash in PF Clustering due to empty HCAL RecHits collection #44693
- File RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitSoAProducer.cc modified in PR(s): Do not include EventSetup record definitions in device code #44682, Fix crash in PF Clustering due to empty HCAL RecHits collection #44693
- File RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtuplet.cc modified in PR(s): Alpaka vs CUDA DQM compare modules for pixel tracks objects #43964

cmsbuild · 2024-04-10T20:28:11Z

A new Pull Request was created by @makortel for master.

It involves the following packages:

EventFilter/EcalRawToDigi (reconstruction)
HeterogeneousCore/AlpakaCore (heterogeneous)
RecoLocalCalo/EcalRecProducers (reconstruction)
RecoLocalTracker/SiPixelClusterizer (reconstruction)
RecoLocalTracker/SiPixelRecHits (reconstruction)
RecoParticleFlow/PFClusterProducer (reconstruction)
RecoParticleFlow/PFRecHitProducer (reconstruction)
RecoTracker/PixelSeeding (reconstruction)
RecoTracker/PixelVertexFinding (reconstruction)
RecoVertex/BeamSpotProducer (reconstruction, alca)

@mandrenguyen, @fwyzard, @perrotta, @consuegs, @saumyaphor4252, @cmsbuild, @jfernan2, @makortel can you please review it and eventually sign? Thanks.
@JanFSchulte, @mmarionncern, @GiacomoSguazzoni, @mroguljic, @tocheng, @VinInn, @threus, @rsreds, @tsusa, @thomreis, @hatakeyamak, @lgray, @rovere, @youyingli, @mmusich, @VourMa, @tvami, @ferencek, @dkotlins, @gpetruc, @argiro, @francescobrivio, @missirol, @wang0jin, @seemasharmafnal, @rchatter, @dgulhan, @sameasy, @Martin-Grunewald, @apsallid, @fabiocos, @ReyerBand, @yuanchao, @mtosi, @felicepantaleo this is something you requested to watch as well.
@rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

makortel · 2024-04-10T20:38:51Z

enable gpu

makortel · 2024-04-10T20:38:56Z

@cmsbuild, please test

makortel · 2024-04-11T00:26:48Z

urgent

cmsbuild · 2024-04-11T00:28:27Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-47488d/38758/summary.html
COMMIT: 2d853ef
CMSSW: CMSSW_14_1_X_2024-04-10-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44698/38758/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 111 lines to the logs
Reco comparison results: 49 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3316263
DQMHistoTests: Total failures: 3
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3316240
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 3
DQMHistoTests: Total histograms compared: 39740
DQMHistoTests: Total failures: 20
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 39720
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
Checked 8 log files, 10 edm output root files, 3 DQM output files
TriggerResults: no differences found

jfernan2 · 2024-04-11T06:10:39Z

+1

perrotta · 2024-04-11T07:00:25Z

+alca

fwyzard · 2024-04-11T11:30:27Z

hi @makortel, a question about the workaround.

Do the branch id depend on the branch type or name, or only on their presence ?

I know it's not (even close to) a production environment, but I'm curious: what would happen if one mixes jobs running on the CPU and on AMD GPUs ?

makortel · 2024-04-11T13:57:50Z

Do the branch id depend on the branch type or name, or only on their presence ?

Both the BranchID and the ProductID depend on the friendly class name, the module label, the product instance name, and the process name (BranchID being a hash of those, and ProductID being an index into a container ordered by those). I.e. these "mock-up products" for the CPU-only case must reflect the other case exactly in order to reproduce the same behavior.

I know it's not (even close to) a production environment, but I'm curious: what would happen if one mixes jobs running on the CPU and on AMD GPUs ?

The CPU+AMD GPU job might have different ProductIDs than a CPU+NVIDIA GPU job, or not. It really depends on the other data products in the job (main their friendly class names) if the NVIDIA vs. AMD difference in the type names would make a difference or not.

If one would require the same ProductIDs for CPU-only, CPU+NVIDIA, and CPU+AMD, then all three backends would have to declare the production of all three flavors of data products on each backend. (which is then why we don't do that)

The workaround in this PR is thus very specific to the present HLT farm hardware.

fwyzard · 2024-04-11T14:05:54Z

+heterogeneous

cmsbuild · 2024-04-11T14:06:17Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @antoniovilela, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

antoniovilela · 2024-04-11T16:18:45Z

+1

cmsbuild added this to the CMSSW_14_1_X milestone Apr 10, 2024

cmsbuild added reconstruction-pending alca-pending pending-signatures tests-pending orp-pending code-checks-pending heterogeneous-pending tracking trk labels Apr 10, 2024

This was referenced Apr 10, 2024

[14_0_X] Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44699

Merged

ProductNotFound issue in T0 HLTMonitor processing jobs starting Run 378985 #44643

Open

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 10, 2024

cmsbuild added tests-started and removed tests-pending labels Apr 10, 2024

makortel mentioned this pull request Apr 10, 2024

Investigate feasibility of a quick workaround cms-sw/framework-team#877

Closed

2 tasks

cmsbuild added the urgent label Apr 11, 2024

cmsbuild added tests-approved and removed tests-started labels Apr 11, 2024

cmsbuild removed the reconstruction-pending label Apr 11, 2024

cmsbuild added the reconstruction-approved label Apr 11, 2024

cmsbuild added alca-approved and removed alca-pending labels Apr 11, 2024

cmsbuild added fully-signed heterogeneous-approved and removed pending-signatures heterogeneous-pending labels Apr 11, 2024

cmsbuild added orp-approved and removed orp-pending labels Apr 11, 2024

cmsbuild merged commit 647bef5 into cms-sw:master Apr 11, 2024
14 checks passed

makortel mentioned this pull request Apr 11, 2024

Revert the workarounds cms-sw/framework-team#880

Open

2 tasks

makortel deleted the productIDModuleTypeResolverWorkaround branch April 12, 2024 15:50

makortel mentioned this pull request May 16, 2024

Handle event meta data changes in streamer files [14_0] #44978

Open

makortel mentioned this pull request May 28, 2024

Revert "Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT" #45080

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44698

Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44698

makortel commented Apr 10, 2024

cmsbuild commented Apr 10, 2024 •

edited

cmsbuild commented Apr 10, 2024

cmsbuild commented Apr 10, 2024

makortel commented Apr 10, 2024

makortel commented Apr 10, 2024

makortel commented Apr 11, 2024

cmsbuild commented Apr 11, 2024

jfernan2 commented Apr 11, 2024

perrotta commented Apr 11, 2024

fwyzard commented Apr 11, 2024

makortel commented Apr 11, 2024

fwyzard commented Apr 11, 2024

cmsbuild commented Apr 11, 2024

antoniovilela commented Apr 11, 2024

Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44698

Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44698

Conversation

makortel commented Apr 10, 2024

PR description:

PR validation:

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

cmsbuild commented Apr 10, 2024 • edited

cmsbuild commented Apr 10, 2024

cmsbuild commented Apr 10, 2024

makortel commented Apr 10, 2024

makortel commented Apr 10, 2024

makortel commented Apr 11, 2024

cmsbuild commented Apr 11, 2024

Comparison Summary

GPU Comparison Summary

jfernan2 commented Apr 11, 2024

perrotta commented Apr 11, 2024

fwyzard commented Apr 11, 2024

makortel commented Apr 11, 2024

fwyzard commented Apr 11, 2024

cmsbuild commented Apr 11, 2024

antoniovilela commented Apr 11, 2024

cmsbuild commented Apr 10, 2024 •

edited