[14_0_X] Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44699

makortel · 2024-04-10T20:26:17Z

PR description:

This PR is a backport of #44698. Original description

This PR is a temporary workaround for the issue discussed in #44643. In short, with an HLT menu that uses Alpaka modules, when some HLT processes use GPU and some do not, the fact that the different backends of Alpaka modules produce different transient data products cause the ProductIDs to be different between the CPU-only and CPU+GPU processes, and presently there is not enough metadata available in the final streamer file for the framework to keep properly track of the various indices, that leads to de-referencing of edm::Ref to fail in a subsequent job (for longer description see #44643 (comment)).

This PR works around the problem by making a subset(*) of the Alpaka EDProducers on the CPU serial backend, for each "device-side data product" they produce (that in reality are the "host-side data products" directly) they register the production of the corresponding CUDA backend data product. This hack makes the CPU-serial and CUDA backend EDProducers to register the production of exactly the same data products, that leads to equal ProductIDs for the same data products between CPU-only and CPU+GPU jobs, circumventing the problem.

(*) ECAL, PF, and phase1 Pixel EDProducers, that are either being used in the present HLT menu, or are planned to be used in the near future

The use of strings for the CUDA backend data products is ugly, but avoids the need to have a compile-time dependence on the CUDA backend code in the CPU serial backend code (I actually tried that first, but ran into compilation issues; probably our present build rules prevent the use of Alpaka CUDA backend types in the CPU serial backend code). At runtime the added explicit CUDA dependence will likely break on platforms that do not support CUDA (all our code directly depending on CUDA would be broken anyway, so the temporary loss of functionality seems acceptable).

This PR is intended to be reverted when the necessary metadata to deal with the different ProductIDs in different HLT processes (CPU-only vs. CPU+GPU) gets propagated to the framework), whose details are being discussed in #44643 .

PR validation:

See #44698

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Backport of #44698

…ackends in Alpaka modules possibly used at HLT To be reverted in the near future after the necessary metadata to deal with different ProductIDs in different HLT processes (CPU-only vs. CPU+GPU) gets propagated to the framework.

cmsbuild · 2024-04-10T20:26:43Z

A new Pull Request was created by @makortel for CMSSW_14_0_X.

It involves the following packages:

EventFilter/EcalRawToDigi (reconstruction)
HeterogeneousCore/AlpakaCore (heterogeneous)
RecoLocalCalo/EcalRecProducers (reconstruction)
RecoLocalTracker/SiPixelClusterizer (reconstruction)
RecoLocalTracker/SiPixelRecHits (reconstruction)
RecoParticleFlow/PFClusterProducer (reconstruction)
RecoParticleFlow/PFRecHitProducer (reconstruction)
RecoTracker/PixelSeeding (reconstruction)
RecoTracker/PixelVertexFinding (reconstruction)
RecoVertex/BeamSpotProducer (alca, reconstruction)

@perrotta, @fwyzard, @mandrenguyen, @consuegs, @cmsbuild, @jfernan2, @saumyaphor4252, @makortel can you please review it and eventually sign? Thanks.
@dkotlins, @mtosi, @tvami, @JanFSchulte, @GiacomoSguazzoni, @threus, @youyingli, @thomreis, @francescobrivio, @rovere, @sameasy, @ferencek, @tsusa, @missirol, @argiro, @VourMa, @ReyerBand, @felicepantaleo, @apsallid, @fabiocos, @Martin-Grunewald, @VinInn, @wang0jin, @gpetruc, @rchatter, @lgray, @mmarionncern, @dgulhan, @mmusich, @hatakeyamak, @yuanchao, @rsreds, @seemasharmafnal, @mroguljic, @tocheng this is something you requested to watch as well.
@antoniovilela, @sextonkennedy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

Backported from Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44698

cmsbuild · 2024-04-10T20:26:43Z

cms-bot internal usage

makortel · 2024-04-11T00:26:36Z

enable gpu

makortel · 2024-04-11T00:26:41Z

@cmsbuild, please test

makortel · 2024-04-11T00:26:44Z

urgent

cmsbuild · 2024-04-11T04:01:53Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-69312b/38763/summary.html
COMMIT: a77f7cd
CMSSW: CMSSW_14_0_X_2024-04-10-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44699/38763/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 84 lines to the logs
Reco comparison results: 55 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3308451
DQMHistoTests: Total failures: 9
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3308422
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 32 differences found in the comparisons
DQMHistoTests: Total files compared: 3
DQMHistoTests: Total histograms compared: 39740
DQMHistoTests: Total failures: 1093
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 38647
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
Checked 8 log files, 10 edm output root files, 3 DQM output files
TriggerResults: no differences found

jfernan2 · 2024-04-11T06:10:28Z

+1

perrotta · 2024-04-11T07:50:14Z

backport of #44698

perrotta · 2024-04-11T07:52:00Z

+alca

Verbatim backport of Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44698: once that PR is approved and successfully tested in master, for alca also the backport is fine

fwyzard · 2024-04-11T14:06:19Z

+heterogeneous

cmsbuild · 2024-04-11T14:06:46Z

This pull request is fully signed and it will be integrated in one of the next CMSSW_14_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_14_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

makortel · 2024-04-11T15:33:57Z

@cms-sw/orp-l2 I believe this PR would satisfy the constraints of a patch release

antoniovilela · 2024-04-11T16:22:21Z

@cms-sw/orp-l2 I believe this PR would satisfy the constraints of a patch release

Many thanks Matti. I was about to ask you this.

antoniovilela · 2024-04-12T09:19:54Z

+1

cmsbuild added this to the CMSSW_14_0_X milestone Apr 10, 2024

cmsbuild added reconstruction-pending alca-pending pending-signatures tests-pending orp-pending heterogeneous-pending tracking trk labels Apr 10, 2024

This was referenced Apr 10, 2024

ProductNotFound issue in T0 HLTMonitor processing jobs starting Run 378985 #44643

Open

Investigate feasibility of a quick workaround cms-sw/framework-team#877

Closed

cmsbuild added tests-started urgent and removed tests-pending labels Apr 11, 2024

cmsbuild added tests-approved and removed tests-started labels Apr 11, 2024

cmsbuild added reconstruction-approved and removed reconstruction-pending labels Apr 11, 2024

cmsbuild added the backport label Apr 11, 2024

cmsbuild removed the alca-pending label Apr 11, 2024

cmsbuild added the alca-approved label Apr 11, 2024

cmsbuild added fully-signed heterogeneous-approved and removed pending-signatures heterogeneous-pending labels Apr 11, 2024

cmsbuild added backport-ok and removed backport labels Apr 11, 2024

makortel mentioned this pull request Apr 11, 2024

Revert the workarounds cms-sw/framework-team#880

Open

2 tasks

cmsbuild added orp-approved and removed orp-pending labels Apr 12, 2024

cmsbuild merged commit c4c5b90 into cms-sw:CMSSW_14_0_X Apr 12, 2024
13 checks passed

antoniovilela mentioned this pull request Apr 12, 2024

Build CMSSW_14_0_5_patch1 #44722

Closed

cmsbuild mentioned this pull request Apr 12, 2024

updated parameters and model for photon XGboost MVA (HLT) [14_0_X (backport)] #44720

Merged

makortel deleted the productIDModuleTypeResolverWorkaround_140x branch April 12, 2024 15:50

mmusich mentioned this pull request May 16, 2024

Handle event meta data changes in streamer files [14_0] #44978

Open

makortel mentioned this pull request May 28, 2024

[14_0_X] Revert "Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT" #45081

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[14_0_X] Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44699

[14_0_X] Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44699

makortel commented Apr 10, 2024

cmsbuild commented Apr 10, 2024 •

edited

cmsbuild commented Apr 10, 2024 •

edited

makortel commented Apr 11, 2024

makortel commented Apr 11, 2024

makortel commented Apr 11, 2024

cmsbuild commented Apr 11, 2024

jfernan2 commented Apr 11, 2024

perrotta commented Apr 11, 2024

perrotta commented Apr 11, 2024

fwyzard commented Apr 11, 2024

cmsbuild commented Apr 11, 2024

makortel commented Apr 11, 2024

antoniovilela commented Apr 11, 2024

antoniovilela commented Apr 12, 2024

[14_0_X] Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44699

[14_0_X] Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44699

Conversation

makortel commented Apr 10, 2024

PR description:

PR validation:

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

cmsbuild commented Apr 10, 2024 • edited

cmsbuild commented Apr 10, 2024 • edited

makortel commented Apr 11, 2024

makortel commented Apr 11, 2024

makortel commented Apr 11, 2024

cmsbuild commented Apr 11, 2024

Comparison Summary

GPU Comparison Summary

jfernan2 commented Apr 11, 2024

perrotta commented Apr 11, 2024

perrotta commented Apr 11, 2024

fwyzard commented Apr 11, 2024

cmsbuild commented Apr 11, 2024

makortel commented Apr 11, 2024

antoniovilela commented Apr 11, 2024

antoniovilela commented Apr 12, 2024

cmsbuild commented Apr 10, 2024 •

edited

cmsbuild commented Apr 10, 2024 •

edited