Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new ECAL DQM GpuTask to monitor and compare CPU and GPU generated ECAL Rec Hits #35946

Closed
wants to merge 2 commits into from

Conversation

alejands
Copy link
Contributor

@alejands alejands commented Nov 1, 2021

PR description:

PPD has requested DQM subsystems to monitor several GPU-enabled collections being introduced in CMSSW_12_1_X. We have introduced a new EcalMonitorTask called GpuTask designed to take in CPU and GPU generated rec hits and produce plots comparing several rec hit quantities for each run. We don't explicitly plot the GPU rec hit values for the sake of memory efficiency. Additional plots may be added in the future.

This task will not run by default on the regular Online DQM workflow.

PR validation:

One caveat to this version of the code is that we have not been able to test it on actual GPU rec hits. There appears to be no workflow currently that produces both types of rec hit collections. We have been informed by Thomas Reis that CPU and GPU ECAL rec hits are the same data type, so our testing is done by changing the input tags so that both collections are the CPU rec hits.

This code was run with the runTheMatrix workflow 10842.512 and the new plots look as expected.

A backport to 12_1_X is submitted here #35947

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 1, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35946/26352

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 1, 2021

A new Pull Request was created by @alejands (Alejandro Sanchez) for master.

It involves the following packages:

  • DQM/EcalMonitorTasks (dqm)

@emanueleusai, @ahmad3213, @cmsbuild, @jfernan2, @pmandrik, @pbo0, @rvenditti can you please review it and eventually sign? Thanks.
@rchatter, @simonepigazzini, @thomreis, @argiro this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@emanueleusai
Copy link
Member

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2021

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b82a8c/20169/summary.html
COMMIT: a47505d
CMSSW: CMSSW_12_2_X_2021-11-01-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/35946/20169/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test TestDQMServicesDemo had ERRORS

Comparison Summary

Summary:

  • You potentially added 3968 lines to the logs
  • Reco comparison results: 89 differences found in the comparisons
  • DQMHistoTests: Total files compared: 42
  • DQMHistoTests: Total histograms compared: 2901890
  • DQMHistoTests: Total failures: 33
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2901834
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 207.665 KiB( 41 files compared)
  • DQMHistoSizes: changed ( 1000.0,... ): 3.350 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 1000.0,... ): 3.350 KiB EcalEndcap/EEGpuTask
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 177 log files, 37 edm output root files, 42 DQM output files
  • TriggerResults: no differences found

@alejands
Copy link
Contributor Author

alejands commented Nov 2, 2021

We believe that the tests failed due to the GPU ecalRecHits not being enabled in CMSSW at the moment. The @cuda version is commented out here: RecoLocalCalo/EcalRecProducers/python/ecalRecHit_cff.py#43. For the time being, we'll modify the GPU input tags to also read from the CPU ecalRecHits. This can be quickly changed back when the GPU rec hits are available.

@jfernan2
Copy link
Contributor

jfernan2 commented Nov 2, 2021

@alejands the UnitTest failed due to the issue fixed in this PR #35921 not to your code, so I would bring back the code as you had it. Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35946/26374

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 2, 2021

Pull request #35946 was updated. @emanueleusai, @ahmad3213, @cmsbuild, @jfernan2, @pmandrik, @pbo0, @rvenditti can you please check and sign again.

@jfernan2
Copy link
Contributor

jfernan2 commented Nov 3, 2021

enable gpu

@alejands
Copy link
Contributor Author

alejands commented Nov 3, 2021

@jfernan2 @perrotta I have added a switch to GpuTask to prevent it from looking for GPU collections unless specified. I re-ran runTheMatrix WF 10842.512 and the warnings went away

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 3, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35946/26400

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 3, 2021

Pull request #35946 was updated. @emanueleusai, @ahmad3213, @cmsbuild, @jfernan2, @pmandrik, @pbo0, @rvenditti can you please check and sign again.

@jfernan2
Copy link
Contributor

jfernan2 commented Nov 3, 2021

please test

@thomreis
Copy link
Contributor

thomreis commented Nov 3, 2021

@jfernan2 @perrotta I have added a switch to GpuTask to prevent it from looking for GPU collections unless specified. I re-ran runTheMatrix WF 10842.512 and the warnings went away

Could this switch me toggled automatically when running on a GPU machine with

from Configuration.ProcessModifiers.gpu_cff import gpu
gpu.toModify(ecalGpuTask, params.runOnGpu = cms.untracked.bool(True))`

?

Or maybe even better modify ecalMonitorTask to only include the GpuTask worker if there is a GPU available?

@alejands
Copy link
Contributor Author

alejands commented Nov 3, 2021

Could this switch me toggled automatically when running on a GPU machine with

from Configuration.ProcessModifiers.gpu_cff import gpu
gpu.toModify(ecalGpuTask, params.runOnGpu = cms.untracked.bool(True))`

?

@thomreis As far as I know, your toggle example should work. We have a similar switch for running on emulated digis, which is switched OFF in most Offline DQM configs (this switch is set to ON by default)

https://github.com/cms-sw/cmssw/blob/master/DQMOffline/Ecal/python/ecal_dqm_source_offline_cff.py#L33

Or maybe even better modify ecalMonitorTask to only include the GpuTask worker if there is a GPU available?

Could you point me to an example on how to check is a GPU is available? I could take a look and see if this is possible within ecalMonitorTask and try to implement it in an upcoming PR

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 3, 2021

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b82a8c/20231/summary.html
COMMIT: 5c364e9
CMSSW: CMSSW_12_2_X_2021-11-03-1100/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/35946/20231/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19782
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19774
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 3.064 KiB( 3 files compared)
  • DQMHistoSizes: changed ( 11634.512 ): 1.532 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 11634.512 ): 1.532 KiB EcalEndcap/EEGpuTask
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 42
  • DQMHistoTests: Total histograms compared: 2901890
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2901862
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 94.984 KiB( 41 files compared)
  • DQMHistoSizes: changed ( 1000.0,... ): 1.532 KiB EcalBarrel/EBGpuTask
  • DQMHistoSizes: changed ( 1000.0,... ): 1.532 KiB EcalEndcap/EEGpuTask
  • Checked 177 log files, 37 edm output root files, 42 DQM output files
  • TriggerResults: no differences found

@thomreis
Copy link
Contributor

thomreis commented Nov 4, 2021

https://github.com/cms-sw/cmssw/blob/master/DQMOffline/Ecal/python/ecal_dqm_source_offline_cff.py#L33

Or maybe even better modify ecalMonitorTask to only include the GpuTask worker if there is a GPU available?

Could you point me to an example on how to check is a GPU is available? I could take a look and see if this is possible within ecalMonitorTask and try to implement it in an upcoming PR

You would use the same gpu.toModify() method as for the single parameter. More details are in this twiki (it is about eras but I think the mechanism is the same for GPU modifications): https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCmsDriverEras

I think something like this should work:
First revert the changes in DQM/EcalMonitorTasks/python/EcalMonitorTask_cfi.py to have a non-GPU version by default and then add something like the code below. Maybe this should be in a separate file EcalMonitorTask_cff.py that is then loaded instead of the cfi file in DQMOffline/Ecal/python/ecal_dqm_source_offline_cff.py

# customisation to run the CPU vs. GPU comparison task if the job runs on a GPU enabled machine
from Configuration.ProcessModifiers.gpu_cff import gpu
from DQM.EcalMonitorTasks.GpuTask_cfi import ecalGpuTask
gpu.toModify(ecalMonitorTask.workers, func = lambda workers: workers.append("GpuTask"))
gpu.toModify(ecalMonitorTask. workerParameters = dict(GpuTask = ecalGpuTask))

I guess in this case the runOnGpu parameter is not really necessary anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants