Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new ECAL DQM GpuTask to monitor and compare CPU and GPU generated ECAL RECO objects #36742

Merged
merged 6 commits into from Feb 22, 2022

Conversation

alejands
Copy link
Contributor

PR description:

PPD has requested DQM subsystems to monitor several GPU-enabled collections being introduced in CMSSW_12_1_X. We have introduced a new EcalMonitorTask called GpuTask designed to take in CPU and GPU generated objects and produce plots comparing several object quantities for each run. Objects include ECAL Digis, Uncalibrared RecHits, and RecHits.

This task will not run by default on the regular Online DQM workflow.

PR validation:

Module was run with the runTheMatrix workflow 10842.512 and Digi and Uncalib RecHit plots look as expected. GPU RecHit plots showed up empty, which was expected due to GPU RecHits currently being disabled in CMSSW. To test functionality with large amounts of data, the module was run on some 2018 data (only CPU objects available) and plots look as expected.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36742/27844

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36742/27845

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @alejands (Alejandro Sanchez) for master.

It involves the following packages:

  • DQM/EcalMonitorTasks (dqm)
  • DQMOffline/Ecal (dqm)

@emanueleusai, @ahmad3213, @cmsbuild, @jfernan2, @pmandrik, @pbo0, @rvenditti can you please review it and eventually sign? Thanks.
@rchatter, @simonepigazzini, @rociovilar, @thomreis, @argiro this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@alejands
Copy link
Contributor Author

I may have confused cms-bot by submitting the formatting patch before it could tell me there were formatting errors

@thomreis
Copy link
Contributor

Maybe squash the two commits and push again. I guess that would trigger the code checks again.

from Configuration.ProcessModifiers.gpu_cff import gpu
from DQM.EcalMonitorTasks.GpuTask_cfi import ecalGpuTask

gpu.toModify(ecalGpuTask.params, runGpuTask = cms.untracked.bool(True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that by setting runGpuTask to True here by default this will run the GPU to CPU comparison by default in the offline DQM if the gpu modifier is given. I do not think this is what we want since it implies to run the CPU and the GPU reconstruction.

As discussed in #35879 we should use a different modifier that is only given when the GPU vs. CPU comparison should be done. @jfernan2 does such a modifier exist already from DQM? In the issue mentioned it was proposed to call it gpu-validation.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36742/27868

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

Pull request #36742 was updated. @emanueleusai, @ahmad3213, @cmsbuild, @jfernan2, @pmandrik, @pbo0, @rvenditti can you please check and sign again.

@alejands
Copy link
Contributor Author

Maybe squash the two commits and push again. I guess that would trigger the code checks again.

I've gone ahead and squashed the formatting commit

Copy link
Contributor

@fwyzard fwyzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not start variable names with an underscore _.
Such names are reserved by the C++ standard (in the global namespace) and should not be used according to the CMSSW coding rules:

2.14 Do not use “_” as first character, except for user-defined suffixes (used in user-defined literals). Only use it as the last character for class data member names, not local variable names.

class GpuTask : public DQWorkerTask {
public:
GpuTask();
~GpuTask() override {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
~GpuTask() override {}
~GpuTask() override = default;

Comment on lines 39 to 42
// Static cast to EB/EEDigiCollection when using
// Defined as void pointers to make compiler happy
void const* EBCpuDigis_;
void const* EECpuDigis_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong with

Suggested change
// Static cast to EB/EEDigiCollection when using
// Defined as void pointers to make compiler happy
void const* EBCpuDigis_;
void const* EECpuDigis_;
EBDigiCollection const* EBCpuDigis_;
EEDigiCollection const* EECpuDigis_;

?

Copy link
Contributor Author

@alejands alejands Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that initially and got a compiler complaint at these lines:

// Static cast to EB/EEDigiCollection during use
// Stored as void pointers to make compiler happy
if (iSubdet == EcalBarrel)
EBCpuDigis_ = &_cpuDigis;
else
EECpuDigis_ = &_cpuDigis;

EBDigiCollection and EEDigiCollection cannot be static_cast into each other. The compiler tries to protect from the situation where the function argument _cpuDigis is one digi flavor and cast into the other. Doing the following will also result in a compiler error saying you can't cast EB/EE to EE/EB:

    if (iSubdet == EcalBarrel)
      EBCpuDigis_ = static_cast<EBDigiCollection const *>(&_cpuDigis);
    else
      EECpuDigis_ = static_cast<EEDigiCollection const *>(&_cpuDigis);

The compiler appears to learn that the template type DigiCollection can be either flavor from the calls function calls here:

case kEBCpuDigi:
if (_p && runGpuTask_)
runOnCpuDigis(*static_cast<EBDigiCollection const*>(_p), _collection);
return runGpuTask_;
break;
case kEECpuDigi:
if (_p && runGpuTask_)
runOnCpuDigis(*static_cast<EEDigiCollection const*>(_p), _collection);
return runGpuTask_;
break;

and tries to ensure that every line can execute for either type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you do EECpuDigis_ = static_cast<EBDigiCollection const *>(&_cpuDigis); and not EECpuDigis_ = static_cast<EEDigiCollection const *>(&_cpuDigis);?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler does not "learn" anything, it simply tries to compile the code as it is written.

When you call GpuTask::runOnCpuDigis() passing an EBDigiCollection as the first argument, the whole function is compiled substituting EBDigiCollection in place of the template type DigiCollection, and becomes something like

  void GpuTask::runOnCpuDigis(EBDigiCollection const& _cpuDigis, Collections _collection) {
    MESet& meDigiCpu(MEs_.at("DigiCpu"));
    MESet& meDigiCpuAmplitude(MEs_.at("DigiCpuAmplitude"));

    int iSubdet(_collection == kEBCpuDigi ? EcalBarrel : EcalEndcap);

    // Save CpuDigis for comparison with GpuDigis
    // Static cast to EB/EEDigiCollection during use
    // Stored as void pointers to make compiler happy
    if (iSubdet == EcalBarrel)
      EBCpuDigis_ = &_cpuDigis;
    else
      EECpuDigis_ = &_cpuDigis;
  ...

so both assignment EBCpuDigis_ = &_cpuDigis and EECpuDigis_ = &_cpuDigis must be valid.

If one is of type EBDigiCollection * and the other is of type EEDigiCollection *, it cannot compile - it doesn't matter that only one of the branches would be executed.

One way to fix the code could be to conditionally compile one of the two assignments, depending on the type of the digi collection:

    if constexpr (std::is_same_v<DigiCollection, EBDigiCollection>) {
      assert(iSubdet == EcalBarrel);
      EBCpuDigis_ = &_cpuDigis;
    } else {
      assert(iSubdet == EcalEndcap);
      EECpuDigis_ = &_cpuDigis;
    }

The use of if constexpr instead of a plain if tells the compiler to evaluate the condition at compile time instead of runtime, and compile only the corresponding branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you do EECpuDigis_ = static_cast<EBDigiCollection const *>(&_cpuDigis); and not EECpuDigis_ = static_cast<EEDigiCollection const *>(&_cpuDigis);?

@thomreis Sorry that was a typo in the comment. I did try it with the correct cast.

@fwyzard Thank you for the clarification and suggestion! I'll implement this along with the other suggestions.

Comment on lines 130 to 132
for (typename DigiCollection::const_iterator cpuItr(_cpuDigis.begin()); cpuItr != _cpuDigis.end(); ++cpuItr) {
// EcalDataFrame is not a derived class of edm::DataFrame, but can take edm::DataFrame (digis) in the constructor
EcalDataFrame cpuDataFrame(*cpuItr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a range for loop:

Suggested change
for (typename DigiCollection::const_iterator cpuItr(_cpuDigis.begin()); cpuItr != _cpuDigis.end(); ++cpuItr) {
// EcalDataFrame is not a derived class of edm::DataFrame, but can take edm::DataFrame (digis) in the constructor
EcalDataFrame cpuDataFrame(*cpuItr);
for (auto const& digis: _cpuDigis) {
// EcalDataFrame is not a derived class of edm::DataFrame, but can take edm::DataFrame (digis) in the constructor
EcalDataFrame cpuDataFrame(digis);

// EcalDataFrame is not a derived class of edm::DataFrame, but can take edm::DataFrame (digis) in the constructor
EcalDataFrame cpuDataFrame(*cpuItr);

for (int iSample = 0; iSample < 10; iSample++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to always loop from 0 to 10 ?
If not, please derive the maximum from the data themselves.
If yes, please replace 10 with a named constant, possibly defined in the data format itself, that makes it clear it is always safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alejands
Copy link
Contributor Author

alejands commented Jan 21, 2022

@fwyzard @thomreis I've implemented the changes suggested and tested to make sure there were no changes to the plots.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-efc172/22491/summary.html
COMMIT: f19e451
CMSSW: CMSSW_12_3_X_2022-02-17-1100/slc7_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36742/22491/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19811
  • DQMHistoTests: Total failures: 953
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 18858
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-efc172/138.4_PromptCollisions+RunMinimumBias2021+ALCARECOPROMPTR3+HARVESTDPROMPTR3
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-efc172/138.5_ExpressCollisions+RunMinimumBias2021+TIER0EXPRUN3+ALCARECOEXPR3+HARVESTDEXPR3
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-efc172/139.001_RunMinimumBias2021+RunMinimumBias2021+HLTDR3_2021+RECODR3_MinBiasOffline+HARVESTD2021MB

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3965143
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3965119
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 204 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Feb 18, 2022

WFs with the .514 suffix exist already though (EcalOnlyGPU_Profiling). Not sure if they are meant for validation.

Too many workflows, too little numbers...

OK, I guess it's time to make some order (again) :-/

The fact that the DQM task requests explicitly @cpu and @cuda collections as input forces the framework to run both branches of the SwitchProducer. For ECAL giving the gpu and the gpuValidationEcal modifiers as in this PR should set things up.

Yes, indeed.
What I meant is to set up corresponding workflows also for the other detectors, to make it easier for them to test their DQM (at least HCAL has it under development).

@thomreis
Copy link
Contributor

Yes, indeed.
What I meant is to set up corresponding workflows also for the other detectors, to make it easier for them to test their DQM (at least HCAL has it under development).

So a WF dedicated to ECAL GPU validation. I guess that would be good. I would do it in a separate PR though to get this one in so that it can be run manually already.

@jfernan2
Copy link
Contributor

So, I understand then that this PR is good to go and we will have a new one for the dedicated WF, right?

@fwyzard
Copy link
Contributor

fwyzard commented Feb 20, 2022

So, I understand then that this PR is good to go and we will have a new one for the dedicated WF, right?

Yes, that would be the idea.

@jfernan2
Copy link
Contributor

+1

@thomreis
Copy link
Contributor

In order for the HCAL GPU validation PR #36998 to use the gpuValidation modifierChain introduced here it would be good if this PR can be merged soon.

Copy link
Contributor

@perrotta perrotta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor fix: no need to delay this PR for it, but please take it into account, for example when making the PR with the updates anticipated bu @thomreis in #36742 (comment)

from Configuration.ProcessModifiers.gpuValidationEcal_cff import gpuValidationEcal
from DQM.EcalMonitorTasks.ecalGpuTask_cfi import ecalGpuTask

gpuValidationEcal.toModify(ecalGpuTask.params, runGpuTask = cms.untracked.bool(True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gpuValidationEcal.toModify(ecalGpuTask.params, runGpuTask = cms.untracked.bool(True))
gpuValidationEcal.toModify(ecalGpuTask.params, runGpuTask = True)

@perrotta
Copy link
Contributor

+operations

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants