Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM: Disable assertLegacySafe when concurrent lumis are enabled. #28920

Merged

Conversation

schneiml
Copy link
Contributor

@schneiml schneiml commented Feb 11, 2020

PR description:

This prevents the crashes in 1361.181 reported in #28622. Prevents, not fixes.

I'd like to summon @Dr15Jones and @makortel to this discussion: What happens here is that a job with VALIDATION enabled (which I am pretty sure does contain some edm::EDAnalyzers -- I have not checked it though) is requested to run with concurrent lumisections. EDM should prevent this, since it is not safe to have concurrent lumisections with legacy modules [1]. However, the new DQMStore still detects that a new lumisection starts before the previous one is saved, and that consequently it needs to copy MEs (this triggers the assertLegacySafe assertion by default, unless it is explicitly turned off). The problem with that is now that edm::EDAnalyzer based DQM code could hold pointers that get free'd by the DQMStore later (as the first lumisection ends). For this reason, it is only safe to disable assertLegacySafe when there are no edm::EDAnalyzer based DQMStore users in the process.

So, with this PR, 1361.181 runs but is unsafe. What we should do instead is either make sure that EDM actually does not use concurrent lumisections at all [2] when there are legacy modules (then we can keep the assertion enabled and it will not fire), or remove all legacy modules from the jobs using concurrent lumis (that is really what we need to do, but much harder than it sounds).

We can of course also just keep the unsafe behaviour. It seems to work for now (maybe because the legacy modules involved don't do anything dangerous [3]), but I can't make any guarantees about that.

[1] actually, the majority of DQM currently runs in edm::one::EDProducers watching lumis which should cause the same effect.
[2] I think EDM might overlap writing the current lumi and processing the next even when there are modules blocking concurrent lumis -- that would explain the behaviour.
[3] (Edit:) Since this entire story is about lumis, but legacy modules by default deal with per-job MEs, we should actually be safe as long as the legacy modules don't explicitly use Scope::LUMI. But then the (meaningful) legacy plugins do actually set a scope different from JOB manually (else they would not produce any output in the reco step), and technically the same problem exists with RUN MEs. Except, we don't do anything close to concurrent runs currently. But, we have reasons to believe that there is no risk of use-after-free in this workflow today.

PR validation:

runTheMatrix.py  -l 1361.181  -i all --job-reports -t 4

passes. Note the -t 4, bare 1361.181 did not trigger the problem.

See concerns above.

Note that this does not actually mean it is safe to use concurrent
lumis; we just disable the check if concurrent lumis are actually
requested. In fact, we know that assertLegacySafe fails even when there
are legacy modules that should block concurrent lumis.
@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28920/13715

  • This PR adds an extra 36KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @schneiml (Marcel Schneider) for master.

It involves the following packages:

Configuration/Applications

@silviodonato, @kpedro88, @cmsbuild, @franzoni, @fabiocos, @davidlange6 can you please review it and eventually sign? Thanks.
@makortel, @Martin-Grunewald this is something you requested to watch as well.
@davidlange6, @silviodonato, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@schneiml
Copy link
Contributor Author

please test 1361.181

@schneiml schneiml mentioned this pull request Feb 11, 2020
@silviodonato
Copy link
Contributor

please test workflow 1361.181

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 11, 2020

The tests are being triggered in jenkins.
Test Parameters:

@Dr15Jones
Copy link
Contributor

What happens here is that a job with VALIDATION enabled (which I am pretty sure does contain some edm::EDAnalyzers -- I have not checked it though) is requested to run with concurrent lumisections. EDM should prevent this, since it is not safe to have concurrent lumisections with legacy modules

@schneiml I'm afraid you have the wrong mental model for how this work. Concurrent LumiBlocks are not prevented, instead the framework still starts the new LuminosityBlock before the last ends while still guaranteeing that such legacy and edm::one modules will see the previous end LuminosityBlock call before the new begin LuminosityBlock call. So it isn't that the legacy prevent concurrent lumis, it is more they prevent processing the events in the next lumi before all events in the previous lumi finish. Sorry we gave you the wrong impression for what is happening.

@schneiml
Copy link
Contributor Author

@Dr15Jones yes, I pretty much inferred that this is what's going on, and it is basically fine. Except for the problem explained above: The new DQMStore tries to not allocate new ME objects as long as possible, using only a single instance of each ME, which allows the legacy code holding pointers to (global) MEs to work all fine (even if they are filled at the weirdest times). This obviously does not work as soon a new lumisection starts before the last one is saved, and once we have more than a single copy of an ME, the "all legacy interactions keep working" guarantee no longer holds.

So as written above, we now have three options:

  • Change these workflows to absolutely not use concurrent lumis.
  • or Clean up and remove the legacy modules from these workflows.
  • or Merge this PR and accept the risk.

I'd prefer the first or second, but the third is obviously much easier.

@cmsbuild
Copy link
Contributor

+1
Tested at: 35ad7e2
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ffb461/4598/summary.html
CMSSW: CMSSW_11_1_X_2020-02-10-2300
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ffb461/4598/summary.html

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/results/JR-comparison/PR-ffb461/1361.181_GluGluHToZZTo4L_M125_Pow_py8_Evt_13UP18ml+GluGluHToZZTo4L_M125_Pow_py8_Evt_13UP18ml+DIGIUP18ml+RECOUP18ml+HARVESTUP18+NANOUP18ml

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2694005
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2693658
  • DQMHistoTests: Total skipped: 346
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@silviodonato
Copy link
Contributor

urgent

@makortel
Copy link
Contributor

First of those is MuonAlignment, for which I opened an issue #28975. I don't know if there are more.

A further test reveals that the MuonAlignment is the only legacy DQM module in the step2 of 1000.0.

@cmsbuild cmsbuild mentioned this pull request Feb 24, 2020
@silviodonato
Copy link
Contributor

With #29027 we removed MuonAlignment from wf 1000.0 and from CMSSW

@makortel
Copy link
Contributor

Now that full matrix non-harvesting steps should be clean from legacy modules, in theory this configuration flag should no longer be needed (in full matrix). I'm left to wonder why the assert fires also for DQMOneLumiEDAnalyzer:

#28920 (comment)

Matti and I looked at the backtrace for the failing IB. It is not caused by an edm::EDAnalyzer. It is caused by a DQMOneLumiEDAnalyzer, in particular AlcaBeamMonitor.

(is it perhaps not detecting legacy modules properly?)

@makortel
Copy link
Contributor

@Dr15Jones reminded me that my question was essentially answered in #28920 (comment).

Can assertLegacySafe be removed then?

@schneiml
Copy link
Contributor Author

@makortel the key point is that this assert does not detect the presence of legacy modules, but simply watches the behavior of the DQMStore (which is in the end driven by EDM). That makes it less useful, but also much simpler, and much easier to understand:

  • If assertLegacySafe is set: There is only a single histogram for each ME name. No cloning, duplicating, free'ing happens ever. Holding any sort of pointers is safe.
  • Else: there may be multiple histograms for each ME. In some cases, it may not be safe to touch pointers handed out earlier. It is safe in the transitions exposed by non-legacy modules.

Ideally the assert would check if legacy modules are present in the job, but that is quite hard, since the very specific property of legacy modules is that they might spontaneously, at any point in time, get the edm::Service and call into the DQMStore. So the DQMStore cannot know if there actually are legacy modules without asking for this information somewhere else. Therefore, the assert is conservative and has to be removed manually.

@makortel
Copy link
Contributor

Ok, so the change in this PR should cover all jobs whose configuration is generated with cmsDriver. Does that propagate to T0 configurations automatically or not? (I'm a bit afraid of the latter)

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 25, 2020 via email

@silviodonato
Copy link
Contributor

@schneiml have you tried to use the tool edmCheckMultithreading to check the modules in the DQM sequence?

@schneiml
Copy link
Contributor Author

schneiml commented Mar 2, 2020

@silviodonato I am at it. It gave three hot candidates: https://cmssdt.cern.ch/dxr/CMSSW/source/Validation/RecoTau/plugins/DQMHistNormalizer.cc , https://cmssdt.cern.ch/dxr/CMSSW/source/DQM/TrigXMonitorClient/src/L1ScalersClient.cc , and https://cmssdt.cern.ch/dxr/CMSSW/source/Validation/RecoTau/plugins/DQMFileLoader.cc .

Though, I think these are false positives, since edmCheckMultithreading will check all modules configured in the configuration, even if they are not on any sequence. So no I am marrying master...schneiml:dqm-dqmdumpsequence with the logic from edmCheckMultithreading for a more precise check.

@schneiml
Copy link
Contributor Author

schneiml commented Mar 2, 2020

The new combined tool [1] indicates that there are no legacy DQM modules on any DQM or VALIDATION sequences defined in autoDQM/autoValidation, at least for scenario pp and a recent era. That is a good sign. There are a few non-DQM legacy modules though, but they should not hurt.

There might be legacy modules hiding in other corners of the configuration parameter space, but chances are not very high.

[1] master...schneiml:dqm-dqmdumpsequence

@makortel
Copy link
Contributor

makortel commented Mar 2, 2020

@schneiml

There are a few non-DQM legacy modules though, but they should not hurt.

Could you please list them in #25090?

@silviodonato
Copy link
Contributor

@schneiml , we can try to run your tests master...schneiml:dqm-dqmdumpsequence in a special IB and check if this is sustainable or not.
@smuzaffar how do you suggest to proceed? Perhaps the easiest (quick&dirty) way is to simply integrate these test (master...schneiml:dqm-dqmdumpsequence). If they take too much time we can revert the change. A nicer and more complex way is to have a special IB including these new tests.

@schneiml
Copy link
Contributor Author

@silviodonato as you might have seen I just updated the test with @smuzaffar 's comments. Now it should not interfer with other tests, but it might take ~2h [1] to complete. Splitting it into more pieces would be better, but then we'd need to arbitrarily split the workflow numbers, and I'd prefer to avoid that complexity.

[1] Extrapolating from that it took 15min at 16 threads. Actually I run it single-threaded on cmsswconfigexplore to serve the html UI, and it takes almost 24h to run, but that is on the smallest VM openstack has to offer...

@silviodonato
Copy link
Contributor

+operations

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @silviodonato, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

@silviodonato
Copy link
Contributor

+1
See #29150

@cmsbuild cmsbuild merged commit d8696a8 into cms-sw:master Mar 16, 2020
@@ -2232,6 +2232,8 @@ def prepare(self, doChecking = False):
self.pythonCfgCode +="process.options.numberOfThreads=cms.untracked.uint32("+self._options.nThreads+")\n"
self.pythonCfgCode +="process.options.numberOfStreams=cms.untracked.uint32("+self._options.nStreams+")\n"
self.pythonCfgCode +="process.options.numberOfConcurrentLuminosityBlocks=cms.untracked.uint32("+self._options.nConcurrentLumis+")\n"
if self._options.nConcurrentLumis > 1:
self.pythonCfgCode +="if process.DQMStore: process.DQMStore.assertLegacySafe=cms.untracked.bool(False)\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have probably been something like

self.pythonCfgCode +="if hasattr(process, 'DQMStore'): process.DQMStore.assertLegacySafe=cms.untracked.bool(False)\n"

The latest IB (CMSSW_11_1/2020-03-16-1100) shows failures like

LHE input from article  18334
Note: this tool is DEPRECATED, use xrdfs instead.
customising the process with customiseWithTimeMemorySummary from Validation/Performance/TimeMemorySummary
Starting /data/cmsbld/jenkins/workspace/ib-run-relvals/cms-bot/monitor_workflow.py timeout --signal SIGTERM 9000  cmsRun -j JobReport1.xml  step1_NONE.py
----- Begin Fatal Exception 16-Mar-2020 12:51:11 CET-----------------------
An exception of category 'ConfigFileReadError' occurred while
   [0] Processing the python configuration file named step1_NONE.py
Exception Message:
 unknown python problem occurred.
AttributeError: 'Process' object has no attribute 'DQMStore'

At:
  step1_NONE.py(92): <module>

----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_1_X_2020-03-16-1100/pyRelValMatrixLogs/run/1370.0_GluGluHToGG_M125_Pow_MINLO_NNLOPS_py8_13+GluGluHToGG_M125_Pow_MINLO_NNLOPS_py8_13+Hadronizer_TuneCUETP8M1_13TeV_powhegEmissionVeto_2p_HToGG_M125_13+DIGIUP15+RECOUP15+HARVESTUP15+NANOUP15Had/step1_GluGluHToGG_M125_Pow_MINLO_NNLOPS_py8_13+GluGluHToGG_M125_Pow_MINLO_NNLOPS_py8_13+Hadronizer_TuneCUETP8M1_13TeV_powhegEmissionVeto_2p_HToGG_M125_13+DIGIUP15+RECOUP15+HARVESTUP15+NANOUP15Had.log#/

cmsbuild added a commit that referenced this pull request Mar 17, 2020
Use hasattr to check process.DQMStore in ConfigBuilder.py (Fix of #28920)
cmsbuild added a commit that referenced this pull request Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants