DQMIO and DQM paths need to be serialized #15866

Dr15Jones · 2016-09-15T23:15:23Z

With the coming of running Paths and EndPaths concurrently during Lumi and Run transitions, the dqmoffine_step path and the path holding the DQMOutputModule need to be serialized. The problem is the DQM EDAnalyzers have a hidden (from the framework) data dependency with DQMOutputModule. The several ways to fix this:

put all the DQM EDAnalyzers into one EndPath and have the DQMOutputModule at its end. That would be bad for parallel processing since the EDAnalyzers would not be run in parallel ever.
put each DQM EDAnalyzer in its own endpath with a simple EDProducer at the end. The path with the DQMOutputModule would have a module at its beginning which consumes all the data products made by the EDProducers at the end of the DQM end paths.
Make the DQM EDAnalyzers into EDProducers which puts a token object into the Lumi/Run and then the DQMOutputModule would get those data object and automatically be dependent on those modules.

cmsbuild · 2016-09-15T23:15:42Z

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @smuzaffar, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2016-09-15T23:17:51Z

assign core

Dr15Jones · 2016-09-15T23:18:05Z

assign dqm

cmsbuild · 2016-09-15T23:18:07Z

New categories assigned: core,dqm

@Dr15Jones,@smuzaffar,@vanbesien,@dmitrijus you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones · 2016-10-06T18:46:16Z

ping do people have any preference for the solution?

dmitrijus · 2016-10-11T14:24:20Z

Hello, neither of us are the experts on the framework, but I think solution #3 would be the best.
The less we touch/involve EndPaths, the better.

Dr15Jones · 2017-05-09T18:44:34Z

At the Core meeting I proposed the following
https://indico.cern.ch/event/638163/contributions/2585775/attachments/1455835/2247950/SummaryProducer.pdf

The idea is to add a new type, EDSummaryProducer, to the framework and convert all DQMEDAnalyzers to this type (done centrally). From the meeting, this would have to be done within the next week (all the work would be done by Core).

Would that be acceptable to you?

An alternative, but not well liked, idea was to change EDAnalyzers to allow them to put data into the Run and LuminosityBlock at end transitions. That touches less user code but muddles the behaviors of the module types.

Dr15Jones · 2017-05-09T18:45:00Z

@fwyzard FYI

dmitrijus · 2017-05-10T13:09:06Z

Hi, I should have stayed for the core meeting...
I don't mind these changes, but after the discussion with @rovere, the main question is "when?"

If it is before data taking, this could disastrous - DQM is famous for having "out of the loop" sequences.
Such as online DQM, PCL. Even HLT has their entire own logic of merging/producing the histograms.

Second point, is what happens to the DQMEDHarvester modules?

In official "reco" step, harvesting modules are not run (they run in a separate step), just DQMEDAnalyzers, but in online DQM and in many "custom" user sequences, two are run together.

Would DQMEDHarvester have to modified to consume the DQMToken? But it has to be run before the output module is run, but after the histograms are merged into the global histograms.

If we leave this for after the data taking, the best would to think about reworking the entire DQM to make it work with multiple containers, instead of one shared container for all the histograms.
Which was suggested here: #18574 (comment)

Dr15Jones · 2017-05-10T14:27:26Z

I don't mind these changes, but after the discussion with @rovere, the main question is "when?"

If it is before data taking, this could disastrous - DQM is famous for having "out of the loop" sequences.
Such as online DQM, PCL. Even HLT has their entire own logic of merging/producing the histograms.

In the meeting, it was stated that the changes had to go in before datataking else they couldn't happen until November because we need to easily backport. Delaying to November would completely stop further framework development until then. Upon further reflection, I think we can compromise. If the framework changes which add the new module type go in for data taking (i.e. CMSSW_9_2) then the rest of the changes can go in to the release following CMSSW_9_2. If a module were changed in the later release, it could still be back ported into CMSSW_9_2 because the framework support would be there. This would allow a more leisurely migration of the DQM code without possible disruption to data taking.

Second point, is what happens to the DQMEDHarvester modules?

In official "reco" step, harvesting modules are not run (they run in a separate step), just DQMEDAnalyzers, but in online DQM and in many "custom" user sequences, two are run together.

Would DQMEDHarvester have to modified to consume the DQMToken? But it has to be run before the output module is run, but after the histograms are merged into the global histograms.

To be completely correct, the DQMEDHarverster should become a EDSummaryProducer (since it makes histograms which ultimately must happen before the OutputModule is run) and it should consume the DQMTokens from the modules which make the histograms it pulls from the DQMStore. That would completely guarantee the framework would run all modules in the proper ordering at global end transitions.

Your question about the DQMEDHarvester made me go back and look more closely at the code and I found an interesting fact. The DQMEDAnalyzers do all their work on end stream transition (they are also called on end global transition but the DQMEDAnalyzer code does nothing). The DQMRootOutputModule and the DQMEDHarversters do their work on end global transitions. The framework guarantees that all end stream transitions happen before the end global transition. Therefore all DQMEDAnalyzers are guaranteed to have done their job filling histograms before the DQMRootOutputModule or the DQMEDHarvester ever reads those histograms. So if we guarantee that no DQMEDAnalyzer overrides the static methods globalEndRunSummary or globalEndLuminosityBlockSummary (that is true today) then the DQMEDAnalyzers can remain EDAnalyzers and everything will work just fine.

Therefore it is only the DQMEDHarvesters which need to be converted to EDSummaryProducer because they do their work at global end transition and need to be run before the DQMRootOutputModule is called.

slava77 · 2017-05-10T14:42:06Z

On 5/10/17 7:27 AM, Chris Jones wrote: In the meeting, it was stated that the changes had to go in before datataking

I have interpreted this as "for the summer 93X (previously known as 92X) release" rather than by today. Has this blocking feature been discovered in the past few days only? Recall that there was no 92X release until end of July on the schedule until other problems appeared.

Dr15Jones · 2017-05-10T14:58:48Z

@slava77 at the meeting I was told any change would have to happen within 2 weeks. Would that mean 9_3_X (which I totally missed hearing about)?

This would not be a blocking change (i.e. we would not wait for it). If it didn't make it in then I'd have to figure something else out for the framework to be able to do work for the next 9 months :).

slava77 · 2017-05-10T15:04:20Z

On 5/10/17 7:58 AM, Chris Jones wrote: @slava77 <https://github.com/slava77> at the meeting I was told any change would have to happen within 2 weeks. Would that mean 9_3_X (which I totally missed hearing about)?

I may have been myself confused with numbering. Anyways, IMHO, disruptive framework level changes should not go in to the "major bug fix" release inserted to fix one specific large problem.

This would not be a blocking change (i.e. we would not wait for it). If it didn't make it in then I'd have to figure something else out for the framework to be able to do work for the next 9 months :).

So, this is a newly discovered feature/problem, right? Otherwise I can't understand the urgency given that 92X is an exceptional release with O(week) time scale to converge.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15866 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEdcbql0qJ6aOw7BUhPM3_9mVD9jlqwcks5r4dCpgaJpZM4J-dAL>.

davidlange6 · 2017-05-10T15:10:30Z

On May 10, 2017, at 5:04 PM, Slava Krutelyov ***@***.***> wrote: On 5/10/17 7:58 AM, Chris Jones wrote: > @slava77 <https://github.com/slava77> at the meeting I was told any > change would have to happen within 2 weeks. Would that mean 9_3_X (which > I totally missed hearing about)? I may have been myself confused with numbering. Anyways, IMHO, disruptive framework level changes should not go in to the "major bug fix" release inserted to fix one specific large problem. > > This would not be a blocking change (i.e. we would not wait for it). If > it didn't make it in then I'd have to figure something else out for the > framework to be able to do work for the next 9 months :). So, this is a newly discovered feature/problem, right? Otherwise I can't understand the urgency given that 92X is an exceptional release with O(week) time scale to converge.

was not so disruptive or risky... nor really FWK changes. Rather a small interface change that otherwise people have to back port around. If back port headaches during the run are preferred, I guess it won't affect me:)

…

> > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#15866 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AEdcbql0qJ6aOw7BUhPM3_9mVD9jlqwcks5r4dCpgaJpZM4J-dAL>. > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dr15Jones · 2017-05-10T15:13:51Z

@slava77 This issue was first openned Sep 15, 2016 so it is not 'new'.
The behavior of the framework would not change for the data taking release. The change to the DQM modules is in anticipation of a framework change to allow concurrent processing of Runs and LuminosityBlocks which wouldn't happen until end of the Summer (at earliest). The change that is proposed to happen shortly would only introduce a new module type and would change nothing else (so I wouldn't consider it a 'disruptive framework level change'). The only reason there is an urgency to get this in now is to easy POSSIBLE backporting. E.g., say we make the DQM changes in mid June and then a week later a bug is found in one of those DQM modules which is affecting data taking. The want is to be easy to backport that change with minimal other changes (i.e. not worry that the back port would have to change the type back to EDAnalyzer from EDSummaryProducer).

slava77 · 2017-05-10T15:22:16Z

On 5/10/17 8:13 AM, Chris Jones wrote: only introduce a new module type and would change nothing else

sorry, I have misunderstood the significance of the preceding discussion which was suggesting that existing modules and sequences are going to change

Dr15Jones · 2017-05-10T15:27:28Z

@dmitrijus I looked deeper into the DQMEDHarvester and see that users can not override analyze. That is actually great.
https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/interface/DQMEDHarvester.h#L27

Therefore it is impossible for a class inheriting from DQMEDHarvester to care about Events. In that case, if we changed the DQMEDHarverster to inherit from EDProducer (instead of EDAnalyzer) it would not matter if the module was removed from a Path (or EndPath) by a call to convertToUnscheduled since it never cares about the Events. Even for an unscheduled EDProducer, the framework will call its (begin|end)(Run|LuminosityBlock) method so the harvesting work will happen.

The upshot is it appears that we do not have a need for the new module EDSummaryProducer (since one of its major features was to avoid ever being accidently run unscheduled). Instead we only need to change DQMEDHarvester to be an EDProducer which produces the DQMTokens at endLuminosityBlock. [By the way, do DQMHarversters only deal with LuminosityBlock histograms since they do not have a dqmEndRun method?]

Dr15Jones · 2017-05-10T15:39:04Z

@dmitrijus @rovere Can the DQMRootOutputModule (or MEtoEDMConverter) actually work with the DQMEDHarvester? It looks like most harvesters do their work at endJob time. The only ones that appear to be able to work are ones which work on LuminosityBlock histograms.

Dr15Jones · 2017-05-10T16:15:14Z

Here is a new proposal

keep classes that inherit from DQMEDAnalyzer the same (since they work on stream transitions instead of global transitions)
change DQMEDHarvester into an EDProducer
change all configuration files for classes that inherit from DQMEDHarvester to be cms.EDProducer instead of cms.EDAnalyzer (needed because of the C++ change).

These changes can either be done before data taking or can happen in a later summer release. If we do the change 'now' it means it would be easy to backport any changes to DQMEDHarvester from the Summer development release to the data taking release. If we wait, then part of a backport of a configuration file fragment containing a DQMEDHarvester class would entail changing the type back to cms.EDAnalyzer.

Comments?

fwyzard · 2017-05-10T17:15:20Z

Just an idea: add inside FWCore.ParameterSet.Config a line like

DQMEDHarvester = EDProducer

and change all python files to use cms.DQMEDHarvester .
Just in case we need a further change down the line...

Dr15Jones · 2017-05-10T17:29:29Z

I had a similar thought but would prefer the line to be in DQMServices/Core/python/DQMEDHarvester.py since that is where the base class is from. The reason is the DQMEDHarvester is not a type known to the framework.

fwyzard · 2017-05-10T17:33:39Z

True - but it would require users to import a file that they currently do not need to import.

Maybe we could have that line in DQMServices.Core.DQMEDHarvester , and make FWCore.ParameterSet.Config import it, and in the future other similar files that declare a specialised name for the plugin type ?

For example we could add DQMEDSource or DQMEDAnalyzer for the anayzers, HLTFilter, etc.

Dr15Jones · 2017-05-10T17:43:21Z

A coulple things I don't like about a python DQMEDHarvester declaration is if one uses a fillDescription it will use the true framework type and not the redirecting python type. I would think that would be confusing. Also, if we had a python DQMEDHarvester I could see someone trying to find a method of cms.Process to return all classes that inherit from DQMEDHarvester in C++. But there isn't a method of cms.Process for that type, just ones for the true type (either EDAnalyzer or EDProducer).

fwyzard · 2017-05-10T17:48:18Z

Again, true, of course.

I can think of some workarounds, but they seems like being more work than this change would be worth...

rovere · 2017-05-11T07:01:13Z

@Dr15Jones, all
I think we also have to put this into the right perspective. IIUC the penalty of not being fully concurrent will be paid once at every beginrun and beginLumi, the latter being used rarely in DQM to book anything (hence holding the lock). I think the overall effect on timing is considerable if measured within these transactions, negligible if integrated along a typical job.
Having said that, the final proposal from Chris sounds ok to me and is relatively cheap to implement.
One final remark: besides pure technical C++ changes, I'd ask to be as conservative as possible with everything related to the DQMNet and the underlying network layer, since this has to be able to talk to CMSSW and DQMGUI at the same time: if we change the layout of the shipped objects on one side of the communication, we need to update also the other peer (DQMGUI) otherwise we won't have Online DQM. I'll check the code changes when proposed in a final PR and make sure the full chain will work.

dmitrijus · 2017-05-11T09:34:08Z

@dmitrijus @rovere Can the DQMRootOutputModule (or MEtoEDMConverter) actually work with the DQMEDHarvester? It looks like most harvesters do their work at endJob time. The only ones that appear to be able to work are ones which work on LuminosityBlock histograms.

I have no idea. They are never used together - at least I don't know anyone using them together.
However, they are used together with DQMFileSaver* modules.
Which also must be supported in multi threaded environment (online DQM, HLT).
So, another token consumer. Currently, they are EDAnalyzers, but it should be easy to port them.

As far as planning goes, the biggest worry is precisely backporting.

I just want to avoid the situation where "upstream" DQM changes are no longer trivial to backport into "production". If there no interface changes or if they are minimal, I am fine with it.
Better done sooner than later.

Dr15Jones · 2017-05-11T17:59:26Z

#18701 makes the needed interface change for the DQMEDHarvester. The actually use of the DQMToken would be done in a later step since it is not critical for allowing easy backporting.

dmitrijus · 2017-05-15T12:38:48Z

A comment for #18701...
It would be really great if instead of changing EDAnalyzer to EDProducer, we did what @fwyzard suggested.

DQMEDHarvester = EDProducer

and then use cms.DQMEDHarvserver every time they are instantiated.
This way we would mark the modules, for any future migration which will happen.

Dr15Jones · 2017-05-15T13:27:23Z

@dmitrijus There are some big drawbacks to doing that which I touch on here
#15866 (comment)

If you still want it, we can do it that way since it is your domain.

Dr15Jones · 2017-05-15T18:57:08Z

@dmitrijus #18717 and #18751 are the two alternate user interfaces. Please let us know which one you wish to support.

Dr15Jones · 2017-05-16T13:52:10Z

@vanbesien @dmitrijus it is crucial that this issue be resolved before CMSSW_9_2 is released since this is prohibiting the framework from moving forward for greater threading efficiency. We've attempted to give you two options, #18717 and #18751 with explainations about their pros and cons. Since DQM is the one who is responsible for maintaining and documenting that code the final decision is yours.

Dr15Jones · 2017-05-16T18:11:00Z

@vanbesien @dmitrijus I spoke with @davidlange6 and the timing for the decision is not this week, but within the next two weeks (i.e. for when CMSSW_9_2 closes).

Dr15Jones · 2017-05-22T17:45:31Z

@vanbesien @dmitrijus which of the two options do you prefer, #18717 or #18751 ?

dmitrijus · 2017-05-23T12:52:04Z

Answered with +1, #18751
I was on vacation, still trying to catch up to everything :)

Dr15Jones · 2017-05-23T13:03:12Z

Thanks!

schneiml · 2019-01-16T09:26:32Z

@Dr15Jones this is also done, as I understand?

Dr15Jones · 2019-01-16T14:26:58Z

+1
The framework does now run modules concurrently during Run and LuminosityBlock transitions so this issue was resolved.

cmsbuild added the pending-assignment label Sep 15, 2016

cmsbuild added core-pending dqm-pending pending-signatures and removed pending-assignment labels Sep 15, 2016

Dr15Jones mentioned this issue Oct 1, 2016

Check that DQMStore and DQMEDAnalyzer can be run concurrently in Runs and Lumis #16062

Closed

dmitrijus mentioned this issue May 15, 2017

Change DQMEDHarvester from an edm::one::EDAnalyzer to an edm::one::EDProducer #18717

Closed

schneiml mentioned this issue Jan 17, 2018

Introduce an alias for DQMEDAnalyzer #21871

Merged

cmsbuild added core-approved and removed core-pending labels Jan 16, 2019

Dr15Jones closed this as completed Jan 16, 2019

DQMIO and DQM paths need to be serialized #15866

DQMIO and DQM paths need to be serialized #15866

Comments

Dr15Jones commented Sep 15, 2016

cmsbuild commented Sep 15, 2016 • edited by smuzaffar

Dr15Jones commented Sep 15, 2016

Dr15Jones commented Sep 15, 2016

cmsbuild commented Sep 15, 2016

Dr15Jones commented Oct 6, 2016

dmitrijus commented Oct 11, 2016

Dr15Jones commented May 9, 2017

Dr15Jones commented May 9, 2017

dmitrijus commented May 10, 2017

Dr15Jones commented May 10, 2017

slava77 commented May 10, 2017 via email

Dr15Jones commented May 10, 2017

slava77 commented May 10, 2017 via email

davidlange6 commented May 10, 2017 via email

Dr15Jones commented May 10, 2017

slava77 commented May 10, 2017 via email

Dr15Jones commented May 10, 2017

Dr15Jones commented May 10, 2017

Dr15Jones commented May 10, 2017

fwyzard commented May 10, 2017

Dr15Jones commented May 10, 2017

fwyzard commented May 10, 2017

Dr15Jones commented May 10, 2017

fwyzard commented May 10, 2017

rovere commented May 11, 2017

dmitrijus commented May 11, 2017

Dr15Jones commented May 11, 2017

dmitrijus commented May 15, 2017 • edited

Dr15Jones commented May 15, 2017

Dr15Jones commented May 15, 2017

Dr15Jones commented May 16, 2017

Dr15Jones commented May 16, 2017

Dr15Jones commented May 22, 2017

dmitrijus commented May 23, 2017

Dr15Jones commented May 23, 2017

schneiml commented Jan 16, 2019

Dr15Jones commented Jan 16, 2019

cmsbuild commented Sep 15, 2016 •

edited by smuzaffar

dmitrijus commented May 15, 2017 •

edited