Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQMIO and DQM paths need to be serialized #15866

Closed
Dr15Jones opened this issue Sep 15, 2016 · 37 comments
Closed

DQMIO and DQM paths need to be serialized #15866

Dr15Jones opened this issue Sep 15, 2016 · 37 comments

Comments

@Dr15Jones
Copy link
Contributor

With the coming of running Paths and EndPaths concurrently during Lumi and Run transitions, the dqmoffine_step path and the path holding the DQMOutputModule need to be serialized. The problem is the DQM EDAnalyzers have a hidden (from the framework) data dependency with DQMOutputModule. The several ways to fix this:

  1. put all the DQM EDAnalyzers into one EndPath and have the DQMOutputModule at its end. That would be bad for parallel processing since the EDAnalyzers would not be run in parallel ever.
  2. put each DQM EDAnalyzer in its own endpath with a simple EDProducer at the end. The path with the DQMOutputModule would have a module at its beginning which consumes all the data products made by the EDProducers at the end of the DQM end paths.
  3. Make the DQM EDAnalyzers into EDProducers which puts a token object into the Lumi/Run and then the DQMOutputModule would get those data object and automatically be dependent on those modules.
@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 15, 2016

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @smuzaffar, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

assign core

@Dr15Jones
Copy link
Contributor Author

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: core,dqm

@Dr15Jones,@smuzaffar,@vanbesien,@dmitrijus you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor Author

ping do people have any preference for the solution?

@dmitrijus
Copy link
Contributor

Hello, neither of us are the experts on the framework, but I think solution #3 would be the best.
The less we touch/involve EndPaths, the better.

@Dr15Jones
Copy link
Contributor Author

At the Core meeting I proposed the following
https://indico.cern.ch/event/638163/contributions/2585775/attachments/1455835/2247950/SummaryProducer.pdf

The idea is to add a new type, EDSummaryProducer, to the framework and convert all DQMEDAnalyzers to this type (done centrally). From the meeting, this would have to be done within the next week (all the work would be done by Core).

Would that be acceptable to you?

An alternative, but not well liked, idea was to change EDAnalyzers to allow them to put data into the Run and LuminosityBlock at end transitions. That touches less user code but muddles the behaviors of the module types.

@Dr15Jones
Copy link
Contributor Author

@fwyzard FYI

@dmitrijus
Copy link
Contributor

Hi, I should have stayed for the core meeting...
I don't mind these changes, but after the discussion with @rovere, the main question is "when?"

If it is before data taking, this could disastrous - DQM is famous for having "out of the loop" sequences.
Such as online DQM, PCL. Even HLT has their entire own logic of merging/producing the histograms.

Second point, is what happens to the DQMEDHarvester modules?

In official "reco" step, harvesting modules are not run (they run in a separate step), just DQMEDAnalyzers, but in online DQM and in many "custom" user sequences, two are run together.

Would DQMEDHarvester have to modified to consume the DQMToken? But it has to be run before the output module is run, but after the histograms are merged into the global histograms.

If we leave this for after the data taking, the best would to think about reworking the entire DQM to make it work with multiple containers, instead of one shared container for all the histograms.
Which was suggested here: #18574 (comment)

@Dr15Jones
Copy link
Contributor Author

I don't mind these changes, but after the discussion with @rovere, the main question is "when?"

If it is before data taking, this could disastrous - DQM is famous for having "out of the loop" sequences.
Such as online DQM, PCL. Even HLT has their entire own logic of merging/producing the histograms.

In the meeting, it was stated that the changes had to go in before datataking else they couldn't happen until November because we need to easily backport. Delaying to November would completely stop further framework development until then. Upon further reflection, I think we can compromise. If the framework changes which add the new module type go in for data taking (i.e. CMSSW_9_2) then the rest of the changes can go in to the release following CMSSW_9_2. If a module were changed in the later release, it could still be back ported into CMSSW_9_2 because the framework support would be there. This would allow a more leisurely migration of the DQM code without possible disruption to data taking.

Second point, is what happens to the DQMEDHarvester modules?

In official "reco" step, harvesting modules are not run (they run in a separate step), just DQMEDAnalyzers, but in online DQM and in many "custom" user sequences, two are run together.

Would DQMEDHarvester have to modified to consume the DQMToken? But it has to be run before the output module is run, but after the histograms are merged into the global histograms.

To be completely correct, the DQMEDHarverster should become a EDSummaryProducer (since it makes histograms which ultimately must happen before the OutputModule is run) and it should consume the DQMTokens from the modules which make the histograms it pulls from the DQMStore. That would completely guarantee the framework would run all modules in the proper ordering at global end transitions.

Your question about the DQMEDHarvester made me go back and look more closely at the code and I found an interesting fact. The DQMEDAnalyzers do all their work on end stream transition (they are also called on end global transition but the DQMEDAnalyzer code does nothing). The DQMRootOutputModule and the DQMEDHarversters do their work on end global transitions. The framework guarantees that all end stream transitions happen before the end global transition. Therefore all DQMEDAnalyzers are guaranteed to have done their job filling histograms before the DQMRootOutputModule or the DQMEDHarvester ever reads those histograms. So if we guarantee that no DQMEDAnalyzer overrides the static methods globalEndRunSummary or globalEndLuminosityBlockSummary (that is true today) then the DQMEDAnalyzers can remain EDAnalyzers and everything will work just fine.

Therefore it is only the DQMEDHarvesters which need to be converted to EDSummaryProducer because they do their work at global end transition and need to be run before the DQMRootOutputModule is called.

@slava77
Copy link
Contributor

slava77 commented May 10, 2017 via email

@Dr15Jones
Copy link
Contributor Author

@slava77 at the meeting I was told any change would have to happen within 2 weeks. Would that mean 9_3_X (which I totally missed hearing about)?

This would not be a blocking change (i.e. we would not wait for it). If it didn't make it in then I'd have to figure something else out for the framework to be able to do work for the next 9 months :).

@slava77
Copy link
Contributor

slava77 commented May 10, 2017 via email

@davidlange6
Copy link
Contributor

davidlange6 commented May 10, 2017 via email

@Dr15Jones
Copy link
Contributor Author

@slava77 This issue was first openned Sep 15, 2016 so it is not 'new'.
The behavior of the framework would not change for the data taking release. The change to the DQM modules is in anticipation of a framework change to allow concurrent processing of Runs and LuminosityBlocks which wouldn't happen until end of the Summer (at earliest). The change that is proposed to happen shortly would only introduce a new module type and would change nothing else (so I wouldn't consider it a 'disruptive framework level change'). The only reason there is an urgency to get this in now is to easy POSSIBLE backporting. E.g., say we make the DQM changes in mid June and then a week later a bug is found in one of those DQM modules which is affecting data taking. The want is to be easy to backport that change with minimal other changes (i.e. not worry that the back port would have to change the type back to EDAnalyzer from EDSummaryProducer).

@slava77
Copy link
Contributor

slava77 commented May 10, 2017 via email

@Dr15Jones
Copy link
Contributor Author

@dmitrijus I looked deeper into the DQMEDHarvester and see that users can not override analyze. That is actually great.
https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/interface/DQMEDHarvester.h#L27

Therefore it is impossible for a class inheriting from DQMEDHarvester to care about Events. In that case, if we changed the DQMEDHarverster to inherit from EDProducer (instead of EDAnalyzer) it would not matter if the module was removed from a Path (or EndPath) by a call to convertToUnscheduled since it never cares about the Events. Even for an unscheduled EDProducer, the framework will call its (begin|end)(Run|LuminosityBlock) method so the harvesting work will happen.

The upshot is it appears that we do not have a need for the new module EDSummaryProducer (since one of its major features was to avoid ever being accidently run unscheduled). Instead we only need to change DQMEDHarvester to be an EDProducer which produces the DQMTokens at endLuminosityBlock. [By the way, do DQMHarversters only deal with LuminosityBlock histograms since they do not have a dqmEndRun method?]

@Dr15Jones
Copy link
Contributor Author

@dmitrijus @rovere Can the DQMRootOutputModule (or MEtoEDMConverter) actually work with the DQMEDHarvester? It looks like most harvesters do their work at endJob time. The only ones that appear to be able to work are ones which work on LuminosityBlock histograms.

@Dr15Jones
Copy link
Contributor Author

Here is a new proposal

  1. keep classes that inherit from DQMEDAnalyzer the same (since they work on stream transitions instead of global transitions)
  2. change DQMEDHarvester into an EDProducer
  3. change all configuration files for classes that inherit from DQMEDHarvester to be cms.EDProducer instead of cms.EDAnalyzer (needed because of the C++ change).

These changes can either be done before data taking or can happen in a later summer release. If we do the change 'now' it means it would be easy to backport any changes to DQMEDHarvester from the Summer development release to the data taking release. If we wait, then part of a backport of a configuration file fragment containing a DQMEDHarvester class would entail changing the type back to cms.EDAnalyzer.

Comments?

@fwyzard
Copy link
Contributor

fwyzard commented May 10, 2017

Just an idea: add inside FWCore.ParameterSet.Config a line like

DQMEDHarvester = EDProducer

and change all python files to use cms.DQMEDHarvester .
Just in case we need a further change down the line...

@Dr15Jones
Copy link
Contributor Author

I had a similar thought but would prefer the line to be in DQMServices/Core/python/DQMEDHarvester.py since that is where the base class is from. The reason is the DQMEDHarvester is not a type known to the framework.

@fwyzard
Copy link
Contributor

fwyzard commented May 10, 2017

True - but it would require users to import a file that they currently do not need to import.

Maybe we could have that line in DQMServices.Core.DQMEDHarvester , and make FWCore.ParameterSet.Config import it, and in the future other similar files that declare a specialised name for the plugin type ?

For example we could add DQMEDSource or DQMEDAnalyzer for the anayzers, HLTFilter, etc.

@Dr15Jones
Copy link
Contributor Author

A coulple things I don't like about a python DQMEDHarvester declaration is if one uses a fillDescription it will use the true framework type and not the redirecting python type. I would think that would be confusing. Also, if we had a python DQMEDHarvester I could see someone trying to find a method of cms.Process to return all classes that inherit from DQMEDHarvester in C++. But there isn't a method of cms.Process for that type, just ones for the true type (either EDAnalyzer or EDProducer).

@fwyzard
Copy link
Contributor

fwyzard commented May 10, 2017

Again, true, of course.

I can think of some workarounds, but they seems like being more work than this change would be worth...

@rovere
Copy link
Contributor

rovere commented May 11, 2017

@Dr15Jones, all
I think we also have to put this into the right perspective. IIUC the penalty of not being fully concurrent will be paid once at every beginrun and beginLumi, the latter being used rarely in DQM to book anything (hence holding the lock). I think the overall effect on timing is considerable if measured within these transactions, negligible if integrated along a typical job.
Having said that, the final proposal from Chris sounds ok to me and is relatively cheap to implement.
One final remark: besides pure technical C++ changes, I'd ask to be as conservative as possible with everything related to the DQMNet and the underlying network layer, since this has to be able to talk to CMSSW and DQMGUI at the same time: if we change the layout of the shipped objects on one side of the communication, we need to update also the other peer (DQMGUI) otherwise we won't have Online DQM. I'll check the code changes when proposed in a final PR and make sure the full chain will work.

@dmitrijus
Copy link
Contributor

@dmitrijus @rovere Can the DQMRootOutputModule (or MEtoEDMConverter) actually work with the DQMEDHarvester? It looks like most harvesters do their work at endJob time. The only ones that appear to be able to work are ones which work on LuminosityBlock histograms.

I have no idea. They are never used together - at least I don't know anyone using them together.
However, they are used together with DQMFileSaver* modules.
Which also must be supported in multi threaded environment (online DQM, HLT).
So, another token consumer. Currently, they are EDAnalyzers, but it should be easy to port them.

As far as planning goes, the biggest worry is precisely backporting.

I just want to avoid the situation where "upstream" DQM changes are no longer trivial to backport into "production". If there no interface changes or if they are minimal, I am fine with it.
Better done sooner than later.

@Dr15Jones
Copy link
Contributor Author

#18701 makes the needed interface change for the DQMEDHarvester. The actually use of the DQMToken would be done in a later step since it is not critical for allowing easy backporting.

@dmitrijus
Copy link
Contributor

dmitrijus commented May 15, 2017

A comment for #18701...
It would be really great if instead of changing EDAnalyzer to EDProducer, we did what @fwyzard suggested.

DQMEDHarvester = EDProducer

and then use cms.DQMEDHarvserver every time they are instantiated.
This way we would mark the modules, for any future migration which will happen.

@Dr15Jones
Copy link
Contributor Author

@dmitrijus There are some big drawbacks to doing that which I touch on here
#15866 (comment)

If you still want it, we can do it that way since it is your domain.

@Dr15Jones
Copy link
Contributor Author

@dmitrijus #18717 and #18751 are the two alternate user interfaces. Please let us know which one you wish to support.

@Dr15Jones
Copy link
Contributor Author

@vanbesien @dmitrijus it is crucial that this issue be resolved before CMSSW_9_2 is released since this is prohibiting the framework from moving forward for greater threading efficiency. We've attempted to give you two options, #18717 and #18751 with explainations about their pros and cons. Since DQM is the one who is responsible for maintaining and documenting that code the final decision is yours.

@Dr15Jones
Copy link
Contributor Author

@vanbesien @dmitrijus I spoke with @davidlange6 and the timing for the decision is not this week, but within the next two weeks (i.e. for when CMSSW_9_2 closes).

@Dr15Jones
Copy link
Contributor Author

@vanbesien @dmitrijus which of the two options do you prefer, #18717 or #18751 ?

@dmitrijus
Copy link
Contributor

Answered with +1, #18751
I was on vacation, still trying to catch up to everything :)

@Dr15Jones
Copy link
Contributor Author

Thanks!

@schneiml
Copy link
Contributor

@Dr15Jones this is also done, as I understand?

@Dr15Jones
Copy link
Contributor Author

+1
The framework does now run modules concurrently during Run and LuminosityBlock transitions so this issue was resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants