Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM-related developments, fixes and improvements (master branch) #21652

Merged
merged 8 commits into from Dec 19, 2017

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Dec 6, 2017

Let the DQMStore delete unused MEs independently from the output modules

The multithreaded DQMStore can create a global copy of each by-lumi ME after each lumisection, in order to merge the stream MEs and write them out.
Deleting these MEs has been left to the DQMFileSaver or DQMOutputModule.
This has the disadvantage that

  • multiple file savers or output modules are not supported;
  • if a job runs any DQM modules without a file saver or output module, the MEs are never deleted, leaking memory.

This PR moves the call to deleteUnusedLumiHistograms into the DQMStore itself, via the postGlobalEndLumi transition, solving both problems.

Support global per-lumisection MEs

To save memory, it is desirable to introduce thread-safe global MonitorElements, with the only restriction that only one lumisection can be processed at a given time.

This PR introduces the code changes in order to support global, per-lumisection MEs:

  • explicitly mark the merged MEs to be deleted at the end of each lumisection, instead of relying on the global vs. stream distinction
  • reset (but not delete) global per-lumi MEs at the beginning of each lumisection
  • let the various DQM file savers write the global per-lumi MEs

Support histograms with concurrent filling

Add a ConcurrentMonitorElement wrapper around the MonitorElement class that exposes a concurrent filling interface:

  • fill() is marked as const to indicate concurrency, and uses a lock internally to protect the underlying ROOT object;
  • setTitle(), setAxis...() etc. methods are non-const, to be used only while the ConcurrentMonitorElement is being booked;
  • no direct access to the underlying ROOT object.

Add a ConcurrentBooker class that extends the IBooker interface to return ConcurrentMonitorElements instead of MonitorElements, and a bookConcurrentTransaction() call to make use of it.

Implement a base class for "global" DQM analyzers, using the ConcurrentBooker and bookConcurrentTransaction() calls.


For a quick tutorial about using these new classes, see How to migrate a DQMEDanalyzer to a DQMGlobalEDAnalyzer.

@cmsbuild cmsbuild added this to the CMSSW_10_0_X milestone Dec 6, 2017
@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2017

@dmitrijus @rovere FYI

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2017

The code-checks are being triggered in jenkins.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2017

@cmsbuild, please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2017

@dmitrijus, I may push more developments here as I go, so there is no need to sign this yet.
Do feel free to review and comment, of course.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2017

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/PR-21652/2454

Code check has found code style and quality issues which could be resolved by applying a patch in https://cmssdt.cern.ch/SDT/code-checks/PR-21652/2454/git-diff.patch
e.g. curl https://cmssdt.cern.ch/SDT/code-checks/PR-21652/2454/git-diff.patch | patch -p1

You can run scram build code-checks to apply code checks directly

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2017

The code-checks are being triggered in jenkins.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2017

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2017

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/PR-21652/2455

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2017

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2017

A new Pull Request was created by @fwyzard (Andrea Bocci) for master.

It involves the following packages:

DQMServices/Components
DQMServices/Core
DQMServices/FileIO

@vazzolini, @kmaeshima, @dmitrijus, @cmsbuild, @jfernan2, @vanbesien can you please review it and eventually sign? Thanks.
@barvic this is something you requested to watch as well.
@davidlange6, @slava77 you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2017

More code checks.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 8, 2017

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-21652/25001/summary.html

The workflows 1003.0, 1001.0, 1000.0, 140.53, 136.7611, 136.731, 4.22 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /build/cmsbld/jenkins-workarea/workspace/compare-root-files-short-matrix/results/JR-comparison/PR-21652/1325.7_TTbar_13_94XNanoAODINPUT+TTbar_13_94XNanoAODINPUT+NANOEDMMC2017

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 27
  • DQMHistoTests: Total histograms compared: 2835241
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2835062
  • DQMHistoTests: Total skipped: 178
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 1.36999999987 KiB( 23 files compared)
  • Checked 113 log files, 9 edm output root files, 27 DQM output files

// MonitorElement is being booked.
void setTitle(std::string const& title)
{
me_->setTitle(title);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @fwyzard - ignorant question - why not add the same lock_guards in these functions (or would that not be sufficient for making them concurrent safe?)

Copy link
Contributor Author

@fwyzard fwyzard Dec 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you are right, and it would indeed make these methods concurrent safe.

The reason I did not mark them as such is the DQM usage pattern, or at least the one I have in mind:

  • the ConcurrentMonitorElements (CMEs) are booked in the (global) begin run transition, where they are accessible as non-const objects;
  • the CMEs are later filled in the event loop, where they are accessible only as const objects to emphasise that they are concurrent-safe.

I can add the locks also for the operations on the titles and labels; however, I expect those to be set when the CMEs are created and available as non-const, are used only by one thread, and anyway already serialised by the big DQM lock.

An alternative approach I've been playing with - which is more flexible but slightly less clean - is to let the users create the histograms on their own via auto h = std::make_unique<TH1F>(name, ...) etc., customise the axes, titles, labels etc via access the full ROOT object interface, and when done transfer ownership to the DQMStore via a call to book1D(name, std::move(h)).
This should be just as efficient from the memory point of view (no Clone() involved), allow more flexibility, at the cost of one more line of code (per histogram) and with the small risk that the users may hold on to the initial pointer and use it later on to bypass the lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I agree its the best usage pattern to have all of these CMEs beautified in the begin run transitions - but if the code allows it elsewhere someone will do...

Your thoughts about transferring ownership of a TH1F are easier in the sense that there are a lot of functions that would need wrapping if the direct access to the TH* is removed (a good thing) - but the tradeoff is the ease of any future underlying histogram interface changes.

The second thought is beyond this PR - do you want to add locks for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather start with the present interface, and update it as needed when we start transitioning the analyzers to use it - we may encounter some patterns I haven't anticipated yet...

@davidlange6
Copy link
Contributor

hi @fwyzard - sorry for my slow answer - that sounds ok assuming someone thinkings they are responsible for catching developers.. is that you or @dmitrijus ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 12, 2017

I'm updating myself the modules that run inside the HLT.
For the modules that run offline, @dmitrijus may have a simpler solution in mind, and I will work with @mtosi (when she's less busy 🤰) for the HLT ones.

@davidlange6
Copy link
Contributor

sounds good - @dmitrijus ok for you for the online/offline dqm side of this?

@dmitrijus
Copy link
Contributor

dmitrijus commented Dec 12, 2017

My "simple" / "one" solution requires DQM modules to be scheduled differently than they are now.
That is planned, but that's a longer term plan.

But I am fine with this PR. Even after we move on to "simple / one module" for rest of the DQM, DQM should still support the concurrent filling of histograms (Which is what this PR introduces.)

Once this PR / CME interface finalizes, we will move many tracker histograms to this too.
This should free up a really large chunk of memory.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 13, 2017

@davidlange6
as I missed the ORP this week, is there anything holding this up ?

@davidlange6
Copy link
Contributor

hi @fwyzard - thanks for following up - instead of a confirmation from @dmitrijus I was told something new was coming next week that meant he didn't need to answer my original question. If thats not the plan, the question above still stands.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 14, 2017

From my side, I am working on further updates, but they are orthogonal to the changes in this PR, so I think this should be merged, and I will open a separate one later on.

As for @dmitrijus, I don't know what they plan to push next week.

@davidlange6
Copy link
Contributor

okay - so i get confusing messages - so back to my question

who thinks they are responsible for catching developers.. is that @dmitrijus ?

@davidlange6
Copy link
Contributor

(offline dqm developers that is)

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 15, 2017

@davidlange6 , here is my understanding, based on the discussion with @dmitrijus from a couple of weeks ago.

Currently all DQM modules are edm::stream::EDAnalyzer; each stream has its own copy of its MonitorElements, plus a global MonitorElement to hold the merged results.

My proposal to reduce the memory consumption is to implement a "concurrent" interface to the MonitorElement, to allow keeping only the global ones around. The DQMStore also needed to be extended to support this approach together with the lumi-based histogram saving used at HLT.
This PR implements all these changes.
The actual DQM modules can still be stream modules (for example if they need the stream interface for reasons unrelated to the DQM), or they can be migrated to be global ones.

As I said earlier, I am already migrating the modules used within the HLT to this approach; I will look into the HLT DQM modules used offline together with the STEAM group.

The proposal from @dmitrijus is somewhat alternative, somewhat complementary: change the DQM modules from edm::stream::EDAnalyzer to edm::one::EDProducer.
The migration from stream to one should take care of the memory reduction.
The migration from EDAnalyzer to EDProducer should help the framework to schedule the DQM modules concurrently, even though they are one modules.

I assume that either the DQM Core team or the individual developers will be in charge of implementing this migration.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 15, 2017

(bottom line, can you please merge this PR ?)

@davidlange6
Copy link
Contributor

seems difficult to get answers to simple questions - maybe @vazzolini, @kmaeshima, @jfernan2 can help - or perhaps the DQM meeting tomorrow will resolve it - I'm hoping we don't really plan two directions forward to solve this problem (I prefer @fwyzard's unless proven to have contention issues at some low # of threads [and even then, maybe])

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 18, 2017

The framework provides global, stream, one (and now limited) modules... why would it be a problem if some DQM modules are global while others are stream or one ?

@davidlange6
Copy link
Contributor

davidlange6 commented Dec 18, 2017 via email

@davidlange6
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 3c3c015 into cms-sw:master Dec 19, 2017
fwyzard added a commit to fwyzard/cmssw that referenced this pull request Dec 20, 2017
> @Dr15Jones
> I have another theory as to what caused the problem. There are two
> different phases in the framework as a lumi is ending. There is the
> 'global end lumi' phase and the 'write lumi' phase. The 'write lumi'
> phase is only seen by the OutputModules and is when they "write the
> lumi to output'. So the change in cms-sw#21652 deletes some histograms at
> 'end of global end lumi' which is before 'write lumi' gets called.
> Therefore if the deleted histograms were supposed to be saved to the
> file, they would be missing.

A workaround for the issue is to let the `DQMStore` delete the
no-longer-needed lumi-based histograms in the "pre global begin lumi"
transition instead of the "post global end lumi".
@fwyzard fwyzard mentioned this pull request Dec 20, 2017
@fwyzard fwyzard deleted the DQMStore_developments_100x branch January 17, 2018 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants