Make DQMStore more thread efficient #18574

Dr15Jones · 2017-05-04T19:42:59Z

The framework now runs the stream beginRun and beginLuminosityBlock transitions concurrently for all streams and within a stream the modules are allowed to run concurrently. We are unable to make full use of this ability becuase the DQMStore takes and holds a lock during the call to the DQMAnalyzer. This means we are unable to run multiple DQMAnalyzers concurrently and since the lock is unknown to the framework, the framework is unable to schedule around the DQMAnalyzer conflict.

To that end, we've had a person looking into the possibility of modifying the DQMStore to allow true concurrent access. The person is no longer able to work on the project but were able to write up their findings (which are below):

Summary of DQMStore Work
Over the last few weeks, I've spent time analyzing how the DQMStore service can be enhanced in a multi-threaded context. Current thread-safety for the DQMStore is achieved by using an std::mutex, which guards access to the DQMStore member data.

Take-away points:

Access to the DQMStore is encouraged through the IBooker and IGetter classes.
Access to the DQMStore is also possible without using the IBooker/IGetter classes.
The bookTransaction interface requires a stream ID and module ID, which is already conducive to concurrency improvements—the second branch I reference below contains changes to that end.
Introducing a thread-safe container, keyed off the stream ID and module ID, is likely necessary for concurrent access, but it is not sufficient. For cases in which all histograms corresponding to a given path are "merged" or retrieved, an efficient means of looping through the relevant histograms is desirable. This perhaps means creating an additional container that maps the "path" to the module IDs for the given path.

Issues of efficiency:

The current design uses an std::set<MonitorElement> data member as a manager of the histograms. Weak ordering is enforced by comparing (by operator<) MonitorElement::DQMNet::CoreObject members in this order:

run, lumi, streamId, moduleId, dirname, objname

I have modified that order to be:

streamId, moduleId, run, lumi, dirname, objname

with the hopes that streamId and moduleId can be removed entirely from the ordering criterion. However, the current implementation uses the following pattern:

MonitorElement proto {…};
auto i = data_.lower_bound(proto);
auto const e = data_.end();
for (; i != e; ++i) {…}

where the proto/lower_bound usage is meant to "short-circuit" the lookup for a given MonitorElement. Ideally, searching in this manner can be removed in favor of a nearly-direct lookup based on stream ID and module ID.

My github fork:

I have two branches that include work I’ve done on the DQMStore. The second branch builds off of the first.

https://github.com/knoepfel/cmssw/tree/DQMStore-cleanup
https://github.com/knoepfel/cmssw/tree/DQMStore-checkpoint
The cleanup branch includes various improvements:
More reliance on modern C++ facilities
Each DQMStore::book* function had two signatures, with one taking a char const* argument, and the other taking an std::string const& argument. The doubling of function signatures has been removed so that each DQMStore::book* function takes a DQMStore::char_string const& argument, which is implicitly convertible to char const* or std::string const&. This greatly reduced the number of lines of code, and it should be invisible to the user.
Functions with signature like foo(void) were replaced with foo().
clang-format was run on several of the files.

The checkpoint branch is move toward more concurrent access to the DQMStore data:

Introduction of safe_data_ member of type tbb::concurrent_unordered_map<std::pair<stream_id_t, module_id_t>, MonitorElement>. As of commit 3471d24, no insertions are done, but the code that to do it is present.

Changes yet to be made:

All uses of data_.lower_bound(proto) should be replaced with something more conducive to multi-threading.
Care should be taken to handle merging, directory-removal, etc. It may be that separate data structures should be constructed which separately handle the merging steps.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2017-05-04T19:43:15Z

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2017-05-04T19:43:29Z

assign dqm

cmsbuild · 2017-05-04T19:43:32Z

New categories assigned: dqm

@vanbesien,@dmitrijus you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones · 2017-05-04T19:43:49Z

Is there interest from DQM to have Core pursue this further?

vanbesien · 2017-05-05T07:35:00Z

I would say yes.
@dmitrijus was planning a revamp of the DQM store when manpower permits.
@dmitrijus what do you think?

dmitrijus · 2017-05-09T14:27:06Z

Introducing a thread-safe container, keyed off the stream ID and module ID.

This was my "long term" plan - make each module or stream (or both) fill in their own container.
There is absolutely no need for DQMEDAnalyzers to access the global or, in fact, any shared container.

But simply not enough time. To make it worse (and more time consuming), DQM runs a lot of legacy code: DQM online, multi run harvesting, PCL workflows, all of which would need to be reviewed.
Half of those are still broken when running multithread :(

The changes above are most welcome. I will try to review them this week and have a talk with Marco.
DQMNet and DQMStore affects DQM GUI (which is external to cmssw), so we will have to make sure DQM gui can be properly compiled with those changes...

knoepfel · 2017-12-08T17:31:35Z

Are you still interested in having the core framework group pursue the above changes?

knoepfel · 2018-02-22T15:17:17Z

DQMStore::bookTransaction does not include the LuminosityBlock number as a function argument. In moving toward a more thread-efficient use of the DQMStore, it might make sense for this number to be provided. Other than breaking current code, is there a reason the lumi-block number should not be included in the interface?

fwyzard · 2018-02-27T23:36:03Z

Let me share some comments, mostly in random order, and mostly based on things that @dmitrijus, @schneiml, @rovere and @knoepfel have said in the past days and weeks...

As far as I know, the main constrain for the DQMStore is that it should work and be efficient both within CMSSW, and for its use in the DQM GUI. Any changes to how the MonitorElements are store and indexed need to take into account both use cases, and the data transfer from CMSSW to the DQM GUI.

Regarding the booking interface, I think we should

move the transient members (current directory, module id, etc) outside the DQMStore and into some kind of proxy; this way we don't need to lock the DQMStore for the whole time the booking takes place in the users code, but only during the actual booking itself; if the booking operation becomes concurrent-safe, we could get rid of the lock altogether;
such proxy cannot be the IBooker because it and the IGetter need to share the current directory; if the two could be unified the IBooker could become the proxy;
it would make sense to separate the booking of per-lumi, per-run, and per-job histograms; I do not know if we really need per-job histograms, or if they should be handled in a different way (e.g. via multirun harvesting).

If we want to allow concurrent lumis and runs, we need to implement into a simpler interface for the global modules.
For example, we could use some way to book N copies of each histogram, where N is the number of concurrent runs or lumisections, and handle the multiple concurrent runs/lumis as much as possible automatically (e.g., users call histo->fill(run, lumi, x, y) and the histo takes care of using the correct MonitorElement based on the run number or run index; or something like this...

if we change the storage of the DQMStore, we should loock for something that lets us change a MonitorElement run or lumisection; once we move to C++17 we could do something like "take them out of the set, change the run/lumi, insert them back"; without C++17 it would be very inefficient.

Regarding the migration to the DQM "tokens", if we want all the scheduling to be done via the tokens dependencies, we could

let each DQM source produce a (per-lumi or per-run) token;
let each DQM harvester consume the tokens produced by the modules that produced the histograms it is going to consume;
let each DQM harvester produce a (per-lumi or per-run) token.
Apart from the amount of work needed by the migration itself, this would work as intended only if all the modules are part of the same job; to keep it working also when the "source"s are in one job and the "harvester"s are in a separate job, we would need the various DQM file formats to also store information about the tokens, and the various DQM sources recreate them. I guess this is one of the aspects where the DQM developers could benefit the most from the framework experts help.

.A

schneiml · 2018-02-28T08:52:19Z

Thanks @fwyzard for this summary of our coffee conversations!

I want to add on top that the proposal from @dmitrijus in the old comment above still stands: maybe the easiest way ahead is to keep one container per DQM module. I could imaging that we can drop the indexing on run and lumi number, since the number of distinct runs and lumis in the store is at any time limited by the number of concurrent runs and lumis (±1), so linear search there won't hurt (which then solves the "changing lumi/run on ME" issue). (This was (afaik) always the case.)

The interesting problem is how harvesting can work, when the MEs are distributed over many containers, but that is surprisingly easy (since it works like that already today -- the container is partitioned by moduleId): If the harvester knows the source, it can consume the right token, get access to the right container and read from there (this is possible today, but not done: no harvester (afaik) does queries with a non-0 moduleId). The "normal", "legacy" harvesting can still work as today: after storing and loading, the moduleIds are erased and one container holds all the MEs. The only big issue to be solved is how to make "dependency based" and "legacy" harvesters coexist during the migration (there will be a migration, and it might take years).

fwyzard · 2018-02-28T10:24:53Z

Let me reiterate one concern: before thinking how to rearrange the storage of the MonitorElements and how to let the various modules access them, please take the time to understand the possible impact on the DQM GUI, outside CMSSW.

schneiml · 2018-02-28T10:38:26Z

Yes, of course -- we have to test this, since probably nobody really knows atm what the DQMGUI does precisely. My suspicion is that this is covered the same way as said for harvesting above: storing and then loading will erase the moduleId in the current partitioned DQMStore, and DQMGUI most likely sets enableMultiThread to false to force all moduleIds to 0. That would also sort of work with a distributed DQMStore (which ofc has to keep pointers to all containers, including one for the "global" things, like loaded histograms).

I agree that this is not a very elegant solution -- but it is exactly what happens today, due to the creative side effects of the enableMultiThread flag.

knoepfel · 2018-03-12T14:24:39Z

Can you point me to the DQM GUI code so I can look at the couplings between it and DQMStore?

@fwyzard: I see discussion about how MonitorElements are stored, but not much discussion about the schema of the MonitorElement class itself. What constraints are there for making adjustments to MonitorElement?

fwyzard · 2018-03-12T15:12:25Z

Hi @knoepfel ,
sorry, I'm afraid I don't really know about that.
.A

schneiml · 2018-03-12T18:26:39Z

@knoepfel the authoritative repo of the DQMGUI is this, if I am not mistaken:
https://github.com/rovere/dqmgui

IIRC the it pulls in some CMSSW code at build time, so you won't find a DQMStore in the source code there, but it will be compiled with it when you build it. Which is nontrivial, but there is documentation: https://github.com/rovere/dqmgui/blob/index128/doc/overview/devguide.rst

knoepfel · 2018-03-13T15:39:01Z

@schneiml thanks. I'm not sure how successful I will be in building it since I doubt I have permissions where they may be required, but I'll give it a go.

@fwyzard regarding proxies: the concept of "current" directory must go away in terms of the DQMStore in ensuring thread safety. I see no problems with (e.g.) the DQMEDAnalyzer supplying the directory for retrieving the MonitorElements, and that can be hidden from any derived classes of DQMEDAnalyzer.

The proposal would be that:

the bookTransaction calls made by anyone must specify the relevant directory.
specifying the directory would, under the covers, create a proxy that all IBooker and IGetter instances for that module would use, thus ensuring correct directory navigation.
For now, leave the schema of MonitorElement alone (alas), but I will need to change the container for storing MonitorElements to ensure thread-safe insertion...without breaking the DQM GUI.

Thoughts?

fwyzard · 2018-03-13T15:49:06Z

the bookTransaction calls made by anyone must specify the relevant directory.

It is perfectly fine for a module to book or access histograms in multiple directories.
So, the choice of directory should not be done at the time of the call to bookTransaction - at least, not only.
The proxy object returned by bookTransaction should still have cd() and similar methods to navigate the directory tree.

rovere · 2018-03-13T16:48:39Z

@knoepfel could you please clarify, at least to me, what is exactly your goal related to CMSSW/DQMStore and the DQMGUI?

knoepfel · 2018-03-13T16:53:47Z

The intention is to adjust DQMStore so that it is not necessary to lock the data structures during booking--i.e. to take better advantage of multi-threading. I am hoping to make no (or at least only minimal) changes to DQMGUI as a consequence of the adjustments to DQMStore.

rovere · 2018-03-13T16:56:56Z

@knoepfel please ignore the DQMGUI for the time being and commit code only in CMSSW. There is, in CMSSW, a standalone compilation of DQMStore: as long as that works, I am happy, and the DQMGUI is happy too.
My long-ish term plan is to definitely cut also this link between CMSSW and DQMGUI, but I still have to make a plan and fully assess the consequences.

Dr15Jones · 2018-03-13T16:57:43Z

Thanks @rovere

knoepfel · 2018-03-13T17:00:35Z

@rovere okay, will do. Thanks.

knoepfel · 2018-03-13T18:45:04Z

@fwyzard yes, calling cd(...) is still supported via the IBooker and IGetter interface because both objects will share the same proxy and therefore have consistent state.

jfernan2 · 2021-12-13T08:27:45Z

Is this issue still alive? My understanding is that it is NOT after the changes in DQMStore by Marcel. Please let me know

smuzaffar · 2023-11-14T11:06:24Z

@Dr15Jones , is it still an open issue?

makortel · 2023-11-14T14:46:01Z

My take is that the topic of making DQMStore more thread efficient is still open, but at this point likely would have to start from scratch.

cmsbuild added the pending-assignment label May 4, 2017

cmsbuild added dqm-pending pending-signatures and removed pending-assignment labels May 4, 2017

dmitrijus mentioned this issue May 10, 2017

DQMIO and DQM paths need to be serialized #15866

Closed

Dr15Jones mentioned this issue Feb 27, 2018

Migrate DQMEDAnalyze to producers #22319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DQMStore more thread efficient #18574

Make DQMStore more thread efficient #18574

Dr15Jones commented May 4, 2017

cmsbuild commented May 4, 2017

Dr15Jones commented May 4, 2017

cmsbuild commented May 4, 2017

Dr15Jones commented May 4, 2017

vanbesien commented May 5, 2017

dmitrijus commented May 9, 2017 •

edited

knoepfel commented Dec 8, 2017

knoepfel commented Feb 22, 2018

fwyzard commented Feb 27, 2018

schneiml commented Feb 28, 2018

fwyzard commented Feb 28, 2018

schneiml commented Feb 28, 2018

knoepfel commented Mar 12, 2018

fwyzard commented Mar 12, 2018

schneiml commented Mar 12, 2018

knoepfel commented Mar 13, 2018

fwyzard commented Mar 13, 2018

rovere commented Mar 13, 2018

knoepfel commented Mar 13, 2018

rovere commented Mar 13, 2018

Dr15Jones commented Mar 13, 2018

knoepfel commented Mar 13, 2018

knoepfel commented Mar 13, 2018

jfernan2 commented Dec 13, 2021

smuzaffar commented Nov 14, 2023

makortel commented Nov 14, 2023

Make DQMStore more thread efficient #18574

Make DQMStore more thread efficient #18574

Comments

Dr15Jones commented May 4, 2017

cmsbuild commented May 4, 2017

Dr15Jones commented May 4, 2017

cmsbuild commented May 4, 2017

Dr15Jones commented May 4, 2017

vanbesien commented May 5, 2017

dmitrijus commented May 9, 2017 • edited

knoepfel commented Dec 8, 2017

knoepfel commented Feb 22, 2018

fwyzard commented Feb 27, 2018

schneiml commented Feb 28, 2018

fwyzard commented Feb 28, 2018

schneiml commented Feb 28, 2018

knoepfel commented Mar 12, 2018

fwyzard commented Mar 12, 2018

schneiml commented Mar 12, 2018

knoepfel commented Mar 13, 2018

fwyzard commented Mar 13, 2018

rovere commented Mar 13, 2018

knoepfel commented Mar 13, 2018

rovere commented Mar 13, 2018

Dr15Jones commented Mar 13, 2018

knoepfel commented Mar 13, 2018

knoepfel commented Mar 13, 2018

jfernan2 commented Dec 13, 2021

smuzaffar commented Nov 14, 2023

makortel commented Nov 14, 2023

dmitrijus commented May 9, 2017 •

edited