New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make DQMStore more thread efficient #18574
Comments
A new Issue was created by @Dr15Jones Chris Jones. @davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign dqm |
New categories assigned: dqm @vanbesien,@dmitrijus you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Is there interest from DQM to have Core pursue this further? |
I would say yes. |
This was my "long term" plan - make each module or stream (or both) fill in their own container. But simply not enough time. To make it worse (and more time consuming), DQM runs a lot of legacy code: DQM online, multi run harvesting, PCL workflows, all of which would need to be reviewed. The changes above are most welcome. I will try to review them this week and have a talk with Marco. |
Are you still interested in having the core framework group pursue the above changes? |
|
Let me share some comments, mostly in random order, and mostly based on things that @dmitrijus, @schneiml, @rovere and @knoepfel have said in the past days and weeks... As far as I know, the main constrain for the Regarding the booking interface, I think we should
If we want to allow concurrent lumis and runs, we need to implement into a simpler interface for the
Regarding the migration to the DQM "tokens", if we want all the scheduling to be done via the tokens dependencies, we could
.A |
Thanks @fwyzard for this summary of our coffee conversations! I want to add on top that the proposal from @dmitrijus in the old comment above still stands: maybe the easiest way ahead is to keep one container per DQM module. I could imaging that we can drop the indexing on run and lumi number, since the number of distinct runs and lumis in the store is at any time limited by the number of concurrent runs and lumis (±1), so linear search there won't hurt (which then solves the "changing lumi/run on ME" issue). (This was (afaik) always the case.) The interesting problem is how harvesting can work, when the MEs are distributed over many containers, but that is surprisingly easy (since it works like that already today -- the container is partitioned by moduleId): If the harvester knows the source, it can consume the right token, get access to the right container and read from there (this is possible today, but not done: no harvester (afaik) does queries with a non-0 moduleId). The "normal", "legacy" harvesting can still work as today: after storing and loading, the moduleIds are erased and one container holds all the MEs. The only big issue to be solved is how to make "dependency based" and "legacy" harvesters coexist during the migration (there will be a migration, and it might take years). |
Let me reiterate one concern: before thinking how to rearrange the storage of the |
Yes, of course -- we have to test this, since probably nobody really knows atm what the DQMGUI does precisely. My suspicion is that this is covered the same way as said for harvesting above: storing and then loading will erase the moduleId in the current partitioned DQMStore, and DQMGUI most likely sets I agree that this is not a very elegant solution -- but it is exactly what happens today, due to the creative side effects of the |
Can you point me to the DQM GUI code so I can look at the couplings between it and @fwyzard: I see discussion about how |
Hi @knoepfel , |
@knoepfel the authoritative repo of the DQMGUI is this, if I am not mistaken: IIRC the it pulls in some CMSSW code at build time, so you won't find a DQMStore in the source code there, but it will be compiled with it when you build it. Which is nontrivial, but there is documentation: https://github.com/rovere/dqmgui/blob/index128/doc/overview/devguide.rst |
@schneiml thanks. I'm not sure how successful I will be in building it since I doubt I have permissions where they may be required, but I'll give it a go. @fwyzard regarding proxies: the concept of "current" directory must go away in terms of the The proposal would be that:
Thoughts? |
It is perfectly fine for a module to book or access histograms in multiple directories. |
@knoepfel could you please clarify, at least to me, what is exactly your goal related to CMSSW/DQMStore and the DQMGUI? |
The intention is to adjust |
@knoepfel please ignore the DQMGUI for the time being and commit code only in CMSSW. There is, in CMSSW, a standalone compilation of DQMStore: as long as that works, I am happy, and the DQMGUI is happy too. |
Thanks @rovere |
@rovere okay, will do. Thanks. |
@fwyzard yes, calling |
Is this issue still alive? My understanding is that it is NOT after the changes in DQMStore by Marcel. Please let me know |
@Dr15Jones , is it still an open issue? |
My take is that the topic of making DQMStore more thread efficient is still open, but at this point likely would have to start from scratch. |
The framework now runs the stream beginRun and beginLuminosityBlock transitions concurrently for all streams and within a stream the modules are allowed to run concurrently. We are unable to make full use of this ability becuase the DQMStore takes and holds a lock during the call to the DQMAnalyzer. This means we are unable to run multiple DQMAnalyzers concurrently and since the lock is unknown to the framework, the framework is unable to schedule around the DQMAnalyzer conflict.
To that end, we've had a person looking into the possibility of modifying the DQMStore to allow true concurrent access. The person is no longer able to work on the project but were able to write up their findings (which are below):
Summary of DQMStore Work
Over the last few weeks, I've spent time analyzing how the DQMStore service can be enhanced in a multi-threaded context. Current thread-safety for the DQMStore is achieved by using an std::mutex, which guards access to the DQMStore member data.
Take-away points:
Issues of efficiency:
The current design uses an
std::set<MonitorElement>
data member as a manager of the histograms. Weak ordering is enforced by comparing (by operator<)MonitorElement::DQMNet::CoreObject
members in this order:run, lumi, streamId, moduleId, dirname, objname
I have modified that order to be:
streamId, moduleId, run, lumi, dirname, objname
with the hopes that
streamId
andmoduleId
can be removed entirely from the ordering criterion. However, the current implementation uses the following pattern:where the proto/lower_bound usage is meant to "short-circuit" the lookup for a given MonitorElement. Ideally, searching in this manner can be removed in favor of a nearly-direct lookup based on stream ID and module ID.
My github fork:
I have two branches that include work I’ve done on the DQMStore. The second branch builds off of the first.
https://github.com/knoepfel/cmssw/tree/DQMStore-cleanup
https://github.com/knoepfel/cmssw/tree/DQMStore-checkpoint
The cleanup branch includes various improvements:
More reliance on modern C++ facilities
Each DQMStore::book* function had two signatures, with one taking a char const* argument, and the other taking an std::string const& argument. The doubling of function signatures has been removed so that each DQMStore::book* function takes a DQMStore::char_string const& argument, which is implicitly convertible to char const* or std::string const&. This greatly reduced the number of lines of code, and it should be invisible to the user.
Functions with signature like foo(void) were replaced with foo().
clang-format was run on several of the files.
The checkpoint branch is move toward more concurrent access to the DQMStore data:
tbb::concurrent_unordered_map<std::pair<stream_id_t, module_id_t>, MonitorElement>
. As of commit 3471d24, no insertions are done, but the code that to do it is present.Changes yet to be made:
The text was updated successfully, but these errors were encountered: