Introduce a local cache for any caching MagneticField functions, and use them in Oscar(MT)Producer #29631

makortel · 2020-05-04T18:06:05Z

PR description:

Following the discussion #28180 this PR proposes a local cache for the caching of "previous volume" in MagGeometry following the sketch in #28180 (comment), and local::MagneticField as an easy replacement of MagneticField const*.

This PR also migrates Oscar(MT)Producer to use the local::MagneticField to enable testing for the most affected use case.

PR validation:

Limited matrix runs. I also tested with one workflow that the SIM step runs with multiple threads.

cmsbuild · 2020-05-04T18:06:28Z

The code-checks are being triggered in jenkins.

cmsbuild · 2020-05-04T18:10:54Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29631/15010

This PR adds an extra 36KB to repository

cmsbuild · 2020-05-04T18:11:18Z

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

MagneticField/Engine
MagneticField/VolumeBasedEngine
SimG4Core/Application
SimG4Core/GeometryProducer
SimG4Core/MagneticField

@perrotta, @cmsbuild, @civanch, @mdhildreth, @slava77 can you please review it and eventually sign? Thanks.
@namapane, @rovere, @fabiocos, @slomeo this is something you requested to watch as well.
@silviodonato, @dpiparo you are the release manager for this.

cms-bot commands are listed here

makortel · 2020-05-04T18:13:57Z

@cmsbuild, please test

cmsbuild · 2020-05-04T18:14:16Z

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/5982/console Started: 2020/05/04 20:17

makortel · 2020-05-04T18:29:49Z

@civanch Could you run the SIM scaling test also for this PR, please?

cmsbuild · 2020-05-04T20:10:20Z

+1
Tested at: b649995
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-df8d5b/5982/summary.html
CMSSW: CMSSW_11_1_X_2020-05-04-1100
SCRAM_ARCH: slc7_amd64_gcc820

cmsbuild · 2020-05-04T20:10:23Z

Comparison job queued.

cmsbuild · 2020-05-04T21:37:15Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-df8d5b/5982/summary.html

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 34
DQMHistoTests: Total histograms compared: 2696239
DQMHistoTests: Total failures: 1
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2695919
DQMHistoTests: Total skipped: 319
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
Checked 147 log files, 16 edm output root files, 34 DQM output files

civanch · 2020-05-05T11:10:15Z

@makortel , @namapane ,

CPU results (events/s) for comparison:

WF "Nthreads" pre6 29561 29631

MInBias 8 1.991 2.072 2.068
MinBias 12 2.683 2.823 2.824
MinBias 16 3.020 3.044 3.187
ttbar 8 0.417 0.448 0.439
ttbar 12 0.606 0.634 0.627
ttbar 16 0.734 0.772 0.765

There is some systematic but in general both #29561 and #29631 provide similar CPU improvements. #29631 is not better than #29561. What I do not understand, why in this PR Cache is used by value not by reference.

slava77 · 2020-05-05T20:32:22Z

@namapane
please clarify if you have significant concerns about the imlementation in this PR.
Thank you.

slava77 · 2020-05-05T20:33:51Z

@namapane
please clarify if you have significant concerns about the imlementation in this PR.
Thank you.

curiously, this was posted within seconds of #29631 (comment)

namapane · 2020-05-05T20:36:19Z

Yes but I was quicker. :-)

namapane · 2020-05-05T21:00:12Z

Just to be clear, I am perfectly fine with this implementation if core believes it is technically superior.

I just want to point out that a workaround may be needed to migrate SHP and other odd cases (although only SHP comes to my mind right away); and that it would be disturbing to find out a degradation due to cache fragmentation at a point where going back is no longer an option (although that may not necessarily be the case; and there would probably be workarounds as well).

slava77 · 2020-05-05T21:01:33Z

One thing I am not sure, and which I mentioned a few times, is the effect of fragmenting the cache, esp. for HLT. In the above example we would impliciltly decouple the cache for calls from the propagator from that of the rest of the track fit. It could be that this improves the rate of cache hits, or that it degrades it. So there is some risk that is difficult to evaluate without completing the migration.

the (SteppingHelix)propagator is caching the volume pointer internally and will call its inTesla interface. In situations where a propagation crosses volume boundaries, a calling code that would also use the field at about the same time before calling the propagator or after getting a return value from it would very likely reuse the same thread_local pointer, it will perhaps similarly (but less) likely hit the "global" (current atomic) cache, and will likely have a cache miss if it implements the caching as in this PR but will not pass the cache to the propagator call.
So, if the migration is to be done, it would have to percolate through all the calls depending on the field in the context of the muon track reco.
OTOH, caching as done in this PR is likely of rather limited benefit in this case: I expect that most of the cost is in the propagation compare to a KF update and the propagator is already caching locally since a long time.

namapane · 2020-05-05T21:22:38Z

I agree with your analysis for the specific case of SHP, but my point is that I am not sure that "percolation through all the calls depending on the field in the context of the muon track reco" is really possible; e.g. SteppingHelixPropagatorESProducer.cc creates a propagator with a local:: field with its own cache.
This is just an example, what I mean is that it is not simple to tell a priori what the overall effect on cache hit rate will be if field =pMF.product() is simply replaced with field = local::MagneticField(pMF.product()) everywhere.

makortel · 2020-05-07T01:48:21Z

@namapane I certainly share the pain on the implied code changes, but the argumentation goes again to the "special case" vs. "generally acceptable pattern".

The local::MagneticField actually may not be placed in ESProducts (like propagators), because the MagneticFieldCache is not thread safe (by design). Instead, the cache needs to be owned e.g. in a local scope of an EDProduer's produce() function, in member variable of a stream-EDModule, or in a StreamCache of a global-EDModule.

I think it is possible to deal with the ESProducts holding a MagneticField (Propagators, NavigationSchool, TrackingRecHitPropagator, TransientTrackBuilder; did I miss any) with the "local cache design", but it will need a bit more thought.

namapane · 2020-05-07T08:45:39Z

@makortel, I agree on the fact that local cache solution cannot be adopted in cases like this one, where SHP is created as an ESProduct.
In addition to this, what I actually wanted to point out with the second lik is that we also let the posisbility to have EDAnalyzers creating a SHP directly. Like for example here or here.
In such case, if the analyzer is migrated to use local::MagneticField, it will percolate it and eventually reach SHP's constructor, where the dynamic_cast will no longer work.
As I mentioned this can be clearly worked around, but we would need no such workarounds with TLS cache.

namapane · 2020-05-07T09:16:20Z

BTW: of course, SHP is protected against failing of the dynamic_cast that will occur if it is constructed passing a local::MagneticField instead of the field itself.
So in this kind of cases we would have a change of behavior that is not immediately detectable, neither at compile time (there's no error), nor at runtime. It will lead to a performance loss due to the silent deactivation of SHP's internal cache, which would have to be spotted and debugged.

Explicit construction of SHP is a marginal use pattern compared to serving it as an ESProduct; but other odd use cases can of course exist.

In my simplified understanding, not having to deal with all this, with a solution that is fully internal to MF (ie TLS cache) is a clear advantage. But I am completely fine if you think these complications are not significant and are worth to be addressed for the gain of not using TLS.

slava77 · 2020-05-07T12:12:41Z

Explicit construction of SHP is a marginal use pattern compared to serving it as an ESProduct; but other odd use cases can of course exist.

I expect that not everything needs a migration by lack of performance benefits.

slava77 · 2020-05-07T12:31:08Z

As I mentioned this can be clearly worked around, but we would need no such workarounds with TLS cache.

IIUC, the cost benefit analysis for TLS in the MF, compared to local caching:

pros:
- the development is much more straightforward and convenient and it directly applies to all use cases
cons:
- while below the TLS limits, there is expected marginal cost from using TLS
- once we hit the TLS limit we can not run CMSSW and either MF or some new external triggering the TLS limit will have to be changed to not use TLS

Unless the last con is misrepresented, I think that it trumps all benefits of convenience.

namapane · 2020-05-07T13:18:25Z

I would agree if we had a figure of what the limit is, how fare we are from hitting it, and how significant the addition of TLS for MF is, wrt hitting the limit. Forgive me for over-simplifiying: say, we are at 80% of the limit (in whatever relevant metrics) and MF adds 5%? Then I agree.
Instead, MF adds 0.5%? Well, then the argument seems to me quite weak.
I appreciate that the limit is architecture- and compiler-dependent so answering the question quantitatively is not straightforward like this. But as we speak of 16 bytes I would rather bet on the latter scenario than on the former.

Also, nothing prevents to move to external caching the day when we figure out we need to add some new external which makes extensive usage of TLS. This is a very hypothetical scenario that may well never happen, so why forward-optimizing for it?
And still, the contribution of MF may well be irrelevant even in this scenario. If the new external eats up 50% of the limit bringing us to hit it, and the contribution of MF is 0.5%, then it having it or not makes no difference at all.

This said, I start becoming uncomfortable with this discussion; as I already mentioned, if the experts say TLS must be avoided at all costs I have no objection.

namapane · 2020-05-07T13:25:23Z

* while below the TLS limits, there is expected marginal cost from using TLS

For the records, if you refer to the overhead of accessing TLS variables, this is included in @civanch 's profiling, which shows it is at least as small of the overheads of external caching. So I think this also does not hold as an argument.
(if instead you just refer to the fact that TLS cache adds a small contribution towards hitting the TLS limit, my previous comment applies)

makortel · 2020-05-07T13:55:10Z

We are already hitting limits (on the number of shared objects with TLS) in ARM and POWER (albeit somewhat randomly). Risks for letting the use of TLS (or memory usage to scale wrt. the number of (worker) threads) to spread include missing (large) allocation opportunities in HPC sites because we would need months of work to get our code to run there. By being proactive we try hard to be as flexible as we can for the unknown future of our computing infrastructure.

slava77 · 2020-05-07T14:33:46Z

say, we are at 80% of the limit (in whatever relevant metrics) and MF adds 5%? Then I agree.
Instead, MF adds 0.5%? [...] But as we speak of 16 bytes

From earlier discussion I understood that there is a limit on the count of libraries using it and that's the limit we are hitting on the ARM/POWER

@makortel please clarify on this.

namapane · 2020-05-07T14:35:26Z

Why don't we then remove cases where TLS usage is not a sensible choice instead. Just looking at random places in CMSSW, I come into these (among others):

TrackingTools/TrajectoryFiltering/src/MinPtTrajectoryFilter.cc
TrackingTools/TrajectoryFiltering/src/ThresholdPtTrajectoryFilter.cc
RecoTracker/TkMSParametrization/src/MultipleScatteringParametrisation.cc
TrackingTools/TrajectoryState/interface/ChurnAllocator.h
RecoEgamma/EgammaIsolationAlgos/src/EgammaTowerIsolation.cc
TrackingTools/TrajectoryCleaning/src/TrajectoryCleanerBySharedHits.cc
RecoTracker/CkfPattern/plugins/GroupedCkfTrajectoryBuilder.cc

I find it hard to believe that TLS is appropriate in any of these, and I believe it would make much more sense to invest work in a campaign of fixing these specific pieces of code than to migrate all usage of a fundamental, ubiquitous service like MF to external cache.
But eventually, it's your choice, I think I have expressed my opinions well enough.

namapane · 2020-05-07T14:52:26Z

From earlier discussion I understood that there is a limit on the count of libraries using it and that's the limit we are hitting on the ARM/POWER

BTW, for my own curiosity and education:

several of the thread_locals I see in CMSSW are in header files under interface/. Doesn't it mean that they can potentially spill into different libraries, ending up being counted several times? For example, there's one in CommonTools/UtilAlgos/interface/TFileService.h.
If the problem is the number of libraries using TLS, isn't a good idea to move all framework-related TLS variables (e.g. all those in FWCore) in a specific library with appropriate accessor functions?

slava77 · 2020-05-07T15:00:26Z

I find it hard to believe that TLS is appropriate in any of these, and I believe it would make much more sense to invest work in a campaign of fixing these specific pieces of code than to migrate all usage of a fundamental, ubiquitous service like MF to external cache.
But eventually, it's your choice, I think I have expressed my opinions well enough.

Thank you for your patience in comments.
I would like to better understand the TLS limit before committing to this solution.

civanch · 2020-05-07T15:06:43Z

Hi all, as I mentioned before it would be good to discuss numbers. It would be good to know HPC numbers for CMSSW.

I my memory when we introduced Geant4 MT, TLS limit was about 4032 B, Geant4 10.0 used about ~10 MB of TLS. Numbers may be not exact but we realized that we have no chance to fit the limit. Because of that, we start to use "dynamic" build options, which bring ~10% slow down of simulation. We reduced TLS usage in Geant4 but we never setup some milestones. I do not know value of Geant4 10.6 TLS but we definitely need numbers and limits for different hardware.

slava77 · 2020-05-07T16:05:12Z

@civanch
ah, from your comment it sounds like the limit is measured in the actual bytes/memory footprint.
Then the 16 bytes seems like a non-issue.

makortel · 2020-05-07T18:30:19Z

@namapane You are exactly right that we should go over all thread_local use and see how to avoid them.

@slava77

From earlier discussion I understood that there is a limit on the count of libraries using it and that's the limit we are hitting on the ARM/POWER

This is correct.

While it is true that the library count limit could be mitigated by hiding the TLS within the framework, such a move would send the message that it would be acceptable to craft data structures whose size scales according to the number of (worker) threads instead of the number of concurrent events. For this specific case the overall effect would be tiny, but if we make such a pattern as an example, no-one knows how bad the situation will become until it is too late (and fixing that would require much more work than what we are discussing about now).

@civanch Essentially TLS and stack share the same memory block, stack increases downwards and TLS upwards. When TLS "goes over the limit" the first symptoms would be stack corruption, that in the best case would lead to a crash. I am not aware of any tool that would monitor the size of the TLS and the stack (I'd expect valgrind to be able to spot the stack corruptions though).

makortel · 2020-05-07T18:33:35Z

Said that, how about we look at this from a different angle. As discussed several times, the impact of this particular case to either TLS size or the number of shared objects with TLS is tiny. Core's main concern is the spread of using TLS or using data structures whose overall size depends on the number of threads instead of the number of concurrent events.

How about we accept the thread_local approach (#29561), and regarding @slava77's #29561 (comment)

I'd be happy to stay with thread_local, however if this is acceptable, it should be acceptable in all packages where it's reasonably needed.
Can we go for this precedent?

the "reasonably needed" would mean something along "existing ESProduct needs an internal cache for performance reasons, and that ESProduct is used by other existing ESProducts via member data, and the use of all these ESProducts is so widespread that the cost to migrate them all is deemed very costly"?

slava77 · 2020-05-07T19:23:47Z

the "reasonably needed" would mean something along "existing ESProduct needs an internal cache for performance reasons, and that ESProduct is used by other existing ESProducts via member data, and the use of all these ESProducts is so widespread that the cost to migrate them all is deemed very costly"?

perhaps this can be proposed for the coding rules.

makortel · 2020-05-07T23:38:16Z

the "reasonably needed" would mean something along "existing ESProduct needs an internal cache for performance reasons, and that ESProduct is used by other existing ESProducts via member data, and the use of all these ESProducts is so widespread that the cost to migrate them all is deemed very costly"?

perhaps this can be proposed for the coding rules.

We don't explain tolerated non-preferred violations for any other rule, but I agree the asterisk (for "exceptional use cases where the rule may be violated with good justiﬁcation") can be added along the "avoid thread_local" rule (when I get there).

slava77 · 2020-05-08T14:30:52Z

-1

based on #29631 (comment)

makortel · 2020-05-11T15:25:26Z

Closing in favor of #29561 (see discussion above).

makortel added 3 commits May 4, 2020 19:37

Add MagneticFieldCache and local::MagneticField

a4fb188

Implement local caching in VolumeBasedMagneticField

e3f72b8

Use local::MagneticField in Oscar(MT)Producer

b649995

cmsbuild added this to the CMSSW_11_1_X milestone May 4, 2020

cmsbuild added code-checks-pending comparison-pending orp-pending pending-signatures reconstruction-pending simulation-pending tests-pending labels May 4, 2020

makortel mentioned this pull request May 4, 2020

Fix a scaling issue in MagneticField/VolumeBasedEngine #28180

Closed

cmsbuild added code-checks-approved and removed code-checks-pending labels May 4, 2020

cmsbuild added tests-started and removed tests-pending labels May 4, 2020

cmsbuild added tests-approved and removed tests-started labels May 4, 2020

cmsbuild added comparison-available and removed comparison-pending labels May 4, 2020

slava77 mentioned this pull request May 8, 2020

Tentative implementation of TLS cache for MF #29561

Merged

cmsbuild added reconstruction-rejected and removed reconstruction-pending labels May 8, 2020

makortel closed this May 11, 2020

Introduce a local cache for any caching MagneticField functions, and use them in Oscar(MT)Producer #29631

Introduce a local cache for any caching MagneticField functions, and use them in Oscar(MT)Producer #29631

Conversation

makortel commented May 4, 2020

PR description:

PR validation:

cmsbuild commented May 4, 2020

cmsbuild commented May 4, 2020

cmsbuild commented May 4, 2020

makortel commented May 4, 2020

cmsbuild commented May 4, 2020 • edited

makortel commented May 4, 2020

cmsbuild commented May 4, 2020

cmsbuild commented May 4, 2020

cmsbuild commented May 4, 2020

civanch commented May 5, 2020 • edited

slava77 commented May 5, 2020

slava77 commented May 5, 2020

namapane commented May 5, 2020

namapane commented May 5, 2020

slava77 commented May 5, 2020

namapane commented May 5, 2020

makortel commented May 7, 2020

namapane commented May 7, 2020 • edited

namapane commented May 7, 2020

slava77 commented May 7, 2020

slava77 commented May 7, 2020

namapane commented May 7, 2020

namapane commented May 7, 2020 • edited

makortel commented May 7, 2020

slava77 commented May 7, 2020

namapane commented May 7, 2020

namapane commented May 7, 2020

slava77 commented May 7, 2020

civanch commented May 7, 2020

slava77 commented May 7, 2020

makortel commented May 7, 2020

makortel commented May 7, 2020

slava77 commented May 7, 2020

makortel commented May 7, 2020

slava77 commented May 8, 2020

makortel commented May 11, 2020

cmsbuild commented May 4, 2020 •

edited

civanch commented May 5, 2020 •

edited

namapane commented May 7, 2020 •

edited

namapane commented May 7, 2020 •

edited