Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a local cache for any caching MagneticField functions, and use them in Oscar(MT)Producer #29631

Closed
wants to merge 3 commits into from

Conversation

makortel
Copy link
Contributor

@makortel makortel commented May 4, 2020

PR description:

Following the discussion #28180 this PR proposes a local cache for the caching of "previous volume" in MagGeometry following the sketch in #28180 (comment), and local::MagneticField as an easy replacement of MagneticField const*.

This PR also migrates Oscar(MT)Producer to use the local::MagneticField to enable testing for the most affected use case.

PR validation:

Limited matrix runs. I also tested with one workflow that the SIM step runs with multiple threads.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 4, 2020

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 4, 2020

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29631/15010

  • This PR adds an extra 36KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented May 4, 2020

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

MagneticField/Engine
MagneticField/VolumeBasedEngine
SimG4Core/Application
SimG4Core/GeometryProducer
SimG4Core/MagneticField

@perrotta, @cmsbuild, @civanch, @mdhildreth, @slava77 can you please review it and eventually sign? Thanks.
@namapane, @rovere, @fabiocos, @slomeo this is something you requested to watch as well.
@silviodonato, @dpiparo you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented May 4, 2020

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented May 4, 2020

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/5982/console Started: 2020/05/04 20:17

@makortel
Copy link
Contributor Author

makortel commented May 4, 2020

@civanch Could you run the SIM scaling test also for this PR, please?

@cmsbuild
Copy link
Contributor

cmsbuild commented May 4, 2020

+1
Tested at: b649995
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-df8d5b/5982/summary.html
CMSSW: CMSSW_11_1_X_2020-05-04-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

cmsbuild commented May 4, 2020

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 4, 2020

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-df8d5b/5982/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2696239
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2695919
  • DQMHistoTests: Total skipped: 319
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@civanch
Copy link
Contributor

civanch commented May 5, 2020

@makortel , @namapane ,

CPU results (events/s) for comparison:

WF "Nthreads" pre6 29561 29631

MInBias 8 1.991 2.072 2.068
MinBias 12 2.683 2.823 2.824
MinBias 16 3.020 3.044 3.187
ttbar 8 0.417 0.448 0.439
ttbar 12 0.606 0.634 0.627
ttbar 16 0.734 0.772 0.765

There is some systematic but in general both #29561 and #29631 provide similar CPU improvements. #29631 is not better than #29561. What I do not understand, why in this PR Cache is used by value not by reference.

@slava77
Copy link
Contributor

slava77 commented May 5, 2020

@namapane
please clarify if you have significant concerns about the imlementation in this PR.
Thank you.

@slava77
Copy link
Contributor

slava77 commented May 5, 2020

@namapane
please clarify if you have significant concerns about the imlementation in this PR.
Thank you.

curiously, this was posted within seconds of #29631 (comment)

@namapane
Copy link
Contributor

namapane commented May 5, 2020

Yes but I was quicker. :-)

@namapane
Copy link
Contributor

namapane commented May 5, 2020

Just to be clear, I am perfectly fine with this implementation if core believes it is technically superior.

I just want to point out that a workaround may be needed to migrate SHP and other odd cases (although only SHP comes to my mind right away); and that it would be disturbing to find out a degradation due to cache fragmentation at a point where going back is no longer an option (although that may not necessarily be the case; and there would probably be workarounds as well).

@slava77
Copy link
Contributor

slava77 commented May 5, 2020

One thing I am not sure, and which I mentioned a few times, is the effect of fragmenting the cache, esp. for HLT. In the above example we would impliciltly decouple the cache for calls from the propagator from that of the rest of the track fit. It could be that this improves the rate of cache hits, or that it degrades it. So there is some risk that is difficult to evaluate without completing the migration.

the (SteppingHelix)propagator is caching the volume pointer internally and will call its inTesla interface. In situations where a propagation crosses volume boundaries, a calling code that would also use the field at about the same time before calling the propagator or after getting a return value from it would very likely reuse the same thread_local pointer, it will perhaps similarly (but less) likely hit the "global" (current atomic) cache, and will likely have a cache miss if it implements the caching as in this PR but will not pass the cache to the propagator call.
So, if the migration is to be done, it would have to percolate through all the calls depending on the field in the context of the muon track reco.
OTOH, caching as done in this PR is likely of rather limited benefit in this case: I expect that most of the cost is in the propagation compare to a KF update and the propagator is already caching locally since a long time.

@namapane
Copy link
Contributor

namapane commented May 5, 2020

I agree with your analysis for the specific case of SHP, but my point is that I am not sure that "percolation through all the calls depending on the field in the context of the muon track reco" is really possible; e.g. SteppingHelixPropagatorESProducer.cc creates a propagator with a local:: field with its own cache.
This is just an example, what I mean is that it is not simple to tell a priori what the overall effect on cache hit rate will be if field =pMF.product() is simply replaced with field = local::MagneticField(pMF.product()) everywhere.

@makortel
Copy link
Contributor Author

makortel commented May 7, 2020

@namapane I certainly share the pain on the implied code changes, but the argumentation goes again to the "special case" vs. "generally acceptable pattern".

The local::MagneticField actually may not be placed in ESProducts (like propagators), because the MagneticFieldCache is not thread safe (by design). Instead, the cache needs to be owned e.g. in a local scope of an EDProduer's produce() function, in member variable of a stream-EDModule, or in a StreamCache of a global-EDModule.

I think it is possible to deal with the ESProducts holding a MagneticField (Propagators, NavigationSchool, TrackingRecHitPropagator, TransientTrackBuilder; did I miss any) with the "local cache design", but it will need a bit more thought.

@namapane
Copy link
Contributor

namapane commented May 7, 2020

@makortel, I agree on the fact that local cache solution cannot be adopted in cases like this one, where SHP is created as an ESProduct.
In addition to this, what I actually wanted to point out with the second lik is that we also let the posisbility to have EDAnalyzers creating a SHP directly. Like for example here or here.
In such case, if the analyzer is migrated to use local::MagneticField, it will percolate it and eventually reach SHP's constructor, where the dynamic_cast will no longer work.
As I mentioned this can be clearly worked around, but we would need no such workarounds with TLS cache.

@namapane
Copy link
Contributor

namapane commented May 7, 2020

BTW: of course, SHP is protected against failing of the dynamic_cast that will occur if it is constructed passing a local::MagneticField instead of the field itself.
So in this kind of cases we would have a change of behavior that is not immediately detectable, neither at compile time (there's no error), nor at runtime. It will lead to a performance loss due to the silent deactivation of SHP's internal cache, which would have to be spotted and debugged.

Explicit construction of SHP is a marginal use pattern compared to serving it as an ESProduct; but other odd use cases can of course exist.

In my simplified understanding, not having to deal with all this, with a solution that is fully internal to MF (ie TLS cache) is a clear advantage. But I am completely fine if you think these complications are not significant and are worth to be addressed for the gain of not using TLS.

@slava77
Copy link
Contributor

slava77 commented May 7, 2020

Explicit construction of SHP is a marginal use pattern compared to serving it as an ESProduct; but other odd use cases can of course exist.

I expect that not everything needs a migration by lack of performance benefits.

@slava77
Copy link
Contributor

slava77 commented May 7, 2020

As I mentioned this can be clearly worked around, but we would need no such workarounds with TLS cache.

IIUC, the cost benefit analysis for TLS in the MF, compared to local caching:

  • pros:
    • the development is much more straightforward and convenient and it directly applies to all use cases
  • cons:
    • while below the TLS limits, there is expected marginal cost from using TLS
    • once we hit the TLS limit we can not run CMSSW and either MF or some new external triggering the TLS limit will have to be changed to not use TLS

Unless the last con is misrepresented, I think that it trumps all benefits of convenience.

@namapane
Copy link
Contributor

namapane commented May 7, 2020

I would agree if we had a figure of what the limit is, how fare we are from hitting it, and how significant the addition of TLS for MF is, wrt hitting the limit. Forgive me for over-simplifiying: say, we are at 80% of the limit (in whatever relevant metrics) and MF adds 5%? Then I agree.
Instead, MF adds 0.5%? Well, then the argument seems to me quite weak.
I appreciate that the limit is architecture- and compiler-dependent so answering the question quantitatively is not straightforward like this. But as we speak of 16 bytes I would rather bet on the latter scenario than on the former.

Also, nothing prevents to move to external caching the day when we figure out we need to add some new external which makes extensive usage of TLS. This is a very hypothetical scenario that may well never happen, so why forward-optimizing for it?
And still, the contribution of MF may well be irrelevant even in this scenario. If the new external eats up 50% of the limit bringing us to hit it, and the contribution of MF is 0.5%, then it having it or not makes no difference at all.

This said, I start becoming uncomfortable with this discussion; as I already mentioned, if the experts say TLS must be avoided at all costs I have no objection.

@namapane
Copy link
Contributor

namapane commented May 7, 2020

* while below the TLS limits, there is expected marginal cost from using TLS

For the records, if you refer to the overhead of accessing TLS variables, this is included in @civanch 's profiling, which shows it is at least as small of the overheads of external caching. So I think this also does not hold as an argument.
(if instead you just refer to the fact that TLS cache adds a small contribution towards hitting the TLS limit, my previous comment applies)

@makortel
Copy link
Contributor Author

makortel commented May 7, 2020

We are already hitting limits (on the number of shared objects with TLS) in ARM and POWER (albeit somewhat randomly). Risks for letting the use of TLS (or memory usage to scale wrt. the number of (worker) threads) to spread include missing (large) allocation opportunities in HPC sites because we would need months of work to get our code to run there. By being proactive we try hard to be as flexible as we can for the unknown future of our computing infrastructure.

@slava77
Copy link
Contributor

slava77 commented May 7, 2020

say, we are at 80% of the limit (in whatever relevant metrics) and MF adds 5%? Then I agree.
Instead, MF adds 0.5%? [...] But as we speak of 16 bytes

From earlier discussion I understood that there is a limit on the count of libraries using it and that's the limit we are hitting on the ARM/POWER

@makortel please clarify on this.

@namapane
Copy link
Contributor

namapane commented May 7, 2020

Why don't we then remove cases where TLS usage is not a sensible choice instead. Just looking at random places in CMSSW, I come into these (among others):

TrackingTools/TrajectoryFiltering/src/MinPtTrajectoryFilter.cc
TrackingTools/TrajectoryFiltering/src/ThresholdPtTrajectoryFilter.cc
RecoTracker/TkMSParametrization/src/MultipleScatteringParametrisation.cc
TrackingTools/TrajectoryState/interface/ChurnAllocator.h
RecoEgamma/EgammaIsolationAlgos/src/EgammaTowerIsolation.cc
TrackingTools/TrajectoryCleaning/src/TrajectoryCleanerBySharedHits.cc
RecoTracker/CkfPattern/plugins/GroupedCkfTrajectoryBuilder.cc

I find it hard to believe that TLS is appropriate in any of these, and I believe it would make much more sense to invest work in a campaign of fixing these specific pieces of code than to migrate all usage of a fundamental, ubiquitous service like MF to external cache.
But eventually, it's your choice, I think I have expressed my opinions well enough.

@namapane
Copy link
Contributor

namapane commented May 7, 2020

From earlier discussion I understood that there is a limit on the count of libraries using it and that's the limit we are hitting on the ARM/POWER

BTW, for my own curiosity and education:

  • several of the thread_locals I see in CMSSW are in header files under interface/. Doesn't it mean that they can potentially spill into different libraries, ending up being counted several times? For example, there's one in CommonTools/UtilAlgos/interface/TFileService.h.
  • If the problem is the number of libraries using TLS, isn't a good idea to move all framework-related TLS variables (e.g. all those in FWCore) in a specific library with appropriate accessor functions?

@slava77
Copy link
Contributor

slava77 commented May 7, 2020

I find it hard to believe that TLS is appropriate in any of these, and I believe it would make much more sense to invest work in a campaign of fixing these specific pieces of code than to migrate all usage of a fundamental, ubiquitous service like MF to external cache.
But eventually, it's your choice, I think I have expressed my opinions well enough.

Thank you for your patience in comments.
I would like to better understand the TLS limit before committing to this solution.

@civanch
Copy link
Contributor

civanch commented May 7, 2020

Hi all, as I mentioned before it would be good to discuss numbers. It would be good to know HPC numbers for CMSSW.

I my memory when we introduced Geant4 MT, TLS limit was about 4032 B, Geant4 10.0 used about ~10 MB of TLS. Numbers may be not exact but we realized that we have no chance to fit the limit. Because of that, we start to use "dynamic" build options, which bring ~10% slow down of simulation. We reduced TLS usage in Geant4 but we never setup some milestones. I do not know value of Geant4 10.6 TLS but we definitely need numbers and limits for different hardware.

@slava77
Copy link
Contributor

slava77 commented May 7, 2020

@civanch
ah, from your comment it sounds like the limit is measured in the actual bytes/memory footprint.
Then the 16 bytes seems like a non-issue.

@makortel
Copy link
Contributor Author

makortel commented May 7, 2020

@namapane You are exactly right that we should go over all thread_local use and see how to avoid them.

@slava77

From earlier discussion I understood that there is a limit on the count of libraries using it and that's the limit we are hitting on the ARM/POWER

This is correct.

While it is true that the library count limit could be mitigated by hiding the TLS within the framework, such a move would send the message that it would be acceptable to craft data structures whose size scales according to the number of (worker) threads instead of the number of concurrent events. For this specific case the overall effect would be tiny, but if we make such a pattern as an example, no-one knows how bad the situation will become until it is too late (and fixing that would require much more work than what we are discussing about now).

@civanch Essentially TLS and stack share the same memory block, stack increases downwards and TLS upwards. When TLS "goes over the limit" the first symptoms would be stack corruption, that in the best case would lead to a crash. I am not aware of any tool that would monitor the size of the TLS and the stack (I'd expect valgrind to be able to spot the stack corruptions though).

@makortel
Copy link
Contributor Author

makortel commented May 7, 2020

Said that, how about we look at this from a different angle. As discussed several times, the impact of this particular case to either TLS size or the number of shared objects with TLS is tiny. Core's main concern is the spread of using TLS or using data structures whose overall size depends on the number of threads instead of the number of concurrent events.

How about we accept the thread_local approach (#29561), and regarding @slava77's #29561 (comment)

I'd be happy to stay with thread_local, however if this is acceptable, it should be acceptable in all packages where it's reasonably needed.
Can we go for this precedent?

the "reasonably needed" would mean something along "existing ESProduct needs an internal cache for performance reasons, and that ESProduct is used by other existing ESProducts via member data, and the use of all these ESProducts is so widespread that the cost to migrate them all is deemed very costly"?

@slava77
Copy link
Contributor

slava77 commented May 7, 2020

the "reasonably needed" would mean something along "existing ESProduct needs an internal cache for performance reasons, and that ESProduct is used by other existing ESProducts via member data, and the use of all these ESProducts is so widespread that the cost to migrate them all is deemed very costly"?

perhaps this can be proposed for the coding rules.

@makortel
Copy link
Contributor Author

makortel commented May 7, 2020

the "reasonably needed" would mean something along "existing ESProduct needs an internal cache for performance reasons, and that ESProduct is used by other existing ESProducts via member data, and the use of all these ESProducts is so widespread that the cost to migrate them all is deemed very costly"?

perhaps this can be proposed for the coding rules.

We don't explain tolerated non-preferred violations for any other rule, but I agree the asterisk (for "exceptional use cases where the rule may be violated with good justification") can be added along the "avoid thread_local" rule (when I get there).

@slava77
Copy link
Contributor

slava77 commented May 8, 2020

-1

based on #29631 (comment)

@makortel
Copy link
Contributor Author

Closing in favor of #29561 (see discussion above).

@makortel makortel closed this May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants