Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid redundant computations in StripCPE #14344

Merged
merged 4 commits into from May 7, 2016
Merged

Conversation

VinInn
Copy link
Contributor

@VinInn VinInn commented May 3, 2016

Major speed up in the computation of Strip CPE during pattern-recognition by avoiding redundant computations depending on track and detector only.
The net result is about 10% speed up for tracking in TTBAR PU35

No regression expected.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 3, 2016

A new Pull Request was created by @VinInn (Vincenzo Innocente) for CMSSW_8_1_X.

It involves the following packages:

DataFormats/Common
RecoLocalTracker/ClusterParameterEstimator
RecoLocalTracker/SiStripRecHitConverter
RecoTracker/MeasurementDet

@smuzaffar, @Dr15Jones, @cvuosalo, @cmsbuild, @slava77, @davidlange6 can you please review it and eventually sign? Thanks.
@ghellwig, @wmtan, @makortel, @forthommel, @yduhm, @GiacomoSguazzoni, @gbenelli, @rovere, @VinInn, @nickmccoll, @mschrode, @jlagram, @wddgit, @cerati, @gpetruc, @OlivierBondu, @threus, @dgulhan this is something you requested to watch as well.
@slava77, @Degano, @smuzaffar you are the release manager for this.

cms-bot commands are list here #13028

@VinInn
Copy link
Contributor Author

VinInn commented May 3, 2016

@cmsbuild , please test

@cmsbuild
Copy link
Contributor

cmsbuild commented May 3, 2016

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/12781/console

@cmsbuild
Copy link
Contributor

cmsbuild commented May 3, 2016

@@ -96,6 +96,9 @@ namespace edmNew {
return edm::Ref<typename HandleT::element_type, typename HandleT::element_type::value_type::value_type>( handle.id(), ci, ci - &(container().front()) );
}

unsigned int makeKeyOf(const_iterator ci) const {
return ci - &(container().front());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ci - container().begin();? That would seem to me to be easier for a person to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have to use front because const_iterator in this class is a pointer and not a std::vector<...>::const_iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make sure I am using pointers and not iterators.
it is consistent to the implementation few lines above

@cmsbuild
Copy link
Contributor

cmsbuild commented May 3, 2016

@Dr15Jones
Copy link
Contributor

+1

@cvuosalo
Copy link
Contributor

cvuosalo commented May 4, 2016

Timing test in progress...

SiStripDetId::SubDetector loc = SiStripDetId( det.geographicalId() ).subDetector();

LocalVector track = ltp.momentum();
track *= -p.thickness/track.z();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this division safe? Could track.z() be zero? We've just been fixing some bugs caused by an invalid TrajectoryStateOnSurface. There might be more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was like this before.
In any case things should be checked in advance, not so deep.
Could please remind me which bug you refers to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VinInn: I was thinking of #14306, which is not related to the code in this PR, but fixes a bug with a bad TrajectoryStateOnSurface. I have a faint recollection of seeing similar issues previously.

How many paths through the code lead to line 61 above? If you have an idea of how to find them all, we could check that the proper validation is being done in advance. But I would prefer bullet-proof code that never has a possibility to propagate NANs and nonsense values.

Copy link
Contributor Author

@VinInn VinInn May 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have a more in depth discussion about "FPE" and NaN particularly in the context of vectorization and the future move to vector hardware. We cannot afford to protect each single operation. The current accepted wisdom is to let NaN to propagate and trap it at very high level. Maybe we need to invite an expert and give us a lecture on how one manage this type of issues in HPC.
btw #14306 has nothing to do with NaN or "FPE" is just a trajectory not reaching the target.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 5/6/16 8:44 AM, Vincenzo Innocente wrote:

In RecoLocalTracker/SiStripRecHitConverter/interface/StripCPE.h
#14344 (comment):

  • SiStripDetId::SubDetector loc; float afullProjection; float corr;
  • };
  • virtual StripClusterParameterEstimator::LocalValues
  • localParameters( const SiStripCluster& cl, AlgoParam const & ap) const {
  • return std::make_pair(LocalPoint(), LocalError());
  • }
  • AlgoParam getAlgoParam(const GeomDetUnit& det, const LocalTrajectoryParameters & ltp) const {
  • StripCPE::Param const & p = param(det);
  • SiStripDetId::SubDetector loc = SiStripDetId( det.geographicalId() ).subDetector();
  • LocalVector track = ltp.momentum();
  • track *= -p.thickness/track.z();

I think we need to have a more in depth discussion about "FPE" and NaN
particularly in the context of vectorization and the future move to
vector hardware. We cannot afford to protect each single operation. The
current accepted wisdom is to let NaN to propagate and trap it at very
high level.

"A very high level" better be the output of the module or at worst
output of a sequence of modules
that has consumable output downstream.

The problem is more complex for utility/tools which are used in many places.
A fraction of users of the code may not care about execution speed much.
Maybe for that case it's practical to add the checks to avoid FPE
(template to hand over FPE-safe and unsafe interface?).

Maybe we need to invite an expert and give us a lecture on
how one manage this type of issues in HPC.

Do you have a name in mind?
Maybe send a suggestion by email.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/cms-sw/cmssw/pull/14344/files/e77be1aadab9d37f1baf957e190b179f95a1fc39#r62294072

Copy link
Contributor Author

@VinInn VinInn May 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very High level is before storing in the event.
whatever happen in the meantime is irrelevant (including FPEs)

In any case this code was like this before, so this is not the place to argue about it.

We need to investigate. Somebody from Intel or NVidia or a supercomputing center...
In any case it is something that goes beyond CMS

@cvuosalo
Copy link
Contributor

cvuosalo commented May 5, 2016

@VinInn: You said no regressions expected, but my test shows numerous small differences. I ran workflow 50202.0_TTbar_13+TTbar_13INPUT+DIGIUP15_PU50 with 70 events against baseline CMSSW_8_1_X_2016-04-29-2300. The most notable differences I found are below. Are they to be expected?

tpfrac
hits

@cvuosalo
Copy link
Contributor

cvuosalo commented May 5, 2016

CPU timing measurements confirm the expected 10% improvement in timing. A test of 50202.0 as described above shows reduced times for tracking modules:

  delta/mean delta/orJob     original                   new       module name
  ---------- ------------     --------                  ----       ------------
   -0.093260      -0.01%        57.92 ms/ev ->        52.76 ms/ev jetCoreRegionalStepTrackCandidates
   -0.060842      -0.10%       611.34 ms/ev ->       575.24 ms/ev pixelPairStepTrackCandidates
   -0.057358      -0.05%       347.44 ms/ev ->       328.07 ms/ev lowPtTripletStepTrackCandidates
   -0.056272      -0.07%       482.71 ms/ev ->       456.29 ms/ev detachedTripletStepTrackCandidates
   -0.055094      -0.09%       649.80 ms/ev ->       614.96 ms/ev initialStepTrackCandidatesPreSplitting

(Times exclude the first event.)

By another measure:

    detachedTripletStepTrackCandidates   343.364 ms/ev -> 325.338 ms/ev
    lowPtTripletStepTrackCandidates      247.207 ms/ev -> 235.246 ms/ev
    pixelPairStepTrackCandidates     434.93 ms/ev -> 408.53 ms/ev
    jetCoreRegionalStepTrackCandidates   42.6031 ms/ev -> 38.6807 ms/ev
     Total in detailed printout:     1202.83 ms/ev -> 1146.98 ms/ev      delta: -55.8407
 Total times: 5536.21 ms/ev -> 5452.16 ms/ev     delta: -84.0479

@cvuosalo
Copy link
Contributor

cvuosalo commented May 5, 2016

The Jenkins test results show very numerous tiny differences for both the DQM and alternative comparisons for many workflows.

@cvuosalo
Copy link
Contributor

cvuosalo commented May 6, 2016

A test of workflow 25202.0_TTbar_13 with 70 events against baseline CMSSW_8_1_X_2016-04-29-2300 also shows about 10% faster CPU time for tracking modules.

  delta/mean delta/orJob     original                   new       module name
  ---------- ------------     --------                  ----       ------------
   -0.217048      -0.03%        88.88 ms/ev ->        71.48 ms/ev jetCoreRegionalStepTrackCandidates
   -0.125561      -0.09%       418.39 ms/ev ->       368.96 ms/ev lowPtTripletStepTrackCandidates
   -0.120764      -0.11%       537.63 ms/ev ->       476.40 ms/ev detachedTripletStepTrackCandidates
   -0.118866      -0.17%       832.45 ms/ev ->       739.05 ms/ev pixelPairStepTrackCandidates
   -0.102000      -0.11%       598.20 ms/ev ->       540.15 ms/ev initialStepTrackCandidates
   -0.098286      -0.11%       615.13 ms/ev ->       557.51 ms/ev initialStepTrackCandidatesPreSplitting
   -0.085970      -0.03%       219.43 ms/ev ->       201.34 ms/ev tobTecStepTrackCandidates
   -0.080878      -0.06%       422.54 ms/ev ->       389.70 ms/ev pixelLessStepTrackCandidates
   -0.070235      -0.00%        30.78 ms/ev ->        28.69 ms/ev mixedTripletStepTrackCandidates

(Times exclude first event)

Second measure:

        initialStepTrackCandidates       422.891 ms/ev -> 381.62 ms/ev
        detachedTripletStepTrackCandidates       378.803 ms/ev -> 335.665 ms/ev
        lowPtTripletStepTrackCandidates          294.53 ms/ev -> 261.284 ms/ev
        pixelPairStepTrackCandidates     587.988 ms/ev -> 520.341 ms/ev
        pixelLessStepTrackCandidates     297.141 ms/ev -> 274.214 ms/ev
        tobTecStepTrackCandidates        153.762 ms/ev -> 141.073 ms/ev
        jetCoreRegionalStepTrackCandidates       64.3726 ms/ev -> 51.8902 ms/ev
         Total in detailed printout:     2785.71 ms/ev -> 2556.17 ms/ev          delta: -229.534
 Total times: 5966.04 ms/ev -> 5733.31 ms/ev     delta: -232.735

There are also numerous tiny differences from the baseline. Unlike workflow 50202.0, these differences seem to generally be improvements. Two examples are shown below.
eff

One misidentified track removed:
misid

@VinInn
Copy link
Contributor Author

VinInn commented May 6, 2016

For what concern the fraction of true pt I wil check with @makortel
the tob hit seems a migration between two contiguous bins.
I cannot exclude differences due to numerics or rare branches (the code in question has more than 8 possible paths...)

@VinInn
Copy link
Contributor Author

VinInn commented May 6, 2016

@cvuosalo , I can reproduce the observed differences with this small change in the current IB

diff --git a/RecoLocalTracker/SiStripRecHitConverter/src/StripCPEfromTrackAngle.cc b/RecoLocalTracker/SiStripRecHitConverter/src/StripCPEfromTrackAngle.cc
index ca02f45..9d16b5f 100644
--- a/RecoLocalTracker/SiStripRecHitConverter/src/StripCPEfromTrackAngle.cc
+++ b/RecoLocalTracker/SiStripRecHitConverter/src/StripCPEfromTrackAngle.cc
@@ -76,8 +76,9 @@ StripCPEfromTrackAngle::localParameters( const SiStripCluster& cluster, const Ge
       break;
   }

-  const float strip = cluster.barycenter() -  0.5f*(1.f-p.backplanecorrection) * fullProjection
-    + 0.5f*p.coveredStrips(track, ltp.position());
+  const float corr = -  0.5f*(1.f-p.backplanecorrection) * fullProjection
+                      + 0.5f*p.coveredStrips(track, ltp.position());
+  const float strip = cluster.barycenter() + corr;

   return std::make_pair( p.topology->localPosition(strip, ltp.vector()),
                         p.topology->localError(strip, uerr2, ltp.vector()) );

we may even claim that the modified code is more precise than the original one

@cvuosalo
Copy link
Contributor

cvuosalo commented May 6, 2016

The test for workflow 25202 (described above) also shows a roughly 5-10% improvement for tracking modules in the HLT step:

  delta/mean delta/orJob     original                   new       module name
  ---------- ------------     --------                  ----       ------------
   -0.101400      -0.01%        32.34 ms/ev ->        29.22 ms/ev hltMuCkfTrackCandidates
   -0.101288      -0.00%        10.09 ms/ev ->         9.11 ms/ev hltIter2PFlowCkfTrackCandidatesForBTag
   -0.099339      -0.00%        14.79 ms/ev ->        13.39 ms/ev hltIter1ElectronsCkfTrackCandidates
   -0.098992      -0.00%        12.32 ms/ev ->        11.16 ms/ev hltIter1PFlowCkfTrackCandidatesForBTag
   -0.097608      -0.01%        42.12 ms/ev ->        38.20 ms/ev hltIter1PFlowCkfTrackCandidates
   -0.090038      -0.00%        34.76 ms/ev ->        31.77 ms/ev hltIter2PFlowCkfTrackCandidates
   -0.086561      -0.00%         8.46 ms/ev ->         7.76 ms/ev hltIter0PFlowCkfTrackCandidatesForBTag
   -0.085895      -0.00%        20.45 ms/ev ->        18.77 ms/ev hltIter2PFlowCkfTrackCandidatesForTau
   -0.085860      -0.00%        15.10 ms/ev ->        13.85 ms/ev hltIter0ElectronsCkfTrackCandidates
   -0.082601      -0.00%        12.10 ms/ev ->        11.14 ms/ev hltIter2PFlowCkfTrackCandidatesForPhotons
   -0.075333      -0.00%        21.76 ms/ev ->        20.18 ms/ev hltEgammaCkfTrackCandidatesForGSF
   -0.075186      -0.00%        37.86 ms/ev ->        35.12 ms/ev hltIter0PFlowCkfTrackCandidates
   -0.072771      -0.00%        28.09 ms/ev ->        26.11 ms/ev hltIter1PFlowCkfTrackCandidatesForTau
   -0.070787      -0.00%        23.03 ms/ev ->        21.46 ms/ev hltIter0PFlowCkfTrackCandidatesForTau
   -0.068314      -0.00%        20.28 ms/ev ->        18.94 ms/ev hltIter0PFlowCkfTrackCandidatesForPhotons
   -0.064093      -0.01%        94.91 ms/ev ->        89.02 ms/ev hltIter2ElectronsCkfTrackCandidates
   -0.063713      -0.01%        70.50 ms/ev ->        66.15 ms/ev hltEgammaCkfTrackCandidatesForGSFUnseeded
   -0.053430      -0.01%        93.81 ms/ev ->        88.93 ms/ev hltDisplacedhltIter4PFlowCkfTrackCandidates

@cvuosalo
Copy link
Contributor

cvuosalo commented May 6, 2016

+1

For #14344 e77be1a

Speeding up Strip CPE computations.

The code changes are satisfactory. Jenkins tests against baseline CMSSW_8_1_X_2016-05-02-2300 show no significant differences but do show numerous tiny differences expected for the code changes. Extended tests of workflows 25202 and 50202 discussed above show similar tiny changes and confirm that a 10% speed-up of affected tracking modules has been achieved.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 6, 2016

This pull request is fully signed and it will be integrated in one of the next CMSSW_8_1_X IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @slava77, @davidlange6, @Degano, @smuzaffar

@davidlange6
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit e64655b into cms-sw:CMSSW_8_1_X May 7, 2016
cmsbuild added a commit that referenced this pull request Aug 26, 2016
backport of #13448: small performance improvement in Tracking, and #14344: avoid redundant computations in StripCPE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants