avoid redundant computations in StripCPE #14344

VinInn · 2016-05-03T10:13:49Z

Major speed up in the computation of Strip CPE during pattern-recognition by avoiding redundant computations depending on track and detector only.
The net result is about 10% speed up for tracking in TTBAR PU35

No regression expected.

cmsbuild · 2016-05-03T10:14:18Z

A new Pull Request was created by @VinInn (Vincenzo Innocente) for CMSSW_8_1_X.

It involves the following packages:

DataFormats/Common
RecoLocalTracker/ClusterParameterEstimator
RecoLocalTracker/SiStripRecHitConverter
RecoTracker/MeasurementDet

@smuzaffar, @Dr15Jones, @cvuosalo, @cmsbuild, @slava77, @davidlange6 can you please review it and eventually sign? Thanks.
@ghellwig, @wmtan, @makortel, @forthommel, @yduhm, @GiacomoSguazzoni, @gbenelli, @rovere, @VinInn, @nickmccoll, @mschrode, @jlagram, @wddgit, @cerati, @gpetruc, @OlivierBondu, @threus, @dgulhan this is something you requested to watch as well.
@slava77, @Degano, @smuzaffar you are the release manager for this.

cms-bot commands are list here #13028

VinInn · 2016-05-03T10:14:28Z

@cmsbuild , please test

cmsbuild · 2016-05-03T10:14:38Z

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/12781/console

cmsbuild · 2016-05-03T13:42:27Z

+1
Tested at: e77be1a
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-14344/12781/summary.html

Dr15Jones · 2016-05-03T13:55:09Z

DataFormats/Common/interface/DetSetNew.h

@@ -96,6 +96,9 @@ namespace edmNew {
      return edm::Ref<typename HandleT::element_type, typename HandleT::element_type::value_type::value_type>( handle.id(), ci, ci - &(container().front()) );
    }

+    unsigned int makeKeyOf(const_iterator ci) const {
+      return  ci - &(container().front());


Why not ci - container().begin();? That would seem to me to be easier for a person to understand.

Did you have to use front because const_iterator in this class is a pointer and not a std::vector<...>::const_iterator?

to make sure I am using pointers and not iterators.
it is consistent to the implementation few lines above

cmsbuild · 2016-05-03T15:20:18Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-14344/12781/summary.html

Dr15Jones · 2016-05-03T15:28:30Z

+1

cvuosalo · 2016-05-04T22:15:09Z

Timing test in progress...

cvuosalo · 2016-05-04T22:47:29Z

RecoLocalTracker/SiStripRecHitConverter/interface/StripCPE.h

+    SiStripDetId::SubDetector loc = SiStripDetId( det.geographicalId() ).subDetector();  
+
+    LocalVector track = ltp.momentum();
+    track *= -p.thickness/track.z();


Is this division safe? Could track.z() be zero? We've just been fixing some bugs caused by an invalid TrajectoryStateOnSurface. There might be more.

was like this before.
In any case things should be checked in advance, not so deep.
Could please remind me which bug you refers to?

@VinInn: I was thinking of #14306, which is not related to the code in this PR, but fixes a bug with a bad TrajectoryStateOnSurface. I have a faint recollection of seeing similar issues previously.

How many paths through the code lead to line 61 above? If you have an idea of how to find them all, we could check that the proper validation is being done in advance. But I would prefer bullet-proof code that never has a possibility to propagate NANs and nonsense values.

I think we need to have a more in depth discussion about "FPE" and NaN particularly in the context of vectorization and the future move to vector hardware. We cannot afford to protect each single operation. The current accepted wisdom is to let NaN to propagate and trap it at very high level. Maybe we need to invite an expert and give us a lecture on how one manage this type of issues in HPC.
btw #14306 has nothing to do with NaN or "FPE" is just a trajectory not reaching the target.

On 5/6/16 8:44 AM, Vincenzo Innocente wrote:

In RecoLocalTracker/SiStripRecHitConverter/interface/StripCPE.h
#14344 (comment):

SiStripDetId::SubDetector loc; float afullProjection; float corr;

};

virtual StripClusterParameterEstimator::LocalValues

localParameters( const SiStripCluster& cl, AlgoParam const & ap) const {

return std::make_pair(LocalPoint(), LocalError());

}

AlgoParam getAlgoParam(const GeomDetUnit& det, const LocalTrajectoryParameters & ltp) const {

StripCPE::Param const & p = param(det);

SiStripDetId::SubDetector loc = SiStripDetId( det.geographicalId() ).subDetector();

LocalVector track = ltp.momentum();

track *= -p.thickness/track.z();

I think we need to have a more in depth discussion about "FPE" and NaN
particularly in the context of vectorization and the future move to
vector hardware. We cannot afford to protect each single operation. The
current accepted wisdom is to let NaN to propagate and trap it at very
high level.

"A very high level" better be the output of the module or at worst
output of a sequence of modules
that has consumable output downstream.

The problem is more complex for utility/tools which are used in many places.
A fraction of users of the code may not care about execution speed much.
Maybe for that case it's practical to add the checks to avoid FPE
(template to hand over FPE-safe and unsafe interface?).

Maybe we need to invite an expert and give us a lecture on
how one manage this type of issues in HPC.

Do you have a name in mind?
Maybe send a suggestion by email.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/cms-sw/cmssw/pull/14344/files/e77be1aadab9d37f1baf957e190b179f95a1fc39#r62294072

Very High level is before storing in the event.
whatever happen in the meantime is irrelevant (including FPEs)

In any case this code was like this before, so this is not the place to argue about it.

We need to investigate. Somebody from Intel or NVidia or a supercomputing center...
In any case it is something that goes beyond CMS

cvuosalo · 2016-05-05T19:52:20Z

@VinInn: You said no regressions expected, but my test shows numerous small differences. I ran workflow 50202.0_TTbar_13+TTbar_13INPUT+DIGIUP15_PU50 with 70 events against baseline CMSSW_8_1_X_2016-04-29-2300. The most notable differences I found are below. Are they to be expected?

$tpfrac$

cvuosalo · 2016-05-05T20:36:27Z

CPU timing measurements confirm the expected 10% improvement in timing. A test of 50202.0 as described above shows reduced times for tracking modules:

  delta/mean delta/orJob     original                   new       module name
  ---------- ------------     --------                  ----       ------------
   -0.093260      -0.01%        57.92 ms/ev ->        52.76 ms/ev jetCoreRegionalStepTrackCandidates
   -0.060842      -0.10%       611.34 ms/ev ->       575.24 ms/ev pixelPairStepTrackCandidates
   -0.057358      -0.05%       347.44 ms/ev ->       328.07 ms/ev lowPtTripletStepTrackCandidates
   -0.056272      -0.07%       482.71 ms/ev ->       456.29 ms/ev detachedTripletStepTrackCandidates
   -0.055094      -0.09%       649.80 ms/ev ->       614.96 ms/ev initialStepTrackCandidatesPreSplitting

(Times exclude the first event.)

By another measure:

    detachedTripletStepTrackCandidates   343.364 ms/ev -> 325.338 ms/ev
    lowPtTripletStepTrackCandidates      247.207 ms/ev -> 235.246 ms/ev
    pixelPairStepTrackCandidates     434.93 ms/ev -> 408.53 ms/ev
    jetCoreRegionalStepTrackCandidates   42.6031 ms/ev -> 38.6807 ms/ev
     Total in detailed printout:     1202.83 ms/ev -> 1146.98 ms/ev      delta: -55.8407
 Total times: 5536.21 ms/ev -> 5452.16 ms/ev     delta: -84.0479

cvuosalo · 2016-05-05T22:15:11Z

The Jenkins test results show very numerous tiny differences for both the DQM and alternative comparisons for many workflows.

cvuosalo · 2016-05-06T03:02:02Z

A test of workflow 25202.0_TTbar_13 with 70 events against baseline CMSSW_8_1_X_2016-04-29-2300 also shows about 10% faster CPU time for tracking modules.

  delta/mean delta/orJob     original                   new       module name
  ---------- ------------     --------                  ----       ------------
   -0.217048      -0.03%        88.88 ms/ev ->        71.48 ms/ev jetCoreRegionalStepTrackCandidates
   -0.125561      -0.09%       418.39 ms/ev ->       368.96 ms/ev lowPtTripletStepTrackCandidates
   -0.120764      -0.11%       537.63 ms/ev ->       476.40 ms/ev detachedTripletStepTrackCandidates
   -0.118866      -0.17%       832.45 ms/ev ->       739.05 ms/ev pixelPairStepTrackCandidates
   -0.102000      -0.11%       598.20 ms/ev ->       540.15 ms/ev initialStepTrackCandidates
   -0.098286      -0.11%       615.13 ms/ev ->       557.51 ms/ev initialStepTrackCandidatesPreSplitting
   -0.085970      -0.03%       219.43 ms/ev ->       201.34 ms/ev tobTecStepTrackCandidates
   -0.080878      -0.06%       422.54 ms/ev ->       389.70 ms/ev pixelLessStepTrackCandidates
   -0.070235      -0.00%        30.78 ms/ev ->        28.69 ms/ev mixedTripletStepTrackCandidates

(Times exclude first event)

Second measure:

        initialStepTrackCandidates       422.891 ms/ev -> 381.62 ms/ev
        detachedTripletStepTrackCandidates       378.803 ms/ev -> 335.665 ms/ev
        lowPtTripletStepTrackCandidates          294.53 ms/ev -> 261.284 ms/ev
        pixelPairStepTrackCandidates     587.988 ms/ev -> 520.341 ms/ev
        pixelLessStepTrackCandidates     297.141 ms/ev -> 274.214 ms/ev
        tobTecStepTrackCandidates        153.762 ms/ev -> 141.073 ms/ev
        jetCoreRegionalStepTrackCandidates       64.3726 ms/ev -> 51.8902 ms/ev
         Total in detailed printout:     2785.71 ms/ev -> 2556.17 ms/ev          delta: -229.534
 Total times: 5966.04 ms/ev -> 5733.31 ms/ev     delta: -232.735

There are also numerous tiny differences from the baseline. Unlike workflow 50202.0, these differences seem to generally be improvements. Two examples are shown below.

One misidentified track removed:

VinInn · 2016-05-06T06:41:13Z

For what concern the fraction of true pt I wil check with @makortel
the tob hit seems a migration between two contiguous bins.
I cannot exclude differences due to numerics or rare branches (the code in question has more than 8 possible paths...)

VinInn · 2016-05-06T12:01:11Z

@cvuosalo , I can reproduce the observed differences with this small change in the current IB

diff --git a/RecoLocalTracker/SiStripRecHitConverter/src/StripCPEfromTrackAngle.cc b/RecoLocalTracker/SiStripRecHitConverter/src/StripCPEfromTrackAngle.cc
index ca02f45..9d16b5f 100644
--- a/RecoLocalTracker/SiStripRecHitConverter/src/StripCPEfromTrackAngle.cc
+++ b/RecoLocalTracker/SiStripRecHitConverter/src/StripCPEfromTrackAngle.cc
@@ -76,8 +76,9 @@ StripCPEfromTrackAngle::localParameters( const SiStripCluster& cluster, const Ge
       break;
   }

-  const float strip = cluster.barycenter() -  0.5f*(1.f-p.backplanecorrection) * fullProjection
-    + 0.5f*p.coveredStrips(track, ltp.position());
+  const float corr = -  0.5f*(1.f-p.backplanecorrection) * fullProjection
+                      + 0.5f*p.coveredStrips(track, ltp.position());
+  const float strip = cluster.barycenter() + corr;

   return std::make_pair( p.topology->localPosition(strip, ltp.vector()),
                         p.topology->localError(strip, uerr2, ltp.vector()) );

we may even claim that the modified code is more precise than the original one

cvuosalo · 2016-05-06T16:31:40Z

The test for workflow 25202 (described above) also shows a roughly 5-10% improvement for tracking modules in the HLT step:

  delta/mean delta/orJob     original                   new       module name
  ---------- ------------     --------                  ----       ------------
   -0.101400      -0.01%        32.34 ms/ev ->        29.22 ms/ev hltMuCkfTrackCandidates
   -0.101288      -0.00%        10.09 ms/ev ->         9.11 ms/ev hltIter2PFlowCkfTrackCandidatesForBTag
   -0.099339      -0.00%        14.79 ms/ev ->        13.39 ms/ev hltIter1ElectronsCkfTrackCandidates
   -0.098992      -0.00%        12.32 ms/ev ->        11.16 ms/ev hltIter1PFlowCkfTrackCandidatesForBTag
   -0.097608      -0.01%        42.12 ms/ev ->        38.20 ms/ev hltIter1PFlowCkfTrackCandidates
   -0.090038      -0.00%        34.76 ms/ev ->        31.77 ms/ev hltIter2PFlowCkfTrackCandidates
   -0.086561      -0.00%         8.46 ms/ev ->         7.76 ms/ev hltIter0PFlowCkfTrackCandidatesForBTag
   -0.085895      -0.00%        20.45 ms/ev ->        18.77 ms/ev hltIter2PFlowCkfTrackCandidatesForTau
   -0.085860      -0.00%        15.10 ms/ev ->        13.85 ms/ev hltIter0ElectronsCkfTrackCandidates
   -0.082601      -0.00%        12.10 ms/ev ->        11.14 ms/ev hltIter2PFlowCkfTrackCandidatesForPhotons
   -0.075333      -0.00%        21.76 ms/ev ->        20.18 ms/ev hltEgammaCkfTrackCandidatesForGSF
   -0.075186      -0.00%        37.86 ms/ev ->        35.12 ms/ev hltIter0PFlowCkfTrackCandidates
   -0.072771      -0.00%        28.09 ms/ev ->        26.11 ms/ev hltIter1PFlowCkfTrackCandidatesForTau
   -0.070787      -0.00%        23.03 ms/ev ->        21.46 ms/ev hltIter0PFlowCkfTrackCandidatesForTau
   -0.068314      -0.00%        20.28 ms/ev ->        18.94 ms/ev hltIter0PFlowCkfTrackCandidatesForPhotons
   -0.064093      -0.01%        94.91 ms/ev ->        89.02 ms/ev hltIter2ElectronsCkfTrackCandidates
   -0.063713      -0.01%        70.50 ms/ev ->        66.15 ms/ev hltEgammaCkfTrackCandidatesForGSFUnseeded
   -0.053430      -0.01%        93.81 ms/ev ->        88.93 ms/ev hltDisplacedhltIter4PFlowCkfTrackCandidates

cvuosalo · 2016-05-06T20:55:08Z

+1

For #14344 e77be1a

Speeding up Strip CPE computations.

The code changes are satisfactory. Jenkins tests against baseline CMSSW_8_1_X_2016-05-02-2300 show no significant differences but do show numerous tiny differences expected for the code changes. Extended tests of workflows 25202 and 50202 discussed above show similar tiny changes and confirm that a 10% speed-up of affected tracking modules has been achieved.

cmsbuild · 2016-05-06T20:55:28Z

This pull request is fully signed and it will be integrated in one of the next CMSSW_8_1_X IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @slava77, @davidlange6, @Degano, @smuzaffar

davidlange6 · 2016-05-07T15:28:05Z

+1

backport of #13448: small performance improvement in Tracking, and #14344: avoid redundant computations in StripCPE

Vincenzo Innocente and others added 4 commits May 2, 2016 16:13

compiles, moved loop in cpe

3baab0d

fix misstype

98808d8

speed also the other methods

fefdadc

use fast cpe also for 1d hits

e77be1a

cmsbuild added this to the Next CMSSW_8_1_X milestone May 3, 2016

cmsbuild added reconstruction-pending core-pending pending-signatures tests-pending orp-pending comparison-pending labels May 3, 2016

cmsbuild added tests-started and removed tests-pending labels May 3, 2016

cmsbuild added tests-approved and removed tests-started labels May 3, 2016

Dr15Jones reviewed May 3, 2016
View reviewed changes

cmsbuild added comparison-available and removed comparison-pending labels May 3, 2016

cmsbuild added core-approved and removed core-pending labels May 3, 2016

cvuosalo reviewed May 4, 2016
View reviewed changes

cmsbuild added reconstruction-approved fully-signed and removed pending-signatures reconstruction-pending labels May 6, 2016

cmsbuild added orp-approved and removed orp-pending labels May 7, 2016

cmsbuild merged commit e64655b into cms-sw:CMSSW_8_1_X May 7, 2016

fwyzard mentioned this pull request Jul 8, 2016

backport PRs to CMSSW 8.0.x to improve HLT performance #15151

Closed

9 tasks

fwyzard mentioned this pull request Jul 26, 2016

backport of #13448: small performance improvement in Tracking, and #14344: avoid redundant computations in StripCPE #15295

Merged

cmsbuild added a commit that referenced this pull request Aug 26, 2016

Merge pull request #15295 from fwyzard/backport_13448+14344_80x

a8061f4

backport of #13448: small performance improvement in Tracking, and #14344: avoid redundant computations in StripCPE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid redundant computations in StripCPE #14344

avoid redundant computations in StripCPE #14344

VinInn commented May 3, 2016

cmsbuild commented May 3, 2016

VinInn commented May 3, 2016

cmsbuild commented May 3, 2016 •

edited

cmsbuild commented May 3, 2016

Dr15Jones May 3, 2016

Dr15Jones May 3, 2016

VinInn May 3, 2016

cmsbuild commented May 3, 2016

Dr15Jones commented May 3, 2016

cvuosalo commented May 4, 2016

cvuosalo May 4, 2016

VinInn May 5, 2016

cvuosalo May 5, 2016

VinInn May 6, 2016 •

edited

slava77 May 6, 2016

VinInn May 6, 2016 •

edited

cvuosalo commented May 5, 2016

cvuosalo commented May 5, 2016

cvuosalo commented May 5, 2016

cvuosalo commented May 6, 2016

VinInn commented May 6, 2016

VinInn commented May 6, 2016 •

edited

cvuosalo commented May 6, 2016

cvuosalo commented May 6, 2016

cmsbuild commented May 6, 2016

davidlange6 commented May 7, 2016

avoid redundant computations in StripCPE #14344

avoid redundant computations in StripCPE #14344

Conversation

VinInn commented May 3, 2016

cmsbuild commented May 3, 2016

VinInn commented May 3, 2016

cmsbuild commented May 3, 2016 • edited

cmsbuild commented May 3, 2016

Dr15Jones May 3, 2016

Choose a reason for hiding this comment

Dr15Jones May 3, 2016

Choose a reason for hiding this comment

VinInn May 3, 2016

Choose a reason for hiding this comment

cmsbuild commented May 3, 2016

Dr15Jones commented May 3, 2016

cvuosalo commented May 4, 2016

cvuosalo May 4, 2016

Choose a reason for hiding this comment

VinInn May 5, 2016

Choose a reason for hiding this comment

cvuosalo May 5, 2016

Choose a reason for hiding this comment

VinInn May 6, 2016 • edited

Choose a reason for hiding this comment

slava77 May 6, 2016

Choose a reason for hiding this comment

VinInn May 6, 2016 • edited

Choose a reason for hiding this comment

cvuosalo commented May 5, 2016

cvuosalo commented May 5, 2016

cvuosalo commented May 5, 2016

cvuosalo commented May 6, 2016

VinInn commented May 6, 2016

VinInn commented May 6, 2016 • edited

cvuosalo commented May 6, 2016

cvuosalo commented May 6, 2016

cmsbuild commented May 6, 2016

davidlange6 commented May 7, 2016

cmsbuild commented May 3, 2016 •

edited

VinInn May 6, 2016 •

edited

VinInn May 6, 2016 •

edited

VinInn commented May 6, 2016 •

edited