DQM instability in 12434.0, pixelLessStep_quickAssociatorByHits #37970

jpata · 2022-05-17T07:58:38Z

In recent PRs it seems the workflow 12434.0 has DQM changes in Tracking/TrackParameters/generalTracks/SeedMon/pixelLessStep/TrackBuilding, pixelLessStep_quickAssociatorByHits

cms-sw/cmsdist#7862
#37918
#37952

The change is not always in the same direction:

@cms-sw/tracking-pog-l2

jpata · 2022-05-17T07:58:44Z

assign reconstruction

cmsbuild · 2022-05-17T07:58:54Z

New categories assigned: reconstruction

@jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2022-05-17T07:58:58Z

A new Issue was created by @jpata Joosep Pata.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

jpata · 2022-05-17T08:03:00Z

As I noted here, it does not seem immediately related to TF giving different results on different CPUs.

37918 baseline: 2022-05-16 14:22:17.657780: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
37918 PR:       2022-05-16 14:49:16.906255: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA

makortel · 2022-05-17T13:42:06Z

(speaking from distant past experience) Looking at #37918 (comment), there are differences in folder Tracking/TrackParameters/generalTracks/SeedMon/pixelLessStep like

These plots are "purely DQM", i.e. tracks (well, TrackCandidates) are not matched to MC. This hints that the difference comes in some way from reconstruction. Could there be e.g. uninitialized variable somewhere where the (mkfit) track candidates are fed to the DNN? @leonardogiannini @slava77

slava77 · 2022-05-17T16:48:01Z

is it clear that the differences started showing up after #37878 ?

slava77 · 2022-05-17T17:10:31Z

Could there be e.g. uninitialized variable somewhere where the (mkfit) track candidates are fed to the DNN? @leonardogiannini @slava77

there is indeed an execution path that could lead to uninitialized values (with an assumption that a default-constructed tf::Tensor is not initialized, a trackCandidate that fails a PCA would run evaluation over uninitialized values).
@leonardogiannini is investigating

leonardogiannini · 2022-05-17T23:31:02Z

this should indeed be due to https://github.com/cms-sw/cmssw/blob/master/RecoTracker/MkFit/plugins/MkFitOutputConverter.cc#L637 which leaves the input undetermined in some cases.
I found 1/3510 tracks (100 evts) passed to cand DNN where this happens. in TTbar with PU this is happens in about 0.1% of tracks passed to cand DNN.

A quick fix would be adding the initialization to 0, e.g. :

   if (!(tsAtClosestApproachTrackCand.isValid())) {
     edm::LogVerbatim("TrackBuilding") << "TrajectoryStateClosestToBeamLine not valid";
     for (auto i=0; i<29; i++) {
        input1.matrix<float>()(nt, i) = 0;
     }
     input2.matrix<float>()(nt, 0) = algo_;
     continue;
   }

I can open the PR. I imagine it needs a backport as well.

slava77 · 2022-05-17T23:49:10Z

A quick fix would be adding the initialization to 0, e.g. :

does a 0 input return an output dnn value that would pass or fail the cuts?
My guess is that it's better to fail the candidate.
So, there may be a need to override the dnn(0) .

leonardogiannini · 2022-05-18T00:05:53Z

Unfortunately, the all 0 tensor passes the selection (though not significant).
The other possibility is to store the indices "itrack" (in a std::vector?) where it happens and then fail the cand making the output -1.

jpata · 2022-05-18T11:53:53Z

type tracking

jpata · 2022-05-18T11:56:03Z

I understand this would be something urgent to address before the final release of 12_4_0 (closed to physics changes, but bugfixes should be fine). Is this something you can address in the next week or so, @leonardogiannini?

mmusich · 2022-05-18T13:05:39Z

The other possibility is to store the indices "itrack" (in a std::vector?) where it happens and then fail the cand making the output -1.

Would something like this work?

diff --git a/RecoTracker/MkFit/plugins/MkFitOutputConverter.cc b/RecoTracker/MkFit/plugins/MkFitOutputConverter.cc
index 16fdfad9ec1..d27575aaad2 100644
--- a/RecoTracker/MkFit/plugins/MkFitOutputConverter.cc
+++ b/RecoTracker/MkFit/plugins/MkFitOutputConverter.cc
@@ -616,6 +616,8 @@ std::vector<float> MkFitOutputConverter::computeDNNs(TrackCandidateCollection co
   tensorflow::Tensor input1(tensorflow::DT_FLOAT, {bsize_, 29});
   tensorflow::Tensor input2(tensorflow::DT_FLOAT, {bsize_, 1});
 
+  std::vector<int> toFail;
+
   for (auto nb = 0; nb < nbatches + 1; nb++) {
     for (auto nt = 0; nt < bsize_; nt++) {
       int itrack = nt + bsize_ * nb;
@@ -634,6 +636,7 @@ std::vector<float> MkFitOutputConverter::computeDNNs(TrackCandidateCollection co
 
       if (!(tsAtClosestApproachTrackCand.isValid())) {
         edm::LogVerbatim("TrackBuilding") << "TrajectoryStateClosestToBeamLine not valid";
+        toFail.push_back(itrack);
         continue;
       }
 
@@ -720,7 +723,12 @@ std::vector<float> MkFitOutputConverter::computeDNNs(TrackCandidateCollection co
         continue;
 
       float out0 = 2.0 * outputs[0].matrix<float>()(nt, 0) - 1.0;
-      output[itrack] = out0;
+
+      if (std::find(toFail.begin(), toFail.end(), itrack) != toFail.end()) {
+        output[itrack] = -1.f;
+      } else {
+        output[itrack] = out0;
+      }
     }
   }

leonardogiannini · 2022-05-18T15:02:07Z

yes, that's it more or less. I am testing almost the same. You or I can open the PR. Let me know

mmusich · 2022-05-18T15:02:52Z

yes, that's it more or less. I am testing almost the same. You or I can open the PR. Let me know

go ahead!

perrotta · 2022-05-20T12:55:58Z

Looks like having been actually fixed by #38004, see also #37954 (comment)

jpata · 2022-05-20T12:59:06Z

+reconstruction

solved by Fix for issue 37970 - address undetermined output of cand DNN in mkFit #38004

cmsbuild · 2022-05-20T12:59:31Z

This issue is fully signed and ready to be closed.

[12.4.X] address undef output of dnn - for issue #37970

cmsbuild added pending-signatures reconstruction-pending labels May 17, 2022

jpata mentioned this issue May 17, 2022

Update tensorflow to 2.6.3 cms-sw/cmsdist#7862

Merged

smuzaffar mentioned this issue May 17, 2022

Update libzmp to version 4.3.4 cms-sw/cmsdist#7875

Merged

perrotta mentioned this issue May 18, 2022

GEM Online & Offline Efficiency Update #37954

Merged

smuzaffar mentioned this issue May 18, 2022

reco comparisons: updated maps and monitoring for new products cms-sw/cms-bot#1762

Merged

cmsbuild added the tracking label May 18, 2022

leonardogiannini mentioned this issue May 18, 2022

Fix for issue 37970 - address undetermined output of cand DNN in mkFit #38004

Merged

mmusich mentioned this issue May 19, 2022

[12.4.X] address undef output of dnn - for issue #37970 #38011

Merged

jpata closed this as completed May 20, 2022

cmsbuild removed reconstruction-pending pending-signatures labels May 20, 2022

cmsbuild added fully-signed reconstruction-approved labels May 20, 2022

cmsbuild added a commit that referenced this issue May 20, 2022

Merge pull request #38011 from mmusich/FixIssue37970_12_4_X

e21da00

[12.4.X] address undef output of dnn - for issue #37970

qliphy mentioned this issue May 25, 2022

Semi-random initial value in the L1 trigger prescale counter #37506

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DQM instability in 12434.0, pixelLessStep_quickAssociatorByHits #37970

DQM instability in 12434.0, pixelLessStep_quickAssociatorByHits #37970

jpata commented May 17, 2022

jpata commented May 17, 2022

cmsbuild commented May 17, 2022

cmsbuild commented May 17, 2022

jpata commented May 17, 2022 •

edited

makortel commented May 17, 2022

slava77 commented May 17, 2022

slava77 commented May 17, 2022

leonardogiannini commented May 17, 2022

slava77 commented May 17, 2022

leonardogiannini commented May 18, 2022

jpata commented May 18, 2022

jpata commented May 18, 2022

mmusich commented May 18, 2022

leonardogiannini commented May 18, 2022

mmusich commented May 18, 2022

perrotta commented May 20, 2022

jpata commented May 20, 2022

cmsbuild commented May 20, 2022

DQM instability in 12434.0, pixelLessStep_quickAssociatorByHits #37970

DQM instability in 12434.0, pixelLessStep_quickAssociatorByHits #37970

Comments

jpata commented May 17, 2022

jpata commented May 17, 2022

cmsbuild commented May 17, 2022

cmsbuild commented May 17, 2022

jpata commented May 17, 2022 • edited

makortel commented May 17, 2022

slava77 commented May 17, 2022

slava77 commented May 17, 2022

leonardogiannini commented May 17, 2022

slava77 commented May 17, 2022

leonardogiannini commented May 18, 2022

jpata commented May 18, 2022

jpata commented May 18, 2022

mmusich commented May 18, 2022

leonardogiannini commented May 18, 2022

mmusich commented May 18, 2022

perrotta commented May 20, 2022

jpata commented May 20, 2022

cmsbuild commented May 20, 2022

jpata commented May 17, 2022 •

edited