Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM instability in 12434.0, pixelLessStep_quickAssociatorByHits #37970

Closed
jpata opened this issue May 17, 2022 · 18 comments
Closed

DQM instability in 12434.0, pixelLessStep_quickAssociatorByHits #37970

jpata opened this issue May 17, 2022 · 18 comments

Comments

@jpata
Copy link
Contributor

jpata commented May 17, 2022

In recent PRs it seems the workflow 12434.0 has DQM changes in Tracking/TrackParameters/generalTracks/SeedMon/pixelLessStep/TrackBuilding, pixelLessStep_quickAssociatorByHits

cms-sw/cmsdist#7862
#37918
#37952

The change is not always in the same direction:
image
image

@cms-sw/tracking-pog-l2

@jpata
Copy link
Contributor Author

jpata commented May 17, 2022

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @jpata Joosep Pata.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@jpata
Copy link
Contributor Author

jpata commented May 17, 2022

As I noted here, it does not seem immediately related to TF giving different results on different CPUs.

37918 baseline: 2022-05-16 14:22:17.657780: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
37918 PR:       2022-05-16 14:49:16.906255: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA

@makortel
Copy link
Contributor

(speaking from distant past experience) Looking at #37918 (comment), there are differences in folder Tracking/TrackParameters/generalTracks/SeedMon/pixelLessStep like
image

These plots are "purely DQM", i.e. tracks (well, TrackCandidates) are not matched to MC. This hints that the difference comes in some way from reconstruction. Could there be e.g. uninitialized variable somewhere where the (mkfit) track candidates are fed to the DNN? @leonardogiannini @slava77

@slava77
Copy link
Contributor

slava77 commented May 17, 2022

is it clear that the differences started showing up after #37878 ?

@slava77
Copy link
Contributor

slava77 commented May 17, 2022

Could there be e.g. uninitialized variable somewhere where the (mkfit) track candidates are fed to the DNN? @leonardogiannini @slava77

there is indeed an execution path that could lead to uninitialized values (with an assumption that a default-constructed tf::Tensor is not initialized, a trackCandidate that fails a PCA would run evaluation over uninitialized values).
@leonardogiannini is investigating

@leonardogiannini
Copy link
Contributor

this should indeed be due to https://github.com/cms-sw/cmssw/blob/master/RecoTracker/MkFit/plugins/MkFitOutputConverter.cc#L637 which leaves the input undetermined in some cases.
I found 1/3510 tracks (100 evts) passed to cand DNN where this happens. in TTbar with PU this is happens in about 0.1% of tracks passed to cand DNN.

A quick fix would be adding the initialization to 0, e.g. :

   if (!(tsAtClosestApproachTrackCand.isValid())) {
     edm::LogVerbatim("TrackBuilding") << "TrajectoryStateClosestToBeamLine not valid";
     for (auto i=0; i<29; i++) {
        input1.matrix<float>()(nt, i) = 0;
     }
     input2.matrix<float>()(nt, 0) = algo_;
     continue;
   }

I can open the PR. I imagine it needs a backport as well.

@slava77
Copy link
Contributor

slava77 commented May 17, 2022

A quick fix would be adding the initialization to 0, e.g. :

does a 0 input return an output dnn value that would pass or fail the cuts?
My guess is that it's better to fail the candidate.
So, there may be a need to override the dnn(0) .

@leonardogiannini
Copy link
Contributor

Unfortunately, the all 0 tensor passes the selection (though not significant).
The other possibility is to store the indices "itrack" (in a std::vector?) where it happens and then fail the cand making the output -1.

@jpata
Copy link
Contributor Author

jpata commented May 18, 2022

type tracking

@jpata
Copy link
Contributor Author

jpata commented May 18, 2022

I understand this would be something urgent to address before the final release of 12_4_0 (closed to physics changes, but bugfixes should be fine). Is this something you can address in the next week or so, @leonardogiannini?

@mmusich
Copy link
Contributor

mmusich commented May 18, 2022

The other possibility is to store the indices "itrack" (in a std::vector?) where it happens and then fail the cand making the output -1.

Would something like this work?

diff --git a/RecoTracker/MkFit/plugins/MkFitOutputConverter.cc b/RecoTracker/MkFit/plugins/MkFitOutputConverter.cc
index 16fdfad9ec1..d27575aaad2 100644
--- a/RecoTracker/MkFit/plugins/MkFitOutputConverter.cc
+++ b/RecoTracker/MkFit/plugins/MkFitOutputConverter.cc
@@ -616,6 +616,8 @@ std::vector<float> MkFitOutputConverter::computeDNNs(TrackCandidateCollection co
   tensorflow::Tensor input1(tensorflow::DT_FLOAT, {bsize_, 29});
   tensorflow::Tensor input2(tensorflow::DT_FLOAT, {bsize_, 1});
 
+  std::vector<int> toFail;
+
   for (auto nb = 0; nb < nbatches + 1; nb++) {
     for (auto nt = 0; nt < bsize_; nt++) {
       int itrack = nt + bsize_ * nb;
@@ -634,6 +636,7 @@ std::vector<float> MkFitOutputConverter::computeDNNs(TrackCandidateCollection co
 
       if (!(tsAtClosestApproachTrackCand.isValid())) {
         edm::LogVerbatim("TrackBuilding") << "TrajectoryStateClosestToBeamLine not valid";
+        toFail.push_back(itrack);
         continue;
       }
 
@@ -720,7 +723,12 @@ std::vector<float> MkFitOutputConverter::computeDNNs(TrackCandidateCollection co
         continue;
 
       float out0 = 2.0 * outputs[0].matrix<float>()(nt, 0) - 1.0;
-      output[itrack] = out0;
+
+      if (std::find(toFail.begin(), toFail.end(), itrack) != toFail.end()) {
+        output[itrack] = -1.f;
+      } else {
+        output[itrack] = out0;
+      }
     }
   }

@leonardogiannini
Copy link
Contributor

yes, that's it more or less. I am testing almost the same. You or I can open the PR. Let me know

@mmusich
Copy link
Contributor

mmusich commented May 18, 2022

yes, that's it more or less. I am testing almost the same. You or I can open the PR. Let me know

go ahead!

@perrotta
Copy link
Contributor

Looks like having been actually fixed by #38004, see also #37954 (comment)

@jpata
Copy link
Contributor Author

jpata commented May 20, 2022

@jpata jpata closed this as completed May 20, 2022
@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants