Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMSSW_12_4_11 did not fix GSF tracking issue in prompt reco #39987

Closed
rappoccio opened this issue Nov 4, 2022 · 36 comments
Closed

CMSSW_12_4_11 did not fix GSF tracking issue in prompt reco #39987

rappoccio opened this issue Nov 4, 2022 · 36 comments

Comments

@rappoccio
Copy link
Contributor

Details are spelled out here for the crash. This was hoped to be solved with 12_4_11 but the issue is still present.

@rappoccio
Copy link
Contributor Author

assign @cms-sw/egamma-pog-l2

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 4, 2022

A new Issue was created by @rappoccio .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Nov 4, 2022

assign reconstruction, egamma-pog

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 4, 2022

New categories assigned: reconstruction,egamma-pog

@mandrenguyen,@lfinco,@clacaputo,@swagata87 you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Nov 4, 2022

but the issue is still present.

as a clarification, there is an issue, but it's of different nature than the other one (before an assertion was triggered, while here there is a segmentation fault)

@slava77
Copy link
Contributor

slava77 commented Nov 7, 2022

There doesn't seem to be much room for a place to crash:

BasicMultiTrajectoryState::BasicMultiTrajectoryState(const
std::vector& tsvec)
: BasicTrajectoryState(tsvec.front().surface()), theStates(tsvec) {

From GaussianStateConversions::tsosFromMultiGaussianState
I see that if everything fails the input vector is empty.

I would suggest to edit GaussianStateConversions::tsosFromMultiGaussianState
so that it returns a default-constructed object if the vector is empty.

@swagata87 could you check?

@slava77
Copy link
Contributor

slava77 commented Nov 7, 2022

or is there a good reason to make BasicMultiTrajectoryState::BasicMultiTrajectoryState(const std::vector<TSOS>& tsvec) accept empty vectors?

@mmusich
Copy link
Contributor

mmusich commented Nov 7, 2022

something like:

diff --git a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
index 1692614f233..44fd569c8b4 100644
--- a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
+++ b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
@@ -50,6 +50,6 @@ namespace GaussianStateConversions {
                                 side);
       }
     }
-    return TrajectoryStateOnSurface((BasicTrajectoryState*)new BasicMultiTrajectoryState(components));
+    return components.empty() ? TrajectoryStateOnSurface() : TrajectoryStateOnSurface((BasicTrajectoryState*)new BasicMultiTrajectoryState(components));
   }
 }  // namespace GaussianStateConversions

?
I guess it would still be nice to able to actually reproduce offline

@swagata87
Copy link
Contributor

Thank you Slava and Marco.

I could not reproduce the actual crash yet. So I tried to simulate the crash...

Basically, I tried to check what happens if no state pass the following loop in GsfMultiStateUpdator.cc:

if (double det;
updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
(det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix22>(0, 0).Det(det) && det > 0) &&
(det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix33>(0, 0).Det(det) && det > 0) &&
(det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix44>(0, 0).Det(det) && det > 0) &&
(det = 0., updatedTSOS.curvilinearError().matrix().Det2(det) && det > 0)) {
result.addState(TrajectoryStateOnSurface(weights[i],

by doing the following:

--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -36,7 +36,7 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
         (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix22>(0, 0).Det(det) && det > 0) &&
         (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix33>(0, 0).Det(det) && det > 0) &&
         (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix44>(0, 0).Det(det) && det > 0) &&
-        (det = 0., updatedTSOS.curvilinearError().matrix().Det2(det) && det > 0)) {
+        (det = 0., updatedTSOS.curvilinearError().matrix().Det2(det) && det > 0 && det<0)) {
       result.addState(TrajectoryStateOnSurface(weights[i],

The above condition, that det > 0 && det<0 will always fail. And when I compiled and ran, it did not crash. I tested on only 10 events. It did print messages like KF updated state 0 is invalid. skipping. as expected from GsfMultiStateUpdator.

Then I rolled back GsfMultiStateUpdator to what it was, and did the same exercise for TsosGaussianStateConversions.cc; ie, made sure that no state passes this loop:

for (auto const& ic : singleStates) {
//require states to be positive-definite
if (double det = 0; (*ic).covariance().Det2(det) && det > 0) {
components.emplace_back((*ic).weight(),

by requiring the same det>0 && det<0:

--- a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
+++ b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
@@ -41,7 +41,7 @@ namespace GaussianStateConversions {
     components.reserve(singleStates.size());
     for (auto const& ic : singleStates) {
       //require states to be positive-definite
-      if (double det = 0; (*ic).covariance().Det2(det) && det > 0) {
+      if (double det = 0; (*ic).covariance().Det2(det) && det > 0 && det<0) {
         components.emplace_back((*ic).weight(),

This time it crashed at the first event.

Then I added the patch suggested by Slava and Marco to TsosGaussianStateConversions.cc

return components.empty() ? TrajectoryStateOnSurface() : TrajectoryStateOnSurface((BasicTrajectoryState*)new BasicMultiTrajectoryState(components));

and ran again. But this time also, it crashed at the first event.

@slava77
Copy link
Contributor

slava77 commented Nov 7, 2022

This time it crashed at the first event.

Then I added the patch suggested by Slava and Marco to TsosGaussianStateConversions.cc

return components.empty() ? TrajectoryStateOnSurface() : TrajectoryStateOnSurface((BasicTrajectoryState*)new BasicMultiTrajectoryState(components));

and ran again. But this time also, it crashed at the first event.

where was the crash this time?

@swagata87
Copy link
Contributor

where was the crash this time?

this time it crashed like this:

Thread 1 (Thread 0x7fb56a21b740 (LWP 6733) "cmsRun"):
#0  0x00007fb56c0beddd in poll () from /lib64/libc.so.6
#1  0x00007fb55f7f673f in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2  0x00007fb55f7f70cc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  0x00007fb55f7f9a1b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fb51eee5801 in PixelClusterParameterEstimator::getParameters(SiPixelCluster const&, GeomDet const&, TrajectoryStateOnSurface const&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libRecoLocalTrackerSiPixelRecHits.so
#6  0x00007fb534dec0b9 in TkClonerImpl::makeShared(SiPixelRecHit const&, TrajectoryStateOnSurface const&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libRecoTrackerTransientTrackingRecHit.so
#7  0x00007fb53bf06a10 in SiPixelRecHit::cloneSH_(TkCloner const&, TrajectoryStateOnSurface const&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libDataFormatsTrackerRecHit2D.so
#8  0x00007fb532c37703 in GsfTrajectoryFitter::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /afs/cern.ch/work/s/swmukher/GsfCrash/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsGsfTracking.so
#9  0x00007fb4db1b13cd in GoodSeedProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginRecoParticleFlowPFTrackingPlugins.so
#10 0x00007fb56eb67703 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007fb56eb4cc3f in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007fb56eaa4e65 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007fb56eaa515b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007fb56eaa7745 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#15 0x00007fb56eca77b5 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreConcurrency.so
#16 0x00007fb56d1c4bec in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7fb5143e3b00, this=0x7fb568b3fe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
#17 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fb568b3fe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
#18 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.cpp:168
#19 0x00007fb56ea152e8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#20 0x00007fb56ea2020b in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#21 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00007fb56d1b30eb in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/arena.cpp:698
#23 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040971c in main ()

Current Modules:

Module: GoodSeedProducer:trackerDrivenElectronSeeds (crashed)

A fatal system signal has occurred: segmentation violation
Segmentation fault (core dumped)

@swagata87
Copy link
Contributor

From technical POV, it looks like I have found a way to bypass the (simulated) crash.

If we use your patch in TsosGaussianStateConversions, plus the following change in GsfTrajectoryFitter:

--- a/TrackingTools/GsfTracking/src/GsfTrajectoryFitter.cc
+++ b/TrackingTools/GsfTracking/src/GsfTrajectoryFitter.cc
@@ -94,7 +94,7 @@ Trajectory GsfTrajectoryFitter::fitOne(const TrajectorySeed& aSeed,
     //
     // temporary protection copied from KFTrajectoryFitter.
     //
-    if ((**ihit).isValid() == false && (**ihit).det() == nullptr) {
+    if ((**ihit).isValid() == false || (**ihit).det() == nullptr) {
       LogDebug("GsfTrackFitters") << " Error: invalid hit with no GeomDet attached .... skipping";
       continue;
     }
@@ -121,6 +121,9 @@ Trajectory GsfTrajectoryFitter::fitOne(const TrajectorySeed& aSeed,
       //update
       assert((!(*ihit)->canImproveWithTrack()) | (nullptr != theHitCloner));
       assert((!(*ihit)->canImproveWithTrack()) | (nullptr != dynamic_cast<BaseTrackerRecHit const*>((*ihit).get())));
+      if (!predTsos.isValid()) {
+       return Trajectory();
+      }
       auto preciseHit = theHitCloner->makeShared(*ihit, predTsos);
       dump(*preciseHit, hitcounter, "GsfTrackFitters");
       currTsos = updator()->update(predTsos, *preciseHit);

then it runs. Currently tested on 20 events only.

We still need to check/understand,

  • check on more events, at least a few hundred
  • do the above changes make sense from physics POV to everyone? What do you think Slava and Marco?
  • check if there is any physics change; for example egamma pT distribution with and w/o the changes can be checked and that should be identical.

@mmusich
Copy link
Contributor

mmusich commented Nov 8, 2022

@swagata87

this time it crashed like this:

I see the Module: GoodSeedProducer:trackerDrivenElectronSeeds (crashed) crash when I apply this patch (that you suggested to simulate the crash):

diff --git a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
index 1692614f233..bf77e61ae83 100644
--- a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
+++ b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
@@ -41,7 +41,7 @@ namespace GaussianStateConversions {
     components.reserve(singleStates.size());
     for (auto const& ic : singleStates) {
       //require states to be positive-definite
-      if (double det = 0; (*ic).covariance().Det2(det) && det > 0) {
+      if (double det = 0; (*ic).covariance().Det2(det) && det > 0 && det < 0) {
         components.emplace_back((*ic).weight(),
                                 LocalTrajectoryParameters((*ic).mean(), pzSign, charged),
                                 LocalTrajectoryError((*ic).covariance()),

as far as I can tell, it's not the same crash as the one reported by Tier-0: Module: GsfTrackProducer:lowPtGsfEleGsfTracks (crashed).

How can we be sure we're curing the right problem? (by the way I still couldn't reproduce offline the crash seen in the replay).

@swagata87
Copy link
Contributor

Hi Marco,

I agree and share your concern that there is no guarantee that we are fixing the actual issue that T0 encountered.
Below one can see the differences in the output message from T0 replay crash and our simulated crash:

Real crash in replay (copied from T0 cmsTalk)

#3  0x00002b2d53b70a0b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b2d5e1624f4 in BasicTrajectoryState::BasicTrajectoryState(Surface const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/libTrackingToolsTrajectoryState.so
#6  0x00002b2d7cb52f40 in BasicMultiTrajectoryState::BasicMultiTrajectoryState(std::vector<TrajectoryStateOnSurface, std::allocator<TrajectoryStateOnSurface> > const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/libTrackingToolsGsfTools.so
#7  0x00002b2d80e0799f in GaussianStateConversions::tsosFromMultiGaussianState(MultiGaussianState<5u> const&, TrajectoryStateOnSurface const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/libTrackingToolsGsfTracking.so
#8  0x00002b2d80dfdc5d in MultiTrajectoryStateMerger::merge(TrajectoryStateOnSurface const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/libTrackingToolsGsfTracking.so
#9  0x00002b2d80dfac9f in GsfTrajectorySmoother::trajectory(Trajectory const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/libTrackingToolsGsfTracking.so
#10 0x00002b2d80c39c1f in (anonymous namespace)::KFFittingSmoother::smoothingStep(Trajectory&&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/pluginTrackingToolsTrackFittersPlugins.so
#11 0x00002b2d80c3abe3 in (anonymous namespace)::KFFittingSmoother::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/pluginTrackingToolsTrackFittersPlugins.so
#12 0x00002b2db0625be9 in TrackProducerAlgorithm<reco::GsfTrack>::buildTrack(TrajectoryFitter const*, Propagator const*, std::vector<AlgoProductTraits<reco::GsfTrack>::AlgoProduct, std::allocator<AlgoProductTraits<reco::GsfTrack>::AlgoProduct> >&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, TrajectoryStateOnSurface&, TrajectorySeed const&, float, reco::BeamSpot const&, edm::RefToBase<TrajectorySeed>, int, signed char) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/libRecoTrackerTrackProducer.so
#13 0x00002b2db04d9511 in TrackProducerAlgorithm<reco::GsfTrack>::runWithCandidate(TrackingGeometry const*, MagneticField const*, std::vector<TrackCandidate, std::allocator<TrackCandidate> > const&, TrajectoryFitter const*, Propagator const*, TransientTrackingRecHitBuilder const*, reco::BeamSpot const&, std::vector<AlgoProductTraits<reco::GsfTrack>::AlgoProduct, std::allocator<AlgoProductTraits<reco::GsfTrack>::AlgoProduct> >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/pluginRecoTrackerTrackProducerPlugins.so
#14 0x00002b2db04d3d63 in GsfTrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/pluginRecoTrackerTrackProducerPlugins.so
#15 0x00002b2d4b90a6f3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/el8_amd64_gcc10/libFWCoreFramework.so
.....
.....
Current Modules:

Module: GsfTrackProducer:lowPtGsfEleGsfTracks (crashed)
Module: GsfElectronProducer:gedGsfElectronsTmp
Module: StandAloneMuonProducer:standAloneMuons
Module: SeedCreatorFromRegionConsecutiveHitsTripletOnlyEDProducer:pixelLessStepSeeds
Module: CkfTrackCandidateMaker:tobTecStepTrackCandidates
Module: PoolOutputModule:write_ALCARECO
Module: GEDPhotonProducer:photons
Module: PFClusterProducer:particleFlowClusterHF

A fatal system signal has occurred: segmentation violation

Simulated crash

#3  0x00007f6d8e5eaa1b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f6d6a7de4f4 in BasicTrajectoryState::BasicTrajectoryState(Surface const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsTrajectoryState.so
#6  0x00007f6d656d0f50 in BasicMultiTrajectoryState::BasicMultiTrajectoryState(std::vector<TrajectoryStateOnSurface, std::allocator<TrajectoryStateOnSurface> > const&) () from /afs/cern.ch/work/s/swmukher/GsfCrash/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsGsfTools.so
#7  0x00007f6d614404c8 in GaussianStateConversions::tsosFromMultiGaussianState(MultiGaussianState<5u> const&, TrajectoryStateOnSurface const&) () from /afs/cern.ch/work/s/swmukher/GsfCrash/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsGsfTracking.so
#8  0x00007f6d61436c3d in MultiTrajectoryStateMerger::merge(TrajectoryStateOnSurface const&) const () from /afs/cern.ch/work/s/swmukher/GsfCrash/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsGsfTracking.so
#9  0x00007f6d6142fdea in GsfTrajectoryFitter::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /afs/cern.ch/work/s/swmukher/GsfCrash/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsGsfTracking.so
#10 0x00007f6d09fb13cd in GoodSeedProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginRecoParticleFlowPFTrackingPlugins.so
#11 0x00007f6d9d91c703 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
.....
.....
Current Modules:

Module: GoodSeedProducer:trackerDrivenElectronSeeds (crashed)

A fatal system signal has occurred: segmentation violation

@mmusich
Copy link
Contributor

mmusich commented Nov 8, 2022

so IIUC the "whole" proposal ( I tested that it avoids the crash, at least in the "simulated" setup) would be:

diff --git a/TrackingTools/GsfTracking/src/GsfTrajectoryFitter.cc b/TrackingTools/GsfTracking/src/GsfTrajectoryFitter.cc
index 7fa6e49f4e3..3b154349315 100644
--- a/TrackingTools/GsfTracking/src/GsfTrajectoryFitter.cc
+++ b/TrackingTools/GsfTracking/src/GsfTrajectoryFitter.cc
@@ -121,6 +121,9 @@ Trajectory GsfTrajectoryFitter::fitOne(const TrajectorySeed& aSeed,
       //update
       assert((!(*ihit)->canImproveWithTrack()) | (nullptr != theHitCloner));
       assert((!(*ihit)->canImproveWithTrack()) | (nullptr != dynamic_cast<BaseTrackerRecHit const*>((*ihit).get())));
+      if (!predTsos.isValid()) {
+       return Trajectory();
+      }
       auto preciseHit = theHitCloner->makeShared(*ihit, predTsos);
       dump(*preciseHit, hitcounter, "GsfTrackFitters");
       currTsos = updator()->update(predTsos, *preciseHit);
diff --git a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
index 1692614f233..44fd569c8b4 100644
--- a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
+++ b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
@@ -50,6 +50,6 @@ namespace GaussianStateConversions {
                                 side);
       }
     }
-    return TrajectoryStateOnSurface((BasicTrajectoryState*)new BasicMultiTrajectoryState(components));
+    return components.empty() ? TrajectoryStateOnSurface() : TrajectoryStateOnSurface((BasicTrajectoryState*)new BasicMultiTrajectoryState(components));
   }
 }  // namespace GaussianStateConversions

I guess we can open a PR to check for regressions.

@swagata87
Copy link
Contributor

Thank you Marco. If you take care of preparing the PR then in the meanwhile I can do all the sanity checks I was planning to do, to make sure there is no adverse effect anywhere in e/gamma reconstruction due to these changes.

@mmusich
Copy link
Contributor

mmusich commented Nov 8, 2022

I think we should try to reproduce on el8 which is prod architecture at tier0 (but not on standard lxplus nodes)

@swagata87
Copy link
Contributor

@mpresill
Copy link

mpresill commented Nov 9, 2022

Hello, it's ORM here,
I would like to bring your attention to a probably related crash for Prompt at T0:
https://cms-talk.web.cern.ch/t/abort-signal-in-promptreco-for-jetmet-in-run-361468/17307 .
We are currently trying to reproduce the crash to see the specific event.

@mmusich
Copy link
Contributor

mmusich commented Nov 9, 2022

@mpresill

I would like to bring your attention to a probably related crash for Prompt at T0:

from the log, it doesn't look related.

@mmusich
Copy link
Contributor

mmusich commented Nov 9, 2022

@mpresill

based on the stack trace:

Thread 1 (Thread 0x2b17ead50940 (LWP 570) "cmsRun"):
#0  0x00002b17ea01eae1 in poll () from /lib64/libc.so.6
#1  0x00002b17efb4072f in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2  0x00002b17efb410bc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  0x00002b17efb43a0b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b17e9f48a4f in raise () from /lib64/libc.so.6
#6  0x00002b17e9f1bdb5 in abort () from /lib64/libc.so.6
#7  0x00002b17e9f1bc89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#8  0x00002b17e9f413a6 in __assert_fail () from /lib64/libc.so.6
#9  0x00002b1845109cb2 in fastjet::contrib::MeasureDefinition::get_partition(std::vector<fastjet::PseudoJet, std::allocator<fastjet::PseudoJet> > const&, std::vector<fastjet::PseudoJet, std::allocator<fastjet::PseudoJet> > const&) const () from /cvmfs/cms.cern.ch/el8_amd64_
gcc10/cms/cmssw/CMSSW_12_4_10/external/el8_amd64_gcc10/lib/libfastjetcontribfragile.so
#10 0x00002b18450ffc48 in fastjet::contrib::Njettiness::getTauComponents(unsigned int, std::vector<fastjet::PseudoJet, std::allocator<fastjet::PseudoJet> > const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/external/el8_amd64_gcc10/lib/libfastj
etcontribfragile.so
#11 0x00002b189b4a25b1 in BoostedDoubleSVProducer::calcNsubjettiness(edm::RefToBase<reco::Jet> const&, float&, float&, std::vector<fastjet::PseudoJet, std::allocator<fastjet::PseudoJet> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/pluginRecoBTagSecondaryVertexProducer.so
#12 0x00002b189b4a3785 in BoostedDoubleSVProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/pluginRecoBTagSecondaryVertexProducer.so
#13 0x00002b17e78e86f3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/libFWCoreFramework.so
#14 0x00002b17e78cdc2f in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/libFWCoreFramework.so
#15 0x00002b17e7825e55 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/libFWCoreFramework.so
...
Current Modules:

Module: BoostedDoubleSVProducer:pfBoostedDoubleSVAK8TagInfosSlimmedAK8DeepTags (crashed)
Module: MuonIdProducer:muons1stStep
Module: AlCaHcalIsotrkProducer:alcaHcalIsotrkProducer
Module: AlCaHcalIsotrkProducer:alcaHcalIsotrkProducer
Module: PoolOutputModule:write_AOD
Module: GsfElectronProducer:lowPtGsfElectronsPreRegression
Module: SiStripRecHitConverter:siStripMatchedRecHits
Module: LowPtGsfElectronSeedProducer:lowPtGsfElectronSeeds

A fatal system signal has occurred: abort signal
Complete

it seems it's coming from https://github.com/cms-externals/fastjet-contrib/blob/283910e44f2c3c81133fc68c8f4942b9c53da6e3/Nsubjettiness/Njettiness.cc#L179-L204

I would suggest opening a different issue (and pinging a different set of experts)

@perrotta
Copy link
Contributor

#40017 is now merged in master: should we prepare a backport to the data taking releases 12_4_X and 12_5_X, and test it if it finally fixes this issue or not?

@swagata87
Copy link
Contributor

okay, I will do the backports soon

@perrotta
Copy link
Contributor

urgent

@francescobrivio
Copy link
Contributor

There was another instance of the same issue (MultiTrajectoryState mixes states with and without errors) in Run 361971.
I tried running the job in:

But I'd like the experts to test it as well.
A minimal recipe to reproduce the crash is:

cmsrel CMSSW_12_4_11
cd CMSSW_12_4_11/src
cmsenv
git cms-addpkg TrackingTools/GsfTracking
git remote add swagataRepo git@github.com:swagata87/cmssw.git
git fetch swagataRepo
git cherry-pick aec3325132ddb2d2423dc872561e6ebf35934786
scram b -j 8
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022F/LogicError/job_3520411/vocms014.cern.ch-3520411-3-log.tar.gz .
tar -zxvf vocms014.cern.ch-3520411-3-log.tar.gz
cd job/WMTaskSpace/cmsRun1/

edit the PSet.py to be:

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
    process.source.eventsToProcess = cms.untracked.VEventRange('361971:435223126-361971:435223128')

run:

cmsRun PSet.py

@mmusich
Copy link
Contributor

mmusich commented Nov 16, 2022

@francescobrivio

again a segFault :(

Is the stack trace the same as in plain 12_4_11?
Please clarify / elaborate

@swagata87
Copy link
Contributor

good to have a reproducible crash.. at least one can debug this

the following extra protection solves this crash for me. Can anyone else confirm please?

--- a/TrackingTools/GsfTracking/src/GsfTrajectorySmoother.cc
+++ b/TrackingTools/GsfTracking/src/GsfTrajectorySmoother.cc
@@ -129,6 +129,10 @@ Trajectory GsfTrajectorySmoother::trajectory(const Trajectory& aTraj) const {
     if (theMerger)
       predTsos = theMerger->merge(predTsos);
 
+    if (!predTsos.isValid()) {
+      return Trajectory();
+    }
+
     if ((*itm).recHit()->isValid()) {
       //update
       currTsos = updator()->update(predTsos, *(*itm).recHit());

@swagata87
Copy link
Contributor

note that the above extra protection in GsfTrajectorySmoother is very similar to the extra protection we added in GsfTrajectoryFitter some days ago , which went into this PR #40017 .

@francescobrivio
Copy link
Contributor

@francescobrivio

again a segFault :(

Is the stack trace the same as in plain 12_4_11? Please clarify / elaborate

stack traces are slightly different (even tho i'm not an expert at all):
Plain 12_4_11:

#3  0x00007f57b4e98a1b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f57a9c354f4 in BasicTrajectoryState::BasicTrajectoryState(Surface const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsTrajectoryState.so
#6  0x00007f578c576f50 in BasicMultiTrajectoryState::BasicMultiTrajectoryState(std::vector<TrajectoryStateOnSurface, std::allocator<TrajectoryStateOnSurface> > const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsGsfTools.so
#7  0x00007f57882e69af in GaussianStateConversions::tsosFromMultiGaussianState(MultiGaussianState<5u> const&, TrajectoryStateOnSurface const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsGsfTracking.so
#8  0x00007f57882dcc6d in MultiTrajectoryStateMerger::merge(TrajectoryStateOnSurface const&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsGsfTracking.so
#9  0x00007f57882d9caf in GsfTrajectorySmoother::trajectory(Trajectory const&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsGsfTracking.so
#10 0x00007f578848bc1f in (anonymous namespace)::KFFittingSmoother::smoothingStep(Trajectory&&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginTrackingToolsTrackFittersPlugins.so
#11 0x00007f578848cbe3 in (anonymous namespace)::KFFittingSmoother::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginTrackingToolsTrackFittersPlugins.so
#12 0x00007f574b18cbf9 in TrackProducerAlgorithm<reco::GsfTrack>::buildTrack(TrajectoryFitter const*, Propagator const*, std::vector<AlgoProductTraits<reco::GsfTrack>::AlgoProduct, std::allocator<AlgoProductTraits<reco::GsfTrack>::AlgoProduct> >&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, TrajectoryStateOnSurface&, TrajectorySeed const&, float, reco::BeamSpot const&, edm::RefToBase<TrajectorySeed>, int, signed char) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libRecoTrackerTrackProducer.so
#13 0x00007f5749e5d521 in TrackProducerAlgorithm<reco::GsfTrack>::runWithCandidate(TrackingGeometry const*, MagneticField const*, std::vector<TrackCandidate, std::allocator<TrackCandidate> > const&, TrajectoryFitter const*, Propagator const*, TransientTrackingRecHitBuilder const*, reco::BeamSpot const&, std::vector<AlgoProductTraits<reco::GsfTrack>::AlgoProduct, std::allocator<AlgoProductTraits<reco::GsfTrack>::AlgoProduct> >&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginRecoTrackerTrackProducerPlugins.so
#14 0x00007f5749e57d73 in GsfTrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginRecoTrackerTrackProducerPlugins.so
#15 0x00007f57bdc7f703 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#16 0x00007f57bdc64c3f in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#17 0x00007f57bdbbce65 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007f57bdbbd15b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x00007f57bdbbf745 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#20 0x00007f57bddbf7b5 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreConcurrency.so
#21 0x00007f57bc2dcbec in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f5769afb900, this=0x7f57b7d3fe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
#22 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f57b7d3fe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
#23 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.cpp:168
#24 0x00007f57bdb2d2e8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#25 0x00007f57bdb3820b in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#26 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#27 0x00007f57bc2cb0eb in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/arena.cpp:698
#28 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#29 0x000000000040971c in main ()

Current Modules:

Module: GsfTrackProducer:lowPtGsfEleGsfTracks (crashed)

A fatal system signal has occurred: segmentation violation

12_4_11 + #40063

#3  0x00007f4c2c0dfa1b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f4bff51ed42 in GsfMultiStateUpdator::update(TrajectoryStateOnSurface const&, TrackingRecHit const&) const () from /afs/cern.ch/work/f/fbrivio/AlCa/replay_12_4_11/new_12_4_11/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsGsfTracking.so
#6  0x00007f4bff526ee7 in GsfTrajectorySmoother::trajectory(Trajectory const&) const () from /afs/cern.ch/work/f/fbrivio/AlCa/replay_12_4_11/new_12_4_11/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libTrackingToolsGsfTracking.so
#7  0x00007f4bff6d8c1f in (anonymous namespace)::KFFittingSmoother::smoothingStep(Trajectory&&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginTrackingToolsTrackFittersPlugins.so
#8  0x00007f4bff6d9be3 in (anonymous namespace)::KFFittingSmoother::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginTrackingToolsTrackFittersPlugins.so
#9  0x00007f4bc23dabf9 in TrackProducerAlgorithm<reco::GsfTrack>::buildTrack(TrajectoryFitter const*, Propagator const*, std::vector<AlgoProductTraits<reco::GsfTrack>::AlgoProduct, std::allocator<AlgoProductTraits<reco::GsfTrack>::AlgoProduct> >&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, TrajectoryStateOnSurface&, TrajectorySeed const&, float, reco::BeamSpot const&, edm::RefToBase<TrajectorySeed>, int, signed char) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libRecoTrackerTrackProducer.so
#10 0x00007f4bc10ab521 in TrackProducerAlgorithm<reco::GsfTrack>::runWithCandidate(TrackingGeometry const*, MagneticField const*, std::vector<TrackCandidate, std::allocator<TrackCandidate> > const&, TrajectoryFitter const*, Propagator const*, TransientTrackingRecHitBuilder const*, reco::BeamSpot const&, std::vector<AlgoProductTraits<reco::GsfTrack>::AlgoProduct, std::allocator<AlgoProductTraits<reco::GsfTrack>::AlgoProduct> >&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginRecoTrackerTrackProducerPlugins.so
#11 0x00007f4bc10a5d73 in GsfTrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/pluginRecoTrackerTrackProducerPlugins.so
#12 0x00007f4c34ecd703 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007f4c34eb2c3f in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007f4c34e0ae65 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#15 0x00007f4c34e0b15b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#16 0x00007f4c34e0d745 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#17 0x00007f4c3500d7b5 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreConcurrency.so
#18 0x00007f4c3352abec in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f4be0cf3900, this=0x7f4c2ef3fe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
#19 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f4c2ef3fe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
#20 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.cpp:168
#21 0x00007f4c34d7b2e8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#22 0x00007f4c34d8620b in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_11/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#23 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#24 0x00007f4c335190eb in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_4_0_pre2-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/arena.cpp:698
#25 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#26 0x000000000040971c in main ()

Current Modules:

Module: GsfTrackProducer:lowPtGsfEleGsfTracks (crashed)

A fatal system signal has occurred: segmentation violation

@mmusich
Copy link
Contributor

mmusich commented Nov 16, 2022

stack traces are slightly different

They look substantially different to me, the last one points to GsfTrajectorySmoother, which is what Swagata's proposal (#39987 (comment)) is addressing. Wondering if there's any other place to treat with the same changes.

@swagata87
Copy link
Contributor

Wondering if there's any other place to treat with the same changes.

looking at GsfTrajectorySmoother, the following part of the code seems unprotected,
because, currTsos is updated, and without checking if updated currTsos is still valid or not, it is used.

if (theMerger)
currTsos = theMerger->merge(currTsos);
dump(currTsos, "currTsos", "GsfTrackFitters");

we can add this before dump in L207

if (!currTsos.isValid()) {
      return Trajectory();
}

@swagata87
Copy link
Contributor

so although I missed the ORP yesterday, I have heard that a new 12_4_X is imminent.
so I will make a PR to have these extra protections in; given that the PR tests takes time.

if there are objections/other requests/different plan to proceed; we can discuss later of course.

@slava77
Copy link
Contributor

slava77 commented Nov 16, 2022

in retrospect, I'm curious why we did not see all these crashes before.
Was the low-pt ele e.g. in B-parking reco in 10_6 a representative period (IIUC, most code is old enough and low-pt ele part was backported)?

I'm not sure from the past if we had all the production crash reports percolated to cmssw issues. My reco memory for the past 10 years was that only T0 crashes were followed systematically and the production was less so.
(but then maybe this memory is biased by just having more problems at T0 while production was running on minimally selected good/better data)

@mmusich
Copy link
Contributor

mmusich commented Nov 16, 2022

in retrospect, I'm curious why we did not see all these crashes before.

I have the same curiosity. In my understanding low pT electron reco is new (only in UL 2018? So perhaps only well calibrated data). Also from anecdotal evidence it seems that the amount of gsf warnings is proportional to the "age" of conditions used (newer conditions seem to trigger less errors). I didn't study it in details yet though.

@swagata87
Copy link
Contributor

For the record, there was another instance of this crash. This is the 4th instance in prompt reco, afaik.

It was in run 362091, as announced in https://cms-talk.web.cern.ch/t/paused-job-in-run-361971-due-to-gsftraking-logic-error/17617/2

From private communication with German Giraldo, I got the tarball, which is here:
/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022F/LogicError/job_3892143
and also here: /eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run362091_ParkingDoubleMuonLowMass5/Reco/vocms014.cern.ch-3892143-3-log.tar.gz

I added the following lines to PSet.py in job_3892143/job/WMTaskSpace/cmsRun1/

    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
    process.source.eventsToProcess = cms.untracked.VEventRange('362091:2491263184-362091:2491263186')

I could reproduce the crash in 12_4_10 (the current release T0 is in).

%MSG
----- Begin Fatal Exception 19-Nov-2022 13:01:28 CET-----------------------
An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 362091 lumi: 1186 event: 2491263185 stream: 0
   [1] Running path 'dqmoffline_10_step'
   [2] Prefetching for module SMPDQM/'SMPDQM'
   [3] Prefetching for module MuonProducer/'muons'
   [4] Prefetching for module PFProducer/'particleFlowTmp'
   [5] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [6] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [7] Calling method for module GsfTrackProducer/'electronGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------

In 12_4_11_patch1, the crash goes away.

@rappoccio
Copy link
Contributor Author

It looks like we can close this now, thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants