Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation violation in L1TStage2CaloLayer1 #32172

Closed
silviodonato opened this issue Nov 18, 2020 · 11 comments
Closed

segmentation violation in L1TStage2CaloLayer1 #32172

silviodonato opened this issue Nov 18, 2020 · 11 comments

Comments

@silviodonato
Copy link
Contributor

silviodonato commented Nov 18, 2020

202 workflows crashed in CMSSW_11_2_X_2020-11-17-2300.
For instance step3 of wf 136.721 get segmentation violation

Begin processing the 1st record. Run 274199, Event 215390593, LumiSection 110 on stream 7 at 18-Nov-2020 12:54:00.858 CET
%MSG-e hltPrescaleTable:  PATTriggerProducer:patTrigger  18-Nov-2020 12:54:01 CET Run: 274199 Event: 215390593
No HLT prescale table found
Using default empty table with all prescales 1
%MSG
%MSG-w L1TGlobal:  L1TGlobalProducer:valGtStage2Digis 18-Nov-2020 12:54:01 CET  Run: 274199 Event: 215390593

Warning: algoBit >= prescaleFactorsAlgoTrig.size() in bx -2
%MSG
%MSG-w L1TGlobal:  L1TGlobalProducer:valGtStage2Digis 18-Nov-2020 12:54:01 CET  Run: 274199 Event: 215390593

Warning: algoBit >= triggerMaskAlgoTrig.size() in bx -2
%MSG
%MSG-w L1TGlobal:  L1TGlobalProducer:valGtStage2Digis 18-Nov-2020 12:54:01 CET  Run: 274199 Event: 215390593

Warning: algoBit >= prescaleFactorsAlgoTrig.size() in bx -1
%MSG
%MSG-w L1TGlobal:  L1TGlobalProducer:valGtStage2Digis 18-Nov-2020 12:54:01 CET  Run: 274199 Event: 215390593

Warning: algoBit >= triggerMaskAlgoTrig.size() in bx -1
%MSG
%MSG-w L1TGlobal:  L1TGlobalProducer:valGtStage2Digis 18-Nov-2020 12:54:01 CET  Run: 274199 Event: 215390593

Warning: algoBit >= prescaleFactorsAlgoTrig.size() in bx 0
%MSG
%MSG-w L1TGlobal:  L1TGlobalProducer:valGtStage2Digis 18-Nov-2020 12:54:01 CET  Run: 274199 Event: 215390593

Warning: algoBit >= triggerMaskAlgoTrig.size() in bx 0
%MSG
%MSG-w L1TGlobal:  L1TGlobalProducer:valGtStage2Digis 18-Nov-2020 12:54:01 CET  Run: 274199 Event: 215390593

Warning: algoBit >= prescaleFactorsAlgoTrig.size() in bx 1
%MSG
%MSG-w L1TGlobal:  L1TGlobalProducer:valGtStage2Digis 18-Nov-2020 12:54:01 CET  Run: 274199 Event: 215390593

Warning: algoBit >= triggerMaskAlgoTrig.size() in bx 1
%MSG
%MSG-w L1TGlobal:  L1TGlobalProducer:valGtStage2Digis 18-Nov-2020 12:54:01 CET  Run: 274199 Event: 215390593

Warning: algoBit >= prescaleFactorsAlgoTrig.size() in bx 2
%MSG
%MSG-w L1TGlobal:  L1TGlobalProducer:valGtStage2Digis 18-Nov-2020 12:54:01 CET  Run: 274199 Event: 215390593

Warning: algoBit >= triggerMaskAlgoTrig.size() in bx 2
%MSG
%MSG-w Configuration:  HLTHighLevel:ALCARECOTkAlMinBiasNOTHLT  18-Nov-2020 12:54:05 CET Run: 274199 Event: 215390593
none of the requested paths and pattern match any HLT path - no events will be selected
%MSG
%MSG-w BSFitter:  AlcaBeamMonitor:AlcaBeamMonitor@endLumi  18-Nov-2020 12:54:12 CET Run: 274199 Lumi: 110
need at least 150 tracks to run beamline fitter.
%MSG


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Wed Nov 18 12:33:28 CET 2020

[...]

Thread 9 (Thread 0x7faf107fe700 (LWP 25123)):
#0  0x00007faf4d7fec3d in poll () from /lib64/libc.so.6
#1  0x00007faf42bdc43f in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-16-1100/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x00007faf42bdcb7c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-16-1100/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x00007faf42bdda59 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-16-1100/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007faeefb8067b in L1TStage2CaloLayer1::mergeMismatchVectors(std::vector<std::tuple<edm::RunID, edm::LuminosityBlockID, edm::EventID, int>, std::allocator<std::tuple<edm::RunID, edm::LuminosityBlockID, edm::EventID, int> > >&, std::vector<std::tuple<edm::RunID, edm::LuminosityBlockID, edm::EventID, int>, std::allocator<std::tuple<edm::RunID, edm::LuminosityBlockID, edm::EventID, int> > >&) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-11-17-2300/lib/slc7_amd64_gcc820/libDQML1TMonitor.so
#6  0x00007faeefb80850 in L1TStage2CaloLayer1::streamEndRunSummary(edm::StreamID, edm::Run const&, edm::EventSetup const&, CaloL1Information::perRunSummaryMonitoringInformation*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-11-17-2300/lib/slc7_amd64_gcc820/libDQML1TMonitor.so
#7  0x00007faeefb88068 in virtual thunk to edm::global::impl::RunSummaryCacheHolder<edm::global::EDProducerBase, CaloL1Information::perRunSummaryMonitoringInformation>::doStreamEndRunSummary_(edm::StreamID, edm::Run const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-11-17-2300/lib/slc7_amd64_gcc820/libDQML1TMonitor.so
#8  0x00007faf50387dbc in edm::global::EDProducerBase::doStreamEndRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-16-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#9  0x00007faf50377240 in edm::WorkerT<edm::global::EDProducerBase>::implDoStreamEnd(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-16-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#10 0x00007faf50284508 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-16-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#11 0x00007faf5028471f in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-16-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#12 0x00007faf50284890 in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-16-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#13 0x00007faf5028495d in edm::FunctorTask<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)2>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-16-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#14 0x00007faf4ead2bfd in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7faf0fdf7e00, context_guard=..., t=t@entry=0x7faf0f394340, isolation=isolation@entry=0) at ../../src/tbb/custom_scheduler.h:393
#15 0x00007faf4ead2ef5 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7faf0fdf7e00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#16 0x00007faf4eacc9ff in tbb::internal::arena::process (this=0x7faf4a7ab480, s=...) at ../../src/tbb/arena.cpp:196
#17 0x00007faf4eacb3d3 in tbb::internal::market::process (this=0x7faf4a7b7580, j=...) at ../../src/tbb/market.cpp:667
#18 0x00007faf4eac77dc in tbb::internal::rml::private_worker::run (this=0x7faf450a8e00) at ../../src/tbb/private_server.cpp:266
#19 0x00007faf4eac79e9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#20 0x00007faf4dae0ea5 in start_thread () from /lib64/libpthread.so.0
#21 0x00007faf4d8098dd in clone () from /lib64/libc.so.6

[...]


Current Modules:

Module: L1TStage2CaloLayer1:l1tStage2CaloLayer1 (crashed)
Module: HiggsDQM:HiggsDQM
Module: PFCandidateAnalyzerDQM:PFCandAnalyzerDQM
Module: DQMMessageLogger:DQMMessageLogger
Module: PhotonAnalyzer:photonAnalysis
Module: CaloTowersAnalyzer:AllCaloTowersDQMOffline
Module: BPhysicsOniaDQM:bphysicsOniaDQM
Module: ZToMuMuGammaAnalyzer:zmumugammaAnalysis

A fatal system signal has occurred: segmentation violation

The offending PR is #32004

@cmsbuild
Copy link
Contributor

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@silviodonato
Copy link
Contributor Author

I confirm that runTheMatrix.py -l 136.721 --command "-n 1" -t8 works in CMSSW_11_2_X_2020-11-17-1100 and gives seg violation in CMSSW_11_2_X_2020-11-17-1100+#32004

@mrodozov
Copy link
Contributor

mrodozov commented Nov 18, 2020

ahh ... alright :) but I was running a failing workflow earlier which passed and this puzzles me.
because there was a frontierclient failures, I thought - ok, maybe it was a frontier failure but not anymore.
lets hope for the 2300 IB :/

@silviodonato
Copy link
Contributor Author

ahh ... alright :) but I was running a failing workflow earlier which passed and this puzzles me.
because there was a frontierclient failures, I thought - ok, maybe it was a frontier failure but not anymore.
lets hope for the 2300 IB :/

Shahzad started a new IB h1600 with #32173

@silviodonato
Copy link
Contributor Author

For reference, this is the full list of the crashing workflows

136.721
136.722
136.723
136.724
136.725
136.726
136.727
136.728
136.729
136.73
136.731
136.732
136.734
136.735
136.736
136.737
136.738
136.739
136.74
136.741
136.742
136.743
136.744
136.745
136.746
136.747
136.748
136.749
136.75
136.751
136.752
136.753
136.754
136.755
136.756
136.757
136.758
136.759
136.76
136.761
136.762
136.763
136.764
136.765
136.766
136.767
136.768
136.769
136.77
136.771
136.772
136.773
136.774
136.775
136.776
136.777
136.778
136.779
136.792
136.793
136.794
136.795
136.796
136.797
136.798
136.799
136.8
136.801
136.802
136.803
136.804
136.805
136.806
136.807
136.808
136.809
136.81
136.811
136.812
136.813
136.814
136.815
136.816
136.817
136.818
136.819
136.82
136.821
136.822
136.823
136.824
136.825
136.826
136.827
136.828
136.829
136.83
136.831
136.832
136.833
136.834
136.835
136.836
136.837
136.838
136.839
136.8391
136.84
136.841
136.842
136.843
136.845
136.846
136.847
136.848
140.57
281.0
1001.2
1040.0
1309.0
1310.0
1311.0
1312.0
1313.0
1314.0
1324.0
1325.0
1325.2
1325.4
1325.9
1325.91
1326.0
1327.0
1328.0
1329.0
1330.0
1331.0
1332.0
1333.0
1334.0
1335.0
1336.0
1337.0
1338.0
1339.0
1340.0
1341.0
1343.0
1344.0
1345.0
1347.0
1348.0
1349.0
1350.0
1351.0
1352.0
1353.0
1354.0
1355.0
1356.0
1360.0
1361.0
1362.0
1363.0
1364.0
1365.0
1366.0
1370.0
25200.0
25202.0
25202.1
25202.15
25202.2
25203.0
25204.0
25205.0
25206.0
25207.0
25208.0
25209.0
25210.0
25211.0
25212.0
25213.0
25214.0
25215.0
139901.0
139902.0
250200.0
250202.0
250202.1
250202.2
250202.3
250202.4
250202.5
250203.0
250204.0
250205.0
250206.0
250207.0
13992501.0
13992502.0

@aloeliger
Copy link
Contributor

aloeliger commented Nov 19, 2020

@silviodonato I'm sorry, I'm just seeing this now. I will admit myself confused. This passed both the build tests, and I tested it myself personally in the online configurations and using runTheMatrix.py on workflow 136.731 (which is one of the many it is now segfaulting on).

I had tested it with:

runTheMatrix.py -l 136.731

But it seems like your command has:

runTheMatrix.py -l 136.731 --command "-n 1" -t8

I assume the difference between these two is an invocation of multiple streams? What all do I need to do to recreate this segfault, and when I have it fixed, how should I reopen the PR to get this fixed?

Thanks,
Andrew Loeliger

@silviodonato
Copy link
Contributor Author

Hi @aloeliger , yes, you can recreate the issue by running with multithreading (eg. using -t4). Once you get it fixed, you have to open a new PR. You can cherry-pick this commit https://github.com/cms-sw/cmssw/pull/32182/commits as a baseline of your new PR.

@aloeliger
Copy link
Contributor

Okay I've isolated the problem. It seems like it was a stack overflow caused by my poorly written binary search for mismatch location in the final list at the end of a stream. I'm working on a fix for it now, but this may not be the only issue present.

@aloeliger
Copy link
Contributor

aloeliger commented Nov 20, 2020

Hello all, I have opened a PR I believe addresses and fixes these issues, which runs runTheMatrix.py -l 136.721 --command "-n 1" -t8 (and 136.731) successfully on multiple threads, #32205. Apologies for the bug.

Out of curiosity, and for the future, are there multi-threaded validation options for the cmsbuild bot that I could have used to avoid committing this error?

@makortel
Copy link
Contributor

Out of curiosity, and for the future, are there multi-threaded validation options for the cmsbuild bot that I could have used to avoid committing this error?

Unfortunately no, the PR tests are single-threaded on purpose (to guarantee easily fully reproducible simulation results).

@qliphy
Copy link
Contributor

qliphy commented Nov 24, 2020

Done by #32205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants