Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation violation in CSCMonitorModule during Express job at T0 #45797

Open
germanfgv opened this issue Aug 25, 2024 · 11 comments
Open

Segmentation violation in CSCMonitorModule during Express job at T0 #45797

germanfgv opened this issue Aug 25, 2024 · 11 comments

Comments

@germanfgv
Copy link
Contributor

There is a segmentation violation affecting single job for Workflow Express_Run384963_StreamExpress. The crash occurs in module CSCMonitorModule:dqmCSCClient. Here the stack trace and error message:

Thread 8 (Thread 0x1484addfd700 (LWP 1097) "cmsRun"):
#0  0x00001485053baac1 in poll () from /lib64/libc.so.6
#1  0x0000148501b3443f in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x0000148501ae94bc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x0000148501ae9640 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00001484947afc62 in cscdqm::EventProcessor::processCSC(CSCEventData const&, int, CSCDCCExaminer const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#6  0x00001484947c1f14 in cscdqm::EventProcessor::processDDU(CSCDDUEventData const&, CSCDCCExaminer const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#7  0x00001484947c690a in cscdqm::EventProcessor::processEvent(edm::Event const&, edm::InputTag const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#8  0x00001484947ad629 in cscdqm::Dispatcher::processEvent(edm::Event const&, edm::InputTag const&, cscdqm::HWStandbyType&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#9  0x00001484947defbb in CSCMonitorModule::analyze(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#10 0x00001484947de3f0 in non-virtual thunk to DQMOneEDAnalyzer<>::accumulate(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginDQMCSCMonitorModulePlugins.so
#11 0x000014850803daae in edm::one::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#12 0x00001485080281fe in edm::WorkerT<edm::one::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x0000148507fba639 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x0000148507fbb70f in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#15 0x00001485082c5650 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#16 0x000014850738b95b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x1484205f9b00, waiter=..., this=0x148503dc9180) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#17 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x148503dc9180) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#18 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:137
#19 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/market.cpp:599
#20 0x000014850738db0e in tbb::detail::r1::rml::private_worker::run (this=0x148501e9df00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#21 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x148501e9df00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#22 0x00001485056661ca in start_thread () from /lib64/libpthread.so.0
#23 0x00001485052c18d3 in clone () from /lib64/libc.so.6

...
Current Modules:

Module: CSCMonitorModule:dqmCSCClient (crashed)
Module: MkFitProducer:pixelLessStepTrackCandidatesMkFit
Module: TrackingRecoMaterialAnalyser:materialDumperAnalyzer
Module: TrackProducer:pixelLessStepTracks
Module: TrackProducer:pixelPairStepTracks
Module: TrackTfClassifier:lowPtQuadStep
Module: MultiHitFromChi2EDProducer:pixelLessStepHitTriplets
Module: MkFitProducer:initialStepTrackCandidatesMkFitPreSplitting

A fatal system signal has occurred: segmentation violation

The job ran 4 times, both in AMD and Intel machines, failing always the same way. You can find logs and tarball to reproduce the error here:

/eos/user/c/cmst0/public/PausedJobs/Run2024G/dqmCSCClient/job_2218360

Can experts take a look?

Best regards

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 25, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @germanfgv.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mandrenguyen
Copy link
Contributor

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

FYI @cms-sw/csc-dpg-l2

@germanfgv
Copy link
Contributor Author

We have 4 more instances of this issue, now in PromptReco jobs. All of them affecting the same run 384963. You can find the tarball of one of these PromptReco jobs here:

/eos/user/c/cmst0/public/PausedJobs/Run2024G/dqmCSCClient/job_2652227/

@mandrenguyen
Copy link
Contributor

@cms-sw/csc-dpg-l2 @cms-sw/dqm-l2 Can someone please have a look

@ptcox
Copy link
Contributor

ptcox commented Aug 28, 2024 via email

@mmusich
Copy link
Contributor

mmusich commented Sep 9, 2024

@ptcox

There’s very likely a corrupt event. I’ve forwarded the mail to Victor Barashko, who’s both the CSC unpacker and DQM expert. I’m on vacation so won’t be looking at it.

do you happen to have any news about this?

@ptcox
Copy link
Contributor

ptcox commented Sep 9, 2024 via email

@mmusich
Copy link
Contributor

mmusich commented Sep 9, 2024

I'll remind him.

thank you, Tim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants