Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation violation in all DQM online clients during HI run #40110

Closed
rvenditti opened this issue Nov 18, 2022 · 35 comments · Fixed by #40136
Closed

Segmentation violation in all DQM online clients during HI run #40110

rvenditti opened this issue Nov 18, 2022 · 35 comments · Fixed by #40136

Comments

@rvenditti
Copy link
Contributor

We are observing DQM Online clients crashing with following segmentation violation (this is for the csc client: https://cmsweb.cern.ch/dqm/dqm-square/tmp/tmp/content_parser_productionPARSER_run362293PARSER_job28.log but all other clients show the same behavior):

Fri Nov 18 16:45:38 CET 2022
Thread 3 (Thread 0x7fbb53f1f700 (LWP 43397) "cmsRun"):
#0 0x00007fbb76b96b43 in select () from /lib64/libc.so.6
#1 0x00007fbb6f4f9a92 in lat::IOSelectSelect::wait(long) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_2_patch1/external/slc7_amd64_gcc10/lib/libclasslib.so
#2 0x00007fbb6f4fa55e in lat::IOSelector::wait(long) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_2_patch1/external/slc7_amd64_gcc10/lib/libclasslib.so
#3 0x00007fbb6f4fa318 in lat::IOSelector::dispatch(long) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_5_2_patch1/external/slc7_amd64_gcc10/lib/libclasslib.so
#4 0x00007fbb71379626 in DQMNet::run() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libDQMServicesCore.so
#5 0x00007fbb7137a0bc in communicate(void*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libDQMServicesCore.so
#6 0x00007fbb76e76ea5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fbb76b9fb0d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fbb549f5700 (LWP 43390) "cmsRun"):
#0 0x00007fbb76e7e1d9 in waitpid () from /lib64/libpthread.so.0
#1 0x00007fbb6fcd3567 in edm::service::cmssw_stacktrace_fork() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007fbb6fcd40da in edm::service::InitRootHandlers::stacktraceHelperThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007fbb77476f90 in std::execute_native_thread_routine (__p=0x7fbb704d0030) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4 0x00007fbb76e76ea5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fbb76b9fb0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fbb74cf1740 (LWP 43345) "cmsRun"):
#0 0x00007fbb76b94ddd in poll () from /lib64/libc.so.6
#1 0x00007fbb6fcd381f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007fbb6fcd41ac in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007fbb6fcd6afb in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4
#5 0x0000000000000000 in ?? ()
#6 0x00007fbb794a8347 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::M_release() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#7 0x00007fbb795863d6 in edm::DataManagingProductResolver::resetProductData
(bool) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#8 0x00007fbb795720a8 in edm::Principal::clearPrincipal() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007fbb794bcdcd in edm::EventPrincipal::clearEventPrincipal() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007fbb794c425e in edm::FunctorWaitingTask<edm::waiting_task::detail::WaitingTaskChain<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#4}>, edm::waiting_task::detail::Conditional<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#3}> >, edm::waiting_task::detail::Conditional<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#2}> >, edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#1}> >::runLast(edm::WaitingTaskHolder)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007fbb794a293f in tbb::detail::d1::function_taskedm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007fbb77c78bec in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7fbb74362d00, this=0x7fbb7438f700) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
#13 tbb::detail::r1::task_dispatcher::local_wait_for_alltbb::detail::r1::external_waiter (waiter=..., t=, this=0x7fbb7438f700) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
#14 tbb::detail::r1::task_dispatcher::execute_and_wait (t=, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/task_dispatcher.cpp:168
#15 0x00007fbb794e1ffd in edm::FinalWaitingTask::wait() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#16 0x00007fbb794cd98c in edm::EventProcessor::processLumis(std::shared_ptr const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#17 0x00007fbb794d8d53 in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#19 0x00007fbb77c670eb in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-slc7_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/arena.cpp:698
#20 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#21 0x000000000040971c in main ()

Current Modules:

Module: none (crashed)

A fatal system signal has occurred: segmentation violation

-- process exit: -11 --

The crash seems to be due to DQMNet, but we don't have any clue to debug it.
Could cmssw experts have a look and give us suggestions if possible?

@cmsbuild
Copy link
Contributor

A new Issue was created by @rvenditti .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@pmandrik
Copy link
Contributor

more info provided at #40111

@francescobrivio
Copy link
Contributor

assign core

  • I see in the stack trace after sig_dostack_then_abort:
https://github.com/cms-sw/cmssw/pull/6 0x00007fbb794a8347 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::M_release() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
https://github.com/cms-sw/cmssw/pull/7 0x00007fbb795863d6 in edm::DataManagingProductResolver::resetProductData(bool) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
https://github.com/cms-sw/cmssw/pull/8 0x00007fbb795720a8 in edm::Principal::clearPrincipal() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
https://github.com/cms-sw/cmssw/pull/9 0x00007fbb794bcdcd in edm::EventPrincipal::clearEventPrincipal() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so
....
  • I'm nt sure if the assignment is correct, if not sorry!

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor

You could try to run the job under valgrind as it is good at finding memory related problems, .e.g.

valgrind cmsRun yourConfig_cfg.py

Be forewarned, it will be very slow (like 60x slower than running cmsRun normally).

@pmandrik
Copy link
Contributor

valgrind latest output:

++++ finished: processing event : stream = 0 run = 362290 lumi = 5 event = 306555 time = 7167369916654472192
==7975== Invalid read of size 8
==7975==    at 0x4C6A029: std::_Sp_counted_deleter<edm::WrapperBase*, std::default_delete<edm::WrapperBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4B89346: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4C673D5: edm::DataManagingProductResolver::resetProductData_(bool) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4C530A7: edm::Principal::clearPrincipal() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4B9DDCC: edm::EventPrincipal::clearEventPrincipal() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4BA525D: edm::FunctorWaitingTask<edm::waiting_task::detail::WaitingTaskChain<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#4}>, edm::waiting_task::detail::Conditional<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#3}> >, edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#4}><edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#2}> >, edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#1}> >::runLast(edm::WaitingTaskHolder)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4B8393E: tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x64BBBEB: tbb::detail::r1::task_dispatcher::execute_and_wait(tbb::detail::d1::task*, tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&) (task_dispatcher.h:322)
==7975==    by 0x4BC2FFC: edm::FinalWaitingTask::wait() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4BAE98B: edm::EventProcessor::processLumis(std::shared_ptr<void> const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4BB9D52: edm::EventProcessor::runToCompletion() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x40A265: tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/bin/slc7_amd64_gcc10/cmsRun)
==7975==  Address 0x5ff81ad0 is 0 bytes inside a block of size 40 free'd
==7975==    at 0x4031721: operator delete(void*, unsigned long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/valgrind/3.17.0-12ba960b086b024cce391ad228855798/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==7975==    by 0x4B89346: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4C673D5: edm::DataManagingProductResolver::resetProductData_(bool) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4C530A7: edm::Principal::clearPrincipal() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4B9DDCC: edm::EventPrincipal::clearEventPrincipal() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4BA525D: edm::FunctorWaitingTask<edm::waiting_task::detail::WaitingTaskChain<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#4}>, edm::waiting_task::detail::Conditional<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#3}> >, edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#4}><edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#2}> >, edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#1}> >::runLast(edm::WaitingTaskHolder)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4B8393E: tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x64BBBEB: tbb::detail::r1::task_dispatcher::execute_and_wait(tbb::detail::d1::task*, tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&) (task_dispatcher.h:322)
==7975==    by 0x4BC2FFC: edm::FinalWaitingTask::wait() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4BAE98B: edm::EventProcessor::processLumis(std::shared_ptr<void> const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4BB9D52: edm::EventProcessor::runToCompletion() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x40A265: tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/bin/slc7_amd64_gcc10/cmsRun)
==7975==  Block was alloc'd at
==7975==    at 0x402EF11: operator new(unsigned long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/valgrind/3.17.0-12ba960b086b024cce391ad228855798/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==7975==    by 0x32C53E58: ROOT::new_edmcLcLWrapperlEedmcLcLSortedCollectionlEHBHERecHitcOedmcLcLStrictWeakOrderinglEHBHERecHitgRsPgRsPgR(void*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libDataFormatsHcalRecHit.so)
==7975==    by 0x5BCA3AA: TClass::NewObject(TClass::ENewType, bool) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-a31cbfc28a0c92b3c007615905b5b9b2/lib/libCore.so)
==7975==    by 0x5BCA7EF: TClass::New(TClass::ENewType, bool) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-a31cbfc28a0c92b3c007615905b5b9b2/lib/libCore.so)
==7975==    by 0x55DBE07: TBufferFile::ReadObjectAny(TClass const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-a31cbfc28a0c92b3c007615905b5b9b2/lib/libRIO.so)
==7975==    by 0x55D43C3: TBufferFile::ReadFastArray(void**, TClass const*, int, bool, TMemberStreamer*, TClass const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-a31cbfc28a0c92b3c007615905b5b9b2/lib/libRIO.so)
==7975==    by 0x5825DE3: int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-a31cbfc28a0c92b3c007615905b5b9b2/lib/libRIO.so)
==7975==    by 0x56DA3BB: TStreamerInfoActions::VectorLooper::GenericRead(TBuffer&, void*, void const*, TStreamerInfoActions::TLoopConfiguration const*, TStreamerInfoActions::TConfiguration const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-a31cbfc28a0c92b3c007615905b5b9b2/lib/libRIO.so)
==7975==    by 0x55D2BB3: TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*, void*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-a31cbfc28a0c92b3c007615905b5b9b2/lib/libRIO.so)
==7975==    by 0x5726311: int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-a31cbfc28a0c92b3c007615905b5b9b2/lib/libRIO.so)
==7975==    by 0x55DB2E4: TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-a31cbfc28a0c92b3c007615905b5b9b2/lib/libRIO.so)
==7975==    by 0x55DBE4F: TBufferFile::ReadObjectAny(TClass const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/lcg/root/6.24.07-a31cbfc28a0c92b3c007615905b5b9b2/lib/libRIO.so)
==7975== 
==7975== Jump to the invalid address stated on the next line
==7975==    at 0x0: ???
==7975==    by 0x4B89346: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4C673D5: edm::DataManagingProductResolver::resetProductData_(bool) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4C530A7: edm::Principal::clearPrincipal() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4B9DDCC: edm::EventPrincipal::clearEventPrincipal() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4BA525D: edm::FunctorWaitingTask<edm::waiting_task::detail::WaitingTaskChain<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#4}>, edm::waiting_task::detail::Conditional<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#3}> >, edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#4}><edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#2}> >, edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::processEventAsyncImpl(edm::WaitingTaskHolder, unsigned int)::{lambda(auto:1)#1}> >::runLast(edm::WaitingTaskHolder)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4B8393E: tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x64BBBEB: tbb::detail::r1::task_dispatcher::execute_and_wait(tbb::detail::d1::task*, tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&) (task_dispatcher.h:322)
==7975==    by 0x4BC2FFC: edm::FinalWaitingTask::wait() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4BAE98B: edm::EventProcessor::processLumis(std::shared_ptr<void> const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x4BB9D52: edm::EventProcessor::runToCompletion() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libFWCoreFramework.so)
==7975==    by 0x40A265: tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/bin/slc7_amd64_gcc10/cmsRun)
==7975==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==7975== 


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

==7975== Unsupported clone() flags: 0x311
==7975== 
==7975== The only supported clone() uses are:
==7975==  - via a threads library (LinuxThreads or NPTL)
==7975==  - via the implementation of fork or vfork
==7975== 
==7975== Valgrind detected that your program requires
==7975== the following unimplemented functionality:
==7975==    Valgrind does not support general clone().
==7975== This may be because the functionality is hard to implement,
==7975== or because no reasonable program would behave this way,
==7975== or because nobody has yet needed it.  In any case, let us know at
==7975== www.valgrind.org and/or try to work around the problem, if you can.
==7975== 
==7975== Valgrind has to exit now.  Sorry.  Bye!
==7975== 

sched status:
  running_tid=2

Thread 1: status = VgTs_WaitSys syscall 7 (lwpid 7975)
==7975==    at 0x73E3DDD: ??? (in /usr/lib64/libc-2.17.so)
==7975==    by 0x1136F81E: full_read.constprop.0 (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so)
==7975==    by 0x113701AB: edm::service::InitRootHandlers::stacktraceFromThread() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so)
==7975==    by 0x11372AFA: sig_dostack_then_abort (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so)
==7975==    by 0x70E262F: ??? (in /usr/lib64/libpthread-2.17.so)

Thread 2: status = VgTs_Runnable syscall 56 (lwpid 8682)
==7975==    at 0x73EEAD1: clone (in /usr/lib64/libc-2.17.so)
==7975==    by 0x1136F553: edm::service::cmssw_stacktrace_fork() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so)
==7975==    by 0x113700D9: edm::service::InitRootHandlers::stacktraceHelperThread() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so)
==7975==    by 0x6CB6F8F: execute_native_thread_routine (thread.cc:80)
==7975==    by 0x70DAEA4: start_thread (in /usr/lib64/libpthread-2.17.so)
==7975==    by 0x73EEB0C: clone (in /usr/lib64/libc-2.17.so)

Thread 3: status = VgTs_WaitSys syscall 23 (lwpid 8690)
==7975==    at 0x73E5B43: ??? (in /usr/lib64/libc-2.17.so)
==7975==    by 0x11BB5A91: lat::IOSelectSelect::wait(long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/classlib/3.1.3-7a805383ac3cf7b871d22a6c66bd8f6f/lib/libclasslib.so)
==7975==    by 0x11BB655D: lat::IOSelector::wait(long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/classlib/3.1.3-7a805383ac3cf7b871d22a6c66bd8f6f/lib/libclasslib.so)
==7975==    by 0x11BB6317: lat::IOSelector::dispatch(long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/classlib/3.1.3-7a805383ac3cf7b871d22a6c66bd8f6f/lib/libclasslib.so)
==7975==    by 0x109D8625: DQMNet::run() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libDQMServicesCore.so)
==7975==    by 0x109D90BB: communicate(void*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/slc7_amd64_gcc10/libDQMServicesCore.so)
==7975==    by 0x70DAEA4: start_thread (in /usr/lib64/libpthread-2.17.so)
==7975==    by 0x73EEB0C: clone (in /usr/lib64/libc-2.17.so)

@Dr15Jones
Copy link
Contributor

The valgrind output is saying that memory is being shared between to Event data products, which is not something that we allow. The question now is how that happened. To figure that out, it would help to figure out what is the lookup info for the products (i.e. module label, product instance label and process name as well as the C++ type).

What is the commands needed to reproduce?

@Dr15Jones
Copy link
Contributor

Were there any other warnings issued by valgrind?

@pmandrik
Copy link
Contributor

pmandrik commented Nov 18, 2022

Hi, @Dr15Jones , valgrind produced lot of output. Please check following logs (limited by my console buffer size):
val.log

To reproduce:

scram project CMSSW_12_5_2_patch1
cd CMSSW_12_5_2_patch1/src; cmsenv
git cms-addpkg DQM/Integration
git fetch official-cmssw refs/pull/37737/*:refs/gh-remotes/pull/39597/*; git cms-merge-topic 39597
git fetch official-cmssw refs/pull/37802/*:refs/gh-remotes/pull/40071/*; git cms-merge-topic 40071
git fetch official-cmssw refs/pull/37741/*:refs/gh-remotes/pull/40073/*; git cms-merge-topic 40073
mkdir data
cp -r /afs/cern.ch/user/p/pmandrik/public/temp/run362290 ./data/.

Put process.source.minEventsPerLumi = 99999999 to the csc_dqm_sourceclient-live_cfg.py and run:

cmsRun DQM/Integration/python/clients/csc_dqm_sourceclient-live_cfg.py runInputDir=./data runNumber=362290 runkey=hi_run scanOnce=True

@petermic
Copy link

petermic commented Nov 18, 2022

Running the following on architecture el8_amd64_gcc10:

cmsrel CMSSW_12_5_2_patch1
cd CMSSW_12_5_2_patch1; cmsenv
git cms-addpkg DQM/Integration
scram b -j24; cd src
cmsRun DQM/Integration/python/clients/sistrip_dqm_sourceclient-live_cfg.py runInputDir=/eos/cms/store/group/phys_heavyions_ops/chenyi/Temp/fff/ramdisk runNumber=362294 runkey=hi_run scanOnce=True

but with MessageLogger cout threshold = 'DEBUG', reproduces the crash and produces the following log file:

DQMStreamLog_run362294.txt

There are a few warnings emitted early in the process:

Begin processing the 1st record. Run 362294, Event 94175, LumiSection 1 on stream 0 at 18-Nov-2022 20:33:43.036 CET
%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigisNoZS  18-Nov-2022 20:33:43 CET Run: 362294 Event: 94175
NULL pointer to FEDRawData for FED: id 434
Note: further warnings of this type will be suppressed (this can be changed by enabling debugging printout)
%MSG
%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigis  18-Nov-2022 20:33:43 CET Run: 362294 Event: 94175
NULL pointer to FEDRawData for FED: id 434
%MSG
%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigis  18-Nov-2022 20:33:43 CET Run: 362294 Event: 94175
[sistrip::RawToDigiUnpacker::createDigis] Problems were found in data and 171 channels could not be unpacked. See output of FED Hardware monitoring for more information.
%MSG

I've been told these warnings are unusual. The following info messages are also apparently unusual:

%MSG-i Configuration:  HLTHighLevel:hltHighLevel 18-Nov-2022 20:33:49 CET  Run: 362294 Event: 94175
requested pattern "HLT_ZeroBias_*" does not match any HLT paths
%MSG
%MSG-i Configuration:  HLTHighLevel:hltHighLevel 18-Nov-2022 20:33:49 CET  Run: 362294 Event: 94175
requested pattern "HLT_ZeroBias1_*" does not match any HLT paths
%MSG
%MSG-i Configuration:  HLTHighLevel:hltHighLevel 18-Nov-2022 20:33:49 CET  Run: 362294 Event: 94175
requested pattern "HLT_PAZeroBias_*" does not match any HLT paths
%MSG
%MSG-i Configuration:  HLTHighLevel:hltHighLevel 18-Nov-2022 20:33:49 CET  Run: 362294 Event: 94175
requested pattern "HLT_PAZeroBias1_*" does not match any HLT paths
%MSG
%MSG-i Configuration:  HLTHighLevel:hltHighLevel 18-Nov-2022 20:33:49 CET  Run: 362294 Event: 94175
requested pattern "HLT_PAL1MinimumBiasHF_OR_SinglePixelTrack_*" does not match any HLT paths
%MSG

@Dr15Jones
Copy link
Contributor

The data product having problems is

type: edm::SortedCollection<HBHERecHit,edm::StrictWeakOrdering<HBHERecHit> >
module : hltHbherecoFromGPU
productInstance :""
process: "HLT"

given this comes from the DQMStreamerReader and that is not code maintained by the Core, it would be best to inform DQM.

@Dr15Jones
Copy link
Contributor

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor

Given this is a non ROOT file, is it possible that there was a schema change on the data stored in the file relative to what code is checked out? If so, then it is highly probably that is the cause as only the ROOT I/O based systems can handle any sort of schema change.

@syuvivida
Copy link
Contributor

We are asking the HLT expert to remove this product hltHbherecoFromGPU from the standard DQM stream.

@syuvivida
Copy link
Contributor

@Dr15Jones Now with the new run, the DQM clients are running without crashes. Thank you for your help!!

@mmusich
Copy link
Contributor

mmusich commented Nov 18, 2022

@pmandrik

I've been told these warnings are unusual.

For the record, these are not unusual. It's just the Strip unpacker reporting a FED is out of DAQ, and that happens all the time.

@Dr15Jones
Copy link
Contributor

So I put some print statements in the source. The event with the crash is the first event which contains a edm::SortedCollection<HBHERecHit,edm::StrictWeakOrdering<HBHERecHit> >. That is true for both the data product from hltHbherecoFromGPU and for the data product from hltHbhereco.

@Dr15Jones
Copy link
Contributor

In the HLT process, is either the hltHbhereco or hltHbherecoFromGPU and EDAlias of the other? If so, this could explain the problem. In PoolOutputModule one can't write the data product and its alias into the same output as the pointers are the same. It is possible the Streamer output module doesn't prohibit the same behavior which would cause ROOT to share the pointed to objects between the two data products.

@missirol
Copy link
Contributor

missirol commented Nov 18, 2022

We are asking the HLT expert to remove this product hltHbherecoFromGPU from the standard DQM stream.

..and in fact, the collection hltHbherecoFromGPU should not have been in the DQM stream (in the standard HIon menu, it is not); this addition, which was then reverted, was maybe a failed attempt to address #40115.

Back to the issue, I don't know what's causing it. It doesn't look to me like DataFormats changed. We do write the product hltHbherecoFromGPU to another stream (DQMGPUvsCPU) and that was not creating issues. I tried to repack the DQM streamer file in 12_5_2_patch1 (same release that produced the streamer file at HLT) and this alone creates the crash; if I try to repack a streamer of DQMGPUvsCPU from the same run, that works normally.

The only thing that caught my eye is that, in the failed configuration, the event content (of the DQM stream) had both *_hltHbherecoFromGPU_*_* and *_hltHbhereco_*_*, and those are really the same, because the latter is a SwitchProducerCUDA whose cuda branch is the former module (and the HLT was running the cuda branch when those streamer files were produced). Should one expect any issues when trying to save two collections this way, when they are really the same one?

cmsRun /afs/cern.ch/work/m/missirol/public/fog/cmssw40110/repack_cfg.py /afs/cern.ch/work/m/missirol/public/fog/cmssw40110/data/run362294/run362294_ls0050_streamDQM_sm-c2a11-43-01.dat tmp.root

@missirol
Copy link
Contributor

missirol commented Nov 18, 2022

Just read #40110 (comment), and would say that the answer is yes.

This config is representative of the menu that was running online (modulo the fact that online we also had the extra keep hltHbherecoFromGPU statement that led to this issue).

@fwyzard
Copy link
Contributor

fwyzard commented Nov 22, 2022

@Dr15Jones @makortel how does the PoolOutputModule behave if one tries to keep both a collection and an EDAlias that points to it ?

@smorovic it's possible we may need to improve the DAQ output modules to behave in the same way.

@fwyzard
Copy link
Contributor

fwyzard commented Nov 22, 2022

assign daq

@cmsbuild
Copy link
Contributor

New categories assigned: daq

@smorovic,@emeschi you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor

how does the PoolOutputModule behave if one tries to keep both a collection and an EDAlias that points to it ?

It throws an exception at the beginning of the job.

@smorovic
Copy link
Contributor

Hi @Dr15Jones , where is the exception condition, is it related to isAlias() condition in this function?

  void PoolOutputModule::beginJob() {
    Service<ConstProductRegistry> reg;
    for (auto const& prod : reg->productList()) {
      BranchDescription const& desc = prod.second;
      if (desc.produced() && desc.branchType() == InEvent && !desc.isAlias()) {
        producedBranches_.emplace_back(desc.branchID());
      }
    }
  }

@Dr15Jones
Copy link
Contributor

The exception seems to be generated here:

auto iter = trueBranchIDToKeptBranchDesc.find(branchID);
if (iter != trueBranchIDToKeptBranchDesc.end()) {
throw edm::Exception(errors::Configuration, "Duplicate Output Selection")
<< "Two (or more) equivalent branches have been selected for output.\n"
<< "#1: " << BranchKey(desc) << "\n"
<< "#2: " << BranchKey(*iter->second) << "\n"
<< "Please drop at least one of them.\n";
}

It is possible to ask for the true BranchID from the BranchDescription and for an EDAlias it will return the id of the branch to which it points.

@makortel
Copy link
Contributor

As far as I can tell, the ProductSelector::checkForDuplicateKeptBranch() gets called from Schedule::finishSetup() via OutputModuleCore::selectProducts(). As such, I'd expect the functionality applies to all OutputModules. Would it be feasible to see the exact configuration that was used online? (the one in #40110 (comment) uses PoolOutputModule, so it approximates too much for this question)

@missirol
Copy link
Contributor

Would it be feasible to see the exact configuration that was used online?

I copied in [1] the output of [2]. This should be very close to what was used online [*].

[1] /afs/cern.ch/work/m/missirol/public/cmssw40110/hlt.py
[2] hltConfigFromDB --run 362293 > hlt.py
[*] Even though, afaik, [1] is not literally what ran online (I think there are some customisations DAQ can add, and the db->python converter could in principle be different from what is used offline); but in this case I think [1] approximates enough.

@smorovic
Copy link
Contributor

I also tried to create a version that at doesn't throw an exception very earlty due to missing parameters:
/afs/cern.ch/user/s/smorovic/public/hlt_362293.py

@makortel
Copy link
Contributor

Thanks @missirol, I was able to reproduce the issue with your configuration (even without input it ran enough). I see now the check in ProductSelector is generally insufficient for the case where an EDAlias is used as a case in SwitchProducer (although the check catches the situation in some cases).

I should be able to come up with a fix shortly. Given the situation with the data taking, is it useful to be backported to 12_4_X?

@fwyzard
Copy link
Contributor

fwyzard commented Nov 22, 2022

IMHO we can live with 12.4.x as it is, and have the fix only in master.

@missirol
Copy link
Contributor

Thanks for checking, Matti. Fwiw, I agree with Andrea.

@makortel
Copy link
Contributor

The fix is in #40136.

@makortel
Copy link
Contributor

+core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.