Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread safety issue in HGCalCLUEAlgo #26452

Closed
Dr15Jones opened this issue Apr 14, 2019 · 10 comments
Closed

Thread safety issue in HGCalCLUEAlgo #26452

Dr15Jones opened this issue Apr 14, 2019 · 10 comments

Comments

@cmsbuild
Copy link
Contributor

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

The relevant part of the job log is

%MSG-w MemoryCheck:  GenJetMatcher:tauGenJetMatchBoosted  13-Apr-2019 18:52:02 CEST Run: 1 Event: 5
MemoryCheck: module GenJetMatcher:tauGenJetMatchBoosted VSIZE 6280.86 0 RSS 3989.25 0.0117188
%MSG


A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Sat Apr 13 20:45:12 CEST 2019
Thread 6 (Thread 0x7f0743bfc700 (LWP 223512)):
#0  0x00007f083c09b965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f083c67c0ac in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>) at /build/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/slc7_amd64_gcc700/external/gcc/7.0.0-lafael/gcc-branches_gcc-7-branch-268351/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:864
#2  std::condition_variable::wait (this=<optimized out>, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:53
#3  0x00007f07ee151bba in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/external/slc7_amd64_gcc700/lib/libtensorflow_framework.so
#4  0x00007f07ee14e687 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/external/slc7_amd64_gcc700/lib/libtensorflow_framework.so
#5  0x00007f083c681cbf in std::execute_native_thread_routine (__p=0x7f07cfd1b9b0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:83
#6  0x00007f083c097dd5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f083bdc0ead in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f0800bff700 (LWP 223510)):
#0  0x00007f083bdb620d in poll () from /lib64/libc.so.6
#1  0x00007f08312c0797 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#2  0x00007f08312c0e2c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#3  0x00007f08312c1e98 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f08312c0020 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#6  <signal handler called>
#7  0x00007f07f3c84d4c in HGCalCLUEAlgo::setDensity(std::vector<KDTreeNodeInfoT<HGCalCLUEAlgo::Hexel, 2u>, std::allocator<KDTreeNodeInfoT<HGCalCLUEAlgo::Hexel, 2u> > > const&) () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginRecoLocalCaloHGCalRecProducersPlugins.so
#8  0x00007f07f3c86910 in tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, tbb::internal::parallel_for_body<HGCalCLUEAlgo::makeClusters()::{lambda()#1}::operator()() const::{lambda(unsigned long)#1}, unsigned long>, tbb::auto_partitioner const>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginRecoLocalCaloHGCalRecProducersPlugins.so
#9  0x00007f083d3d8931 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f0839643e00, parent=..., child=<optimized out>) at ../../src/tbb/custom_scheduler.h:521
#10 0x00007f083d3d6040 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x7f0839643e00, first=0x7f0839647540, next=@0x7f0839647538: 0x7f06fc4a8c40) at ../../src/tbb/scheduler.cpp:703
#11 0x00007f07f3c83ce7 in tbb::interface7::internal::delegated_function<HGCalCLUEAlgo::makeClusters()::{lambda()#1} const, void>::operator()() const () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginRecoLocalCaloHGCalRecProducersPlugins.so
#12 0x00007f083d3d26b3 in tbb::interface7::internal::isolate_within_arena (d=..., reserved=<optimized out>) at ../../src/tbb/arena.cpp:1042
#13 0x00007f07f3c83e70 in HGCalCLUEAlgo::makeClusters() () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginRecoLocalCaloHGCalRecProducersPlugins.so
#14 0x00007f07f3c9e9ac in HGCalLayerClusterProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginRecoLocalCaloHGCalRecProducersPlugins.so
#15 0x00007f083eb9ea7d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/libFWCoreFramework.so
#16 0x00007f083eb5e8b2 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/libFWCoreFramework.so
#17 0x00007f083ea6826a in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/libFWCoreFramework.so
#18 0x00007f083ea6842d in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/libFWCoreFramework.so
#19 0x00007f083ea69abb in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/libFWCoreFramework.so
#20 0x00007f083ea6aab0 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/libFWCoreFramework.so
#21 0x00007f083d3d8931 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f0839643e00, parent=..., child=<optimized out>) at ../../src/tbb/custom_scheduler.h:521
#22 0x00007f083d3d23d8 in tbb::internal::arena::process (this=0x7f083977ad00, s=...) at ../../src/tbb/arena.cpp:160
#23 0x00007f083d3d0d53 in tbb::internal::market::process (this=0x7f083977b580, j=...) at ../../src/tbb/market.cpp:710
#24 0x00007f083d3cd011 in tbb::internal::rml::private_worker::run (this=0x7f083959e080) at ../../src/tbb/private_server.cpp:270
#25 0x00007f083d3cd259 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:223
#26 0x00007f083c097dd5 in start_thread () from /lib64/libpthread.so.0
#27 0x00007f083bdc0ead in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f0801f14700 (LWP 223509)):
#0  0x00007f083bd87e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f083bd87cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f08312c0061 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f083bdbb1c9 in syscall () from /lib64/libc.so.6
#5  0x00007f083d3cd215 in tbb::internal::futex_wait (comparand=2, futex=0x7f083959e1ac) at ../../include/tbb/machine/linux_common.h:85
#6  tbb::internal::binary_semaphore::P (this=0x7f083959e1ac) at ../../src/tbb/semaphore.h:209
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7f083959e1a0) at ../../src/tbb/../rml/server/thread_monitor.h:258
#8  tbb::internal::rml::private_worker::run (this=0x7f083959e180) at ../../src/tbb/private_server.cpp:277
#9  0x00007f083d3cd259 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:223
#10 0x00007f083c097dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f083bdc0ead in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f0802915700 (LWP 223508)):
#0  0x00007f083bd87e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f083bd87cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f08312c0061 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f083bdbb1c9 in syscall () from /lib64/libc.so.6
#5  0x00007f083d3cd215 in tbb::internal::futex_wait (comparand=2, futex=0x7f083959e12c) at ../../include/tbb/machine/linux_common.h:85
#6  tbb::internal::binary_semaphore::P (this=0x7f083959e12c) at ../../src/tbb/semaphore.h:209
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0x7f083959e120) at ../../src/tbb/../rml/server/thread_monitor.h:258
#8  tbb::internal::rml::private_worker::run (this=0x7f083959e100) at ../../src/tbb/private_server.cpp:277
#9  0x00007f083d3cd259 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:223
#10 0x00007f083c097dd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f083bdc0ead in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f081ad8c700 (LWP 223494)):
#0  0x00007f083c09f179 in waitpid () from /lib64/libpthread.so.0
#1  0x00007f08312c0267 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#2  0x00007f08312c0d4a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#3  0x00007f083c681cbf in std::execute_native_thread_routine (__p=0x7f08343966c0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:83
#4  0x00007f083c097dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f083bdc0ead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f083a487480 (LWP 223143)):
#0  0x00007f083bd87e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f083bd87cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f08312c0061 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f083ec4a6f5 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2
#5  0x00007f083ec4afcf in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#6  0x00007f083ec4fd1e in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#7  0x00007f083ec5791a in _dl_runtime_resolve_xsave () from /lib64/ld-linux-x86-64.so.2
#8  0x00007f08312c01f8 in full_write () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#9  0x00007f08312c1e73 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#10 <signal handler called>
#11 0x00007f083bda5d47 in sched_yield () from /lib64/libc.so.6
#12 0x00007f083d3d7715 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x7f083976a600, completion_ref_count=<optimized out>, isolation=0) at ../../src/tbb/custom_scheduler.h:289
#13 0x00007f083d3d8508 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f083976a600, parent=..., child=<optimized out>) at ../../src/tbb/custom_scheduler.h:633
#14 0x00007f083eb0f780 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/libFWCoreFramework.so
#15 0x00007f083eb190e1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/nweek-02571/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2019-04-13-1100/lib/slc7_amd64_gcc700/libFWCoreFramework.so
#16 0x000000000040fe78 in main::{lambda()#1}::operator()() const ()
#17 0x000000000040e142 in main ()

Current Modules:

Module: HGCalLayerClusterProducer:hgcalLayerClusters (crashed)
Module: none
Module: none
Module: none
  1. Things to notice, the time between the last normal event printout and the time stamp for the stack dump is nearly 2 hours.
  2. Only one thread is presently doing any work. This means all the other streams have finished processing their events and have nothing else to do. Also, HGCalCLUEAlgo is using multiple TBB tasks (via a parallel_for) but only one task is still running.

Those two items imply to me that the job was probably stuck in one call to HGCalCLUEAlgo::setDensity for nearly two hours.

@Dr15Jones
Copy link
Contributor Author

assign upgrade

@cmsbuild
Copy link
Contributor

New categories assigned: upgrade

@kpedro88 you have been requested to review this Pull request/Issue and eventually sign? Thanks

@kpedro88
Copy link
Contributor

@rovere @felicepantaleo please take a look

it seems like density_ may need to become a local variable inside the loop to avoid data races

@Dr15Jones
Copy link
Contributor Author

Yes, density_ is not thread safe

typedef std::map< DetId, float > Density;

@kpedro88
Copy link
Contributor

@rovere @felicepantaleo this needs to be addressed sooner rather than later

@felicepantaleo
Copy link
Contributor

ok, I'm starting working on it, I'll make a PR asap

@kpedro88
Copy link
Contributor

+upgrade

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants