Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offline crashes in {HLT,L1}TriggerJSONMonitoring in CMSSW_14_0_6_MULTIARCHS #44975

Closed
mmusich opened this issue May 15, 2024 · 23 comments
Closed

Comments

@mmusich
Copy link
Contributor

mmusich commented May 15, 2024

@silviodonato reported a crash in CMSSW_14_0_6_MULTIARCHS when running:

ssh lxplus8.cern.ch
export SCRAM_ARCH=el8_amd64_gcc12
cmsrel CMSSW_14_0_6_MULTIARCHS
cd CMSSW_14_0_6_MULTIARCHS/src
cmsenv
hltGetConfiguration run:380647 --globaltag  140X_dataRun3_HLT_v3  --input file:/eos/cms/tier0/store/data/Run2024D/EphemeralHLTPhysics0/RAW/v1/000/380/647/00000/a8bb2f4f-008c-454b-8a8c-f77ff51e8fcf.root

concerning:

Thread 1 (Thread 0x7fe7ae29d640 (LWP 1513991) "cmsRun"):
#0  0x00007fe7aee6a301 in poll () from /lib64/libc.so.6
#1  0x00007fe7a26f62ff in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007fe7a26a9afc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3  0x00007fe7a26aa460 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fe7aee14e41 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#6  0x00007fe7af8117ab in std::char_traits<char>::copy (__n=49, __s2=<optimized out>, __s1=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/char_traits.h:435
#7  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy (__n=49, __s=<optimized out>, __d=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/basic_string.h:431
#8  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy (__n=49, __s=<optimized out>, __d=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/basic_string.h:426
#9  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign (this=0x7fffc5118a40, __str=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:291
#10 0x00007fe711d1caf6 in L1TriggerJSONMonitoring::globalEndLuminosityBlockSummary(edm::LuminosityBlock const&, edm::EventSetup const&, L1TriggerJSONMonitoringData::lumisection*) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginHLTriggerJSONMonitoringPlugins.so
#11 0x00007fe711d1d8c8 in virtual thunk to edm::global::impl::LuminosityBlockSummaryCacheHolder<edm::global::EDAnalyzerBase, L1TriggerJSONMonitoringData::lumisection>::doEndLuminosityBlockSummary_(edm::LuminosityBlock const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginHLTriggerJSONMonitoringPlugins.so
#12 0x00007fe7b18c1ff5 in edm::global::EDAnalyzerBase::doEndLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 0x00007fe7b18b9da0 in edm::WorkerT<edm::global::EDAnalyzerBase>::implDoEnd(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#14 0x00007fe7b1807a7f in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007fe7b17f5ef8 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#16 0x00007fe7b17b8bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#17 0x00007fe7afff3281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe7acc99380) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#18 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe7acc99380) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#19 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#20 0x00007fe7b17c941b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#21 0x00007fe7b17d324d in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#22 0x00007fe7b17d37b1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#23 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#24 0x00007fe7affdf9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#25 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#26 0x000000000040517c in main ()

Current Modules:

Module: L1TriggerJSONMonitoring:hltL1TriggerJSONMonitoring (crashed)Segmentation fault (core dumped)

Trying to reproduce with a slightly different setup (e.g. the script below)

#!/bin/bash -ex

# CMSSW_14_0_6_MULTIARCHS

hltGetConfiguration run:380647 \
            --globaltag 140X_dataRun3_HLT_v3 \
            --input file:/eos/cms/tier0/store/data/Run2024D/EphemeralHLTPhysics0/RAW/v1/000/380/647/00000/a8bb2f4f-008c-454b-8a8c-f77ff51e8fcf.root > hlt_run380647.py

cat <<@EOF >> hlt_run380647.py
process.options.wantSummary = False
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt_run380647.py &> hlt.log

I get a different crash (also on CPU-only) involving

Thread 1 (Thread 0x7fed272ac640 (LWP 2328682) "cmsRun"):
#0  0x00007fed27e79301 in poll () from /lib64/libc.so.6
#1  0x00007fed1b72f2ff in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007fed1b6e2afc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3  0x00007fed1b6e3460 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fed27e23e37 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#6  0x00007fed27de7009 in __GI__IO_file_xsputn () from /lib64/libc.so.6
#7  0x00007fed27ddc19c in fwrite () from /lib64/libc.so.6
#8  0x00007fed2881127d in std::basic_streambuf<char, std::char_traits<char> >::sputn (__n=50, __s=0x0, this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/streambuf:455
#9  std::__ostream_write<char, std::char_traits<char> > (__n=50, __s=0x0, __out=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/ostream_insert.h:51
#10 std::__ostream_insert<char, std::char_traits<char> > (__out=..., __s=0x0, __n=50) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/ostream_insert.h:102
#11 0x00007fec9955a16b in HLTriggerJSONMonitoring::globalEndLuminosityBlockSummary(edm::LuminosityBlock const&, edm::EventSetup const&, HLTriggerJSONMonitoringData::lumisection*) const () from /tmp/musich/hltL1TriggerJSONMonitoring/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginHLTriggerJSONMonitoringPlugins.so
#12 0x00007fec9955f0c8 in virtual thunk to edm::global::impl::LuminosityBlockSummaryCacheHolder<edm::global::EDAnalyzerBase, HLTriggerJSONMonitoringData::lumisection>::doEndLuminosityBlockSummary_(edm::LuminosityBlock const&, edm::EventSetup const&) () from /tmp/musich/hltL1TriggerJSONMonitoring/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginHLTriggerJSONMonitoringPlugins.so
#13 0x00007fed2a8d0ff5 in edm::global::EDAnalyzerBase::doEndLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#14 0x00007fed2a8c8da0 in edm::WorkerT<edm::global::EDAnalyzerBase>::implDoEnd(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007fed2a816a7f in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#16 0x00007fed2a804ef8 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#17 0x00007fed2a7c7bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fed29002281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fed25c83e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#19 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fed25c83e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#20 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#21 0x00007fed2a7d841b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#22 0x00007fed2a7e224d in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#23 0x00007fed2a7e27b1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#24 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#25 0x00007fed28fee9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#26 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#27 0x000000000040517c in main ()

Current Modules:

Module: HLTriggerJSONMonitoring:hltHLTriggerJSONMonitoring (crashed)
Module: none

A fatal system signal has occurred: segmentation violation

As additional information, it looks like it depends on the output configuration.
Setting:

it runs without problems, whereas setting:

  • --output all

it crashes are reported above.

FYI @missirol @fwyzard @cms-sw/hlt-l2

@cmsbuild
Copy link
Contributor

cmsbuild commented May 15, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @mmusich.

@antoniovilela, @sextonkennedy, @rappoccio, @Dr15Jones, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented May 15, 2024

Does the same crash happen in plain CMSSW_14_0_6 ?

@mmusich
Copy link
Contributor Author

mmusich commented May 15, 2024

Does the same crash happen in plain CMSSW_14_0_6 ?

yes.

@makortel
Copy link
Contributor

assign hlt

@cmsbuild
Copy link
Contributor

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Martin-Grunewald
Copy link
Contributor

assign daq

@cmsbuild
Copy link
Contributor

New categories assigned: daq

@emeschi,@smorovic you have been requested to review this Pull request/Issue and eventually sign? Thanks

@smorovic
Copy link
Contributor

It doesn't crash if this is appended:

process.FastMonitoringService = cms.Service( "FastMonitoringService")

process.EvFDaqDirector = cms.Service( "EvFDaqDirector",
    baseDir = cms.untracked.string( "." ),
    buBaseDir = cms.untracked.string( "." ),
    buBaseDirsAll = cms.untracked.vstring(  ),
    buBaseDirsNumStreams = cms.untracked.vint32(  ),
    runNumber = cms.untracked.uint32( 380647 ),
    useFileBroker = cms.untracked.bool( False ),
    fileBrokerHostFromCfg = cms.untracked.bool( True ),
    fileBrokerHost = cms.untracked.string( "" ),
    fileBrokerPort = cms.untracked.string( "8080" ),
    fileBrokerKeepAlive = cms.untracked.bool( True ),
    fileBrokerUseLocalLock = cms.untracked.bool( True ),
    fuLockPollInterval = cms.untracked.uint32( 2000 ),
    outputAdler32Recheck = cms.untracked.bool( False ),
    directorIsBU = cms.untracked.bool( False ),
    hltSourceDirectory = cms.untracked.string( "" ),
    mergingPset = cms.untracked.string( "" )
)

along with mkdir run380647.

From the code it is not clear why it would crash.
Maybe it's the cast from MicroStateService to FastMonitoringService pointer (in case dummy MSS is inserted somehow). We are planning to finally remove MicroStateService base class, it will happen in 14_1_X (soon).

@smorovic
Copy link
Contributor

From the code it is not clear why it would crash. Maybe it's the cast from MicroStateService to FastMonitoringService pointer (in case dummy MSS is inserted somehow). We are planning to finally remove MicroStateService base class, it will happen in 14_1_X (soon).

It is not that, even if removing check for the FMS service there is still a crash.

@Martin-Grunewald
Copy link
Contributor

Indeed, hltGetConfiguration removes these (see https://github.com/cms-sw/cmssw/blob/master/HLTrigger/Configuration/python/Tools/confdb.py#L809)

    # remove the DAQ modules and the online definition of the DQMStore and DQMFileSaver                                                                        
    # unless a hilton-like configuration has been requested                     
    if not self.config.hilton:
      self.options['services'].append( "-EvFDaqDirector" )
      self.options['services'].append( "-FastMonitoringService" )
      self.options['services'].append( "-DQMStore" )
      self.options['modules'].append( "-hltDQMFileSaver" )
      self.options['modules'].append( "-hltDQMFileSaverPB" )

It is recommended to use minimal or none output in the hltGetConfiguration command, or at least explicitly remove the "-RatesMonitoring" path.

@smorovic
Copy link
Contributor

I ran this now on a FU machine (with a GPU) and I'm getting a bit different stack trace with more information:

Thread 1 (Thread 0x7f5639321640 (LWP 2990918) "cmsRun"):
#0  0x00007f5639eeb0e1 in poll () from /lib64/libc.so.6
#1  0x00007f5630bbe6af in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f5630b72dbc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f5630b73720 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f5639e95f7b in __memmove_avx_unaligned () from /lib64/libc.so.6
#6  0x00007f55b54723ba in Json::duplicateAndPrefixStringValue(char const*, unsigned int) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/external/el8_amd64_gcc12/lib/libtensorflow_framework.so.2
#7  0x00007f55b5472582 in Json::Value::Value(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/external/el8_amd64_gcc12/lib/libtensorflow_framework.so.2
#8  0x00007f55874cf94e in HLTriggerJSONMonitoring::globalEndLuminosityBlockSummary(edm::LuminosityBlock const&, edm::EventSetup const&, HLTriggerJSONMonitoringData::lumisection*) const () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginHLTriggerJSONMonitoringPlugins.so
#9  0x00007f55874d4388 in virtual thunk to edm::global::impl::LuminosityBlockSummaryCacheHolder<edm::global::EDAnalyzerBase, HLTriggerJSONMonitoringData::lumisection>::doEndLuminosityBlockSummary_(edm::LuminosityBlock const&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginHLTriggerJSONMonitoringPlugins.so

Here it seems that it tries to use Json::Value from the tensorflow library, while we have an integrated (older and modified for thread safety) version in EventFilter/Utilities.
https://github.com/cms-sw/cmssw/blob/master/EventFilter/Utilities/interface/json.h
https://github.com/cms-sw/cmssw/blob/master/EventFilter/Utilities/interface/value.h
That version doesn't actually include Json::duplicateAndPrefixStringValue function anywhere.

I think what happens is, when EventFilter/Utilities library is loaded by using services, correct version is used and there is no crash.
If not, then we end up using the tensorflow version and, as code is compiled with headers from EventFilter/Utilities, it causes memory corruption and a crash either here or later in the module. On lxplus I also noticed the crash happens on a simple string size() call done after json::Value is defined in the code, and removing json::Value removes the crash.

@makortel
Copy link
Contributor

Sounds like a one-definition rule violation. If the copy EventFilter/Utilities still needs to be kept, I'd recommend moving all the relevant code into a CMS-specific namespace.

@smorovic
Copy link
Contributor

Sounds like a one-definition rule violation. If the copy EventFilter/Utilities still needs to be kept, I'd recommend moving all the relevant code into a CMS-specific namespace.

I wouldn't dare to change to a different version in the short term, and in the long term we were already thinking of evaluating different json implementations.
Using a namespace seems fine (It could be "evf" which is used for the most of the EventFilter/Utilities code). I'll work on those changes.

@fwyzard
Copy link
Contributor

fwyzard commented May 16, 2024

in the long term we were already thinking of evaluating different json implementations.

In other CMSSW packages we've been using https://github.com/nlohmann/json , which is available as an external via <use name="json"/>.

@smorovic
Copy link
Contributor

Updated in: #44989
I used jsoncollector namespace.

@smorovic
Copy link
Contributor

Crash is gone with 14_0_6 with the backport. I'll open backport PR as well.

@mmusich
Copy link
Contributor Author

mmusich commented May 22, 2024

proposed fixes are merged:

@mmusich
Copy link
Contributor Author

mmusich commented May 22, 2024

+hlt

@mmusich
Copy link
Contributor Author

mmusich commented May 25, 2024

@cms-sw/daq-l2 this issue could be closed, right?

@smorovic
Copy link
Contributor

+1
yes

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@mmusich
Copy link
Contributor Author

mmusich commented May 28, 2024

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants