Replay test with CMSSW_11_3_1 #4574

qliphy · 2021-05-28T05:46:20Z

Test of CMSSW_11_3_1 for the MWGR#4 (2-4 June).

Configuration followed from #4573

cms-sw/cmssw#33867 should fix the issue reported in:
https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2213/1/2/1/1/1/1/1.html

qliphy · 2021-05-28T05:46:28Z

run replay please

cmsdmwmbot · 2021-05-28T05:50:23Z

Container Tests
No container is available now

cmsdmwmbot · 2021-05-28T05:53:55Z

Host name : vocms047.cern.ch
Container ID : 1
Pull request url : #4574
Current build at jenkins : https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/282/.
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-658

cmsdmwmbot · 2021-05-28T05:54:00Z

Container Tests
JIRA URL : https://its.cern.ch/jira/browse/CMSTZDEV-658
Tier0_REPLAY v282 DMWM-T0-PR-test-job on vocms047.cern.ch. Replay test with CMSSW_11_3_1
There is NO fileset and job.
Replay might be empty.
All filesets were closed.
There was NO paused job in the replay.
End.
Replay was empty..

germanfgv · 2021-05-28T05:57:48Z

Hi @qliphy, at the moment we are experiencing an issue with an Oracle account after the recent upgrade, and we are unable to run replays. It should be fixed during the day, at which point I'll start the replay. I'm sorry for the delay.

cmsdmwmbot · 2021-05-28T09:33:21Z

Host name : vocms047.cern.ch
Container ID : 1
Pull request url : #4574
Current build at jenkins : https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/283/.
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-659

germanfgv · 2021-05-28T09:54:03Z

I relaunched the replay and It's running right now. I'll let you know how it goes.

cmsdmwmbot · 2021-05-30T10:10:10Z

!!! Couldn't read commit file !!!

silviodonato · 2021-05-31T07:24:30Z

Hi @germanfgv has the replay test failed?

germanfgv · 2021-05-31T08:05:03Z

@silviodonato Hi Silvio. The test was interrumpted due to an issue accessing /cvmfs/ on our testing node. @jhonatanamado manually started an aditional test using the new candidate GTs. There was a paused job in that test and I'm about to report it in Hypernews.

Also, yesterday I manually started this exact configuration file (with your proposed GTs) in another machine. This test is about to finish.

germanfgv · 2021-05-31T08:26:38Z

Hi @germanfgv has the replay test failed?

Here you can find a detailed description of the error: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223.html

francescobrivio · 2021-05-31T08:29:03Z

@silviodonato Hi Silvio. The test was interrumpted due to an issue accessing /cvmfs/ on our testing node. @jhonatanamado manually started started an aditional test using the new candidate GTs. There was a paused job in that test and I'm about to report it in Hypernews.

Also, yesterday I manually started this exact configuration file (with your proposed GTs) in another machine. This test is about to finish.

Hi @germanfgv can you confirm the GTs used in your manual test? because the HN post you link here it's the wrong (old) one. The correct (new) post is: https://hypernews.cern.ch/HyperNews/CMS/get/calibrations/4408.html

silviodonato · 2021-05-31T08:57:18Z

@silviodonato Hi Silvio. The test was interrumpted due to an issue accessing /cvmfs/ on our testing node. @jhonatanamado manually started started an aditional test using the new candidate GTs. There was a paused job in that test and I'm about to report it in Hypernews.
Also, yesterday I manually started this exact configuration file (with your proposed GTs) in another machine. This test is about to finish.

Hi @germanfgv can you confirm the GTs used in your manual test? because the HN post you link here it's the wrong (old) one. The correct (new) post is: https://hypernews.cern.ch/HyperNews/CMS/get/calibrations/4408.html

@francescobrivio

In /afs/cern.ch/user/c/cmst0/public/PausedJobs/MWGR4-2021/tarballs/job_11195/75dbd874-62ca-48fa-a4c7-5ec8f51b7451-Periodic-Harvest-1-6-logArchive.tar.gz , specifically job/WMTaskSpace/cmsRun1/PSet.py I see 113X_dataRun3_Express_Candidate_2021_05_28_11_56_38

francescobrivio · 2021-05-31T09:01:27Z

113X_dataRun3_Express_Candidate_2021_05_28_11_56_38

Thanks @silviodonato! The GT seems the correct one!
I saw your message on the HN about the EOS glitch and I am also running the PSet locally (lxplus) to see if it is working fine.

germanfgv · 2021-05-31T09:22:43Z

yes @francescobrivio, we used the new GTs. I provided the wrong link in my comment. I apologize. The HN thread has the correct link.

silviodonato · 2021-05-31T14:28:37Z

alca: @francescobrivio @christopheralanwest @malbouis @pohsun @yuanchao @tlampen
dqm: @jfernan2 @kmaeshima @rvenditti @andrius-k @ErnestaP @ahmad3213
ecal-dpg: @thomreis

Reproducing the replay test, I got this failure related to EcalDQMonitorClient. It is rather urgent because we didn't manage yet to pass the replay test and the MWGR is schedule next week...
Do you know the reason of this error?

31-May-2021 10:49:10 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root
----- Begin Fatal Exception 31-May-2021 10:56:39 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readRun_
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::Streamer
The value of fKeylen is incorrect (-3385) ; trying to recover by setting it to zero

----- End Fatal Exception -------------------------------------------------
%MSG-e EcalDQM:  EcalDQMonitorClient:ecalMonitorClient@endProcessBlock  31-May-2021 10:56:39 CEST post-events
Ecal Monitor Client: Exception in bookMEs @ IntegrityClient
%MSG


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

More details at https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1.html

francescobrivio · 2021-05-31T14:34:59Z

alca: @francescobrivio @christopheralanwest @malbouis @pohsun @yuanchao @tlampen
dqm: @jfernan2 @kmaeshima @rvenditti @andrius-k @ErnestaP @ahmad3213
ecal-dpg: @thomreis

Reproducing the replay test, I got this failure related to EcalDQMonitorClient. It is rather urgent because we didn't manage yet to pass the replay test and the MWGR is schedule next week...
Do you know the reason of this error?
31-May-2021 10:49:10 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root
----- Begin Fatal Exception 31-May-2021 10:56:39 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readRun_
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::Streamer
The value of fKeylen is incorrect (-3385) ; trying to recover by setting it to zero

----- End Fatal Exception -------------------------------------------------
%MSG-e EcalDQM:  EcalDQMonitorClient:ecalMonitorClient@endProcessBlock  31-May-2021 10:56:39 CEST post-events
Ecal Monitor Client: Exception in bookMEs @ IntegrityClient
%MSG


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
More details at https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1.html

I am running a few tests:

use the "old" GT ( 113X_dataRun3_Express_v1 ) --> this gives the same crash
Change the Ecal flag that was updated especially for this MWGR (Test ECAL pedestals in next MWGR cms-sw/cmssw#33765) to test ECAL PCL pedestals --> this also gives the same crash

I'm now running all the possible combinations of options 1 and 2...but any input from ECAL or DQM is welcome!

francescobrivio · 2021-05-31T14:39:21Z

alca: @francescobrivio @christopheralanwest @malbouis @pohsun @yuanchao @tlampen
dqm: @jfernan2 @kmaeshima @rvenditti @andrius-k @ErnestaP @ahmad3213
ecal-dpg: @thomreis

Reproducing the replay test, I got this failure related to EcalDQMonitorClient. It is rather urgent because we didn't manage yet to pass the replay test and the MWGR is schedule next week...
Do you know the reason of this error?
31-May-2021 10:49:10 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root
----- Begin Fatal Exception 31-May-2021 10:56:39 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readRun_
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::Streamer
The value of fKeylen is incorrect (-3385) ; trying to recover by setting it to zero

----- End Fatal Exception -------------------------------------------------
%MSG-e EcalDQM:  EcalDQMonitorClient:ecalMonitorClient@endProcessBlock  31-May-2021 10:56:39 CEST post-events
Ecal Monitor Client: Exception in bookMEs @ IntegrityClient
%MSG


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
More details at https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1.html

BTW, just to give everyone a full picture, after the stack trace (related to ECAL I think) I see:

Current Modules:

Module: SiStripBadComponentInfo:siStripBadComponentInfo (crashed)

A fatal system signal has occurred: segmentation violation
Segmentation fault (core dumped)

silviodonato · 2021-05-31T15:14:10Z

Thanks @francescobrivio. Indeed I've just tried to remove the EcalDQMonitorClient overwriting process.DQMOfflineCosmics_SecondStepEcal = cms.Sequence(process.es_dqm_client_offline)
and I got the same error. So probably the error comes from SiStripBadComponentInfo.
@mmusich perhaps it is something like cms-sw/cmssw#33867 ?

31-May-2021 16:41:55 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root
[2021-05-31 16:50:53.006480 +0200][Error  ][PostMaster        ] [p06636710w24543.cern.ch:1095 #0] Forcing error on disconnect: [ERROR] Operation interrupted.
----- Begin Fatal Exception 31-May-2021 16:58:02 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readRun_
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::Streamer
The value of fKeylen is incorrect (-7886) ; trying to recover by setting it to zero

----- End Fatal Exception -------------------------------------------------


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Mon May 31 16:58:02 CEST 2021
Thread 7 (Thread 0x7fd179123700 (LWP 22098)):
#0  0x00007fd1c2ff7b3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fd1c2ff7bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fd1c2ff7c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007fd1bcac9816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007fd1bcac98c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7fd179924700 (LWP 22097)):
#0  0x00007fd1c2ff7b3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fd1c2ff7bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fd1c2ff7c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007fd1bcac9816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007fd1bcac98c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fd17a125700 (LWP 22096)):
#0  0x00007fd1c2ff7b3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fd1c2ff7bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fd1c2ff7c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007fd1bcac9816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007fd1bcac98c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fd17a926700 (LWP 22095)):
#0  0x00007fd1c2ff8e9d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007fd1bcee96ff in XrdSysTimer::Wait(int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#2  0x00007fd1bca6bb0a in XrdCl::TaskManager::RunTasks() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#3  0x00007fd1bca6bc69 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fd17b127700 (LWP 22094)):
#0  0x00007fd1c2d1afd3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fd1bcee3ed2 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#2  0x00007fd1bcee078d in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#3  0x00007fd1bcee8f68 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#4  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fd19fb24700 (LWP 21767)):
#0  0x00007fd1c2ff91d9 in waitpid () from /lib64/libpthread.so.0
#1  0x00007fd1ba1918d7 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007fd1ba19249a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007fd1c35f1af0 in std::execute_native_thread_routine (__p=0x7fd1a5e7f500) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fd1c12f3540 (LWP 21655)):
#0  0x00007fd1c2d0fccd in poll () from /lib64/libc.so.6
#1  0x00007fd1ba191cd7 in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007fd1ba19256c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007fd1ba193922 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fd1c58e7cc6 in std::vector<unsigned int, std::allocator<unsigned int> >::operator=(std::vector<unsigned int, std::allocator<unsigned int> > const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#6  0x00007fd198b5a72a in SiStripQuality::SiStripQuality(SiStripQuality const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libCalibFormatsSiStripObjects.so
#7  0x00007fd18ebc15ae in SiStripBadComponentInfo::dqmEndJob(dqm::implementation::IBooker&, dqm::implementation::IGetter&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginDQMSiStripMonitorClientPlugins.so
#8  0x00007fd18ebc2fd4 in non-virtual thunk to DQMEDHarvester::endProcessBlockProduce(edm::ProcessBlock&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginDQMSiStripMonitorClientPlugins.so
#9  0x00007fd1c58f4518 in edm::one::EDProducerBase::doEndProcessBlock(edm::ProcessBlockPrincipal const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#10 0x00007fd1c58d1b90 in edm::WorkerT<edm::one::EDProducerBase>::implDoEndProcessBlock(edm::ProcessBlockPrincipal const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#11 0x00007fd1c57dea67 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00007fd1c57dec6d in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#13 0x00007fd1c57def16 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#14 0x00007fd1c57df330 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#15 0x00007fd1c57df4e1 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(tbb::task_group&, edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#16 0x00007fd1c5a77f49 in tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so
#17 0x00007fd1c404126d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7fd1bfd53200, context_guard=..., t=0x7fd17b4e5540, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#18 0x00007fd1c4041756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7fd1bfd53200, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#19 0x00007fd1c57ac618 in edm::EventProcessor::endProcessBlock(bool, bool) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#20 0x00007fd1c576dc44 in edm::EventProcessor::runToCompletion() [clone .cold] () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#21 0x000000000040bae6 in tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const ()
#22 0x00007fd1c403a552 in tbb::interface7::internal::task_arena_base::internal_execute (this=0x7ffc036ac590, d=...) at ../../src/tbb/arena.cpp:1105
#23 0x000000000040ca13 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040b62c in main ()

Current Modules:

Module: SiStripBadComponentInfo:siStripBadComponentInfo (crashed)

A fatal system signal has occurred: segmentation violation

mmusich · 2021-05-31T15:17:41Z

@silviodonato I didn't touch that file in the past 3 years:

https://github.com/cms-sw/cmssw/commits/02363795488db50af2aeda423bd934e7bf0b884f/DQM/SiStripMonitorClient/plugins/SiStripBadComponentInfo.cc

@pieterdavid touched it fairly recently though.

mmusich · 2021-05-31T15:26:12Z

@silviodonato @francescobrivio
Please post a full recipe to get the error at #4574 (comment)

silviodonato · 2021-05-31T15:29:05Z

thanks @mmusich ! let me also point the PR cms-sw/cmssw#32677 to get track of the issue

francescobrivio · 2021-05-31T15:33:13Z

@silviodonato @francescobrivio
Please post a full recipe to get the error at #4574 (comment)

Here is the recipe:

cmsrel CMSSW_11_3_1
cd CMSSW_11_3_1/src
cmsenv
scram b -j 8
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/MWGR4-2021/tarballs/job_11195/75dbd874-62ca-48fa-a4c7-5ec8f51b7451-Periodic-Harvest-1-6-logArchive.tar.gz . 
tar -zxvf 75dbd874-62ca-48fa-a4c7-5ec8f51b7451-Periodic-Harvest-1-6-logArchive.tar.gz 
cd job/WMTaskSpace/cmsRun1/
cmsRun -e PSet.py

germanfgv · 2021-05-31T16:03:43Z

My replay, using the GTs in this PR, finished with a similar error. More details can be found in HN: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1/1.html

francescobrivio · 2021-05-31T16:09:20Z

My replay, using the GTs in this PR, finished with a similar error. More details can be found in HN: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1/1.html

Hi German, thanks for the additional info!
I just want to point out that the file you report in the HN (FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root) it's the last one of the fileList in the PSet.py, so probably that's why it always happens after that file? Not sure if this helps somehow...

pieterdavid · 2021-05-31T16:11:32Z

With the recipe from #4574 (comment) I could reproduce the problem in #4574 (comment), I'm investigating.

mmusich · 2021-05-31T17:02:35Z

placing some random couts in the code I am actually getting a different (root-related) exception:

%MSG
----- Begin Fatal Exception 31-May-2021 18:37:19 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readLuminosityBlock_
   Additional Info:
      [a] Fatal Root Error: @SUB=TStorageFactoryFile::ReadBuffer
read from Storage::xread returned 0. Asked to read n bytes: 531035723 from offset: 1248034286 with file size: 1248034286

----- End Fatal Exception -------------------------------------------------

cmsdmwmbot · 2021-05-31T17:10:34Z

Host name : vocms047.cern.ch
Container ID : None
Pull request url : #4574
Current build at jenkins : https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/285/.
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-660

silviodonato · 2021-05-31T17:10:34Z

I tried to remove process.siStripBadComponentInfo by overwriting

process.SiStripCosmicDQMClient = cms.Sequence(process.siStripQTester+process.siStripOfflineAnalyser)

and I've got

31-May-2021 18:35:09 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/F4114984-C10F-11EB-9983-00B5B8BCBEEF.root
%MSG-w XrdAdaptor:  AfterModBeginStream 31-May-2021 18:35:11 CEST BeforeEvents
Data is served from cern.ch instead of original site eoscms
%MSG
31-May-2021 18:35:11 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root
%MSG-e HLTConfigProvider:  EgHLTOfflineClient:egHLTOffDQMClient@beginRun  31-May-2021 18:55:44 CEST Run: 341169
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider:  EgHLTOfflineClient:egHLTOffDQMClient@beginRun  31-May-2021 18:55:45 CEST Run: 341169
 Process name 'HLT' not found in registry!
%MSG
%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  31-May-2021 18:55:45 CEST Run: 341169
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  31-May-2021 18:55:45 CEST Run: 341169
 Process name 'HLT' not found in registry!
%MSG


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Mon May 31 18:59:03 CEST 2021
Thread 7 (Thread 0x7f791c263700 (LWP 31374)):
#0  0x00007f796613ab3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f796613abcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f796613ac6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f79606cb816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007f79606cb8c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f791ca64700 (LWP 31373)):
#0  0x00007f796613ab3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f796613abcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f796613ac6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f79606cb816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007f79606cb8c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f791d265700 (LWP 31372)):
#0  0x00007f796613ab3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f796613abcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f796613ac6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f79606cb816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007f79606cb8c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f791da66700 (LWP 31371)):
#0  0x00007f796613be9d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f79602f16ff in XrdSysTimer::Wait(int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#2  0x00007f796066db0a in XrdCl::TaskManager::RunTasks() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#3  0x00007f796066dc69 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f791e267700 (LWP 31370)):
#0  0x00007f7965e5dfd3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f79602ebed2 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#2  0x00007f79602e878d in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#3  0x00007f79602f0f68 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#4  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f79433d2700 (LWP 31346)):
#0  0x00007f796613c1d9 in waitpid () from /lib64/libpthread.so.0
#1  0x00007f795f1568d7 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007f795f15749a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007f7966734af0 in std::execute_native_thread_routine (__p=0x7f79498a84e0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f7964437540 (LWP 31150)):
#0  0x00007f7965e52ccd in poll () from /lib64/libc.so.6
#1  0x00007f795f156cd7 in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007f795f15756c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007f795f158922 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f79325ec761 in DQMRootSource::readElements() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginDQMServicesFwkIOPlugins.so
#6  0x00007f79325ece20 in DQMRootSource::readLuminosityBlock_(edm::LuminosityBlockPrincipal&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginDQMServicesFwkIOPlugins.so
#7  0x00007f796895df2b in decltype ({parm#1}()) edm::convertException::wrap<edm::callWithTryCatchAndPrint<void>(std::function<void ()>, char const*, bool)::{lambda()#1}>(edm::callWithTryCatchAndPrint<void>(std::function<void ()>, char const*, bool)::{lambda()#1}) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#8  0x00007f796895e03c in void edm::callWithTryCatchAndPrint<void>(std::function<void ()>, char const*, bool) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#9  0x00007f796895a826 in edm::InputSource::readLuminosityBlock(edm::LuminosityBlockPrincipal&, edm::HistoryAppender&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#10 0x00007f79688ed76a in edm::EventProcessor::readLuminosityBlock(edm::LuminosityBlockProcessingStatus&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#11 0x00007f79688f58d2 in edm::EventProcessor::beginLumiAsync(edm::IOVSyncValue const&, std::shared_ptr<void> const&, edm::WaitingTaskHolder)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda()#1}::operator()() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00007f79688f5f01 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::EventProcessor::beginLumiAsync(edm::IOVSyncValue const&, std::shared_ptr<void> const&, edm::WaitingTaskHolder)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda()#1}>(tbb::task_group&, edm::EventProcessor::beginLumiAsync(edm::IOVSyncValue const&, std::shared_ptr<void> const&, edm::WaitingTaskHolder)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda()#1}&&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#13 0x00007f7968bbbf49 in tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so
#14 0x00007f796718426d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7f7962ebf200, context_guard=..., t=0x7f791e5ead40, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#15 0x00007f7967184756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f7962ebf200, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#16 0x00007f79688ec058 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#17 0x00007f79688f5035 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#18 0x000000000040bae6 in tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const ()
#19 0x00007f796717d552 in tbb::interface7::internal::task_arena_base::internal_execute (this=0x7ffedd405130, d=...) at ../../src/tbb/arena.cpp:1105
#20 0x000000000040ca13 in main::{lambda()#1}::operator()() const ()
#21 0x000000000040b62c in main ()

Current Modules:

Module: none (crashed)

A fatal system signal has occurred: segmentation violation

cmsdmwmbot · 2021-05-31T17:10:36Z

Container Tests
JIRA URL : https://its.cern.ch/jira/browse/CMSTZDEV-660
Tier0_REPLAY v285 DMWM-T0-PR-test-job on vocms047.cern.ch. Replay test with CMSSW_11_3_1

silviodonato · 2021-05-31T17:11:29Z

placing some random couts in the code I am actually getting a different (root-related) exception:

%MSG
----- Begin Fatal Exception 31-May-2021 18:37:19 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readLuminosityBlock_
   Additional Info:
      [a] Fatal Root Error: @SUB=TStorageFactoryFile::ReadBuffer
read from Storage::xread returned 0. Asked to read n bytes: 531035723 from offset: 1248034286 with file size: 1248034286

----- End Fatal Exception -------------------------------------------------

@Dr15Jones @smuzaffar @makortel do you have any idea about this?

silviodonato · 2021-06-01T09:06:20Z

@jfernan2 @kmaeshima @rvenditti @andrius-k @ErnestaP @ahmad3213
Is it normal that this step
cmsDriver.py singleTest --conditions 113X_dataRun3_Express_Candidate_2021_05_28_11_56_38 --filein=root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root --scenario cosmics --filetype DQM -s HARVESTING:dqmHarvestingFakeHLT --data --era Run3 -n 100
takes up to 2GB of memory?

jfernan2 · 2021-06-01T09:21:45Z

@jfernan2 @kmaeshima @rvenditti @andrius-k @ErnestaP @ahmad3213
Is it normal that this step
cmsDriver.py singleTest --conditions 113X_dataRun3_Express_Candidate_2021_05_28_11_56_38 --filein=root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root --scenario cosmics --filetype DQM -s HARVESTING:dqmHarvestingFakeHLT --data --era Run3 -n 100
takes up to 2GB of memory?

I do not know if it is normal since we have not monitorized memory at Production level for Cosmics Harvesting, all that I can say is that despite being Cosmics, almost all DPGs/POGs (but Trigger due to FakeHLT menu) are involved in this Harvesting process:
https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/DQMOffline/Configuration/python/DQMOfflineCosmics_SecondStep_cff.py#L60-L92

We can do a check with igproof to see if there is any module out of control

silviodonato · 2021-06-01T09:45:59Z

I made some tests reducing the number of files. I see that running on all files using cmsDriver.py singleTest --conditions 113X_dataRun3_Express_Candidate_2021_05_28_11_56_38 --filein=root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root --scenario cosmics --filetype DQM -s HARVESTING:dqmHarvestingFakeHLT --data --era Run3 on all files contained in /eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/ (198 files)
reproduces the error.

However, if you run on a subset of ~10 files the job is ok. If you job on ~50 files the run sometimes is ok sometime the job crashes.

@germanfgv is there the possibility to split the job in ~5 parts?

germanfgv · 2021-06-01T10:10:39Z

However, if you run on a subset of ~10 files the job is ok. If you job on ~50 files the run sometimes is ok sometime the job crashes.

@germanfgv is there the possibility to split the job in ~5 parts?

To be honest, that's not something I have ever done with our regular workflows. I'll guess I'll have to mess with our JobSplitter logic. I don't think is something we can modify that easily. Let me check with the rest of the team to see what we can do.

silviodonato · 2021-06-01T10:13:32Z

However, if you run on a subset of ~10 files the job is ok. If you job on ~50 files the run sometimes is ok sometime the job crashes.
@germanfgv is there the possibility to split the job in ~5 parts?

To be honest, that's not something I have ever done with our regular workflows. I'll guess I'll have to mess with our JobSplitter logic. I don't think is something we can modify that easily. Let me check with the rest of the team to see what we can do.

Thanks German, this would be only an "emergency" solution. Of course we want to understand and fix the bug.

germanfgv · 2021-06-01T11:12:50Z

@silviodonato could this be an issue with EOS? We have recently had problems reading files. Are we sure that's not what's happening here.

silviodonato · 2021-06-01T11:21:40Z

@silviodonato could this be an issue with EOS? We have recently had problems reading files. Are we sure that's not what's happening here.

yes, they seem problems with EOS, but I though it is ok now. I'll try again with a local test

silviodonato · 2021-06-01T11:43:45Z

@germanfgv @ all Yes, I can confirm this is a problem with EOS.
I copied all files in my AFS area, then I've added to the PSet.py

for i, f in enumerate(process.source.fileNames): 
    process.source.fileNames[i] = process.source.fileNames[i].replace("/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/","file:///afs/cern.ch/user/s/sdonato/AFSwork/public/Tier0_REPLAY_2021_StreamExpressCosmics_DQMIO/")

to pick the local file instead of the files on EOS.
The job worked properly, I also got the DQM_V0001_R000341169__StreamExpressCosmics__Tier0_REPLAY_2021-Express-v2105300635__DQMIO.root

I'll send a message to the HN. Thanks everybody for the help!

germanfgv · 2021-06-01T12:26:27Z

@silviodonato We have an open ticket with EOS discussing problems that we had during MWGR#3. I'm reporting this into that ticket.

The configuration of T2 backfill area was changed recently and that may be the cause of the problem. I'll retry the replay using our new T0 storage site where this should not be a problem.

Dr15Jones · 2021-06-01T13:23:55Z

do you have any idea about this?

All the fatal exception messages are saying that the PoolSource requested ROOT to read some more of the file and what ROOT got back was basically 'broken'. That completely fits with an EOS problem where the remote file reading was giving back garbage.

cmsdmwmbot · 2021-06-02T00:36:36Z

Container Tests
JIRA URL : https://its.cern.ch/jira/browse/CMSTZDEV-661
Tier0_REPLAY v286 DMWM-T0-PR-test-job on vocms047.cern.ch. Replay test with CMSSW_11_3_1
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours.. Replay will be aborted by time out.
Workflow failed
ORA-00942: table or view does not exist

germanfgv · 2021-06-08T08:42:54Z

I'll close this PR as testing for MWGR has moved towards #4577

test 11_3_1

c24adbc

silviodonato mentioned this pull request May 31, 2021

esConsumes migration for EDModules in SiStrip packages, part 2 cms-sw/cmssw#32677

Merged

dmwm deleted a comment from cmsdmwmbot Jun 1, 2021

qliphy mentioned this pull request Jun 2, 2021

Replay test with CMSSW_11_3_1_patch1 #4577

Closed

germanfgv closed this Jun 8, 2021

Replay test with CMSSW_11_3_1 #4574

Replay test with CMSSW_11_3_1 #4574

Conversation

qliphy commented May 28, 2021

qliphy commented May 28, 2021

cmsdmwmbot commented May 28, 2021

cmsdmwmbot commented May 28, 2021

cmsdmwmbot commented May 28, 2021

germanfgv commented May 28, 2021

cmsdmwmbot commented May 28, 2021

germanfgv commented May 28, 2021

cmsdmwmbot commented May 30, 2021

silviodonato commented May 31, 2021

germanfgv commented May 31, 2021 • edited

germanfgv commented May 31, 2021

francescobrivio commented May 31, 2021 • edited

silviodonato commented May 31, 2021

francescobrivio commented May 31, 2021

germanfgv commented May 31, 2021

silviodonato commented May 31, 2021

francescobrivio commented May 31, 2021

francescobrivio commented May 31, 2021

silviodonato commented May 31, 2021

mmusich commented May 31, 2021

mmusich commented May 31, 2021

silviodonato commented May 31, 2021

francescobrivio commented May 31, 2021

germanfgv commented May 31, 2021 • edited

francescobrivio commented May 31, 2021 • edited

pieterdavid commented May 31, 2021

mmusich commented May 31, 2021

cmsdmwmbot commented May 31, 2021

silviodonato commented May 31, 2021

cmsdmwmbot commented May 31, 2021

silviodonato commented May 31, 2021

silviodonato commented Jun 1, 2021

jfernan2 commented Jun 1, 2021

silviodonato commented Jun 1, 2021

germanfgv commented Jun 1, 2021

silviodonato commented Jun 1, 2021

germanfgv commented Jun 1, 2021

silviodonato commented Jun 1, 2021

silviodonato commented Jun 1, 2021 • edited

germanfgv commented Jun 1, 2021

Dr15Jones commented Jun 1, 2021

cmsdmwmbot commented Jun 2, 2021

germanfgv commented Jun 8, 2021

germanfgv commented May 31, 2021 •

edited

francescobrivio commented May 31, 2021 •

edited

germanfgv commented May 31, 2021 •

edited

francescobrivio commented May 31, 2021 •

edited

silviodonato commented Jun 1, 2021 •

edited