Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replay test with CMSSW_11_3_1 #4574

Closed
wants to merge 1 commit into from
Closed

Replay test with CMSSW_11_3_1 #4574

wants to merge 1 commit into from

Conversation

qliphy
Copy link
Contributor

@qliphy qliphy commented May 28, 2021

Test of CMSSW_11_3_1 for the MWGR#4 (2-4 June).

Configuration followed from #4573

cms-sw/cmssw#33867 should fix the issue reported in:
https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2213/1/2/1/1/1/1/1.html

@qliphy
Copy link
Contributor Author

qliphy commented May 28, 2021

run replay please

@cmsdmwmbot
Copy link

Container Tests
No container is available now

@cmsdmwmbot
Copy link

Host name : vocms047.cern.ch
Container ID : 1
Pull request url : #4574
Current build at jenkins : https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/282/.
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-658

@cmsdmwmbot
Copy link

Container Tests
JIRA URL : https://its.cern.ch/jira/browse/CMSTZDEV-658
Tier0_REPLAY v282 DMWM-T0-PR-test-job on vocms047.cern.ch. Replay test with CMSSW_11_3_1
There is NO fileset and job.
Replay might be empty.
All filesets were closed.
There was NO paused job in the replay.
End.
Replay was empty..

@germanfgv
Copy link
Contributor

Hi @qliphy, at the moment we are experiencing an issue with an Oracle account after the recent upgrade, and we are unable to run replays. It should be fixed during the day, at which point I'll start the replay. I'm sorry for the delay.

@cmsdmwmbot
Copy link

Host name : vocms047.cern.ch
Container ID : 1
Pull request url : #4574
Current build at jenkins : https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/283/.
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-659

@germanfgv
Copy link
Contributor

I relaunched the replay and It's running right now. I'll let you know how it goes.

@cmsdmwmbot
Copy link

!!! Couldn't read commit file !!!

@silviodonato
Copy link

Hi @germanfgv has the replay test failed?

@germanfgv
Copy link
Contributor

germanfgv commented May 31, 2021

@silviodonato Hi Silvio. The test was interrumpted due to an issue accessing /cvmfs/ on our testing node. @jhonatanamado manually started an aditional test using the new candidate GTs. There was a paused job in that test and I'm about to report it in Hypernews.

Also, yesterday I manually started this exact configuration file (with your proposed GTs) in another machine. This test is about to finish.

@germanfgv
Copy link
Contributor

Hi @germanfgv has the replay test failed?

Here you can find a detailed description of the error: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223.html

@francescobrivio
Copy link
Contributor

francescobrivio commented May 31, 2021

@silviodonato Hi Silvio. The test was interrumpted due to an issue accessing /cvmfs/ on our testing node. @jhonatanamado manually started started an aditional test using the new candidate GTs. There was a paused job in that test and I'm about to report it in Hypernews.

Also, yesterday I manually started this exact configuration file (with your proposed GTs) in another machine. This test is about to finish.

Hi @germanfgv can you confirm the GTs used in your manual test? because the HN post you link here it's the wrong (old) one. The correct (new) post is: https://hypernews.cern.ch/HyperNews/CMS/get/calibrations/4408.html

@silviodonato
Copy link

@silviodonato Hi Silvio. The test was interrumpted due to an issue accessing /cvmfs/ on our testing node. @jhonatanamado manually started started an aditional test using the new candidate GTs. There was a paused job in that test and I'm about to report it in Hypernews.
Also, yesterday I manually started this exact configuration file (with your proposed GTs) in another machine. This test is about to finish.

Hi @germanfgv can you confirm the GTs used in your manual test? because the HN post you link here it's the wrong (old) one. The correct (new) post is: https://hypernews.cern.ch/HyperNews/CMS/get/calibrations/4408.html

@francescobrivio

In /afs/cern.ch/user/c/cmst0/public/PausedJobs/MWGR4-2021/tarballs/job_11195/75dbd874-62ca-48fa-a4c7-5ec8f51b7451-Periodic-Harvest-1-6-logArchive.tar.gz , specifically job/WMTaskSpace/cmsRun1/PSet.py I see 113X_dataRun3_Express_Candidate_2021_05_28_11_56_38

@francescobrivio
Copy link
Contributor

113X_dataRun3_Express_Candidate_2021_05_28_11_56_38

Thanks @silviodonato! The GT seems the correct one!
I saw your message on the HN about the EOS glitch and I am also running the PSet locally (lxplus) to see if it is working fine.

@germanfgv
Copy link
Contributor

yes @francescobrivio, we used the new GTs. I provided the wrong link in my comment. I apologize. The HN thread has the correct link.

@silviodonato
Copy link

alca: @francescobrivio @christopheralanwest @malbouis @pohsun @yuanchao @tlampen
dqm: @jfernan2 @kmaeshima @rvenditti @andrius-k @ErnestaP @ahmad3213
ecal-dpg: @thomreis

Reproducing the replay test, I got this failure related to EcalDQMonitorClient. It is rather urgent because we didn't manage yet to pass the replay test and the MWGR is schedule next week...
Do you know the reason of this error?

31-May-2021 10:49:10 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root
----- Begin Fatal Exception 31-May-2021 10:56:39 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readRun_
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::Streamer
The value of fKeylen is incorrect (-3385) ; trying to recover by setting it to zero

----- End Fatal Exception -------------------------------------------------
%MSG-e EcalDQM:  EcalDQMonitorClient:ecalMonitorClient@endProcessBlock  31-May-2021 10:56:39 CEST post-events
Ecal Monitor Client: Exception in bookMEs @ IntegrityClient
%MSG


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

More details at https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1.html

@francescobrivio
Copy link
Contributor

alca: @francescobrivio @christopheralanwest @malbouis @pohsun @yuanchao @tlampen
dqm: @jfernan2 @kmaeshima @rvenditti @andrius-k @ErnestaP @ahmad3213
ecal-dpg: @thomreis

Reproducing the replay test, I got this failure related to EcalDQMonitorClient. It is rather urgent because we didn't manage yet to pass the replay test and the MWGR is schedule next week...
Do you know the reason of this error?

31-May-2021 10:49:10 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root
----- Begin Fatal Exception 31-May-2021 10:56:39 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readRun_
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::Streamer
The value of fKeylen is incorrect (-3385) ; trying to recover by setting it to zero

----- End Fatal Exception -------------------------------------------------
%MSG-e EcalDQM:  EcalDQMonitorClient:ecalMonitorClient@endProcessBlock  31-May-2021 10:56:39 CEST post-events
Ecal Monitor Client: Exception in bookMEs @ IntegrityClient
%MSG


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

More details at https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1.html

I am running a few tests:

  1. use the "old" GT ( 113X_dataRun3_Express_v1 ) --> this gives the same crash
  2. Change the Ecal flag that was updated especially for this MWGR (Test ECAL pedestals in next MWGR cms-sw/cmssw#33765) to test ECAL PCL pedestals --> this also gives the same crash

I'm now running all the possible combinations of options 1 and 2...but any input from ECAL or DQM is welcome!

@francescobrivio
Copy link
Contributor

alca: @francescobrivio @christopheralanwest @malbouis @pohsun @yuanchao @tlampen
dqm: @jfernan2 @kmaeshima @rvenditti @andrius-k @ErnestaP @ahmad3213
ecal-dpg: @thomreis

Reproducing the replay test, I got this failure related to EcalDQMonitorClient. It is rather urgent because we didn't manage yet to pass the replay test and the MWGR is schedule next week...
Do you know the reason of this error?

31-May-2021 10:49:10 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root
----- Begin Fatal Exception 31-May-2021 10:56:39 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readRun_
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::Streamer
The value of fKeylen is incorrect (-3385) ; trying to recover by setting it to zero

----- End Fatal Exception -------------------------------------------------
%MSG-e EcalDQM:  EcalDQMonitorClient:ecalMonitorClient@endProcessBlock  31-May-2021 10:56:39 CEST post-events
Ecal Monitor Client: Exception in bookMEs @ IntegrityClient
%MSG


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

More details at https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1.html

BTW, just to give everyone a full picture, after the stack trace (related to ECAL I think) I see:

Current Modules:

Module: SiStripBadComponentInfo:siStripBadComponentInfo (crashed)

A fatal system signal has occurred: segmentation violation
Segmentation fault (core dumped)

@silviodonato
Copy link

Thanks @francescobrivio. Indeed I've just tried to remove the EcalDQMonitorClient overwriting process.DQMOfflineCosmics_SecondStepEcal = cms.Sequence(process.es_dqm_client_offline)
and I got the same error. So probably the error comes from SiStripBadComponentInfo.
@mmusich perhaps it is something like cms-sw/cmssw#33867 ?

31-May-2021 16:41:55 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root
[2021-05-31 16:50:53.006480 +0200][Error  ][PostMaster        ] [p06636710w24543.cern.ch:1095 #0] Forcing error on disconnect: [ERROR] Operation interrupted.
----- Begin Fatal Exception 31-May-2021 16:58:02 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readRun_
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::Streamer
The value of fKeylen is incorrect (-7886) ; trying to recover by setting it to zero

----- End Fatal Exception -------------------------------------------------


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Mon May 31 16:58:02 CEST 2021
Thread 7 (Thread 0x7fd179123700 (LWP 22098)):
#0  0x00007fd1c2ff7b3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fd1c2ff7bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fd1c2ff7c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007fd1bcac9816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007fd1bcac98c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7fd179924700 (LWP 22097)):
#0  0x00007fd1c2ff7b3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fd1c2ff7bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fd1c2ff7c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007fd1bcac9816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007fd1bcac98c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fd17a125700 (LWP 22096)):
#0  0x00007fd1c2ff7b3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fd1c2ff7bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fd1c2ff7c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007fd1bcac9816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007fd1bcac98c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fd17a926700 (LWP 22095)):
#0  0x00007fd1c2ff8e9d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007fd1bcee96ff in XrdSysTimer::Wait(int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#2  0x00007fd1bca6bb0a in XrdCl::TaskManager::RunTasks() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#3  0x00007fd1bca6bc69 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fd17b127700 (LWP 22094)):
#0  0x00007fd1c2d1afd3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fd1bcee3ed2 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#2  0x00007fd1bcee078d in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#3  0x00007fd1bcee8f68 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_0_patch1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#4  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fd19fb24700 (LWP 21767)):
#0  0x00007fd1c2ff91d9 in waitpid () from /lib64/libpthread.so.0
#1  0x00007fd1ba1918d7 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007fd1ba19249a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007fd1c35f1af0 in std::execute_native_thread_routine (__p=0x7fd1a5e7f500) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00007fd1c2ff1ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fd1c2d1a9fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fd1c12f3540 (LWP 21655)):
#0  0x00007fd1c2d0fccd in poll () from /lib64/libc.so.6
#1  0x00007fd1ba191cd7 in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007fd1ba19256c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007fd1ba193922 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fd1c58e7cc6 in std::vector<unsigned int, std::allocator<unsigned int> >::operator=(std::vector<unsigned int, std::allocator<unsigned int> > const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#6  0x00007fd198b5a72a in SiStripQuality::SiStripQuality(SiStripQuality const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libCalibFormatsSiStripObjects.so
#7  0x00007fd18ebc15ae in SiStripBadComponentInfo::dqmEndJob(dqm::implementation::IBooker&, dqm::implementation::IGetter&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginDQMSiStripMonitorClientPlugins.so
#8  0x00007fd18ebc2fd4 in non-virtual thunk to DQMEDHarvester::endProcessBlockProduce(edm::ProcessBlock&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/pluginDQMSiStripMonitorClientPlugins.so
#9  0x00007fd1c58f4518 in edm::one::EDProducerBase::doEndProcessBlock(edm::ProcessBlockPrincipal const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#10 0x00007fd1c58d1b90 in edm::WorkerT<edm::one::EDProducerBase>::implDoEndProcessBlock(edm::ProcessBlockPrincipal const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#11 0x00007fd1c57dea67 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00007fd1c57dec6d in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#13 0x00007fd1c57def16 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#14 0x00007fd1c57df330 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#15 0x00007fd1c57df4e1 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(tbb::task_group&, edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#16 0x00007fd1c5a77f49 in tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so
#17 0x00007fd1c404126d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7fd1bfd53200, context_guard=..., t=0x7fd17b4e5540, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#18 0x00007fd1c4041756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7fd1bfd53200, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#19 0x00007fd1c57ac618 in edm::EventProcessor::endProcessBlock(bool, bool) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#20 0x00007fd1c576dc44 in edm::EventProcessor::runToCompletion() [clone .cold] () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#21 0x000000000040bae6 in tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const ()
#22 0x00007fd1c403a552 in tbb::interface7::internal::task_arena_base::internal_execute (this=0x7ffc036ac590, d=...) at ../../src/tbb/arena.cpp:1105
#23 0x000000000040ca13 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040b62c in main ()

Current Modules:

Module: SiStripBadComponentInfo:siStripBadComponentInfo (crashed)

A fatal system signal has occurred: segmentation violation

@mmusich
Copy link
Contributor

mmusich commented May 31, 2021

@mmusich
Copy link
Contributor

mmusich commented May 31, 2021

@silviodonato @francescobrivio
Please post a full recipe to get the error at #4574 (comment)

@silviodonato
Copy link

thanks @mmusich ! let me also point the PR cms-sw/cmssw#32677 to get track of the issue

@francescobrivio
Copy link
Contributor

@silviodonato @francescobrivio
Please post a full recipe to get the error at #4574 (comment)

Here is the recipe:

cmsrel CMSSW_11_3_1
cd CMSSW_11_3_1/src
cmsenv
scram b -j 8
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/MWGR4-2021/tarballs/job_11195/75dbd874-62ca-48fa-a4c7-5ec8f51b7451-Periodic-Harvest-1-6-logArchive.tar.gz . 
tar -zxvf 75dbd874-62ca-48fa-a4c7-5ec8f51b7451-Periodic-Harvest-1-6-logArchive.tar.gz 
cd job/WMTaskSpace/cmsRun1/
cmsRun -e PSet.py

@germanfgv
Copy link
Contributor

germanfgv commented May 31, 2021

My replay, using the GTs in this PR, finished with a similar error. More details can be found in HN: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1/1.html

@francescobrivio
Copy link
Contributor

francescobrivio commented May 31, 2021

My replay, using the GTs in this PR, finished with a similar error. More details can be found in HN: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2223/1/1/1.html

Hi German, thanks for the additional info!
I just want to point out that the file you report in the HN (FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root) it's the last one of the fileList in the PSet.py, so probably that's why it always happens after that file? Not sure if this helps somehow...

@pieterdavid
Copy link

With the recipe from #4574 (comment) I could reproduce the problem in #4574 (comment), I'm investigating.

@mmusich
Copy link
Contributor

mmusich commented May 31, 2021

placing some random couts in the code I am actually getting a different (root-related) exception:

%MSG
----- Begin Fatal Exception 31-May-2021 18:37:19 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readLuminosityBlock_
   Additional Info:
      [a] Fatal Root Error: @SUB=TStorageFactoryFile::ReadBuffer
read from Storage::xread returned 0. Asked to read n bytes: 531035723 from offset: 1248034286 with file size: 1248034286

----- End Fatal Exception -------------------------------------------------

@cmsdmwmbot
Copy link

Host name : vocms047.cern.ch
Container ID : None
Pull request url : #4574
Current build at jenkins : https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/285/.
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-660

@silviodonato
Copy link

I tried to remove process.siStripBadComponentInfo by overwriting

process.SiStripCosmicDQMClient = cms.Sequence(process.siStripQTester+process.siStripOfflineAnalyser)

and I've got

31-May-2021 18:35:09 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/F4114984-C10F-11EB-9983-00B5B8BCBEEF.root
%MSG-w XrdAdaptor:  AfterModBeginStream 31-May-2021 18:35:11 CEST BeforeEvents
Data is served from cern.ch instead of original site eoscms
%MSG
31-May-2021 18:35:11 CESTSuccessfully opened file root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root
%MSG-e HLTConfigProvider:  EgHLTOfflineClient:egHLTOffDQMClient@beginRun  31-May-2021 18:55:44 CEST Run: 341169
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider:  EgHLTOfflineClient:egHLTOffDQMClient@beginRun  31-May-2021 18:55:45 CEST Run: 341169
 Process name 'HLT' not found in registry!
%MSG
%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  31-May-2021 18:55:45 CEST Run: 341169
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  31-May-2021 18:55:45 CEST Run: 341169
 Process name 'HLT' not found in registry!
%MSG


A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Mon May 31 18:59:03 CEST 2021
Thread 7 (Thread 0x7f791c263700 (LWP 31374)):
#0  0x00007f796613ab3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f796613abcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f796613ac6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f79606cb816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007f79606cb8c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f791ca64700 (LWP 31373)):
#0  0x00007f796613ab3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f796613abcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f796613ac6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f79606cb816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007f79606cb8c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f791d265700 (LWP 31372)):
#0  0x00007f796613ab3b in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f796613abcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f796613ac6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007f79606cb816 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007f79606cb8c9 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#5  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f791da66700 (LWP 31371)):
#0  0x00007f796613be9d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f79602f16ff in XrdSysTimer::Wait(int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#2  0x00007f796066db0a in XrdCl::TaskManager::RunTasks() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#3  0x00007f796066dc69 in RunRunnerThread () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdCl.so.2
#4  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f791e267700 (LWP 31370)):
#0  0x00007f7965e5dfd3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f79602ebed2 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#2  0x00007f79602e878d in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#3  0x00007f79602f0f68 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/external/slc7_amd64_gcc900/lib/libXrdUtils.so.2
#4  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f79433d2700 (LWP 31346)):
#0  0x00007f796613c1d9 in waitpid () from /lib64/libpthread.so.0
#1  0x00007f795f1568d7 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007f795f15749a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007f7966734af0 in std::execute_native_thread_routine (__p=0x7f79498a84e0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00007f7966134ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f7965e5d9fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f7964437540 (LWP 31150)):
#0  0x00007f7965e52ccd in poll () from /lib64/libc.so.6
#1  0x00007f795f156cd7 in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007f795f15756c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007f795f158922 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f79325ec761 in DQMRootSource::readElements() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginDQMServicesFwkIOPlugins.so
#6  0x00007f79325ece20 in DQMRootSource::readLuminosityBlock_(edm::LuminosityBlockPrincipal&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/pluginDQMServicesFwkIOPlugins.so
#7  0x00007f796895df2b in decltype ({parm#1}()) edm::convertException::wrap<edm::callWithTryCatchAndPrint<void>(std::function<void ()>, char const*, bool)::{lambda()#1}>(edm::callWithTryCatchAndPrint<void>(std::function<void ()>, char const*, bool)::{lambda()#1}) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#8  0x00007f796895e03c in void edm::callWithTryCatchAndPrint<void>(std::function<void ()>, char const*, bool) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#9  0x00007f796895a826 in edm::InputSource::readLuminosityBlock(edm::LuminosityBlockPrincipal&, edm::HistoryAppender&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#10 0x00007f79688ed76a in edm::EventProcessor::readLuminosityBlock(edm::LuminosityBlockProcessingStatus&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#11 0x00007f79688f58d2 in edm::EventProcessor::beginLumiAsync(edm::IOVSyncValue const&, std::shared_ptr<void> const&, edm::WaitingTaskHolder)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda()#1}::operator()() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00007f79688f5f01 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::EventProcessor::beginLumiAsync(edm::IOVSyncValue const&, std::shared_ptr<void> const&, edm::WaitingTaskHolder)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda()#1}>(tbb::task_group&, edm::EventProcessor::beginLumiAsync(edm::IOVSyncValue const&, std::shared_ptr<void> const&, edm::WaitingTaskHolder)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda()#1}&&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#13 0x00007f7968bbbf49 in tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so
#14 0x00007f796718426d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7f7962ebf200, context_guard=..., t=0x7f791e5ead40, isolation=isolation@entry=0) at ../../include/tbb/task.h:1003
#15 0x00007f7967184756 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f7962ebf200, parent=..., child=<optimized out>) at ../../src/tbb/scheduler.h:650
#16 0x00007f79688ec058 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#17 0x00007f79688f5035 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#18 0x000000000040bae6 in tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const ()
#19 0x00007f796717d552 in tbb::interface7::internal::task_arena_base::internal_execute (this=0x7ffedd405130, d=...) at ../../src/tbb/arena.cpp:1105
#20 0x000000000040ca13 in main::{lambda()#1}::operator()() const ()
#21 0x000000000040b62c in main ()

Current Modules:

Module: none (crashed)

A fatal system signal has occurred: segmentation violation

@cmsdmwmbot
Copy link

Container Tests
JIRA URL : https://its.cern.ch/jira/browse/CMSTZDEV-660
Tier0_REPLAY v285 DMWM-T0-PR-test-job on vocms047.cern.ch. Replay test with CMSSW_11_3_1

@silviodonato
Copy link

placing some random couts in the code I am actually getting a different (root-related) exception:

%MSG
----- Begin Fatal Exception 31-May-2021 18:37:19 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readLuminosityBlock_
   Additional Info:
      [a] Fatal Root Error: @SUB=TStorageFactoryFile::ReadBuffer
read from Storage::xread returned 0. Asked to read n bytes: 531035723 from offset: 1248034286 with file size: 1248034286

----- End Fatal Exception -------------------------------------------------

@Dr15Jones @smuzaffar @makortel do you have any idea about this?

@silviodonato
Copy link

@jfernan2 @kmaeshima @rvenditti @andrius-k @ErnestaP @ahmad3213
Is it normal that this step
cmsDriver.py singleTest --conditions 113X_dataRun3_Express_Candidate_2021_05_28_11_56_38 --filein=root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root --scenario cosmics --filetype DQM -s HARVESTING:dqmHarvestingFakeHLT --data --era Run3 -n 100
takes up to 2GB of memory?

@jfernan2
Copy link

jfernan2 commented Jun 1, 2021

@jfernan2 @kmaeshima @rvenditti @andrius-k @ErnestaP @ahmad3213
Is it normal that this step
cmsDriver.py singleTest --conditions 113X_dataRun3_Express_Candidate_2021_05_28_11_56_38 --filein=root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root --scenario cosmics --filetype DQM -s HARVESTING:dqmHarvestingFakeHLT --data --era Run3 -n 100
takes up to 2GB of memory?

I do not know if it is normal since we have not monitorized memory at Production level for Cosmics Harvesting, all that I can say is that despite being Cosmics, almost all DPGs/POGs (but Trigger due to FakeHLT menu) are involved in this Harvesting process:
https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/DQMOffline/Configuration/python/DQMOfflineCosmics_SecondStep_cff.py#L60-L92

We can do a check with igproof to see if there is any module out of control

@silviodonato
Copy link

I made some tests reducing the number of files. I see that running on all files using cmsDriver.py singleTest --conditions 113X_dataRun3_Express_Candidate_2021_05_28_11_56_38 --filein=root://eoscms.cern.ch//eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/FBD2E3DC-C11C-11EB-B90F-DD22B8BCBEEF.root --scenario cosmics --filetype DQM -s HARVESTING:dqmHarvestingFakeHLT --data --era Run3 on all files contained in /eos/cms/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/ (198 files)
reproduces the error.

However, if you run on a subset of ~10 files the job is ok. If you job on ~50 files the run sometimes is ok sometime the job crashes.

@germanfgv is there the possibility to split the job in ~5 parts?

@germanfgv
Copy link
Contributor

However, if you run on a subset of ~10 files the job is ok. If you job on ~50 files the run sometimes is ok sometime the job crashes.

@germanfgv is there the possibility to split the job in ~5 parts?

To be honest, that's not something I have ever done with our regular workflows. I'll guess I'll have to mess with our JobSplitter logic. I don't think is something we can modify that easily. Let me check with the rest of the team to see what we can do.

@silviodonato
Copy link

However, if you run on a subset of ~10 files the job is ok. If you job on ~50 files the run sometimes is ok sometime the job crashes.
@germanfgv is there the possibility to split the job in ~5 parts?

To be honest, that's not something I have ever done with our regular workflows. I'll guess I'll have to mess with our JobSplitter logic. I don't think is something we can modify that easily. Let me check with the rest of the team to see what we can do.

Thanks German, this would be only an "emergency" solution. Of course we want to understand and fix the bug.

@germanfgv
Copy link
Contributor

@silviodonato could this be an issue with EOS? We have recently had problems reading files. Are we sure that's not what's happening here.

@silviodonato
Copy link

@silviodonato could this be an issue with EOS? We have recently had problems reading files. Are we sure that's not what's happening here.

yes, they seem problems with EOS, but I though it is ok now. I'll try again with a local test

@silviodonato
Copy link

silviodonato commented Jun 1, 2021

@germanfgv @ all Yes, I can confirm this is a problem with EOS.
I copied all files in my AFS area, then I've added to the PSet.py

for i, f in enumerate(process.source.fileNames): 
    process.source.fileNames[i] = process.source.fileNames[i].replace("/store/backfill/1/express/Tier0_REPLAY_2021/StreamExpressCosmics/DQMIO/Express-v2105300635/000/341/169/00000/","file:///afs/cern.ch/user/s/sdonato/AFSwork/public/Tier0_REPLAY_2021_StreamExpressCosmics_DQMIO/")

to pick the local file instead of the files on EOS.
The job worked properly, I also got the DQM_V0001_R000341169__StreamExpressCosmics__Tier0_REPLAY_2021-Express-v2105300635__DQMIO.root

I'll send a message to the HN. Thanks everybody for the help!

@dmwm dmwm deleted a comment from cmsdmwmbot Jun 1, 2021
@germanfgv
Copy link
Contributor

@silviodonato We have an open ticket with EOS discussing problems that we had during MWGR#3. I'm reporting this into that ticket.

The configuration of T2 backfill area was changed recently and that may be the cause of the problem. I'll retry the replay using our new T0 storage site where this should not be a problem.

@Dr15Jones
Copy link

do you have any idea about this?

All the fatal exception messages are saying that the PoolSource requested ROOT to read some more of the file and what ROOT got back was basically 'broken'. That completely fits with an EOS problem where the remote file reading was giving back garbage.

@cmsdmwmbot
Copy link

Container Tests
JIRA URL : https://its.cern.ch/jira/browse/CMSTZDEV-661
Tier0_REPLAY v286 DMWM-T0-PR-test-job on vocms047.cern.ch. Replay test with CMSSW_11_3_1
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours..
Filesets had no change for {} hours.. Replay will be aborted by time out.
Workflow failed
ORA-00942: table or view does not exist

@germanfgv
Copy link
Contributor

I'll close this PR as testing for MWGR has moved towards #4577

@germanfgv germanfgv closed this Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants