Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in DQMStore #29743

Closed
silviodonato opened this issue May 6, 2020 · 16 comments · Fixed by #29745
Closed

Segmentation fault in DQMStore #29743

silviodonato opened this issue May 6, 2020 · 16 comments · Fixed by #29745

Comments

@silviodonato
Copy link
Contributor

We are getting segmentation fault in wf 4.52, 122.0, 136.732, 136.87, 1362.18, 11650.0, 12634.0, 25212.17

#3  <signal handler called>
#4  0x00002b58fb2a350d in __lll_lock_wait () from /lib64/libpthread.so.0
#5  0x00002b58fb29ee76 in _L_lock_941 () from /lib64/libpthread.so.0
#6  0x00002b58fb29ed6f in pthread_mutex_lock () from /lib64/libpthread.so.0
#7  0x00002b5911af24cc in dqm::implementation::DQMStore::enterLumi(unsigned int, unsigned int, unsigned long) () from /cvmfs/cms-ib.cern.ch/nweek-02627/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-05-05-2300/lib/slc7_amd64_gcc820/libDQMServicesCore.so
#8  0x00002b5948ca7640 in DQMEDAnalyzer::beginRun(edm::Run const&amp, edm::EventSetup const&amp) () from /cvmfs/cms-ib.cern.ch/nweek-02627/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-05-05-2300/lib/slc7_amd64_gcc820/pluginDQMTrackingMonitor.so
#9  0x00002b58f8dd286d in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::doStreamBeginRun(edm::StreamID, edm::RunPrincipal const&amp, edm::EventSetupImpl const&amp, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02627/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_1_X_2020-05-03-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#10 0x00002b58f8cdeea0 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDoStreamBegin(edm::StreamID, edm::RunPrincipal const&amp, edm::EventSetupImpl const&amp, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02627/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_1_X_2020-05-03-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so

Please note also the assertion fail
cmsRun: /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/f57771e3fd2a453450b4e62a76c14438/opt/cmssw/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-05-05-2300/src/DQMServices/Core/src/DQMStore.cc:480: void dqm::implementation::DQMStore::enterLumi(edm::RunNumber_t, edm::LuminosityBlockNumber_t, uint64_t): Assertion `anyme &amp&amp checkScope(anyme->getScope()) == false' failed.

It seems related to #29738 (and issue #29605). Regarding the warnings %MSG-e HLTConfigProvider: METAnalyzer:pfMetDQMAnalyzerMiniAOD@beginRun 06-May-2020 06:04:04 CEST Run: 194533 Process name 'RECO' not found in registry! %MSG, they might be related to #29254

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_amd64_gcc820/CMSSW_11_1_X_2020-05-05-2300/pyRelValMatrixLogs/run/4.52_RunMu2012B+RunMu2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT/step3_RunMu2012B+RunMu2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT.log

@silviodonato
Copy link
Contributor Author

assign dqm

@cmsbuild
Copy link
Contributor

cmsbuild commented May 6, 2020

New categories assigned: dqm

@jfernan2,@andrius-k,@schneiml,@fioriNTU,@kmaeshima you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented May 6, 2020

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@silviodonato
Copy link
Contributor Author

silviodonato commented May 6, 2020

In 4008.0 and 134.811 we have the same error
cmsRun: /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/f57771e3fd2a453450b4e62a76c14438/opt/cmssw/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-05-05-2300/src/DQMServices/Core/src/DQMStore.cc:480: void dqm::implementation::DQMStore::enterLumi(edm::RunNumber_t, edm::LuminosityBlockNumber_t, uint64_t): Assertion anyme &amp&amp checkScope(anyme->getScope()) == false' failed.`
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_1_X_2020-05-05-2300/pyRelValMatrixLogs/run/4008.0_QCD_Pt_80_170_BCtoE_8TeV+QCD_Pt_80_170_BCtoE_8TeV+DIGIPU1+RECOPUDBG/step3_QCD_Pt_80_170_BCtoE_8TeV+QCD_Pt_80_170_BCtoE_8TeV+DIGIPU1+RECOPUDBG.log#/

@schneiml
Copy link
Contributor

schneiml commented May 6, 2020

Well ok, that means that #29738 did not catch all cases. I'll have a look; in the worst case there is a simple workaround but I'd prefer to get it right.

@schneiml
Copy link
Contributor

schneiml commented May 6, 2020

4.52 passes for me locally in CMSSW_11_1_X_2020-05-05-2300 IB. Let me look at the IB logs...

@silviodonato
Copy link
Contributor Author

@schneiml I'm not sure, but it might be related to the multithreading

@silviodonato
Copy link
Contributor Author

@schneiml it's weird. We are getting fewer and different errors in slc7_amd64_gcc900 (wf 1010.0 and 1370.18)

@schneiml
Copy link
Contributor

schneiml commented May 6, 2020

Yes, that is my guess as well. Can you point me to the precise commands used?

@schneiml
Copy link
Contributor

schneiml commented May 6, 2020

It could well be a race, maybe there is no synchronization between framework callbacks where I expected one. In the worst case we can just run initLumi as needed (that is safe andwill fix this problem), but it would be better to know why it does not work as designed.

@silviodonato
Copy link
Contributor Author

The exact command should be runTheMatrix.py -i all --job-reports -t 4 --ibeos (see #28622 (comment) )
I let @smuzaffar @makortel confirm it.

@schneiml The weird thing is that even about the issue #29744, we see fewer and different errors in slc7_amd64_gcc900 wrt slc7_amd64_gcc820

@smuzaffar
Copy link
Contributor

yes, you should use -t 4 to run it in threaded mode.

@schneiml
Copy link
Contributor

schneiml commented May 6, 2020

@silviodonato I have a proposed fix. I have no idea if it actually fixes the problem, but it's a very obvious bug and looks suspicious. It should also be not very dangerous to merge.

I tried to reproduce the problem, with varying number of threads, but never succeeded. I suspect this is a race that is sort-of hard to win, so it only fails in a handful times out of 100's (1000's?) of IB tests, even though it probably affects ~all workflows.

@Dr15Jones
Copy link
Contributor

@schneiml

It could well be a race, maybe there is no synchronization between framework callbacks where I expected one. In the worst case we can just run initLumi as needed (that is safe andwill fix this problem), but it would be better to know why it does not work as designed.

Can you explain what expectations you had for the callback?

@silviodonato
Copy link
Contributor Author

#29745 solved the issue. We don't see anymore the errors in CMSSW_11_1_X_2020-05-06-1300!

@schneiml
Copy link
Contributor

schneiml commented May 7, 2020

@Dr15Jones The assumption is "All begin run things will be done before the first begin lumi starts".

If this is violated, the per-lumi plots (which are still booked in beginRun, via bookHistograms) could be initialized via the global hook before they exist, and, effectively, not initialized. This is a problem I was afraid of with the new change.

However, the crashes were completely unrelated to that, and simply due to a stupid copy-paste mistake. So, I think this condition holds, and we also need it do hold, else the DQMStore needs another change (We could simpliy initialize on demand, like before -- but then we risk loosing a set of lumi MEs if the module never calls enterLumi (that was the very original issue here), and also I think it is architecturally cleaner to not do things on-demand that can be done ahead of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants