Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in Tier0 replay of Run-2 data using 12_0_0 #35304

Closed
francescobrivio opened this issue Sep 16, 2021 · 17 comments
Closed

Crash in Tier0 replay of Run-2 data using 12_0_0 #35304

francescobrivio opened this issue Sep 16, 2021 · 17 comments

Comments

@francescobrivio
Copy link
Contributor

A crash was observed while running a Tier0 replay on Run-2 data (dmwm/T0#4602, more info in this HN message) to test the PCL workflows and the new AlCaRecos: SiPixelCalSingleMuonLoose, SiPixelCalSingleMuonTight, TkAlDiMuonAndVertex.

The crash has been reported in https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2261/1.html as

Current Modules:

Module: L1TTauOffline:l1tTauOfflineDQMEmu (crashed)
Module: METMonitor:PFMET110_PFMHT110_IDTight_METmonitoring
Module: LowPtGsfElectronSeedProducer:lowPtGsfElectronSeeds
Module: PATPackedCandidateProducer:packedPFCandidates
Module: HTMonitor:hltHT_HT650_DisplacedDijet60_Inclusive_Prommonitoring
Module: FastjetJetProducer:ak8PFJetsPuppi
Module: CandSecondaryVertexProducer:pfInclusiveSecondaryVertexFinderCvsLTagInfosPuppi
Module: CandSecondaryVertexProducer:pfInclusiveSecondaryVertexFinderTagInfos

A fatal system signal has occurred: segmentation violation
Segmentation fault (core dumped)

@tvami spotted the exact event for which this crash happens: Run 317696, Event 59331484, LumiSection 63
So the crash can be easily reproduced by doing:

cmsrel CMSSW_12_0_0
cd CMSSW_12_0_0/src
cmsenv
cp /afs/cern.ch/work/f/fbrivio/public/ALCA/replay_Run2data_PR4602/job_config.py .
cmsRun job_config.py

@makortel reported in https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2261/1/1/1.html the stack trace of the crashing module:

#5  0x00002b4a0a4e17b2 in L1TTauOffline::getProbeTaus(edm::Event const&, 
edm::Handle<std::vector<reco::PFTau, std::allocator<reco::PFTau> > > 
const&, edm::Handle<std::vector<reco::Muon, std::allocator<reco::Muon> > 
 > const&, reco::Vertex const&) () from 
/cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_0/lib/slc7_amd64_gcc900/pluginDQMOfflineL1Trigger.so
#6  0x00002b4a0a4e2032 in L1TTauOffline::analyze(edm::Event const&, 
edm::EventSetup const&) () from 
/cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_0/lib/slc7_amd64_gcc900/pluginDQMOfflineL1Trigger.so
#7  0x00002b49973221cc in 
edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo 
const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () 
from 
/cvmfs/cms.cern.ch/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_0/lib/slc7_amd64_gcc900/libFWCoreFramework.so

which seems related to L1T

@cmsbuild
Copy link
Contributor

A new Issue was created by @francescobrivio .

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@francescobrivio
Copy link
Contributor Author

assign l1

@cmsbuild
Copy link
Contributor

New categories assigned: l1

@rekovic,@cecilecaillol you have been requested to review this Pull request/Issue and eventually sign? Thanks

@francescobrivio
Copy link
Contributor Author

FYI @cms-sw/alca-l2

@mmusich
Copy link
Contributor

mmusich commented Sep 16, 2021

@francescobrivio I think the crashing code is actually in the L1T DQM:

getProbeTaus(e, taus, muons, primaryVertex);

perhaps it should be assigned DQM as well

@francescobrivio
Copy link
Contributor Author

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@emanueleusai,@jfernan2,@pbo0,@rvenditti,@ahmad3213,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@jfernan2
Copy link
Contributor

@davignon @cecilecaillol @kreczko @benkrikler @vukasinmilosevic @dinyar @thomreis @rekovic @stahlleiton @astakia @cobby319 @ly615
As L1T DQM Offline contacts, could you please have a look and provide a fix?

@boudoul
Copy link
Contributor

boudoul commented Sep 16, 2021

[attracting the attention of @dpiparo and James Letts (I don't know his github name) ]

@lathomas
Copy link
Contributor

https://github.com/cms-sw/cmssw/blob/master/DQMOffline/L1Trigger/src/L1TTauOffline.cc#L709
This is the problematic line (the antiele part) that triggers the crash.
A quick and dirty fix seems to simply make sure the antielectron discriminators were computed
if((*antiele)[tauCandidate].workingPoints.size() ==0) continue;

@tvami
Copy link
Contributor

tvami commented Sep 16, 2021

https://github.com/cms-sw/cmssw/blob/master/DQMOffline/L1Trigger/src/L1TTauOffline.cc#L709
This is the problematic line (the antiele part) that triggers the crash.
A quick and dirty fix seems to simply make sure the antielectron discriminators were computed
if((*antiele)[tauCandidate].workingPoints.size() ==0) continue;

Hi All, since we are kind of pressured in time, AlCa agrees to this plan, please go ahead with the PR! Maybe on the top of this a protection against all discriminators should be added?

@tvami
Copy link
Contributor

tvami commented Sep 16, 2021

@lathomas is there somebody already identified who'll do the PR? We need it to merge to master first and then do the backport and then cut a release, so the schedule is somewhat tight. (Sorry for being pushy)

@lathomas
Copy link
Contributor

Ok I will make the PR

@tvami
Copy link
Contributor

tvami commented Sep 17, 2021

I made the backport here
#35322

thanks again @lathomas for the quick reaction!

@tvami
Copy link
Contributor

tvami commented Sep 20, 2021

@francescobrivio since the PR has been merged, please close this issue

@francescobrivio
Copy link
Contributor Author

Solved by #35322

@jfernan2
Copy link
Contributor

+1
For the records

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants