Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEM Det ID error in Online and Offline DQM #39456

Closed
rvenditti opened this issue Sep 20, 2022 · 19 comments
Closed

GEM Det ID error in Online and Offline DQM #39456

rvenditti opened this issue Sep 20, 2022 · 19 comments

Comments

@rvenditti
Copy link
Contributor

rvenditti commented Sep 20, 2022

We observed the following error message:
Exception Message:
GEMDetId ctor: Invalid parameters: region 1 ring 1 station 1 layer 0 chamber 5 ieta -99
in:

The error seems to be produced by:

throw cms::Exception("InvalidDetId")
<< "GEMDetId ctor: Invalid parameters: region " << region << " ring " << ring << " station " << station
<< " layer " << layer << " chamber " << chamber << " ieta " << ieta << std::endl;

First investigations:

  • Express reco does not give error in 12_4_8, but it crashes with 12_4_8+PR39389 and 12_4_9 (thanks @francescobrivio for checking)
  • Online system does not crash with 12_4_8, but it crashes with 12_4_8+PR39389 and 12_4_9 (thanks @syuvivida for checking)
@cmsbuild
Copy link
Contributor

A new Issue was created by @rvenditti .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@francescobrivio
Copy link
Contributor

francescobrivio commented Sep 20, 2022

Just to add a quick recipe for reproducing the Express crash:
12_4_9 - crash:

cmsrel CMSSW_12_4_9
cd CMSSW_12_4_9/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022E/InvalidDetIdError/job_1924/Express-5bfd16d5-50da-4ab0-9650-a2f3dae3bc11-3-logArchive.tar.gz .
tar -zxvf Express-5bfd16d5-50da-4ab0-9650-a2f3dae3bc11-3-logArchive.tar.gz 
cd job/WMTaskSpace/cmsRun1
cmsRun PSet.py

12_4_8 - no crash:

cmsrel CMSSW_12_4_8
cd CMSSW_12_4_8/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022E/InvalidDetIdError/job_1924/Express-5bfd16d5-50da-4ab0-9650-a2f3dae3bc11-3-logArchive.tar.gz .
tar -zxvf Express-5bfd16d5-50da-4ab0-9650-a2f3dae3bc11-3-logArchive.tar.gz 
cd job/WMTaskSpace/cmsRun1
cp /afs/cern.ch/work/f/fbrivio/public/tier0_issue_run359045/my_config.py .
cmsRun my_config.py

Note1: in 12_4_8 you need to copy the my_config.py where I disabled ParticleNetJetTagMonitor and SiPixelCalSingleMuonAnalyzer which were introduced in 12_4_9.

Note2: to introduce the crash in 12_4_8 just do:

cd CMSSW_12_4_8/src
cmsenv
git cms-addpkg EventFilter/L1TRawToDigi
git cherry-pick 65ea079ee92db467fbe026058220aa78fa0a7cf0
git cherry-pick 195accc73a10b8749fa1a6fd3fd15d6687afd5ce
scram b -j 8

and then again run my_config.py

@francescobrivio
Copy link
Contributor

Let me add from the @cms-sw/alca-l2 point of view:

@eyigitba you are the author of the original PR: any idea?

@francescobrivio
Copy link
Contributor

let me tag also @hyunyong @quark2 @jshlee @watson-ij as GEM experts

@rappoccio
Copy link
Contributor

urgent

@makortel
Copy link
Contributor

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

assign l1

@cmsbuild
Copy link
Contributor

New categories assigned: l1

@epalencia,@rekovic,@cecilecaillol you have been requested to review this Pull request/Issue and eventually sign? Thanks

@francescobrivio
Copy link
Contributor

To add even more info:
as pointed out by @davidlange6 the run used in the Tier0 replay (https://cmsoms.cern.ch/cms/runs/report?cms_run=356824&cms_run_sequence=GLOBAL-RUN) did not have GEM included in the GlobalRun, that's probably why the crash did not appear in the replay...

@eyigitba
Copy link
Contributor

Hi, I think I saw this error in some other workflow. I don't why this wasn't caught in my tests for this PR. I'll check and submit a PR to fix this today.

@eyigitba
Copy link
Contributor

I found the bug, it was a stupid mistake by me. It exists in all 3 PRs. I'll submit fixes now. Sorry about that.

@perrotta
Copy link
Contributor

@eyigitba prepared the fix, in #39460, backported as #39461 in 12_5_X and #39462 in 12_4_X

The fix is quite logical, therefore it must be applied. Can anybody please verify whether it fully fixes the issue, and there are no other similar bugs hidden somewhere else in the code?

@francescobrivio
Copy link
Contributor

francescobrivio commented Sep 20, 2022

I'm testing it offline with the recipe I provided above, but it's better if it gets tested in DQM online (@cms-sw/dqm-l2) as well. I'm not sure if can be tested in Tier0 without a release...

@syuvivida
Copy link
Contributor

I am building and testing PR 39462 now, will let you know the results

@francescobrivio
Copy link
Contributor

Ok I've re-run the tarball posted by Tier0 experts for the problematic job:

  • plain CMSSW_12_4_9 --> An exception of category 'InvalidDetId' occurred
  • CMSSW_12_4_9 + 7ebacc9 --> no crash

So the fix seems correct at least from the offline point of view.

@syuvivida
Copy link
Contributor

syuvivida commented Sep 20, 2022

We tested PR #39462 at the playback using the crashed run 359045 (reported by the DQM group earlier). The run finished OK without errors (25 LS tested). The test was done with CMSSW_12_4_9 + PR 39462.

@perrotta
Copy link
Contributor

fixed by #39460, backported as #39461 in 12_5_X and #39462 in 12_4_X

@perrotta
Copy link
Contributor

@rvenditti I think you can close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants