Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash in HBHEPhase1Reconstructor in PromptReco_Run345915_ZeroBias19 #35785

Closed
slava77 opened this issue Oct 22, 2021 · 16 comments
Closed

crash in HBHEPhase1Reconstructor in PromptReco_Run345915_ZeroBias19 #35785

slava77 opened this issue Oct 22, 2021 · 16 comments

Comments

@slava77
Copy link
Contributor

slava77 commented Oct 22, 2021

https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2294.html

/afs/cern.ch/user/c/cmst0/public/PausedJobs/PilotBeam/job_128876/tarball

The stack trace is pointing to

#5 HcalDeterministicFit::phase1Apply(HBHEChannelInfo const&, float&, float&, HcalTimeSlew const*)
#6 SimpleHBHEPhase1Algo::reconstruct(HBHEChannelInfo const&, ...
#7 void HBHEPhase1Reconstructor::processData<QIE11DataFrame,
#8 HBHEPhase1Reconstructor::produce(

run details https://cmsoms.cern.ch/cms/runs/report?cms_run=345915&cms_run_sequence=GLOBAL-RUN

since I have not seen crashes in HBHEPhase1Reconstructor in a while, it would be nice to have the HCAL experts to have a look at this.

@cms-sw/hcal-dpg-l2

@cmsbuild
Copy link
Contributor

A new Issue was created by @slava77 Slava Krutelyov.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@slava77
Copy link
Contributor Author

slava77 commented Oct 22, 2021

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@slava77,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@abdoulline
Copy link

abdoulline commented Oct 24, 2021

This is just to explicitly add @igv4321 and @mariadalfonso
As Maria has pointed out in HCAL thread:

  • HcalDeterministicFit (seemingly provoking crash) = M3 fit, which is not supposed to be used anywhere in Run3;
  • it didn't get any attention in the context of Run3 preparations.

As far as I can see, M3 stays activated in some cases, including (major one) HBHEPhase1Reconstructor
https://cmssdt.cern.ch/lxr/search?%21&_filestring=&_string=useM3&_casesensitive=1
So we may consider swithcing it off (both in 12_1_X and 12_0_X) right away as a precautionary measure.

@abdoulline
Copy link

abdoulline commented Oct 24, 2021

The recipe (for our colleagues' eventual investigation) to reproduce the issue
(looking also in https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2294.html info) :

cmsrel CMSSW_12_0_2_patch2
cd CMSSW_12_0_2_patch2/src
cmsenv
cp /eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run345915_ZeroBias19/Reco/vocms0314.cern.ch-128876-3-log.tar.gz
tar -xzvf vocms0314.cern.ch-128876-3-log.tar.gz
cd job/WMTaskSpace/cmsRun1/
cmsRun PSet.py

NB: would be good to go directly to 26299th record (Run 345915, Event 12287281). Not clear how to do this with the config in the tarball...
Otherwise it takes ~40 min on lxplus (8-process default mode) to get to the crash point with ~126 MB of log file...

@abdoulline
Copy link

abdoulline commented Oct 25, 2021

NB: when I run the simplest ("minimum-minimorum" = HCAL RAW unpacking + "hbheprereco" [1] )
job [2] in the regular CMSSW_12_0_2_patch2 on this particular event of interest, nothing (bad) happens.
It's worth ignoring (irrelevant in our context) warning message about QIE10 (HF/ZDC) digi, which pops up only in the very first processed event, as it's a known temporary harmless issue.


[1]
https://cmssdt.cern.ch/lxr/source/RecoLocalCalo/HcalRecProducers/python/HBHEPhase1Reconstructor_cfi.py#0009

[2]
/afs/cern.ch/user/a/abdullin/public/reading_345915-12287281/test_on_data.py

@rappoccio
Copy link
Contributor

Hi, @abdoulline, we should try this in CMSSW_12_0_3, which is what is now deployed at T0 and elsewhere.

We should make a call what to do with this. Do we want to switch it off and make a new release?

@abdoulline
Copy link

abdoulline commented Oct 25, 2021

@rappoccio
Hi Salvatore, I guess there are some 12_0_X PRs in the queue, so that some release after 12_0_3 will be cut anyway?
As you might have seen from my previous post - in a single CPU thread a "standalone" HCAL reconstruction doesn't crash on the "event of interest", so I'm not sure what really has happened at T0 for a particular event in a particular not very good (per se) run 345915. And an investigation may take (quite) some time...
Can this particular run be excluded from T0 processing for the time being?

@rappoccio
Copy link
Contributor

Hi, @abdoulline we can of course exclude it but we want to make sure this doesn't happen during collisions. Is this expected to re-occur?

@abdoulline
Copy link

abdoulline commented Oct 25, 2021

@rappoccio
I don't know the answer right away.
As Slava has mentioned above - HBHEPhase1Reconstructor didn't crash since a long while. I don't remeber - since 2017 or even longer (?)

With M3 (which right now looks like the culprit) switched off #35807 reapearring of the issue is much less probable, but I don't have 100% guarantee at the moment.

@abdoulline
Copy link

abdoulline commented Oct 25, 2021

With the help from @rappoccio (provided T0 config) and @mariadalfonso (with her eagle eye - catched remaining mis-settings in the aforemetioned config) it looks like in 12_0_X the workaround with useM3=False works.

@slava77
Copy link
Contributor Author

slava77 commented Oct 25, 2021

With the help from @rappoccio (provided T0 config) and @mariadalfonso (with her eagle eye - catched remaining mis-settings in the aforemetioned config) it looks like in 12_0_X the workaround with useM3=False works.

now that the crash is reproducible, it would be nice to fix M3 code, in case it needs to run again (or if this crashes start appearing in the Run-2 setup where M3 is still enabled).

@abdoulline
Copy link

abdoulline commented Oct 25, 2021

@slava77 sure, even though we'll not get back to M3 in Run3, the issue needs to be fully tracked down and fixed.
It may take some time though...

@abdoulline
Copy link

@slava77
Slava, I suppose we can close the issue?

@slava77
Copy link
Contributor Author

slava77 commented Nov 3, 2021

+reconstruction

via #35944 (and the mitigation in #35806)

@slava77 slava77 closed this as completed Nov 3, 2021
@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 3, 2021

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants