New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protect TH1::Add() with a catch #26722
Conversation
We saw failing merge jobs due to Add() throwing on incompatible axis. Specifically, if bins have 0 width the axis checks can never succeed, while our tests accepted them. Instead of trying to replicate the ROOT behaviour 1:1, we resort to catching.
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26722/9708
|
A new Pull Request was created by @schneiml (Marcel Schneider) for master. It involves the following packages: DQMServices/FwkIO @andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
The tests are being triggered in jenkins. |
so the idea is hide problems in the log file that no one will look at (eg, like the one root emits when booking the histogram you refer too...)? much preferable to add a test that crashes when wrong histogram definitions are added at the PR/IB level and not the release/relval level so that developers more quickly learn about and fix problems. |
@davidlange6 Inconsistent axis range bugs where always hidden in the log files. The problem here is that the check did not catch equal-but-illegal axis ranges. I agree that Nevertheless, DQM should be able to handle whatever subsystems throw at it without crashing (esp. crashing in the merge step). So even if we fix HCAL DQM, we should fix this as well. Checking and failing on weird axis ranges in booking is a good suggestion, but not something I'd want to sneak into a production release past the deadlines. |
Nevertheless, DQM should be able to handle whatever subsystems throw at it without crashing (esp. crashing in the merge step). So even if we fix HCAL DQM, we should fix this as well.
so i guess thats where we disagree
|
Indeed, there is a bug in [4]. Thank you for tracking it down. https://cmssdt.cern.ch/dxr/CMSSW/source/DQMOffline/Hcal/src/HcalRecHitsAnalyzer.cc#80 |
Comparison job queued. |
I would say that using exceptions in See https://root.cern.ch/doc/v614/TH1_8cxx_source.html#l01610 . |
// TODO: we might want to switch to Merge() which should be smarter. It | ||
// seems that Add() would call Merge() if needed (not identical bins | ||
// etc.) but that seems to not work, see below. | ||
iOriginal->Add(iToAdd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any idea if TH1::Add is exception safe? That is, after an exception is thrown, is the TH1 left in a valid state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very good point. I'd think so, but we should check what ROOT has to say. Actually, I am no longer fully certain that Add
is supposed to throw at all...
We should certainly check the return value now. One moment please...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ROOT doesn't cause the throw, CMS does. By default, we convert ROOT error messages into exceptions (by plugging into ROOT's error handling system). We've done this for more than a decade as it has been the only reasonable way to quickly spot problems originating in ROOT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that explains the mysterious step that was not clear to me. Reading the error message actually points to a specific place in TH1Merger
, let me check what is going on there.
The code-checks are being triggered in jenkins. |
I just made the check more proper -- ROOT does actually not throw exceptions here, and we still need to check some things manually. Also, the exception thrown by |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26722/9715
|
Pull request #26722 was updated. @andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please check and sign again. |
@Dr15Jones does the ROOT error conversion also turn warnings into exceptions? Assuming that, what we see is https://root.cern.ch/doc/v612/TH1Merger_8cxx_source.html#l00546 , which indicates that |
@schneiml The code which converts a ROOT message to an exception is here: https://github.com/cms-sw/cmssw/blob/master/FWCore/Services/plugins/InitRootHandlers.cc#L184 That code does two parts. One is convert ROOT messages into MessageLogger messages. The second is for items ROOT (or CMS) believes are fatal, it throws an exception. It looks like by default we throw for anything ROOT has marked as kWarning or above. |
Comparison is ready Comparison Summary:
|
@Dr15Jones regarding exception safety: So when ROOT calls I'd assume a "do nothing" is not very dangerous in this case and the TH1 is left in a sane state, but if we hijack control flow, ROOT could be rightfully confused and mess up things, so this if probably not safe in general. We should figure out why that |
Closing this in favour of #26738, which does not rely on catching exceptions. Continuing after a |
We saw failing merge jobs [1] due to Add() throwing on incompatible axis. Specifically, if bins have 0 width the axis checks can never succeed, while our tests accepted them. Instead of trying to replicate the ROOT behaviour 1:1, we resort to catching.
Closer inspection shows this is a cascade of four potential bugs.
DQMRootSource.cc
is not identical to the check done by ROOT. This allowsAdd
to fail when we expect it to succeed.CheckAxisLimits()
[2] check in ROOT always fails if there is a zero-width bin, because it then uses a 0 epsilon inAreEqualAbs
[3], which uses a less-than comparison. One or even both of these things can be considered bugs.CheckConsistency()
should catch the exceptions thrown byCheckAxisLimits()
, it seems aFatalRootError
still makes it up to CMSSW. I don't see how that happens, but I have not checked the details.N_HE
from HCAL [4] uses some math to compute the axis range, which can lead to a 0-0 range (probably only in Phase2). This is most likely also a bug, and it is what triggers the chain of events.To resolve this issue, I resorted to catching exceptions. This is not what we usually do, but the alternatives seem not much better:
HcalRecHitsAnalyzer
, but that does not resolve the underlying problem.N_HE
(and probably others) histogram will produce an ~empty result).I am open for better suggestions. Also cross-posting to the ROOT forum for help [5].
[1] https://cms-unified.web.cern.ch/cms-unified//joblogs/ssawant_RVCMSSW_10_6_0_pre4ElectronGunPt2To100__2023D41noPU_190424_005155_567/8022/RecoFullGlobal_2023D41MergeDQMoutput/a97935fb-c204-4a88-b18c-8984a68cca65-0-4-logArchive/job/WMTaskSpace/cmsRun1/
[2] https://root.cern.ch/doc/master/TH1_8cxx_source.html#l01532
[3] https://root.cern.ch/doc/master/TMath_8h_source.html#l00412
[4] https://cmssdt.cern.ch/dxr/CMSSW/source/DQMOffline/Hcal/src/HcalRecHitsAnalyzer.cc#140-143
[5] https://root-forum.cern.ch/t/exceptions-on-th1-add/33973
PR Vaildation:
Fixes the problem in question, using the affected input files. No further validation so far.