Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect TH1::Add() with a catch #26722

Closed
wants to merge 2 commits into from

Conversation

schneiml
Copy link
Contributor

@schneiml schneiml commented May 9, 2019

We saw failing merge jobs [1] due to Add() throwing on incompatible axis. Specifically, if bins have 0 width the axis checks can never succeed, while our tests accepted them. Instead of trying to replicate the ROOT behaviour 1:1, we resort to catching.

Closer inspection shows this is a cascade of four potential bugs.

  • The check for axis compatibility in DQMRootSource.cc is not identical to the check done by ROOT. This allows Add to fail when we expect it to succeed.
  • The CheckAxisLimits() [2] check in ROOT always fails if there is a zero-width bin, because it then uses a 0 epsilon in AreEqualAbs [3], which uses a less-than comparison. One or even both of these things can be considered bugs.
  • While CheckConsistency() should catch the exceptions thrown by CheckAxisLimits(), it seems a FatalRootError still makes it up to CMSSW. I don't see how that happens, but I have not checked the details.
  • The histogram N_HE from HCAL [4] uses some math to compute the axis range, which can lead to a 0-0 range (probably only in Phase2). This is most likely also a bug, and it is what triggers the chain of events.

To resolve this issue, I resorted to catching exceptions. This is not what we usually do, but the alternatives seem not much better:

  • We could copy-paste the checks from ROOT to make sure an exception never happens. But this is prone to diverge over time (maybe this is what happend here, I did not check ROOT version history).
  • We could simply fix HcalRecHitsAnalyzer, but that does not resolve the underlying problem.
  • We could wait for the ROOT team to resolve the issue, but that might take its time, while these failures are catastrophic to the DQMIO merge step (any merge job touching the problematic N_HE (and probably others) histogram will produce an ~empty result).

I am open for better suggestions. Also cross-posting to the ROOT forum for help [5].

[1] https://cms-unified.web.cern.ch/cms-unified//joblogs/ssawant_RVCMSSW_10_6_0_pre4ElectronGunPt2To100__2023D41noPU_190424_005155_567/8022/RecoFullGlobal_2023D41MergeDQMoutput/a97935fb-c204-4a88-b18c-8984a68cca65-0-4-logArchive/job/WMTaskSpace/cmsRun1/
[2] https://root.cern.ch/doc/master/TH1_8cxx_source.html#l01532
[3] https://root.cern.ch/doc/master/TMath_8h_source.html#l00412
[4] https://cmssdt.cern.ch/dxr/CMSSW/source/DQMOffline/Hcal/src/HcalRecHitsAnalyzer.cc#140-143
[5] https://root-forum.cern.ch/t/exceptions-on-th1-add/33973

PR Vaildation:

Fixes the problem in question, using the affected input files. No further validation so far.

We saw failing merge jobs due to Add() throwing on incompatible axis. Specifically, if bins have 0 width the axis checks can never succeed, while our tests accepted them. Instead of trying to replicate the ROOT behaviour 1:1, we resort to catching.
@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2019

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2019

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26722/9708

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2019

A new Pull Request was created by @schneiml (Marcel Schneider) for master.

It involves the following packages:

DQMServices/FwkIO

@andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please review it and eventually sign? Thanks.
@barvic this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@schneiml
Copy link
Contributor Author

schneiml commented May 9, 2019

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/119/console Started: 2019/05/09 12:00

@davidlange6
Copy link
Contributor

so the idea is hide problems in the log file that no one will look at (eg, like the one root emits when booking the histogram you refer too...)?

much preferable to add a test that crashes when wrong histogram definitions are added at the PR/IB level and not the release/relval level so that developers more quickly learn about and fix problems.

@schneiml
Copy link
Contributor Author

schneiml commented May 9, 2019

@davidlange6 Inconsistent axis range bugs where always hidden in the log files. The problem here is that the check did not catch equal-but-illegal axis ranges.

I agree that HcalRecHitsAnalyzer should be fixed, but I can't really do that without coordination with the subsystem. I am not sure how urgent this is, but I have to assume that no Phase2 workflow will produce any DQM output, which sounds bad.

Nevertheless, DQM should be able to handle whatever subsystems throw at it without crashing (esp. crashing in the merge step). So even if we fix HCAL DQM, we should fix this as well.

Checking and failing on weird axis ranges in booking is a good suggestion, but not something I'd want to sneak into a production release past the deadlines.

@davidlange6
Copy link
Contributor

davidlange6 commented May 9, 2019 via email

@abdoulline
Copy link

abdoulline commented May 9, 2019

Indeed, there is a bug in [4]. Thank you for tracking it down.

https://cmssdt.cern.ch/dxr/CMSSW/source/DQMOffline/Hcal/src/HcalRecHitsAnalyzer.cc#80
gets 0 for Phase2 (as there is no HE subdetector)
We have to fix it (sooner than later). Hopefully tomorrow.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2019

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2019

Comparison job queued.

@fwyzard
Copy link
Contributor

fwyzard commented May 9, 2019

I would say that using exceptions in TH1::CheckConsistency() is at least counterintuitive... what is the point of returning false if it is prepended by a throw ?

See https://root.cern.ch/doc/v614/TH1_8cxx_source.html#l01610 .

// TODO: we might want to switch to Merge() which should be smarter. It
// seems that Add() would call Merge() if needed (not identical bins
// etc.) but that seems to not work, see below.
iOriginal->Add(iToAdd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any idea if TH1::Add is exception safe? That is, after an exception is thrown, is the TH1 left in a valid state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good point. I'd think so, but we should check what ROOT has to say. Actually, I am no longer fully certain that Add is supposed to throw at all...

We should certainly check the return value now. One moment please...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROOT doesn't cause the throw, CMS does. By default, we convert ROOT error messages into exceptions (by plugging into ROOT's error handling system). We've done this for more than a decade as it has been the only reasonable way to quickly spot problems originating in ROOT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that explains the mysterious step that was not clear to me. Reading the error message actually points to a specific place in TH1Merger, let me check what is going on there.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2019

The code-checks are being triggered in jenkins.

@schneiml
Copy link
Contributor Author

schneiml commented May 9, 2019

I just made the check more proper -- ROOT does actually not throw exceptions here, and we still need to check some things manually.

Also, the exception thrown by CheckAxisLimits might have been a false lead -- gdb stopped there, but it does not immediatly cause the merge failure.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2019

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26722/9715

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2019

Pull request #26722 was updated. @andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please check and sign again.

@schneiml
Copy link
Contributor Author

schneiml commented May 9, 2019

@Dr15Jones does the ROOT error conversion also turn warnings into exceptions? Assuming that, what we see is https://root.cern.ch/doc/v612/TH1Merger_8cxx_source.html#l00546 , which indicates that hdes->fBuffer is null. Not sure what to make of this. Either the range checks somewhere in Merge got confused or going down the Merge path is a mistake in the first place and does not work in general for whatever reason.

@Dr15Jones
Copy link
Contributor

@schneiml The code which converts a ROOT message to an exception is here:

https://github.com/cms-sw/cmssw/blob/master/FWCore/Services/plugins/InitRootHandlers.cc#L184

That code does two parts. One is convert ROOT messages into MessageLogger messages. The second is for items ROOT (or CMS) believes are fatal, it throws an exception. It looks like by default we throw for anything ROOT has marked as kWarning or above.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2019

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6a99bb/119/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 33
  • DQMHistoTests: Total histograms compared: 3211964
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3211759
  • DQMHistoTests: Total skipped: 204
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 32 files compared)
  • Checked 137 log files, 14 edm output root files, 33 DQM output files

@schneiml
Copy link
Contributor Author

schneiml commented May 9, 2019

@Dr15Jones regarding exception safety: So when ROOT calls Warning, that call never returns because CMSSW transfers control via an exception?

I'd assume a "do nothing" is not very dangerous in this case and the TH1 is left in a sane state, but if we hijack control flow, ROOT could be rightfully confused and mess up things, so this if probably not safe in general.

We should figure out why that BufferMerge fails, maybe we can just "correctly" (as good as ROOT can) merge these corner cases...

@schneiml
Copy link
Contributor Author

Closing this in favour of #26738, which does not rely on catching exceptions. Continuing after a Warning-turned-exception in TH1Merger seems way more dangerous than the risk of the checks diverging again, risking another merge job failure.

@schneiml schneiml closed this May 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants