Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM merge failure in CMSSW_11_0_0_pre3 #27528

Closed
fabiocos opened this issue Jul 16, 2019 · 17 comments
Closed

DQM merge failure in CMSSW_11_0_0_pre3 #27528

fabiocos opened this issue Jul 16, 2019 · 17 comments

Comments

@fabiocos
Copy link
Contributor

The PdmV team reports a massive failure of the validation in CMSSW_11_0_0_pre3 in the DQMIO merge step, with errors like:

Fatal Exception (Exit code: 8022) 
An exception of category 'FatalRootError' occurred while
[0] Calling InputSource::readRun_
Additional Info:
[a] Fatal Root Error: @SUB=TProfileHelper::Merge
Cannot merge profiles 1 dim - limits are inconsistent:
first: (31, -9.000000, 15.000000), second: (31, -15.000000, 9.000000)

seen for instance in

https://cms-unified.web.cern.ch/cms-unified/report/muahmad_RVCMSSW_11_0_0_pre3Higgs200ChargedTaus_13_190711_065022_9820

or

https://cms-unified.web.cern.ch/cms-unified/report/chayanit_RVCMSSW_11_0_0_pre3RunCharmonium2018D__RelVal_2018D_190710_083031_7155

(but the problem is general).

In the set of IB workflows we have now test 137.8 which performs DQM harvesting on two separate files from two periods, without failing. I have also tested the DQMIO merge (missing, I would add it), but it does not fail. But if I try to split 136.85 in two pieces of 50 events each, and I try the merge, I get the failure.

This looks a show-stopper for a meaningful validation of the release, and should be solved asap.

@cmsbuild
Copy link
Contributor

A new Issue was created by @fabiocos Fabio Cossutti.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fabiocos
Copy link
Contributor Author

assign dqm

for the problem itself

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@jfernan2,@andrius-k,@schneiml,@fioriNTU,@kmaeshima you have been requested to review this Pull request/Issue and eventually sign? Thanks

@fabiocos
Copy link
Contributor Author

assign pdmv

for the extension of release tests

@cmsbuild
Copy link
Contributor

New categories assigned: pdmv

@pgunnell,@prebello,@zhenhu you have been requested to review this Pull request/Issue and eventually sign? Thanks

@fabiocos
Copy link
Contributor Author

@jfernan2 this looks the continuation of #26727 . I will close that, keeping here there reference to it

@fabiocos
Copy link
Contributor Author

repeating the test with wf 136.85 in the recent IB, using the addition by @Dr15Jones #27473 👍

----- Begin Fatal Exception 16-Jul-2019 13:13:29 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling InputSource::readRun_
   [1] While reading element Tracking/TrackParameters/generalTracks/GeneralProperties/DistanceOfClosestApproachToBSVsEta_GenTk
   Additional Info:
      [a] Fatal Root Error: @SUB=TProfileHelper::Merge
Cannot merge profiles 1 dim - limits are inconsistent:
 first: (31, -15.000000, 9.000000), second: (31, -9.000000, 15.000000)

----- End Fatal Exception -------------------------------------------------

@fabiocos
Copy link
Contributor Author

@fabiocos
Copy link
Contributor Author

I wonder whether the problem is not caused by https://github.com/cms-sw/cmssw/blob/master/DQM/TrackingMonitor/src/TrackAnalyzer.cc#L866 where the axis range is dynamically extended

@fabiocos
Copy link
Contributor Author

fabiocos commented Jul 16, 2019

indeed, removing all the instances of SetCanExtend from that module let the test with 136.85 split in two to complete without any crash

@fabiocos
Copy link
Contributor Author

I see that this method is used in many places, and just one instance has been added by @hbecerri in #27330 (the one that has caused the failure). I wonder whether the others are all innocent...

@jfernan2
Copy link
Contributor

@fabiocos I reached the same conclusion...but indeed SetCanExtend is used 190 times across CMSSW, so something else must be in the code from #27330 I believe
Should I create a PR commenting this instances in Tracker until the real solution to the problem is found?

@jfernan2
Copy link
Contributor

Oh sorry, I see you already created it...git hub had not refreshed the status.
Thanks

@fioriNTU
Copy link
Contributor

@fabiocos @jfernan2 @mtosi @schneiml just to add a comment, the Tracker i believe is the main user of the SetCanExtend flag, there the use case is to have time profiles defined with a definite bin number but with the maximum value that get adjusted automatically in the case the run gets too long. I have used privately this functionality for years, and I can assure that the CanExtend+hadd usually does the right thing, it was one of the most robust (and useful) features of root.

In this case I see that the extendable axes are "All", while usually we need only the x-axis to be extendable. I have no idea in which case the extension of the y axis of an histogram can be effective, but maybe is this feature that confuses root. Sorry for the long comment, but I believe that informing ROOT developers about this issue, instead of changing DQM code, would be better.

@Dr15Jones
Copy link
Contributor

@fioriNTU

Sorry for the long comment, but I believe that informing ROOT developers about this issue, instead of changing DQM code, would be better.

The exception is from CMSSW, not from ROOT. In CMSSW we 'trap' ROOT Error and Warning messages and turn them into exceptions. This is because way to many of the ROOT messages are actually fatal but we do not discover the problems until way too late (e.g. missing dictionaries are just a warning message but cause data to not be stored). This is done here

https://github.com/cms-sw/cmssw/blob/master/FWCore/Services/plugins/InitRootHandlers.cc#L179

We can add special handling for certain message, however, we really try to avoid this since we have found that ROOT error and warning messages are almost invariable saying there is a serious problem happening that a human is supposed to do something about (i.e. ROOT really is setup to work interactively and not super great for batch processing).

@fabiocos
Copy link
Contributor Author

#27535 solves this issue (tested with the workflow 136.85 split into two parts)

@fabiocos
Copy link
Contributor Author

as closing remark I think we need to extend test 137.8 so as to probe also DQMIo file merge (althouhg it would not detect this specific problem)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants