Outstanding issues to migrating to ROOT 6.14 #25919

Dr15Jones · 2019-02-12T16:53:22Z

This is a collection of items that need to be resolved before CMSSW can be migrated to ROOT 6.14

Dr15Jones · 2019-02-12T16:53:38Z

FYI @slava77 @pcanal

cmsbuild · 2019-02-12T16:53:45Z

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2019-02-12T16:58:09Z

When running merge jobs on release validation jobs for CMSSW_10_5_0_pre1 using ROOT 6.14 we experienced segmentation faults because the default main thread stack size was too small for the ROOT IO code to read a particular branch. When the crash occurred, the stack size was of order 16k deep.

Increasing the stack to 10M (up from the default of 8M) allowed the jobs to complete.

Dr15Jones · 2019-02-13T20:38:24Z

@slava77 could you add the ROOT 6.14 related issue you were referring to during the meeting?

smuzaffar · 2019-02-14T11:15:47Z

@amaltaro as mentioned privately, this is the issue we see while running ROOT6.14 based production jobs [a].
@Dr15Jones suggested that we can change this by running ulimit -s 10240 in the wrapper script. Can you please try adding ulimit -a to see what are the limits set on worker nodes and how can we change it for ROOT614 based RelVals?

[a]
https://cms-unified.web.cern.ch/cms-unified/report/chayanit_RVCMSSW_10_5_0_pre1_ROOT614MinBias_13_190207_084242_5475

https://cms-unified.web.cern.ch/cms-unified/report/nwickram_RVCMSSW_10_5_0_pre1_ROOT614H125GGgluonfusion_13_UP18__FastSim_190205_031513_2301

amaltaro · 2019-02-14T14:47:11Z

Let me test these things in one of the test agents.

fabiocos · 2019-02-14T15:55:01Z

@slava77 related to the observed problems in the past, I believe they are the same as those seen in the test of cms-sw/cmsdist#4678

makortel · 2019-02-14T16:23:17Z

The unit test failures in cms-sw/cmsdist#4678 look as weird as the unit test failures e.g. in #25907.

slava77 · 2019-02-14T20:40:31Z

@slava77 related to the observed problems in the past, I believe they are the same as those seen in the test of cms-sw/cmsdist#4678

yes, it refers to the differences in the DQM comparisons in 136.7611 and 136.8311.

DQMHistoTests: Total failures: 15
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_5_X_2019-02-06-2300+4678/30351/136.7611_RunJetHT2016E_reminiaod+RunJetHT2016E_reminiaod+REMINIAOD_data2016_HIPM+HARVESTDR2_REMINIAOD_data2016_HIPM/Tracking_PackedCandidate.html

amaltaro · 2019-02-18T12:04:15Z

@smuzaffar Shahzad, I cloned that workflow into my testbed setup (so using CMSSW_10_5_0_pre1_ROOT614) and it had 100% success rate.
Looking at the wrapper log for a CERN worker node, I see the stack size is actually not set for those processes, see

Printing resources available to this process with ulimit:
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 117400
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 117400
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Site name:  T2_CH_CERN
Hostname:   b6eb86d52a.cern.ch

Dr15Jones · 2019-02-18T15:13:37Z

I ask Burt Holzman about FNAL worker nodes and he determined they have a soft stacksize of 8MB, which is too small for those jobs.

amaltaro · 2019-02-19T08:41:40Z

This is what I got from FNAL and Nebraska worker nodes (just one sample of each, so I'm not sure it's valid for all site resources.

Printing resources available to this process with ulimit:
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515318
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Site name:  T1_US_FNAL
Hostname:   cmsgli-4912632-0-cmswn2097.fnal.gov

Printing resources available to this process with ulimit:
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514008
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Site name:  T2_US_Nebraska
Hostname:   cmsprod-2998778.0-red-c1810.unl.edu

davidlange6 · 2019-02-19T12:00:39Z

Hi Alan, which options for ulimit are you using?

…

On Feb 19, 2019, at 9:41 AM, Alan Malta Rodrigues ***@***.***> wrote: This is what I got from FNAL and Nebraska worker nodes (just one sample of each, so I'm not sure it's valid for all site resources. Printing resources available to this process with ulimit: core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 515318 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Site name: T1_US_FNAL Hostname: cmsgli-4912632-0-cmswn2097.fnal.gov Printing resources available to this process with ulimit: core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 514008 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 65536 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Site name: T2_US_Nebraska Hostname: cmsprod-2998778.0-red-c1810.unl.edu — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

amaltaro · 2019-02-19T12:02:45Z

ulimit -a

davidlange6 · 2019-02-19T12:07:34Z

so then the fnal number should correspond to the soft limit suggested by @Dr15Jones (but it does not)

…

On Feb 19, 2019, at 1:02 PM, Alan Malta Rodrigues ***@***.***> wrote: ulimit -a — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

amaltaro · 2019-02-19T12:19:25Z

Yes, that's why I provided the hostname as well. Perhaps Chris can ping Dave Mason to check that node?

dan131riley · 2019-02-19T13:24:08Z

This commit from four years ago, e3a98af sets the TBB thread stack size to 10MB, so the stack requirement doesn't seem to be new. Could the new factor just be a WN setting an 8MB stack limit? And, if we're asking for 10MB of stack for the TBB threads, maybe we should also try to bump the main thread's ulimit if it is lower than that (as long as it is just the soft limit)?

Dr15Jones · 2019-02-19T14:42:45Z

@holzman FYI

Dr15Jones · 2019-02-19T14:45:19Z

@dan131riley the 10MB limit for non-main threads is there because the default of 2MB from linux was too small for Geant4 jobs once we started running multi-threaded. I picked 10MB because it was a nice 'round' number, not from any attempt to find a particular minimum.

holzman · 2019-02-19T15:02:16Z

Sorry for the confusion. The default soft limit set by the kernel is indeed 8 MB, but unless the job itself declares a StackSize attribute, HTCondor changes it to unlimited. This does mean that running interactively vs. batch will give different values of RLIMIT_STACK.

Details for the curious:

https://github.com/htcondor/htcondor/blob/master/src/condor_sysapi/resource_limits.cpp#L31-L53
https://github.com/htcondor/htcondor/blob/master/src/condor_starter.V6.1/baseStarter.cpp#L249
https://github.com/htcondor/htcondor/blob/master/src/condor_starter.V6.1/job_info_communicator.cpp#L520

davidlange6 · 2019-02-19T15:38:36Z

so to circle back to the original problem - is there a message that should show up in the condor/cmssw logs for jobs killed in this way? I had poked around the logs without really understanding what actually caused the crash

…

On Feb 19, 2019, at 4:04 PM, Burt Holzman ***@***.***> wrote: Sorry for the confusion. The default soft limit set by the kernel is indeed 8 MB, but unless the job itself declares a StackSize attribute, HTCondor changes it to unlimited. This does mean that running interactively vs. batch will give different values of RLIMIT_STACK. Details for the curious: https://github.com/htcondor/htcondor/blob/master/src/condor_sysapi/resource_limits.cpp#L31-L53 https://github.com/htcondor/htcondor/blob/master/src/condor_starter.V6.1/baseStarter.cpp#L249 https://github.com/htcondor/htcondor/blob/master/src/condor_starter.V6.1/job_info_communicator.cpp#L520 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dr15Jones · 2019-02-19T15:40:30Z

@davidlange6

so to circle back to the original problem - is there a message that should show up in the condor/cmssw logs for jobs killed in this way? I had poked around the logs without really understanding what actually caused the crash

When we hit the stack limit, it causes a segmentation fault (since the operating system has protected memory beyond the stack).

Dr15Jones · 2019-02-19T17:41:24Z

@amaltaro do we know what sites were seeing failures?

davidlange6 · 2019-02-19T18:16:22Z

you can from the unified links above - the examples were FNAL. (normal site for relvals)

…

On Feb 19, 2019, at 6:41 PM, Chris Jones ***@***.***> wrote: @amaltaro do we know what sites were seeing failures? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nwickram · 2019-02-27T17:38:53Z

Hi all!
During our first submission of relvals we had failures for fullsim, fastsim and Gen under the CMSSW_10_5_0_pre1_ROOT614 validation Campiagn. Only data was OK. You can find the rate of Completion for each samples from the links below.

Fullsim noPU:
https://dmytro.web.cern.ch/dmytro/cmsprodmon/requests.php?campaign=CMSSW_10_5_0_pre1_ROOT614__fullsim_noPU_2018_set1-1549525175
https://dmytro.web.cern.ch/dmytro/cmsprodmon/requests.php?campaign=CMSSW_10_5_0_pre1_ROOT614__fullsim_noPU_2018_set2-1549525839
https://dmytro.web.cern.ch/dmytro/cmsprodmon/requests.php?campaign=CMSSW_10_5_0_pre1_ROOT614__fullsim_noPU_2018_set3-1549526165
https://dmytro.web.cern.ch/dmytro/cmsprodmon/requests.php?campaign=CMSSW_10_5_0_pre1_ROOT614__fullsim_noPU_2018_set4-1549527853

FastSim:
https://dmytro.web.cern.ch/dmytro/cmsprodmon/requests.php?campaign=CMSSW_10_5_0_pre1_ROOT614__fastsim_mixing_2018-1549332135
https://dmytro.web.cern.ch/dmytro/cmsprodmon/requests.php?campaign=CMSSW_10_5_0_pre1_ROOT614__fastsim_noPU2018-1549332669

Gen:
https://dmytro.web.cern.ch/dmytro/cmsprodmon/requests.php?campaign=CMSSW_10_5_0_pre1_ROOT614__generator-1549627668
https://dmytro.web.cern.ch/dmytro/cmsprodmon/requests.php?campaign=CMSSW_10_5_0_pre1_ROOT614__ext_gen-1549642153

Thanks
Nadeesha for PdmV

fabiocos · 2019-05-07T11:59:36Z

@Dr15Jones @slava77 so far the official PdmV validation has not reported problems (@prebello @zhenhu please confirm). I have looked at the RelMon output, and made some small scale test (1000 events) myself with a subselection of workflows. For 136.878 (RunMuonEG2018C) I see just 14 differences out of 156022 histograms, all concentrated in HLT/EXO and JetMET for the histograms
https://cmssdt.cern.ch/lxr/source/DQMOffline/Trigger/plugins/METMonitor.cc#0127

that show a different numbers of entries, here two examples

fabiocos · 2019-05-07T12:18:23Z

Looking at how the histograms are filled in https://cmssdt.cern.ch/lxr/source/DQMOffline/Trigger/plugins/METMonitor.cc#0268 it looks strange the agreement in MET and difference in deltaPhi, which looks correct in ROOT 6.14.

Dr15Jones · 2019-05-07T12:31:44Z

Just to be clear, the ROOT 6.12 histograms look inconsistent and the 6.14 look consistent?

fabiocos · 2019-05-07T12:51:14Z

Sorry, I need to rectify, CMSSW_10_6_0_pre3 (ROOT 6.12) looks consistent, while ROOT 6.14 does not. For the central ValDB comparison one can see https://cms-pdmv.cern.ch/ReleaseMonitoring/CMSSW_10_6_0_pre3_ROOT614vsCMSSW_10_6_0_pre3/DataReport_HLT/MuonEGmuEG2018C-v1_105X_dataRun2_v8_resub_RelVal_muEG2018C_319450/b3da40809c.html
for instance, the only plot reported to be different is again
https://cmsweb.cern.ch/dqm/relval/plotfairy/overlay?obj=archive%2F319450%2FMuonEG%2FCMSSW_10_6_0_pre3-105X_dataRun2_v8_resub_RelVal_muEG2018C-v1%2FDQMIO%2FHLT%2FEXO%2FMET%2FMonoCentralPFJet80_PFMETNoMu90%2Fdeltaphi_jet1jet2_denominator;obj=archive%2F319450%2FMuonEG%2FCMSSW_10_6_0_pre3_ROOT614-105X_dataRun2_v8_RelVal_muEG2018C-v1%2FDQMIO%2FHLT%2FEXO%2FMET%2FMonoCentralPFJet80_PFMETNoMu90%2Fdeltaphi_jet1jet2_denominator;w=600;h=600;ref=ratiooverlay
In my test I am sure I have run on 1000 events in both cases using the same configuration

fabiocos · 2019-05-07T12:54:58Z

BTW, the whole setup for my checks with output files and histograms is in cmsdev25.cern.ch:/build/fabiocos/106X/ROOT614 , in the work subdirectories I have all the results for a number of workflows

Dr15Jones · 2019-05-07T13:18:02Z

@fabiocos could you try running the ROOT6.14 again and see if it makes the same results?

davidlange6 · 2019-05-07T13:19:38Z

most likely its due to two dqm modules writing to the same spot src/DQMOffline/Trigger/python/HTMonitor_cff.py: MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_HTmonitoring.FolderName = cms.string('HLT/EXO/MET/MonoCentralPFJet80_PFMETNoMu90/') src/DQMOffline/Trigger/python/METMonitor_cff.py: MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_METmonitoring.FolderName = cms.string('HLT/EXO/MET/MonoCentralPFJet80_PFMETNoMu90/') and both have a histogram with this title.. its not totally obvious to me why it would just be this one directory that is affected if my understanding is correct

…

On May 7, 2019, at 2:54 PM, Fabio Cossutti ***@***.***> wrote: BTW, the whole setup for my checks with output files and histograms is in cmsdev25.cern.ch:/build/fabiocos/106X/ROOT614 , in the work subdirectories I have all the results for a number of workflows — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

davidlange6 · 2019-05-07T13:24:19Z

On May 7, 2019, at 3:19 PM, David Lange ***@***.***> wrote: most likely its due to two dqm modules writing to the same spot src/DQMOffline/Trigger/python/HTMonitor_cff.py: MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_HTmonitoring.FolderName = cms.string('HLT/EXO/MET/MonoCentralPFJet80_PFMETNoMu90/') src/DQMOffline/Trigger/python/METMonitor_cff.py: MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_METmonitoring.FolderName = cms.string('HLT/EXO/MET/MonoCentralPFJet80_PFMETNoMu90/') and both have a histogram with this title.. its not totally obvious to me why it would just be this one directory that is affected if my understanding is correct

ah - perhaps because thet jet selection is different only in the case of MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_HTmonitoring vs MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_METmonitoring and not for the other similarly named pairs of modules cool race condition?

…

> On May 7, 2019, at 2:54 PM, Fabio Cossutti ***@***.***> wrote: > > BTW, the whole setup for my checks with output files and histograms is in cmsdev25.cern.ch:/build/fabiocos/106X/ROOT614 , in the work subdirectories I have all the results for a number of workflows > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub, or mute the thread. > — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

prebello · 2019-05-07T14:05:38Z

@fabiocos so far mostly greenlighted but Jet FullSim in progress (validator contacted to update the status)
https://hypernews.cern.ch/HyperNews/CMS/get/relval/12889/44.html

fabiocos · 2019-05-08T11:36:43Z

@davidlange6 ok, this issue looks more an issue in the DQM code than in ROOT itself...

smuzaffar · 2019-09-09T15:48:34Z

@Dr15Jones @fabiocos , any reason to keep it open?

fabiocos · 2019-09-09T15:59:30Z

I let @Dr15Jones comment, I would say that we are approaching "Outstanding issues to migrating to ROOT 6.18"...

Dr15Jones · 2019-09-09T16:01:40Z

This can be closed.

cmsbuild added the pending-assignment label Feb 12, 2019

fabiocos mentioned this issue Mar 4, 2019

Fix excessive stack use when storing HepMCProduct #26053

Merged

slava77 mentioned this issue Mar 14, 2019

Updated root to tip of branch v6-12-00-patches cms-sw/cmsdist#4775

Merged

Dr15Jones closed this as completed Sep 9, 2019

Outstanding issues to migrating to ROOT 6.14 #25919

Outstanding issues to migrating to ROOT 6.14 #25919

Comments

Dr15Jones commented Feb 12, 2019

Dr15Jones commented Feb 12, 2019

cmsbuild commented Feb 12, 2019

Dr15Jones commented Feb 12, 2019

Dr15Jones commented Feb 13, 2019

smuzaffar commented Feb 14, 2019

amaltaro commented Feb 14, 2019

fabiocos commented Feb 14, 2019

makortel commented Feb 14, 2019

slava77 commented Feb 14, 2019

amaltaro commented Feb 18, 2019

Dr15Jones commented Feb 18, 2019

amaltaro commented Feb 19, 2019

davidlange6 commented Feb 19, 2019 via email

amaltaro commented Feb 19, 2019

davidlange6 commented Feb 19, 2019 via email

amaltaro commented Feb 19, 2019

dan131riley commented Feb 19, 2019

Dr15Jones commented Feb 19, 2019

Dr15Jones commented Feb 19, 2019

holzman commented Feb 19, 2019

davidlange6 commented Feb 19, 2019 via email

Dr15Jones commented Feb 19, 2019

Dr15Jones commented Feb 19, 2019

davidlange6 commented Feb 19, 2019 via email

nwickram commented Feb 27, 2019

fabiocos commented May 7, 2019 • edited

fabiocos commented May 7, 2019

Dr15Jones commented May 7, 2019

fabiocos commented May 7, 2019

fabiocos commented May 7, 2019

Dr15Jones commented May 7, 2019

davidlange6 commented May 7, 2019 via email

davidlange6 commented May 7, 2019 via email

prebello commented May 7, 2019

fabiocos commented May 8, 2019

smuzaffar commented Sep 9, 2019

fabiocos commented Sep 9, 2019

Dr15Jones commented Sep 9, 2019

fabiocos commented May 7, 2019 •

edited