New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outstanding issues to migrating to ROOT 6.14 #25919
Comments
A new Issue was created by @Dr15Jones Chris Jones. @davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
When running merge jobs on release validation jobs for CMSSW_10_5_0_pre1 using ROOT 6.14 we experienced segmentation faults because the default main thread stack size was too small for the ROOT IO code to read a particular branch. When the crash occurred, the stack size was of order 16k deep. Increasing the stack to 10M (up from the default of 8M) allowed the jobs to complete. |
@slava77 could you add the ROOT 6.14 related issue you were referring to during the meeting? |
@amaltaro as mentioned privately, this is the issue we see while running ROOT6.14 based production jobs [a]. |
Let me test these things in one of the test agents. |
@slava77 related to the observed problems in the past, I believe they are the same as those seen in the test of cms-sw/cmsdist#4678 |
The unit test failures in cms-sw/cmsdist#4678 look as weird as the unit test failures e.g. in #25907. |
yes, it refers to the differences in the DQM comparisons in 136.7611 and 136.8311.
|
@smuzaffar Shahzad, I cloned that workflow into my testbed setup (so using CMSSW_10_5_0_pre1_ROOT614) and it had 100% success rate.
|
I ask Burt Holzman about FNAL worker nodes and he determined they have a soft stacksize of 8MB, which is too small for those jobs. |
This is what I got from FNAL and Nebraska worker nodes (just one sample of each, so I'm not sure it's valid for all site resources.
|
Hi Alan,
which options for ulimit are you using?
… On Feb 19, 2019, at 9:41 AM, Alan Malta Rodrigues ***@***.***> wrote:
This is what I got from FNAL and Nebraska worker nodes (just one sample of each, so I'm not sure it's valid for all site resources.
Printing resources available to this process with ulimit:
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515318
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Site name: T1_US_FNAL
Hostname: cmsgli-4912632-0-cmswn2097.fnal.gov
Printing resources available to this process with ulimit:
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 514008
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Site name: T2_US_Nebraska
Hostname: cmsprod-2998778.0-red-c1810.unl.edu
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
so then the fnal number should correspond to the soft limit suggested by @Dr15Jones (but it does not)
… On Feb 19, 2019, at 1:02 PM, Alan Malta Rodrigues ***@***.***> wrote:
ulimit -a
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Yes, that's why I provided the hostname as well. Perhaps Chris can ping Dave Mason to check that node? |
This commit from four years ago, e3a98af sets the TBB thread stack size to 10MB, so the stack requirement doesn't seem to be new. Could the new factor just be a WN setting an 8MB stack limit? And, if we're asking for 10MB of stack for the TBB threads, maybe we should also try to bump the main thread's ulimit if it is lower than that (as long as it is just the soft limit)? |
@holzman FYI |
@dan131riley the 10MB limit for non-main threads is there because the default of 2MB from linux was too small for Geant4 jobs once we started running multi-threaded. I picked 10MB because it was a nice 'round' number, not from any attempt to find a particular minimum. |
Sorry for the confusion. The default soft limit set by the kernel is indeed 8 MB, but unless the job itself declares a StackSize attribute, HTCondor changes it to unlimited. This does mean that running interactively vs. batch will give different values of RLIMIT_STACK. Details for the curious: https://github.com/htcondor/htcondor/blob/master/src/condor_sysapi/resource_limits.cpp#L31-L53 |
so to circle back to the original problem - is there a message that should show up in the condor/cmssw logs for jobs killed in this way? I had poked around the logs without really understanding what actually caused the crash
… On Feb 19, 2019, at 4:04 PM, Burt Holzman ***@***.***> wrote:
Sorry for the confusion. The default soft limit set by the kernel is indeed 8 MB, but unless the job itself declares a StackSize attribute, HTCondor changes it to unlimited. This does mean that running interactively vs. batch will give different values of RLIMIT_STACK.
Details for the curious:
https://github.com/htcondor/htcondor/blob/master/src/condor_sysapi/resource_limits.cpp#L31-L53
https://github.com/htcondor/htcondor/blob/master/src/condor_starter.V6.1/baseStarter.cpp#L249
https://github.com/htcondor/htcondor/blob/master/src/condor_starter.V6.1/job_info_communicator.cpp#L520
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
When we hit the stack limit, it causes a segmentation fault (since the operating system has protected memory beyond the stack). |
@amaltaro do we know what sites were seeing failures? |
you can from the unified links above - the examples were FNAL. (normal site for relvals)
… On Feb 19, 2019, at 6:41 PM, Chris Jones ***@***.***> wrote:
@amaltaro do we know what sites were seeing failures?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@Dr15Jones @slava77 so far the official PdmV validation has not reported problems (@prebello @zhenhu please confirm). I have looked at the RelMon output, and made some small scale test (1000 events) myself with a subselection of workflows. For 136.878 (RunMuonEG2018C) I see just 14 differences out of 156022 histograms, all concentrated in HLT/EXO and JetMET for the histograms that show a different numbers of entries, here two examples |
Looking at how the histograms are filled in https://cmssdt.cern.ch/lxr/source/DQMOffline/Trigger/plugins/METMonitor.cc#0268 it looks strange the agreement in MET and difference in deltaPhi, which looks correct in ROOT 6.14. |
Just to be clear, the ROOT 6.12 histograms look inconsistent and the 6.14 look consistent? |
Sorry, I need to rectify, CMSSW_10_6_0_pre3 (ROOT 6.12) looks consistent, while ROOT 6.14 does not. For the central ValDB comparison one can see https://cms-pdmv.cern.ch/ReleaseMonitoring/CMSSW_10_6_0_pre3_ROOT614vsCMSSW_10_6_0_pre3/DataReport_HLT/MuonEGmuEG2018C-v1_105X_dataRun2_v8_resub_RelVal_muEG2018C_319450/b3da40809c.html |
BTW, the whole setup for my checks with output files and histograms is in cmsdev25.cern.ch:/build/fabiocos/106X/ROOT614 , in the work subdirectories I have all the results for a number of workflows |
@fabiocos could you try running the ROOT6.14 again and see if it makes the same results? |
most likely its due to two dqm modules writing to the same spot
src/DQMOffline/Trigger/python/HTMonitor_cff.py: MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_HTmonitoring.FolderName = cms.string('HLT/EXO/MET/MonoCentralPFJet80_PFMETNoMu90/')
src/DQMOffline/Trigger/python/METMonitor_cff.py: MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_METmonitoring.FolderName = cms.string('HLT/EXO/MET/MonoCentralPFJet80_PFMETNoMu90/')
and both have a histogram with this title.. its not totally obvious to me why it would just be this one directory that is affected if my understanding is correct
… On May 7, 2019, at 2:54 PM, Fabio Cossutti ***@***.***> wrote:
BTW, the whole setup for my checks with output files and histograms is in cmsdev25.cern.ch:/build/fabiocos/106X/ROOT614 , in the work subdirectories I have all the results for a number of workflows
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
On May 7, 2019, at 3:19 PM, David Lange ***@***.***> wrote:
most likely its due to two dqm modules writing to the same spot
src/DQMOffline/Trigger/python/HTMonitor_cff.py: MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_HTmonitoring.FolderName = cms.string('HLT/EXO/MET/MonoCentralPFJet80_PFMETNoMu90/')
src/DQMOffline/Trigger/python/METMonitor_cff.py: MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_METmonitoring.FolderName = cms.string('HLT/EXO/MET/MonoCentralPFJet80_PFMETNoMu90/')
and both have a histogram with this title.. its not totally obvious to me why it would just be this one directory that is affected if my understanding is correct
ah - perhaps because thet jet selection is different only in the case of MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_HTmonitoring vs MonoCentralPFJet80_PFMETNoMu90_PFMHTNoMu90_METmonitoring and not for the other similarly named pairs of modules
cool
race condition?
…
> On May 7, 2019, at 2:54 PM, Fabio Cossutti ***@***.***> wrote:
>
> BTW, the whole setup for my checks with output files and histograms is in cmsdev25.cern.ch:/build/fabiocos/106X/ROOT614 , in the work subdirectories I have all the results for a number of workflows
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@fabiocos so far mostly greenlighted but Jet FullSim in progress (validator contacted to update the status) |
@davidlange6 ok, this issue looks more an issue in the DQM code than in ROOT itself... |
@Dr15Jones @fabiocos , any reason to keep it open? |
I let @Dr15Jones comment, I would say that we are approaching "Outstanding issues to migrating to ROOT 6.18"... |
This can be closed. |
This is a collection of items that need to be resolved before CMSSW can be migrated to ROOT 6.14
The text was updated successfully, but these errors were encountered: