Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outstanding issues to migrating to ROOT 6.14 #25919

Closed
Dr15Jones opened this issue Feb 12, 2019 · 38 comments
Closed

Outstanding issues to migrating to ROOT 6.14 #25919

Dr15Jones opened this issue Feb 12, 2019 · 38 comments

Comments

@Dr15Jones
Copy link
Contributor

This is a collection of items that need to be resolved before CMSSW can be migrated to ROOT 6.14

@Dr15Jones
Copy link
Contributor Author

FYI @slava77 @pcanal

@cmsbuild
Copy link
Contributor

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

When running merge jobs on release validation jobs for CMSSW_10_5_0_pre1 using ROOT 6.14 we experienced segmentation faults because the default main thread stack size was too small for the ROOT IO code to read a particular branch. When the crash occurred, the stack size was of order 16k deep.

Increasing the stack to 10M (up from the default of 8M) allowed the jobs to complete.

@Dr15Jones
Copy link
Contributor Author

@slava77 could you add the ROOT 6.14 related issue you were referring to during the meeting?

@smuzaffar
Copy link
Contributor

@amaltaro as mentioned privately, this is the issue we see while running ROOT6.14 based production jobs [a].
@Dr15Jones suggested that we can change this by running ulimit -s 10240 in the wrapper script. Can you please try adding ulimit -a to see what are the limits set on worker nodes and how can we change it for ROOT614 based RelVals?

[a]
https://cms-unified.web.cern.ch/cms-unified/report/chayanit_RVCMSSW_10_5_0_pre1_ROOT614MinBias_13_190207_084242_5475

https://cms-unified.web.cern.ch/cms-unified/report/nwickram_RVCMSSW_10_5_0_pre1_ROOT614H125GGgluonfusion_13_UP18__FastSim_190205_031513_2301

@amaltaro
Copy link

Let me test these things in one of the test agents.

@fabiocos
Copy link
Contributor

@slava77 related to the observed problems in the past, I believe they are the same as those seen in the test of cms-sw/cmsdist#4678

@makortel
Copy link
Contributor

The unit test failures in cms-sw/cmsdist#4678 look as weird as the unit test failures e.g. in #25907.

@slava77
Copy link
Contributor

slava77 commented Feb 14, 2019

@slava77 related to the observed problems in the past, I believe they are the same as those seen in the test of cms-sw/cmsdist#4678

yes, it refers to the differences in the DQM comparisons in 136.7611 and 136.8311.

DQMHistoTests: Total failures: 15
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_5_X_2019-02-06-2300+4678/30351/136.7611_RunJetHT2016E_reminiaod+RunJetHT2016E_reminiaod+REMINIAOD_data2016_HIPM+HARVESTDR2_REMINIAOD_data2016_HIPM/Tracking_PackedCandidate.html

@amaltaro
Copy link

@smuzaffar Shahzad, I cloned that workflow into my testbed setup (so using CMSSW_10_5_0_pre1_ROOT614) and it had 100% success rate.
Looking at the wrapper log for a CERN worker node, I see the stack size is actually not set for those processes, see

Printing resources available to this process with ulimit:
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 117400
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 117400
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Site name:  T2_CH_CERN
Hostname:   b6eb86d52a.cern.ch

@Dr15Jones
Copy link
Contributor Author

I ask Burt Holzman about FNAL worker nodes and he determined they have a soft stacksize of 8MB, which is too small for those jobs.

@amaltaro
Copy link

This is what I got from FNAL and Nebraska worker nodes (just one sample of each, so I'm not sure it's valid for all site resources.

Printing resources available to this process with ulimit:
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515318
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Site name:  T1_US_FNAL
Hostname:   cmsgli-4912632-0-cmswn2097.fnal.gov
Printing resources available to this process with ulimit:
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514008
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Site name:  T2_US_Nebraska
Hostname:   cmsprod-2998778.0-red-c1810.unl.edu

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 19, 2019 via email

@amaltaro
Copy link

ulimit -a

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 19, 2019 via email

@amaltaro
Copy link

Yes, that's why I provided the hostname as well. Perhaps Chris can ping Dave Mason to check that node?

@dan131riley
Copy link

This commit from four years ago, e3a98af sets the TBB thread stack size to 10MB, so the stack requirement doesn't seem to be new. Could the new factor just be a WN setting an 8MB stack limit? And, if we're asking for 10MB of stack for the TBB threads, maybe we should also try to bump the main thread's ulimit if it is lower than that (as long as it is just the soft limit)?

@Dr15Jones
Copy link
Contributor Author

@holzman FYI

@Dr15Jones
Copy link
Contributor Author

@dan131riley the 10MB limit for non-main threads is there because the default of 2MB from linux was too small for Geant4 jobs once we started running multi-threaded. I picked 10MB because it was a nice 'round' number, not from any attempt to find a particular minimum.

@holzman
Copy link

holzman commented Feb 19, 2019

Sorry for the confusion. The default soft limit set by the kernel is indeed 8 MB, but unless the job itself declares a StackSize attribute, HTCondor changes it to unlimited. This does mean that running interactively vs. batch will give different values of RLIMIT_STACK.

Details for the curious:

https://github.com/htcondor/htcondor/blob/master/src/condor_sysapi/resource_limits.cpp#L31-L53
https://github.com/htcondor/htcondor/blob/master/src/condor_starter.V6.1/baseStarter.cpp#L249
https://github.com/htcondor/htcondor/blob/master/src/condor_starter.V6.1/job_info_communicator.cpp#L520

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 19, 2019 via email

@Dr15Jones
Copy link
Contributor Author

@davidlange6

so to circle back to the original problem - is there a message that should show up in the condor/cmssw logs for jobs killed in this way? I had poked around the logs without really understanding what actually caused the crash

When we hit the stack limit, it causes a segmentation fault (since the operating system has protected memory beyond the stack).

@Dr15Jones
Copy link
Contributor Author

@amaltaro do we know what sites were seeing failures?

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 19, 2019 via email

@fabiocos
Copy link
Contributor

fabiocos commented May 7, 2019

@Dr15Jones @slava77 so far the official PdmV validation has not reported problems (@prebello @zhenhu please confirm). I have looked at the RelMon output, and made some small scale test (1000 events) myself with a subselection of workflows. For 136.878 (RunMuonEG2018C) I see just 14 differences out of 156022 histograms, all concentrated in HLT/EXO and JetMET for the histograms
https://cmssdt.cern.ch/lxr/source/DQMOffline/Trigger/plugins/METMonitor.cc#0127

that show a different numbers of entries, here two examples

image
image

@fabiocos
Copy link
Contributor

fabiocos commented May 7, 2019

Looking at how the histograms are filled in https://cmssdt.cern.ch/lxr/source/DQMOffline/Trigger/plugins/METMonitor.cc#0268 it looks strange the agreement in MET and difference in deltaPhi, which looks correct in ROOT 6.14.

@Dr15Jones
Copy link
Contributor Author

Just to be clear, the ROOT 6.12 histograms look inconsistent and the 6.14 look consistent?

@fabiocos
Copy link
Contributor

fabiocos commented May 7, 2019

BTW, the whole setup for my checks with output files and histograms is in cmsdev25.cern.ch:/build/fabiocos/106X/ROOT614 , in the work subdirectories I have all the results for a number of workflows

@Dr15Jones
Copy link
Contributor Author

@fabiocos could you try running the ROOT6.14 again and see if it makes the same results?

@davidlange6
Copy link
Contributor

davidlange6 commented May 7, 2019 via email

@davidlange6
Copy link
Contributor

davidlange6 commented May 7, 2019 via email

@prebello
Copy link
Contributor

prebello commented May 7, 2019

@fabiocos so far mostly greenlighted but Jet FullSim in progress (validator contacted to update the status)
https://hypernews.cern.ch/HyperNews/CMS/get/relval/12889/44.html

@fabiocos
Copy link
Contributor

fabiocos commented May 8, 2019

@davidlange6 ok, this issue looks more an issue in the DQM code than in ROOT itself...

@smuzaffar
Copy link
Contributor

@Dr15Jones @fabiocos , any reason to keep it open?

@fabiocos
Copy link
Contributor

fabiocos commented Sep 9, 2019

I let @Dr15Jones comment, I would say that we are approaching "Outstanding issues to migrating to ROOT 6.18"...

@Dr15Jones
Copy link
Contributor Author

This can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests