New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Visualization plugin hangs on some multi-node setups #749
Comments
I'd love to see a backtrace to see where it hangs. Do you know how to do that? The idea is that you wait till it hangs, then log onto one of the nodes on which the jobs run, start the debugger, and attach to one of the processes that run the program. In essence, it's like running the program in a debugger, but you're attaching the debugger to an already running program. Once you're attached, you can call The location where it hangs may be different for different processes. |
@bangerth you implemented the "write to tmp and mv" scheme for graphical output, right? Did you have evidence that this improved performance? I assume it was faster for you because /tmp is local whereas you wanted to write to an NFS file system? My suggestion would be to not do this anymore, at least by default. We had several people report problems with this (hangs, etc.) and I sometimes see these "WARNING: could not create temporary ..." appear too. It is also easy to write files into a local directory instead of the NFS system (I create output/ as a symlink to a local directory). If files are big enough to be a problem, one would want to use MPI IO (so use grouping>0) instead anyways. Thoughts? |
I think I've done as you ask - I've no experience of gdb or MPI debugging! I've ssh-ed into each node that's running the job, and then Every time I attach, I get a lot of warnings that debug information is not found for lots of libraries. Results are attached. The error file is empty, and the output file holds: Number of active cells: 256 (on 5 levels) *** Timestep 0: t=0 seconds Postprocessing: *** Timestep 1: t=0.0123322 seconds I'm not sure this is too helpful, as I can't see any process that's not stuck in the preconditioner stage - please advise if I'm doing something wrong! I will also try again, see if I can get it to hang before it prints out Timestep 1. |
I had a spare moment - running it again, it hangs at *** Timestep 4: t=0.0194467 seconds Postprocessing: *** Timestep 5: t=0.0204739 seconds Postprocessing: and output is here: |
I agree that we should make the default behaviour more resistant against system specific problems. Maybe we can make "Write in background" and "Temporary file location" input parameters, and the default behaviour is to write to the final destination directly and without using an additional thread? Then we could also dump all these fallback options in https://github.com/geodynamics/aspect/blob/master/source/postprocess/visualization.cc#L534 and simply fail with a useful error message in case something does not work. |
I'll admit that I'm confused. All processes seem to hang in Does the problem reproduce if you run only two processes, but have them run on different machines? (Most schedulers allow you to specify that you want to run only one process per node.) In other words, is the problem that you're running on multiple nodes, or that you run on more than 16 cores? |
On 02/10/2016 10:50 AM, Timo Heister wrote:
Yes. I don't think I have the data any more, but it turned out that having
How do you find a local directory? Or do you want to suggest that users set |
On 02/10/2016 12:27 PM, Rene Gassmöller wrote:
|
Do you want to investigate the problem further and create the PR yourself, or should I go ahead and just create the input parameters? |
The latter of course. On every large machine I have been on there are guidelines regarding storage and using the NFS for large IO is never advertised for good reasons. :-) Anyway, if you have 1000+ processes you better know where you are writing and if you can use MPI IO. |
@gassmoeller -- if you have time, please go ahead. I won't get to it within the next few days for sure :-( |
Running on a multiple node setup on a cluster can hang on the visualization plugin, only interrupted by the cluster's walltime limits.
On my cluster:
Typical error output, for what it's worth:
anode120:UCM:746e:cb233700: 2081437717 us(2081437717 us!!!): dapl async_event CQ (0x10dfdf0) ERR 0
anode120:UCM:746e:cb233700: 2081437754 us(37 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0)
anode120:UCM:746e:cb233700: 2081437784 us(30 us): dapl async_event QP (0x10fa3b0) Event 1
anode120:UCM:746c:cb233700: -287395973 us(-287395973 us): dapl async_event CQ (0x10dfdf0) ERR 0
anode120:UCM:746c:cb233700: -287395932 us(41 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0)
anode120:UCM:746c:cb233700: -287395902 us(30 us): dapl async_event QP (0x10fa3b0) Event 1
anode120:UCM:7470:cb233700: -279412729 us(-279412729 us): dapl async_event CQ (0x10dfdf0) ERR 0
anode120:UCM:7470:cb233700: -279412691 us(38 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0)
anode120:UCM:7470:cb233700: -279412605 us(86 us): dapl async_event QP (0x10fa3b0) Event 1
Using impi 4.1.3.048
See discussion on Aspect-devel mailing list in early August 2015 for more details.
The text was updated successfully, but these errors were encountered: