Visualization plugin hangs on some multi-node setups #749

spco · 2016-02-10T15:32:27Z

Running on a multiple node setup on a cluster can hang on the visualization plugin, only interrupted by the cluster's walltime limits.

On my cluster:

Disabling the visualization plugin allows the program to run.
Alternatively, using 'set Number of grouped files = 1' in the Visualization subsection allows continued running, and output.
I couldn't get changing TMP or TMPDIR to have any effect.
Running on up to 16 cores on one node works fine.
Running step-32 across multiple nodes works fine, with visualization output.

Typical error output, for what it's worth:

anode120:UCM:746e:cb233700: 2081437717 us(2081437717 us!!!): dapl async_event CQ (0x10dfdf0) ERR 0
anode120:UCM:746e:cb233700: 2081437754 us(37 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0)
anode120:UCM:746e:cb233700: 2081437784 us(30 us): dapl async_event QP (0x10fa3b0) Event 1
anode120:UCM:746c:cb233700: -287395973 us(-287395973 us): dapl async_event CQ (0x10dfdf0) ERR 0
anode120:UCM:746c:cb233700: -287395932 us(41 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0)
anode120:UCM:746c:cb233700: -287395902 us(30 us): dapl async_event QP (0x10fa3b0) Event 1
anode120:UCM:7470:cb233700: -279412729 us(-279412729 us): dapl async_event CQ (0x10dfdf0) ERR 0
anode120:UCM:7470:cb233700: -279412691 us(38 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0)
anode120:UCM:7470:cb233700: -279412605 us(86 us): dapl async_event QP (0x10fa3b0) Event 1

Using impi 4.1.3.048

See discussion on Aspect-devel mailing list in early August 2015 for more details.

bangerth · 2016-02-10T15:36:30Z

I'd love to see a backtrace to see where it hangs.

Do you know how to do that? The idea is that you wait till it hangs, then log onto one of the nodes on which the jobs run, start the debugger, and attach to one of the processes that run the program. In essence, it's like running the program in a debugger, but you're attaching the debugger to an already running program. Once you're attached, you can call backtrace to see where it hangs.

The location where it hangs may be different for different processes.

tjhei · 2016-02-10T16:50:19Z

@bangerth you implemented the "write to tmp and mv" scheme for graphical output, right? Did you have evidence that this improved performance? I assume it was faster for you because /tmp is local whereas you wanted to write to an NFS file system? My suggestion would be to not do this anymore, at least by default. We had several people report problems with this (hangs, etc.) and I sometimes see these "WARNING: could not create temporary ..." appear too.

It is also easy to write files into a local directory instead of the NFS system (I create output/ as a symlink to a local directory). If files are big enough to be a problem, one would want to use MPI IO (so use grouping>0) instead anyways.

Thoughts?

spco · 2016-02-10T17:01:32Z

I think I've done as you ask - I've no experience of gdb or MPI debugging! I've ssh-ed into each node that's running the job, and then
ps ax | grep aspect
then done gdb -p <firstPID>
backtrace
detach
attach <nextPID> etc

Every time I attach, I get a lot of warnings that debug information is not found for lots of libraries.

Results are attached.
debug_output.txt

The error file is empty, and the output file holds:
-----------------------------------------------------------------------------
-- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
-- . version 1.4.0-pre
-- . running in DEBUG mode
-- . running with 17 MPI processes
-- . using Trilinos
-----------------------------------------------------------------------------

Number of active cells: 256 (on 5 levels)
Number of degrees of freedom: 3,556 (2,178+289+1,089)

*** Timestep 0: t=0 seconds
Solving temperature system... 0 iterations.
Rebuilding Stokes preconditioner...
Solving Stokes system... 27 iterations.

Postprocessing:
RMS, max velocity: 1.79 m/s, 2.53 m/s
Temperature min/avg/max: 0 K, 0.5 K, 1 K
Heat fluxes through boundary parts: 4.724e-06 W, -4.724e-06 W, -1 W, 1 W
Writing graphical output: output/solution-00000

*** Timestep 1: t=0.0123322 seconds
Solving temperature system...

I'm not sure this is too helpful, as I can't see any process that's not stuck in the preconditioner stage - please advise if I'm doing something wrong! I will also try again, see if I can get it to hang before it prints out Timestep 1.

spco · 2016-02-10T17:19:02Z

I had a spare moment - running it again, it hangs at

*** Timestep 4: t=0.0194467 seconds
Solving temperature system... 16 iterations.
Solving Stokes system... 25 iterations.

Postprocessing:
RMS, max velocity: 21.5 m/s, 30.4 m/s
Temperature min/avg/max: 0 K, 0.5 K, 1 K
Heat fluxes through boundary parts: 6.103e-05 W, -5.949e-05 W, -1.087 W, 1.087 W

*** Timestep 5: t=0.0204739 seconds
Solving temperature system... 14 iterations.
Solving Stokes system... 24 iterations.

Postprocessing:
RMS, max velocity: 26.6 m/s, 37.8 m/s
Temperature min/avg/max: 0 K, 0.5 K, 1 K
Heat fluxes through boundary parts: 7.658e-05 W, -7.511e-05 W, -1.132 W, 1.132 W
Writing graphical output: output/solution-00002

and output is here:
debug_output2.txt

gassmoeller · 2016-02-10T18:27:45Z

I agree that we should make the default behaviour more resistant against system specific problems. Maybe we can make "Write in background" and "Temporary file location" input parameters, and the default behaviour is to write to the final destination directly and without using an additional thread? Then we could also dump all these fallback options in https://github.com/geodynamics/aspect/blob/master/source/postprocess/visualization.cc#L534 and simply fail with a useful error message in case something does not work.

bangerth · 2016-02-10T19:48:02Z

I'll admit that I'm confused. All processes seem to hang in Epetra_BlockMap::SameAs but that makes no sense. There must be one process that is stuck somewhere else.

Does the problem reproduce if you run only two processes, but have them run on different machines? (Most schedulers allow you to specify that you want to run only one process per node.) In other words, is the problem that you're running on multiple nodes, or that you run on more than 16 cores?

bangerth · 2016-02-10T19:50:34Z

On 02/10/2016 10:50 AM, Timo Heister wrote:

@bangerth https://github.com/bangerth you implemented the "write to tmp and
mv" scheme for graphical output, right? Did you have evidence that this
improved performance? I assume it was faster for you because /tmp is local
whereas you wanted to write to an NFS file system?

Yes. I don't think I have the data any more, but it turned out that having
1000 processes write into the same directory of some NFS file server really
brought down the system. I think this was back on the brazos cluster.

It is also easy to write files into a local directory instead of the NFS
system (I create output/ as a symlink to a local directory). If files are big
enough to be a problem, one would want to use MPI IO (so use grouping>0)
instead anyways.

How do you find a local directory? Or do you want to suggest that users set
things up that way themselves?

bangerth · 2016-02-10T19:50:39Z

On 02/10/2016 12:27 PM, Rene Gassmöller wrote:

I agree that we should make the default behaviour more resistant against
system specific problems. Maybe we can make "Write in background" and
"Temporary file location" input parameters, and the default behaviour is to
write to the final destination directly and without using an additional
thread? Then we could also dump all these fallback options in
https://github.com/geodynamics/aspect/blob/master/source/postprocess/visualization.cc#L534
and simply fail with a useful error message in case something does not work.
Yes, that would certainly be feasible without too much trouble.

gassmoeller · 2016-02-10T21:30:59Z

Do you want to investigate the problem further and create the PR yourself, or should I go ahead and just create the input parameters?

tjhei · 2016-02-11T01:00:31Z

How do you find a local directory? Or do you want to suggest that users set
things up that way themselves?

The latter of course. On every large machine I have been on there are guidelines regarding storage and using the NFS for large IO is never advertised for good reasons. :-) Anyway, if you have 1000+ processes you better know where you are writing and if you can use MPI IO.

bangerth · 2016-02-11T21:24:00Z

@gassmoeller -- if you have time, please go ahead. I won't get to it within the next few days for sure :-(

gassmoeller · 2016-02-19T22:48:13Z

Fixed by #752. @spco, if you agree please close this issue. I just do not want to steal your issue 😉

gassmoeller mentioned this issue Feb 14, 2016

Visualization improvements #752

Merged

spco closed this as completed Feb 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visualization plugin hangs on some multi-node setups #749

Visualization plugin hangs on some multi-node setups #749

spco commented Feb 10, 2016

bangerth commented Feb 10, 2016

tjhei commented Feb 10, 2016

spco commented Feb 10, 2016

spco commented Feb 10, 2016

gassmoeller commented Feb 10, 2016

bangerth commented Feb 10, 2016

bangerth commented Feb 10, 2016

bangerth commented Feb 10, 2016

gassmoeller commented Feb 10, 2016

tjhei commented Feb 11, 2016

bangerth commented Feb 11, 2016

gassmoeller commented Feb 19, 2016

Visualization plugin hangs on some multi-node setups #749

Visualization plugin hangs on some multi-node setups #749

Comments

spco commented Feb 10, 2016

bangerth commented Feb 10, 2016

tjhei commented Feb 10, 2016

spco commented Feb 10, 2016

spco commented Feb 10, 2016

gassmoeller commented Feb 10, 2016

bangerth commented Feb 10, 2016

bangerth commented Feb 10, 2016

bangerth commented Feb 10, 2016

gassmoeller commented Feb 10, 2016

tjhei commented Feb 11, 2016

bangerth commented Feb 11, 2016

gassmoeller commented Feb 19, 2016