Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualization plugin hangs on some multi-node setups #749

Closed
spco opened this issue Feb 10, 2016 · 12 comments
Closed

Visualization plugin hangs on some multi-node setups #749

spco opened this issue Feb 10, 2016 · 12 comments

Comments

@spco
Copy link
Contributor

spco commented Feb 10, 2016

Running on a multiple node setup on a cluster can hang on the visualization plugin, only interrupted by the cluster's walltime limits.

On my cluster:

  • Disabling the visualization plugin allows the program to run.
  • Alternatively, using 'set Number of grouped files = 1' in the Visualization subsection allows continued running, and output.
  • I couldn't get changing TMP or TMPDIR to have any effect.
  • Running on up to 16 cores on one node works fine.
  • Running step-32 across multiple nodes works fine, with visualization output.

Typical error output, for what it's worth:

anode120:UCM:746e:cb233700: 2081437717 us(2081437717 us!!!): dapl async_event CQ (0x10dfdf0) ERR 0
anode120:UCM:746e:cb233700: 2081437754 us(37 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0)
anode120:UCM:746e:cb233700: 2081437784 us(30 us): dapl async_event QP (0x10fa3b0) Event 1
anode120:UCM:746c:cb233700: -287395973 us(-287395973 us): dapl async_event CQ (0x10dfdf0) ERR 0
anode120:UCM:746c:cb233700: -287395932 us(41 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0)
anode120:UCM:746c:cb233700: -287395902 us(30 us): dapl async_event QP (0x10fa3b0) Event 1
anode120:UCM:7470:cb233700: -279412729 us(-279412729 us): dapl async_event CQ (0x10dfdf0) ERR 0
anode120:UCM:7470:cb233700: -279412691 us(38 us): -- dapl_evd_cq_async_error_callback (0x107e170, 0x1080e90, 0x2aaacb232c90, 0x10dfdf0)
anode120:UCM:7470:cb233700: -279412605 us(86 us): dapl async_event QP (0x10fa3b0) Event 1

Using impi 4.1.3.048

See discussion on Aspect-devel mailing list in early August 2015 for more details.

@bangerth
Copy link
Contributor

I'd love to see a backtrace to see where it hangs.

Do you know how to do that? The idea is that you wait till it hangs, then log onto one of the nodes on which the jobs run, start the debugger, and attach to one of the processes that run the program. In essence, it's like running the program in a debugger, but you're attaching the debugger to an already running program. Once you're attached, you can call backtrace to see where it hangs.

The location where it hangs may be different for different processes.

@tjhei
Copy link
Member

tjhei commented Feb 10, 2016

@bangerth you implemented the "write to tmp and mv" scheme for graphical output, right? Did you have evidence that this improved performance? I assume it was faster for you because /tmp is local whereas you wanted to write to an NFS file system? My suggestion would be to not do this anymore, at least by default. We had several people report problems with this (hangs, etc.) and I sometimes see these "WARNING: could not create temporary ..." appear too.

It is also easy to write files into a local directory instead of the NFS system (I create output/ as a symlink to a local directory). If files are big enough to be a problem, one would want to use MPI IO (so use grouping>0) instead anyways.

Thoughts?

@spco
Copy link
Contributor Author

spco commented Feb 10, 2016

I think I've done as you ask - I've no experience of gdb or MPI debugging! I've ssh-ed into each node that's running the job, and then
ps ax | grep aspect
then done gdb -p <firstPID>
backtrace
detach
attach <nextPID> etc

Every time I attach, I get a lot of warnings that debug information is not found for lots of libraries.

Results are attached.
debug_output.txt

The error file is empty, and the output file holds:
-----------------------------------------------------------------------------
-- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
-- . version 1.4.0-pre
-- . running in DEBUG mode
-- . running with 17 MPI processes
-- . using Trilinos
-----------------------------------------------------------------------------

Number of active cells: 256 (on 5 levels)
Number of degrees of freedom: 3,556 (2,178+289+1,089)

*** Timestep 0: t=0 seconds
Solving temperature system... 0 iterations.
Rebuilding Stokes preconditioner...
Solving Stokes system... 27 iterations.

Postprocessing:
RMS, max velocity: 1.79 m/s, 2.53 m/s
Temperature min/avg/max: 0 K, 0.5 K, 1 K
Heat fluxes through boundary parts: 4.724e-06 W, -4.724e-06 W, -1 W, 1 W
Writing graphical output: output/solution-00000

*** Timestep 1: t=0.0123322 seconds
Solving temperature system...

I'm not sure this is too helpful, as I can't see any process that's not stuck in the preconditioner stage - please advise if I'm doing something wrong! I will also try again, see if I can get it to hang before it prints out Timestep 1.

@spco
Copy link
Contributor Author

spco commented Feb 10, 2016

I had a spare moment - running it again, it hangs at

*** Timestep 4: t=0.0194467 seconds
Solving temperature system... 16 iterations.
Solving Stokes system... 25 iterations.

Postprocessing:
RMS, max velocity: 21.5 m/s, 30.4 m/s
Temperature min/avg/max: 0 K, 0.5 K, 1 K
Heat fluxes through boundary parts: 6.103e-05 W, -5.949e-05 W, -1.087 W, 1.087 W

*** Timestep 5: t=0.0204739 seconds
Solving temperature system... 14 iterations.
Solving Stokes system... 24 iterations.

Postprocessing:
RMS, max velocity: 26.6 m/s, 37.8 m/s
Temperature min/avg/max: 0 K, 0.5 K, 1 K
Heat fluxes through boundary parts: 7.658e-05 W, -7.511e-05 W, -1.132 W, 1.132 W
Writing graphical output: output/solution-00002

and output is here:
debug_output2.txt

@gassmoeller
Copy link
Member

I agree that we should make the default behaviour more resistant against system specific problems. Maybe we can make "Write in background" and "Temporary file location" input parameters, and the default behaviour is to write to the final destination directly and without using an additional thread? Then we could also dump all these fallback options in https://github.com/geodynamics/aspect/blob/master/source/postprocess/visualization.cc#L534 and simply fail with a useful error message in case something does not work.

@bangerth
Copy link
Contributor

I'll admit that I'm confused. All processes seem to hang in Epetra_BlockMap::SameAs but that makes no sense. There must be one process that is stuck somewhere else.

Does the problem reproduce if you run only two processes, but have them run on different machines? (Most schedulers allow you to specify that you want to run only one process per node.) In other words, is the problem that you're running on multiple nodes, or that you run on more than 16 cores?

@bangerth
Copy link
Contributor

On 02/10/2016 10:50 AM, Timo Heister wrote:

@bangerth https://github.com/bangerth you implemented the "write to tmp and
mv" scheme for graphical output, right? Did you have evidence that this
improved performance? I assume it was faster for you because /tmp is local
whereas you wanted to write to an NFS file system?

Yes. I don't think I have the data any more, but it turned out that having
1000 processes write into the same directory of some NFS file server really
brought down the system. I think this was back on the brazos cluster.

It is also easy to write files into a local directory instead of the NFS
system (I create output/ as a symlink to a local directory). If files are big
enough to be a problem, one would want to use MPI IO (so use grouping>0)
instead anyways.

How do you find a local directory? Or do you want to suggest that users set
things up that way themselves?

@bangerth
Copy link
Contributor

On 02/10/2016 12:27 PM, Rene Gassmöller wrote:

I agree that we should make the default behaviour more resistant against
system specific problems. Maybe we can make "Write in background" and
"Temporary file location" input parameters, and the default behaviour is to
write to the final destination directly and without using an additional
thread? Then we could also dump all these fallback options in
https://github.com/geodynamics/aspect/blob/master/source/postprocess/visualization.cc#L534
and simply fail with a useful error message in case something does not work.
Yes, that would certainly be feasible without too much trouble.

@gassmoeller
Copy link
Member

Do you want to investigate the problem further and create the PR yourself, or should I go ahead and just create the input parameters?

@tjhei
Copy link
Member

tjhei commented Feb 11, 2016

How do you find a local directory? Or do you want to suggest that users set
things up that way themselves?

The latter of course. On every large machine I have been on there are guidelines regarding storage and using the NFS for large IO is never advertised for good reasons. :-) Anyway, if you have 1000+ processes you better know where you are writing and if you can use MPI IO.

@bangerth
Copy link
Contributor

@gassmoeller -- if you have time, please go ahead. I won't get to it within the next few days for sure :-(

@gassmoeller
Copy link
Member

Fixed by #752. @spco, if you agree please close this issue. I just do not want to steal your issue 😉

@spco spco closed this as completed Feb 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants