Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASPECT hangs right after checkpoint when using >1 node #4046

Closed
dansand opened this issue May 28, 2021 · 6 comments
Closed

ASPECT hangs right after checkpoint when using >1 node #4046

dansand opened this issue May 28, 2021 · 6 comments

Comments

@dansand
Copy link
Contributor

dansand commented May 28, 2021

Issue occurs on NCI "GADI", CentOS 8, Intel Cascade Lake processors, lustre filesystem, 48 procs per node

Right after a check-pointing step, models hang when run on more than one node.

  • Problem occurs (variably) on first or second or third checkpoint step (i.e. models can successfully checkpoint 1 or 2 times)
  • It seems that checkpoint step completes (i.e. the restart.* files get written), after which 1 or 2 lines of output are written to the log file, and then it hangs
  • Happens with and without visualization
  • Affects all input models I've tried, 2d and 3d, different material models, geometry models etc.
  • Affects multiple versions of ASPECT from the last 6 months (haven't tried earlier)
  • Affects ASPECT built against deal-ii 9.2 and 9.3-pre
  • Debug mode doesn't catch the problem (models hang in debug mode)

Example output when model hangs (log.txt):

*** Snapshot created!
*** Timestep 50:  t=637745 years, dt=20000 years
@tjhei
Copy link
Member

tjhei commented May 28, 2021

Can you start with an empty output directory and post the output of ls -al please? I want to make sure the output looks sensible.
Also, can you resume from this checkpoint? It looks like the checkpointing completed successfully.

@dansand
Copy link
Contributor Author

dansand commented May 31, 2021

Thanks Timo, the computer is undergoing maintenance currently, I'll update you when it comes back on line.

@dansand
Copy link
Contributor Author

dansand commented Jun 2, 2021

When the model hangs, log.txt records:

*** Snapshot created!

*** Timestep 15:  t=2.58538e+08 years, dt=6.95585e+06 years
   Solving temperature system... 19 iterations.
   Rebuilding Stokes preconditioner...

Output of ls -al:

     33280 Jun  2 11:22 .
     33280 Jun  2 11:14 ..
      9304 Jun  2 11:22 depth_average.vtu
     10952 Jun  2 11:22 log.txt
      2013 Jun  2 11:14 original.prm
    851437 Jun  2 11:14 parameters.json
    612914 Jun  2 11:14 parameters.prm
   1128544 Jun  2 11:14 parameters.tex
   2119856 Jun  2 11:22 restart.mesh
 365197340 Jun  2 11:22 restart.mesh_fixed.data
 365197340 Jun  2 11:19 restart.mesh_fixed.data.old
        98 Jun  2 11:22 restart.mesh.info
        98 Jun  2 11:19 restart.mesh.info.old
   2119856 Jun  2 11:19 restart.mesh.old
      8481 Jun  2 11:22 restart.resume.z
      5964 Jun  2 11:19 restart.resume.z.old
     41472 Jun  2 11:21 solution
      1608 Jun  2 11:21 solution.pvd
       765 Jun  2 11:21 solution.visit
      4823 Jun  2 11:22 statistics

The model will restart from checkpoint, but again hangs right after a checkpoint step:

*** Snapshot created!

*** Timestep 45:  t=3.17522e+08 years, dt=721807 years
   Solving temperature system... 19 iterations.
   Rebuilding Stokes preconditioner...
   Solving Stokes system...

@tjhei
Copy link
Member

tjhei commented Jun 2, 2021

Indeed, it looks like the checkpointing completes correctly. Are you saying this problem only occurs when checkpointing is enabled, though? Tricky problem. I have to admit that I have no idea what the problem could be. Maybe the file system is doing something unexpected?

The easiest next step would probably to run in debug mode, wait until you hang and then get a call stack by sshing into one of the compute nodes and then print the call stack. You can do this with something like (find the process ids using top):

gdb
attach <process id>
bt

You might need more than one backtrace. Is that something you can do?

@dansand
Copy link
Contributor Author

dansand commented Jun 15, 2021

Switching from OpenMpi 4.0.1 to 4.1.0 has fixed this problem. Thanks for the assistance.

@dansand dansand closed this as completed Jun 15, 2021
@tjhei
Copy link
Member

tjhei commented Jun 15, 2021

Thanks for letting us know! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants