-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASPECT hangs right after checkpoint when using >1 node #4046
Comments
Can you start with an empty output directory and post the output of |
Thanks Timo, the computer is undergoing maintenance currently, I'll update you when it comes back on line. |
When the model hangs, log.txt records:
Output of
The model will restart from checkpoint, but again hangs right after a checkpoint step:
|
Indeed, it looks like the checkpointing completes correctly. Are you saying this problem only occurs when checkpointing is enabled, though? Tricky problem. I have to admit that I have no idea what the problem could be. Maybe the file system is doing something unexpected? The easiest next step would probably to run in debug mode, wait until you hang and then get a call stack by sshing into one of the compute nodes and then print the call stack. You can do this with something like (find the process ids using top):
You might need more than one backtrace. Is that something you can do? |
Switching from OpenMpi 4.0.1 to 4.1.0 has fixed this problem. Thanks for the assistance. |
Thanks for letting us know! 👍 |
Issue occurs on NCI "GADI", CentOS 8, Intel Cascade Lake processors, lustre filesystem, 48 procs per node
Right after a check-pointing step, models hang when run on more than one node.
Example output when model hangs (log.txt):
The text was updated successfully, but these errors were encountered: