Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3D Spherical Errors #918

Closed
GEuen opened this issue Jun 21, 2016 · 7 comments
Closed

3D Spherical Errors #918

GEuen opened this issue Jun 21, 2016 · 7 comments

Comments

@GEuen
Copy link
Contributor

GEuen commented Jun 21, 2016

Convergence.txt
Segmentation.txt
ToRow.txt

Hello all. Dr. Scott King and I have been trying to run some simple 3D spherical models to get a sense of runtime on our cluster, but many cases end with one of three errors. An example of each is attached. The 'ToRow' error especially is strange. Shangxin Liu has more information on the tests that have been run and their errors, and will have this information at the Hackathon. Any help or insight is greatly appreciated.

@tjhei
Copy link
Member

tjhei commented Jun 22, 2016

I think is the bug Scott reported a while ago on the mailing list. I am tracking it here: dealii/dealii#2613

@tjhei
Copy link
Member

tjhei commented Jun 25, 2016

This is a simplified test that triggers the bug for me:
A1_ref2T.prm.txt

@Shangxin-Liu
Copy link
Contributor

Shangxin-Liu commented Jun 25, 2016

To track Grant's errors, I'm now running the dealii step-32 3D spherical shell example with a long end time on Hokiespeed (4 initial global refinement, 0 AMR refinement), to see whether this will also crash with the similar errors. I ran three same jobs. One is running fine to tilmestep 105332 and I killed it. Two of them crashed with the same "Bus" error, however one at timestep 59639 and the other only at timestep 65. I highly doubt this is a cluster problem with torque and I will run it again on Blueridge or Newriver. But I remembered I already ran into this "Bus" error before so am just curious what's the reason might be. I attach the error message here:

An error has occurred processing your job, see below.
Post job file processing error; job 84419.master.cluster on host hs050/0+hs050/1+hs050/2+hs050/3+hs050/4+hs050/5+hs050/6+hs050/7+hs050/8+hs050/9+hs050/10+hs050/11+hs059/0+hs059/1+hs059/2+hs059/3+hs059/4+hs059/5+hs059/6+hs059/7+hs059/8+hs059/9+hs059/10+hs059/11+hs118/0+hs118/1+hs118/2+hs118/3+hs118/4+hs118/5+hs118/6+hs118/7+hs118/8+hs118/9+hs118/10+hs118/11+hs048/0+hs048/1+hs048/2+hs048/3+hs048/4+hs048/5+hs048/6+hs048/7+hs048/8+hs048/9+hs048/10+hs048/11+hs023/0+hs023/1+hs023/2+hs023/3+hs023/4+hs023/5+hs023/6+hs023/7+hs023/8+hs023/9+hs023/10+hs023/11+hs077/0+hs077/1+hs077/2+hs077/3+hs077/4+hs077/5+hs077/6+hs077/7+hs077/8+hs077/9+hs077/10+hs077/11+hs129/0+hs129/1+hs129/2+hs129/3+hs129/4+hs129/5+hs129/6+hs129/7+hs129/8+hs129/9+hs129/10+hs129/11+hs034/0+hs034/1+hs034/2+hs034/3+hs034/4+hs034/5+hs034/6+hs034/7+hs034/8+hs034/9+hs034/10+hs034/11

Unable to copy file /opt/torque/torque/spool/spool/84419.master.cluster.OU to /work/hokiespeed/shangxin/dealii_step-32/dealii_step_32.sh.o84419, error 1
*** error from copy
/bin/cp: cannot stat `/opt/torque/torque/spool/spool/84419.master.cluster.OU': No such file or directory
*** end error output

Unable to copy file /opt/torque/torque/spool/spool/84419.master.cluster.ER to /work/hokiespeed/shangxin/dealii_step-32/dealii_step_32.sh.e84419, error 1
*** error from copy
/bin/cp: cannot stat `/opt/torque/torque/spool/spool/84419.master.cluster.ER': No such file or directory
*** end error output

[hs129:15576] *** Process received signal ***
[hs129:15584] *** Process received signal ***

mpirun noticed that process rank 73 with PID 15576 on node hs129 exited on signal 7 (Bus error).

Any comments? @tjhei @gassmoeller

@bangerth
Copy link
Contributor

This is typically a problem with the file system on the cluster. It may be that because the job died, all machines try to more or copy their status and log files into a central location, and the file system is just overwhelmed and gives up. I've seen this on some machines as well, but it's a problem with the administration of the machine, not with the program that you are running (here, with ASPECT).

@Shangxin-Liu
Copy link
Contributor

Yes. I'm running the jobs on a different cluster than Grant's so this is indeed a cluster problem. I'm now running the same job on Blueridge, the cluster Grant catch up the his errors.

@gassmoeller
Copy link
Member

@GEuen: Is this issue resolved? If so please close it. Otherwise, any progress or questions about the issue?

@Shangxin-Liu
Copy link
Contributor

Hi Rene,

We haven't observed these errors when we updated our ASEPCT code to the new git hub version this year, so we think this issue can be closed now. @gassmoeller

@GEuen GEuen closed this as completed Dec 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants