3D Spherical Errors #918

GEuen · 2016-06-21T20:30:20Z

Convergence.txt
Segmentation.txt
ToRow.txt

Hello all. Dr. Scott King and I have been trying to run some simple 3D spherical models to get a sense of runtime on our cluster, but many cases end with one of three errors. An example of each is attached. The 'ToRow' error especially is strange. Shangxin Liu has more information on the tests that have been run and their errors, and will have this information at the Hackathon. Any help or insight is greatly appreciated.

tjhei · 2016-06-22T04:00:51Z

I think is the bug Scott reported a while ago on the mailing list. I am tracking it here: dealii/dealii#2613

tjhei · 2016-06-25T00:22:38Z

This is a simplified test that triggers the bug for me:
A1_ref2T.prm.txt

Shangxin-Liu · 2016-06-25T07:17:11Z

To track Grant's errors, I'm now running the dealii step-32 3D spherical shell example with a long end time on Hokiespeed (4 initial global refinement, 0 AMR refinement), to see whether this will also crash with the similar errors. I ran three same jobs. One is running fine to tilmestep 105332 and I killed it. Two of them crashed with the same "Bus" error, however one at timestep 59639 and the other only at timestep 65. I highly doubt this is a cluster problem with torque and I will run it again on Blueridge or Newriver. But I remembered I already ran into this "Bus" error before so am just curious what's the reason might be. I attach the error message here:

An error has occurred processing your job, see below.
Post job file processing error; job 84419.master.cluster on host hs050/0+hs050/1+hs050/2+hs050/3+hs050/4+hs050/5+hs050/6+hs050/7+hs050/8+hs050/9+hs050/10+hs050/11+hs059/0+hs059/1+hs059/2+hs059/3+hs059/4+hs059/5+hs059/6+hs059/7+hs059/8+hs059/9+hs059/10+hs059/11+hs118/0+hs118/1+hs118/2+hs118/3+hs118/4+hs118/5+hs118/6+hs118/7+hs118/8+hs118/9+hs118/10+hs118/11+hs048/0+hs048/1+hs048/2+hs048/3+hs048/4+hs048/5+hs048/6+hs048/7+hs048/8+hs048/9+hs048/10+hs048/11+hs023/0+hs023/1+hs023/2+hs023/3+hs023/4+hs023/5+hs023/6+hs023/7+hs023/8+hs023/9+hs023/10+hs023/11+hs077/0+hs077/1+hs077/2+hs077/3+hs077/4+hs077/5+hs077/6+hs077/7+hs077/8+hs077/9+hs077/10+hs077/11+hs129/0+hs129/1+hs129/2+hs129/3+hs129/4+hs129/5+hs129/6+hs129/7+hs129/8+hs129/9+hs129/10+hs129/11+hs034/0+hs034/1+hs034/2+hs034/3+hs034/4+hs034/5+hs034/6+hs034/7+hs034/8+hs034/9+hs034/10+hs034/11

Unable to copy file /opt/torque/torque/spool/spool/84419.master.cluster.OU to /work/hokiespeed/shangxin/dealii_step-32/dealii_step_32.sh.o84419, error 1
*** error from copy
/bin/cp: cannot stat `/opt/torque/torque/spool/spool/84419.master.cluster.OU': No such file or directory
*** end error output

Unable to copy file /opt/torque/torque/spool/spool/84419.master.cluster.ER to /work/hokiespeed/shangxin/dealii_step-32/dealii_step_32.sh.e84419, error 1
*** error from copy
/bin/cp: cannot stat `/opt/torque/torque/spool/spool/84419.master.cluster.ER': No such file or directory
*** end error output

[hs129:15576] *** Process received signal ***
[hs129:15584] *** Process received signal ***

mpirun noticed that process rank 73 with PID 15576 on node hs129 exited on signal 7 (Bus error).

Any comments? @tjhei @gassmoeller

bangerth · 2016-06-26T06:22:36Z

This is typically a problem with the file system on the cluster. It may be that because the job died, all machines try to more or copy their status and log files into a central location, and the file system is just overwhelmed and gives up. I've seen this on some machines as well, but it's a problem with the administration of the machine, not with the program that you are running (here, with ASPECT).

Shangxin-Liu · 2016-06-26T06:25:51Z

Yes. I'm running the jobs on a different cluster than Grant's so this is indeed a cluster problem. I'm now running the same job on Blueridge, the cluster Grant catch up the his errors.

gassmoeller · 2016-12-15T01:37:44Z

@GEuen: Is this issue resolved? If so please close it. Otherwise, any progress or questions about the issue?

Shangxin-Liu · 2016-12-16T06:28:32Z

Hi Rene,

We haven't observed these errors when we updated our ASEPCT code to the new git hub version this year, so we think this issue can be closed now. @gassmoeller

GEuen closed this as completed Dec 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3D Spherical Errors #918

3D Spherical Errors #918

GEuen commented Jun 21, 2016

tjhei commented Jun 22, 2016

tjhei commented Jun 25, 2016

Shangxin-Liu commented Jun 25, 2016 •

edited

bangerth commented Jun 26, 2016

Shangxin-Liu commented Jun 26, 2016

gassmoeller commented Dec 15, 2016

Shangxin-Liu commented Dec 16, 2016

3D Spherical Errors #918

3D Spherical Errors #918

Comments

GEuen commented Jun 21, 2016

tjhei commented Jun 22, 2016

tjhei commented Jun 25, 2016

Shangxin-Liu commented Jun 25, 2016 • edited

bangerth commented Jun 26, 2016

Shangxin-Liu commented Jun 26, 2016

gassmoeller commented Dec 15, 2016

Shangxin-Liu commented Dec 16, 2016

Shangxin-Liu commented Jun 25, 2016 •

edited