New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3D Spherical Errors #918
Comments
I think is the bug Scott reported a while ago on the mailing list. I am tracking it here: dealii/dealii#2613 |
This is a simplified test that triggers the bug for me: |
To track Grant's errors, I'm now running the dealii step-32 3D spherical shell example with a long end time on Hokiespeed (4 initial global refinement, 0 AMR refinement), to see whether this will also crash with the similar errors. I ran three same jobs. One is running fine to tilmestep 105332 and I killed it. Two of them crashed with the same "Bus" error, however one at timestep 59639 and the other only at timestep 65. I highly doubt this is a cluster problem with torque and I will run it again on Blueridge or Newriver. But I remembered I already ran into this "Bus" error before so am just curious what's the reason might be. I attach the error message here: An error has occurred processing your job, see below. Unable to copy file /opt/torque/torque/spool/spool/84419.master.cluster.OU to /work/hokiespeed/shangxin/dealii_step-32/dealii_step_32.sh.o84419, error 1 Unable to copy file /opt/torque/torque/spool/spool/84419.master.cluster.ER to /work/hokiespeed/shangxin/dealii_step-32/dealii_step_32.sh.e84419, error 1 [hs129:15576] *** Process received signal *** mpirun noticed that process rank 73 with PID 15576 on node hs129 exited on signal 7 (Bus error). Any comments? @tjhei @gassmoeller |
This is typically a problem with the file system on the cluster. It may be that because the job died, all machines try to more or copy their status and log files into a central location, and the file system is just overwhelmed and gives up. I've seen this on some machines as well, but it's a problem with the administration of the machine, not with the program that you are running (here, with ASPECT). |
Yes. I'm running the jobs on a different cluster than Grant's so this is indeed a cluster problem. I'm now running the same job on Blueridge, the cluster Grant catch up the his errors. |
@GEuen: Is this issue resolved? If so please close it. Otherwise, any progress or questions about the issue? |
Hi Rene, We haven't observed these errors when we updated our ASEPCT code to the new git hub version this year, so we think this issue can be closed now. @gassmoeller |
Convergence.txt
Segmentation.txt
ToRow.txt
Hello all. Dr. Scott King and I have been trying to run some simple 3D spherical models to get a sense of runtime on our cluster, but many cases end with one of three errors. An example of each is attached. The 'ToRow' error especially is strange. Shangxin Liu has more information on the tests that have been run and their errors, and will have this information at the Hackathon. Any help or insight is greatly appreciated.
The text was updated successfully, but these errors were encountered: