Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heisenbug #59

Closed
kwschultz opened this issue Feb 18, 2015 · 25 comments
Closed

Heisenbug #59

kwschultz opened this issue Feb 18, 2015 · 25 comments

Comments

@kwschultz
Copy link
Contributor

Error in running VQ on multiple processors with a model more complex than a few faults and a few thousand elements. Error is that the simulation gets stuck and cannot continue. Error is randomly reproducible, occurring at different points in multiple runs of the same simulation.

We must fix this bug before we can reliably run simulations on multiple processors. Right now we are effectively dead in the water with respect to a full CA simulation or even including aftershocks on smaller simulations.

Eric's summary of the bug when it occurred on a 3 processor run of a 6 fault subset from the UCERF2 California model:
"So, here’s my Sherlock Holmes take:
From the backtrace, we know process 0 and 1 are stuck in distributeUpdateField(), while process 2 is in MPI_Recv() in processBlocksSecondaryFailures()
Since the processes are in order, this means the MPI_Recv that is stuck must correspond to the solution of Ax=b being sent back from the root (process 0) to process 2
The only way this could happen is if the number of MPI_Send() calls from root does not match the number of MPI_Recv() calls in the other processes
The only way this mismatch could happen is if the total number of entries in global_id_list is not equal to the sum of the number of entries in local_id_list for each process
or if processes have different understandings of the assignment of blocks to each process
Since my laptop run is already at 4300 events with no problems, it seems more likely this is a bug caused by bad memory writing
Such that one of theses structures is being corrupted by something overwriting the existing data
So the question is how do we check whether this corruption is happening"

@markyoder
Copy link
Contributor

I think i'm observing the same bug on several Linux machines (Ubuntu 14.04 and Mint 17). The simulation seems to run fine on a single processor (SPP) -- but the actual output data need to be verified; for multiple processors (MPP), the Greens functions calculate successfully, then the whole thing quits. i get an error message like this:

/# note: this error occurs after greens function calcs.
/#
[Umbasa:04607] Signal: Segmentation fault (11)
[Umbasa:04607] Signal code: Address not mapped (1)
[Umbasa:04607] Failing at address: 0xffffffffffffffe8
[Umbasa:04607] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f4a75654d40]
[Umbasa:04607] [ 1] ../../build/src/vq(_ZN10GreensInit4initEP12SimFramework+0x6af) [0x43c4df]
[Umbasa:04607] [ 2] ../../build/src/vq(_ZN12SimFramework4initEv+0x59e) [0x45370e]
[Umbasa:04607] [ 3] ../../build/src/vq(_ZN10Simulation4initEv+0x29) [0x467a49]
[Umbasa:04607] [ 4] ../../build/src/vq(main+0x109b) [0x42a42b]
[Umbasa:04607] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f4a7563fec5]
[Umbasa:04607] [ 6] ../../build/src/vq() [0x42b732]

[Umbasa:04607] *** End of error message ***

mpirun noticed that process rank 1 with PID 4607 on node Umbasa exited on signal 11 (Segmentation fault).

Looking at "make test", it seems that the single processor tests go quite well; there is a significant failure rate for multi-processor tests (and note some sort of error at the end as well):
80% tests passed, 47 tests failed out of 238

Total Test time (real) = 69.30 sec

The following tests FAILED:
82 - run_P2_none_6000 (Failed)
84 - test_slip_P2_none_6000 (Failed)
85 - test_interevent_P2_none_6000 (Failed)
88 - run_P2_none_4000 (Failed)
90 - test_slip_P2_none_4000 (Failed)
91 - test_interevent_P2_none_4000 (Failed)
94 - run_P2_none_3000 (Failed)
96 - test_slip_P2_none_3000 (Failed)
97 - test_interevent_P2_none_3000 (Failed)
100 - run_P2_none_2000 (Failed)
102 - test_slip_P2_none_2000 (Failed)
103 - test_interevent_P2_none_2000 (Failed)
106 - run_P2_taper_6000 (Failed)
110 - run_P2_taper_4000 (Failed)
114 - run_P2_taper_3000 (Failed)
118 - run_P2_taper_2000 (Failed)
122 - run_P2_taper_renorm_6000 (Failed)
126 - run_P2_taper_renorm_4000 (Failed)
130 - run_P2_taper_renorm_3000 (Failed)
134 - run_P2_taper_renorm_2000 (Failed)
138 - run_P4_none_6000 (Failed)
140 - test_slip_P4_none_6000 (Failed)
141 - test_interevent_P4_none_6000 (Failed)
144 - run_P4_none_4000 (Failed)
146 - test_slip_P4_none_4000 (Failed)
147 - test_interevent_P4_none_4000 (Failed)
150 - run_P4_none_3000 (Failed)
152 - test_slip_P4_none_3000 (Failed)
153 - test_interevent_P4_none_3000 (Failed)
156 - run_P4_none_2000 (Failed)
158 - test_slip_P4_none_2000 (Failed)
159 - test_interevent_P4_none_2000 (Failed)
162 - run_P4_taper_6000 (Failed)
166 - run_P4_taper_4000 (Failed)
170 - run_P4_taper_3000 (Failed)
174 - run_P4_taper_2000 (Failed)
178 - run_P4_taper_renorm_6000 (Failed)
182 - run_P4_taper_renorm_4000 (Failed)
186 - run_P4_taper_renorm_3000 (Failed)
190 - run_P4_taper_renorm_2000 (Failed)
222 - check_sum_P1_green_3000 (Failed)
228 - run_gen_P2_green_3000 (Failed)
229 - check_sum_P2_green_3000 (Failed)
230 - run_full_P2_green_3000 (Failed)
235 - run_gen_P4_green_3000 (Failed)
236 - check_sum_P4_green_3000 (Failed)
237 - run_full_P4_green_3000 (Failed)
Errors while running CTest
make: *** [test] Error 8

@markyoder
Copy link
Contributor

... and the "failing at" address:
Failing at address: 0xffffffffffffffe8

appears to be consistent with at least two runs (seemingly, at the end of the register),
AND:
the exception appears to occur after the Greens functions are calculated, but before they are written to file (in the event that the run is configured to save them); at least, when i did a MPP run to pre-calc the greens functions for L=3000m, the greens functions finished; the simulation failed (like message above), and the greens-functions HDF5 file was not created.

@markyoder
Copy link
Contributor

... and then, if i run vq in mpp mode using the pre-calculated greens functions, i get the same error (and note, the same failure address at the end of the register: Failing at address: 0xffffffffffffffe8)

*******************************

*** Virtual Quake ***

*** Version 1.2.0 ***

*** Git revision ID 5289436 ***

*** QuakeLib 1.2.0 Git revision 5289436 ***

*** MPI process count : 2 ***

*** OpenMP not enabled ***

*******************************

Initializing blocks.

To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).

Reading Greens function data from file all_cal_greens_5000.h5....

Greens function took 3.95872 seconds.

Greens shear matrix takes 45.5098 megabytes

Greens normal matrix takes 45.5098 megabytes

[yodubuntu:15069] *** Process received signal ***

Global Greens shear matrix takes 91.0195 megabytes.

Global Greens normal matrix takes 91.0195 megabytes.

[yodubuntu:15069] Signal: Segmentation fault (11)
[yodubuntu:15069] Signal code: Address not mapped (1)
[yodubuntu:15069] Failing at address: 0xffffffffffffffe8
[yodubuntu:15069] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f4fed504d40]
[yodubuntu:15069] [ 1] ../../build/src/vq(_ZN10GreensInit4initEP12SimFramework+0x6af) [0x43c4df]
[yodubuntu:15069] [ 2] ../../build/src/vq(_ZN12SimFramework4initEv+0x59e) [0x45370e]
[yodubuntu:15069] [ 3] ../../build/src/vq(_ZN10Simulation4initEv+0x29) [0x467a49]
[yodubuntu:15069] [ 4] ../../build/src/vq(main+0x109b) [0x42a42b]
[yodubuntu:15069] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f4fed4efec5]
[yodubuntu:15069] [ 6] ../../build/src/vq() [0x42b732]
[yodubuntu:15069] *** End of error message ***


mpirun noticed that process rank 1 with PID 15069 on node yodubuntu.physics.ucdavis.edu exited on signal 11 (Segmentation fault).

@eheien
Copy link
Contributor

eheien commented Feb 21, 2015

I'm guessing it's in the Green's function calculation then. Can you try commenting out the call to symmetrizeMatrix (misc/GreensFunctions.cpp:81), recompile, and see if it still crashes? And if it still crashes, then try removing the whole assignment loop (lines 83-89) after that?

@eheien
Copy link
Contributor

eheien commented Feb 21, 2015

FYI Kasey, this is also a common debugging technique - keep removing code until it runs, then figure out what was wrong about the code you removed.

@markyoder
Copy link
Contributor

so i tried commentingout mist/GreensFunctions.cpp:81, 83-89, but no joy. the GreensFunctions calculate, but they don't write to file (or other wise proceed to the next step in the sim).

i also see this in the error output:
vq: /home/myoder/Documents/Research/yoder/VC/vq/src/io/GreensFileOutput.cpp:31: virtual void GreensFileOutput::initDesc(const SimFramework*) const: Assertion `false' failed.

which points to this bit in GreensFileOuput.cpp:
#ifndef HDF5_IS_PARALLEL
if (sim->getWorldSize() > 1) {
assertThrow(false, "# ERROR: Greens HDF5 output in parallel only allowed if using HDF5 parallel library.");
}
#endif

is this a parallel vs serial HDF5 problem?

@eheien
Copy link
Contributor

eheien commented Feb 21, 2015

So when you say they don't write to file, does it crash in the same way as before or does it just have the assertion failure?

I don't think it's the parallel vs. serial HDF5 problem, because that would manifest as a different sort of error.

@markyoder
Copy link
Contributor

the sim breaks before the write to file; always the same error. i've added some debugging code to SimFramework.cpp (amongst other places). in particular, in the initialization function:
void SimFramework::init(void), i added some debugging lines to the 'plugin' initialization loop:
for (it=ordered_plugins.begin(); it!=ordered_plugins.end(); ++it) { ...}
in MPI mode, 2 processors:
on the first process (in this case 6407), i get to the 5th (index=4) plugin; the initDesc() statement (first of 3) statement runs; the init() statement is started, then we get the "Signal: Segmentation fault..." bit.

... but the second process (6408) has only reached the 4th (i=3) plugin. initDesc() completes; init() has started, then we get a segmentation fault reported by 6408. output pasted below.

so i'm not sure yet which plugins those are.


myoder@Umbasa ~/Documents/Research/yoder/VC/vq/examples/ca_model $ mpirun -np 2 ../../build/src/vq params.d
Debug(SimFramework::SimFramework()): run mpi initilizations. 6407..
Debug(SimFramework::SimFramework()): run mpi initilizations. 6408..
Debug(SimFramework::SimFramework()): mpi initializations finished.6407..
Debug(SimFramework::SimFramework()): mpi initializations finished.6408..
Debug: Initialize SimFramework...
Debug: Initialize SimFramework...
Debug: SimFramework::init(), 'dry run'/normal initialization loop??6408..

*******************************

Debug: plugin_init 0, initDesc() pid: 6408

*** Virtual Quake ***

*** Version 1.2.0 ***

*** Git revision ID dde492c ***

Debug: plugin_init 0, init() pid: 6408

*** QuakeLib 1.2.0 Git revision dde492c ***

*** MPI process count : 2 ***

*** OpenMP not enabled ***

*******************************

Debug: SimFramework::init(), 'dry run'/normal initialization loop??6407..
Debug: plugin_init 0, initDesc() pid: 6407
Debug: plugin_init 0, init() pid: 6407
Debug: plugin_init 0, timer bit... pid: 6408
plugin cycle finished 0/6408
Debug: plugin_init 1, initDesc() pid: 6408
Debug: plugin_init 1, init() pid: 6408
Debug: plugin_init 0, timer bit... pid: 6407
plugin cycle finished 0/6407
Debug: plugin_init 1, initDesc() pid: 6407

Initializing blocks.

Debug: plugin_init 1, init() pid: 6407
Debug: plugin_init 1, timer bit... pid: 6407
plugin cycle finished 1/6407
Debug: plugin_init 2, initDesc() pid: 6407

To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).

Debug: plugin_init 2, init() pid: 6407
Debug: plugin_init 2, timer bit... pid: 6407
plugin cycle finished 2/6407
Debug: plugin_init 3, initDesc() pid: 6407
Debug: plugin_init 3, init() pid: 6407

Reading Greens function data from file greens_15000.h5.Debug: plugin_init 1, timer bit... pid: 6408

plugin cycle finished 1/6408
Debug: plugin_init 2, initDesc() pid: 6408
Debug: plugin_init 2, init() pid: 6408
Debug: plugin_init 2, timer bit... pid: 6408
plugin cycle finished 2/6408
Debug: plugin_init 3, initDesc() pid: 6408
Debug: plugin_init 3, init() pid: 6408

Greens function took 0.012284 seconds.

Greens shear matrix takes 1014 kilobytes

Greens normal matrix takes 1014 kilobytes

[Umbasa:06408] *** Process received signal ***

Global Greens shear matrix takes 1.98047 megabytes.

Global Greens normal matrix takes 1.98047 megabytes.

Debug: plugin_init 3, timer bit... pid: 6407
plugin cycle finished 3/6407
Debug: plugin_init 4, initDesc() pid: 6407
Debug: plugin_init 4, init() pid: 6407
[Umbasa:06408] Signal: Segmentation fault (11)
[Umbasa:06408] Signal code: Address not mapped (1)
[Umbasa:06408] Failing at address: 0xffffffffffffffe8
[Umbasa:06408] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f56ab074d40]
[Umbasa:06408] [ 1] ../../build/src/vq(_ZN10GreensInit4initEP12SimFramework+0x6af) [0x43c59f]
[Umbasa:06408] [ 2] ../../build/src/vq(_ZN12SimFramework4initEv+0x606) [0x453896]
[Umbasa:06408] [ 3] ../../build/src/vq(_ZN10Simulation4initEv+0x36) [0x469196]
[Umbasa:06408] [ 4] ../../build/src/vq(main+0x109b) [0x42a4db]
[Umbasa:06408] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f56ab05fec5]
[Umbasa:06408] [ 6] ../../build/src/vq() [0x42b7f2]
[Umbasa:06408] *** End of error message ***


mpirun noticed that process rank 1 with PID 6408 on node Umbasa exited on signal 11 (Segmentation fault).


@eheien
Copy link
Contributor

eheien commented Feb 22, 2015

To make sure it's not HDF5 related, can you recompile with HDF5 disabled and run again?

@kwschultz
Copy link
Contributor Author

I'm not sure if it's the same error, but here is an error that the possibly-Iranian graduate student is getting on multiprocessor runs (this is the only output he gave me, I'll ask for full output):

vq: /home/user/Desktop/vq-master3.3/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed.
[PHP-06:07000] *** Process received signal ***
[PHP-06:07000] Signal: Aborted (6)
[PHP-06:07000] Signal code: (-6)
[PHP-06:07000] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36c30) [0x7f726bf67c30]
[PHP-06:07000] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f726bf67bb9]
[PHP-06:07000] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f726bf6afc8]
[PHP-06:07000] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fa76) [0x7f726bf60a76]
[PHP-06:07000] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb22) [0x7f726bf60b22]
[PHP-06:07000] [ 5] ./vq() [0x4292ea]
[PHP-06:07000] [ 6] ./vq(_ZN17UpdateBlockStress4initEP12SimFramework+0x51d) [0x448e5d]
[PHP-06:07000] [ 7] ./vq(_ZN12SimFramework4initEv+0x51e) [0x45379e]
[PHP-06:07000] [ 8] ./vq(_ZN10Simulation4initEv+0x29) [0x467bf9]
[PHP-06:07000] [ 9] ./vq(main+0x109b) [0x42a47b]
[PHP-06:07000] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f726bf52ec5]
[PHP-06:07000] [11] ./vq() [0x42b782]

[PHP-06:07000] *** End of error message ***

mpirun noticed that process rank 0 with PID 7000 on node PHP-06 exited on signal 6 (Aborted).

@markyoder
Copy link
Contributor

... and the winner is (i think):
looks like this all comes down to the sim->console() call reporting the global Greens data. in:
GreensInit.cpp. void GreensInit::init(SimFramework *_sim) { ...
in the " #ifdef MPI_C_FOUND " block, at the end of all things, we get two lines like:

sim->console() << "# Global Greens shear matrix takes " << abbr_global_shear_bytes << " " << space_vals[global_shear_ind] << "." << std::endl;

and the string array space_vals[] is not declared in the scope of the child nodes. what is the right way to share scope? the sim appears to run with these lines commented out. we'll run a big sim to improve confidence.

@ericheien
Copy link

From what I can tell the space_vals[] is in the scope of the child nodes since it's created at function start. You can try running with this, but I don't think it's the source of the problem. However, it might "solve" the problem by creating extra space that absorbs whatever memory overwriting is normally causing a crash.

@ericheien
Copy link

The message the Iranian student is getting is really weird, because there should never be accesses past the number of blocks. My best guess is that this would be related to the same memory corruption that's caused problems on our side, just manifesting itself differently.

@markyoder
Copy link
Contributor

sadly Eric, i with my limited understanding of MPI and the VQ architecture, i was hoping you'd have something different to say. i can manage to reproduce the segmentation fault using 8 processors, (not) fix in place and all. once the sim gets past the initialization, however, it seems to be stable. that said, the manifestation in the mac environment may be less forgiving.

... and on another machine, where i think vq was running before, i'm getting an error for all runs (mpi and spp modes, so maybe this is a good thing). in this case, however, i'm getting something more like:
vq: /home/myoder/Documents/Research/yoder/VC/vq/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed.
[Umbasa:09437] *** Process received signal ***
[Umbasa:09437] Signal: Aborted (6)
[Umbasa:09437] Signal code: (-6)
[Umbasa:09437] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f74124b8d40]
[Umbasa:09437] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f74124b8cc9]
[Umbasa:09437] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f74124bc0d8]
[Umbasa:09437] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb86) [0x7f74124b1b86]
[Umbasa:09437] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fc32) [0x7f74124b1c32]
[Umbasa:09437] [ 5] ../../build/src/vq(_ZN10Simulation15partitionBlocksEv+0xcc5) [0x467115]

looking at Simulation.cpp::Simulation::partitionBlocks(), i'm not clear on the scope of the arrays {local_block_ids, global_block_ids, block_node_map}. local_block_ids is declared as "New" in Simulation::distributeBlocks(). these are declared in "core/CommPartition.h". CommPartition is inherited by Simulation. however, is the re-declaration of local_block_ids in distributeBlocks() correct? also, these arrays are declared in CommPartition.h, but where are they allocated?

@kwschultz
Copy link
Contributor Author

I made the change and successfully ran a 10kyr sim on Kapalua (fault model is all CA traces from VQ/examples/fault_traces/ca_traces/, actually a copy of Mark's 5km model) with 4 processes. This is not necessarily evidence that it works though, as before I ran an AllCal 3km sim (Michael's model file but meshed with VQ mesher) on multiple processes and it had some successful runs but most of them would get hung up after 4kyr+. I'll try a few more like a 50kyr with aftershocks on Kapalua and a few more with newly meshed 3km models on my Mac, though I never got explicit memory errors like on Linux.

@kwschultz
Copy link
Contributor Author

Full output from graduate student in some middle east country not to be named:

user@PHP-06:~/Desktop/NW_3_1_50000_kasey$ mpirun -np 1 ./vq ./params.prm

*******************************

*** Virtual Quake ***

*** Version 1.2.0 ***

*** QuakeLib 1.2.0 ***

*** MPI process count : 1 ***

*** OpenMP not enabled ***

*******************************

Initializing blocks.

To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).

Reading Greens function data from file Green_NW_Iran.h5.

Greens function took 0.22002 seconds.

Greens shear matrix takes 61.6003 megabytes

Greens normal matrix takes 61.6003 megabytes

vq: /home/user/Desktop/vq-master3.3/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed.
[PHP-06:14746] *** Process received signal ***
[PHP-06:14746] Signal: Aborted (6)
[PHP-06:14746] Signal code: (-6)
[PHP-06:14746] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36c30) [0x7f760c8c4c30]
[PHP-06:14746] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f760c8c4bb9]
[PHP-06:14746] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f760c8c7fc8]
[PHP-06:14746] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fa76) [0x7f760c8bda76]
[PHP-06:14746] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb22) [0x7f760c8bdb22]
[PHP-06:14746] [ 5] ./vq() [0x4292ea]
[PHP-06:14746] [ 6] ./vq(_ZN17UpdateBlockStress4initEP12SimFramework+0x51d) [0x448e5d]
[PHP-06:14746] [ 7] ./vq(_ZN12SimFramework4initEv+0x51e) [0x45379e]
[PHP-06:14746] [ 8] ./vq(_ZN10Simulation4initEv+0x29) [0x467bf9]
[PHP-06:14746] [ 9] ./vq(main+0x109b) [0x42a47b]
[PHP-06:14746] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f760c8afec5]
[PHP-06:14746] [11] ./vq() [0x42b782]

[PHP-06:14746] *** End of error message ***

mpirun noticed that process rank 0 with PID 14746 on node PHP-06 exited on signal 6 (Aborted).


I also checked the test, which is included in the VQ source.

The following tests FAILED:
12 - run_P1_none_12000 (OTHER_FAULT)
14 - test_slip_P1_none_12000 (Failed)
15 - test_interevent_P1_none_12000 (Failed)
18 - run_P1_none_6000 (OTHER_FAULT)
20 - test_slip_P1_none_6000 (Failed)
21 - test_interevent_P1_none_6000 (Failed)
24 - run_P1_none_4000 (OTHER_FAULT)
26 - test_slip_P1_none_4000 (Failed)
27 - test_interevent_P1_none_4000 (Failed)
30 - run_P1_none_3000 (OTHER_FAULT)
32 - test_slip_P1_none_3000 (Failed)
33 - test_interevent_P1_none_3000 (Failed)
36 - run_P1_none_2000 (OTHER_FAULT)
38 - test_slip_P1_none_2000 (Failed)
39 - test_interevent_P1_none_2000 (Failed)
42 - run_P1_taper_12000 (OTHER_FAULT)
46 - run_P1_taper_6000 (OTHER_FAULT)
50 - run_P1_taper_4000 (OTHER_FAULT)
54 - run_P1_taper_3000 (OTHER_FAULT)
58 - run_P1_taper_2000 (OTHER_FAULT)
62 - run_P1_taper_renorm_12000 (OTHER_FAULT)
66 - run_P1_taper_renorm_6000 (OTHER_FAULT)
70 - run_P1_taper_renorm_4000 (OTHER_FAULT)
74 - run_P1_taper_renorm_3000 (OTHER_FAULT)
78 - run_P1_taper_renorm_2000 (OTHER_FAULT)
82 - run_P2_none_6000 (Failed)
84 - test_slip_P2_none_6000 (Failed)
85 - test_interevent_P2_none_6000 (Failed)
88 - run_P2_none_4000 (Failed)
90 - test_slip_P2_none_4000 (Failed)
91 - test_interevent_P2_none_4000 (Failed)
94 - run_P2_none_3000 (Failed)
96 - test_slip_P2_none_3000 (Failed)
97 - test_interevent_P2_none_3000 (Failed)
100 - run_P2_none_2000 (Failed)
102 - test_slip_P2_none_2000 (Failed)
103 - test_interevent_P2_none_2000 (Failed)
106 - run_P2_taper_6000 (Failed)
110 - run_P2_taper_4000 (Failed)
114 - run_P2_taper_3000 (Failed)
118 - run_P2_taper_2000 (Failed)
122 - run_P2_taper_renorm_6000 (Failed)
126 - run_P2_taper_renorm_4000 (Failed)
130 - run_P2_taper_renorm_3000 (Failed)
134 - run_P2_taper_renorm_2000 (Failed)
138 - run_P4_none_6000 (Failed)
140 - test_slip_P4_none_6000 (Failed)
141 - test_interevent_P4_none_6000 (Failed)
144 - run_P4_none_4000 (Failed)
146 - test_slip_P4_none_4000 (Failed)
147 - test_interevent_P4_none_4000 (Failed)
150 - run_P4_none_3000 (Failed)
152 - test_slip_P4_none_3000 (Failed)
153 - test_interevent_P4_none_3000 (Failed)
156 - run_P4_none_2000 (Failed)
158 - test_slip_P4_none_2000 (Failed)
159 - test_interevent_P4_none_2000 (Failed)
162 - run_P4_taper_6000 (Failed)
166 - run_P4_taper_4000 (Failed)
170 - run_P4_taper_3000 (Failed)
174 - run_P4_taper_2000 (Failed)
178 - run_P4_taper_renorm_6000 (Failed)
182 - run_P4_taper_renorm_4000 (Failed)
186 - run_P4_taper_renorm_3000 (Failed)
190 - run_P4_taper_renorm_2000 (Failed)
194 - run_two_none_6000 (OTHER_FAULT)
196 - test_two_slip_none_6000 (Failed)
199 - run_two_none_3000 (OTHER_FAULT)
201 - test_two_slip_none_3000 (Failed)
204 - run_two_taper_6000 (OTHER_FAULT)
208 - run_two_taper_3000 (OTHER_FAULT)
212 - run_two_taper_renorm_6000 (OTHER_FAULT)
216 - run_two_taper_renorm_3000 (OTHER_FAULT)
221 - run_gen_P1_green_3000 (Failed)
222 - run_full_P1_green_3000 (Failed)
227 - run_gen_P2_green_3000 (Failed)
228 - run_full_P2_green_3000 (Failed)
233 - run_gen_P4_green_3000 (Failed)
234 - run_full_P4_green_3000 (Failed)
Errors while running CTest

@markyoder
Copy link
Contributor

i get the same error. it looks like the ::init() function in the
VCInitBlocks() class object is not executing. i'm hoping to have this
narrowed down in the next day or two.

On Mon, Mar 9, 2015 at 12:06 PM, Kasey Schultz notifications@github.com
wrote:

Full output from graduate student in some middle east country not to be
named:

user@PHP-06:~/Desktop/NW_3_1_50000_kasey$ mpirun -np 1 ./vq ./params.prm
******************************* *** Virtual Quake *** *** Version 1.2.0
*** *** QuakeLib 1.2.0 *** *** MPI process count : 1 *** *** OpenMP not
enabled *** ******************************* Initializing blocks. To
gracefully quit, create the file quit_vq in the run directory or use a
SIGINT (Control-C). Reading Greens function data from file
Green_NW_Iran.h5. Greens function took 0.22002 seconds. Greens shear
matrix takes 61.6003 megabytes Greens normal matrix takes 61.6003
megabytes

vq: /home/user/Desktop/vq-master3.3/src/core/SimDataBlocks.h:47: Block&
VCSimDataBlocks::getBlock(const BlockID&): Assertion
`block_num<blocks.size()' failed.
[PHP-06:14746] *** Process received signal ***
[PHP-06:14746] Signal: Aborted (6)
[PHP-06:14746] Signal code: (-6)
[PHP-06:14746] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36c30)
[0x7f760c8c4c30]
[PHP-06:14746] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39)
[0x7f760c8c4bb9]
[PHP-06:14746] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)
[0x7f760c8c7fc8]
[PHP-06:14746] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fa76)
[0x7f760c8bda76]
[PHP-06:14746] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb22)
[0x7f760c8bdb22]
[PHP-06:14746] [ 5] ./vq() [0x4292ea]
[PHP-06:14746] [ 6]
./vq(_ZN17UpdateBlockStress4initEP12SimFramework+0x51d) [0x448e5d]
[PHP-06:14746] [ 7] ./vq(_ZN12SimFramework4initEv+0x51e) [0x45379e]
[PHP-06:14746] [ 8] ./vq(_ZN10Simulation4initEv+0x29) [0x467bf9]
[PHP-06:14746] [ 9] ./vq(main+0x109b) [0x42a47b]
[PHP-06:14746] [10]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f760c8afec5]
[PHP-06:14746] [11] ./vq() [0x42b782]
[PHP-06:14746] *** End of error message *** mpirun noticed that process

rank 0 with PID 14746 on node PHP-06 exited on signal 6 (Aborted).

I also checked the test, which is included in the VQ source.

The following tests FAILED:
12 - run_P1_none_12000 (OTHER_FAULT)
14 - test_slip_P1_none_12000 (Failed)
15 - test_interevent_P1_none_12000 (Failed)
18 - run_P1_none_6000 (OTHER_FAULT)
20 - test_slip_P1_none_6000 (Failed)
21 - test_interevent_P1_none_6000 (Failed)
24 - run_P1_none_4000 (OTHER_FAULT)
26 - test_slip_P1_none_4000 (Failed)
27 - test_interevent_P1_none_4000 (Failed)
30 - run_P1_none_3000 (OTHER_FAULT)
32 - test_slip_P1_none_3000 (Failed)
33 - test_interevent_P1_none_3000 (Failed)
36 - run_P1_none_2000 (OTHER_FAULT)
38 - test_slip_P1_none_2000 (Failed)
39 - test_interevent_P1_none_2000 (Failed)
42 - run_P1_taper_12000 (OTHER_FAULT)
46 - run_P1_taper_6000 (OTHER_FAULT)
50 - run_P1_taper_4000 (OTHER_FAULT)
54 - run_P1_taper_3000 (OTHER_FAULT)
58 - run_P1_taper_2000 (OTHER_FAULT)
62 - run_P1_taper_renorm_12000 (OTHER_FAULT)
66 - run_P1_taper_renorm_6000 (OTHER_FAULT)
70 - run_P1_taper_renorm_4000 (OTHER_FAULT)
74 - run_P1_taper_renorm_3000 (OTHER_FAULT)
78 - run_P1_taper_renorm_2000 (OTHER_FAULT)
82 - run_P2_none_6000 (Failed)
84 - test_slip_P2_none_6000 (Failed)
85 - test_interevent_P2_none_6000 (Failed)
88 - run_P2_none_4000 (Failed)
90 - test_slip_P2_none_4000 (Failed)
91 - test_interevent_P2_none_4000 (Failed)
94 - run_P2_none_3000 (Failed)
96 - test_slip_P2_none_3000 (Failed)
97 - test_interevent_P2_none_3000 (Failed)
100 - run_P2_none_2000 (Failed)
102 - test_slip_P2_none_2000 (Failed)
103 - test_interevent_P2_none_2000 (Failed)
106 - run_P2_taper_6000 (Failed)
110 - run_P2_taper_4000 (Failed)
114 - run_P2_taper_3000 (Failed)
118 - run_P2_taper_2000 (Failed)
122 - run_P2_taper_renorm_6000 (Failed)
126 - run_P2_taper_renorm_4000 (Failed)
130 - run_P2_taper_renorm_3000 (Failed)
134 - run_P2_taper_renorm_2000 (Failed)
138 - run_P4_none_6000 (Failed)
140 - test_slip_P4_none_6000 (Failed)
141 - test_interevent_P4_none_6000 (Failed)
144 - run_P4_none_4000 (Failed)
146 - test_slip_P4_none_4000 (Failed)
147 - test_interevent_P4_none_4000 (Failed)
150 - run_P4_none_3000 (Failed)
152 - test_slip_P4_none_3000 (Failed)
153 - test_interevent_P4_none_3000 (Failed)
156 - run_P4_none_2000 (Failed)
158 - test_slip_P4_none_2000 (Failed)
159 - test_interevent_P4_none_2000 (Failed)
162 - run_P4_taper_6000 (Failed)
166 - run_P4_taper_4000 (Failed)
170 - run_P4_taper_3000 (Failed)
174 - run_P4_taper_2000 (Failed)
178 - run_P4_taper_renorm_6000 (Failed)
182 - run_P4_taper_renorm_4000 (Failed)
186 - run_P4_taper_renorm_3000 (Failed)
190 - run_P4_taper_renorm_2000 (Failed)
194 - run_two_none_6000 (OTHER_FAULT)
196 - test_two_slip_none_6000 (Failed)
199 - run_two_none_3000 (OTHER_FAULT)
201 - test_two_slip_none_3000 (Failed)
204 - run_two_taper_6000 (OTHER_FAULT)
208 - run_two_taper_3000 (OTHER_FAULT)
212 - run_two_taper_renorm_6000 (OTHER_FAULT)
216 - run_two_taper_renorm_3000 (OTHER_FAULT)
221 - run_gen_P1_green_3000 (Failed)
222 - run_full_P1_green_3000 (Failed)
227 - run_gen_P2_green_3000 (Failed)
228 - run_full_P2_green_3000 (Failed)
233 - run_gen_P4_green_3000 (Failed)
234 - run_full_P4_green_3000 (Failed)
Errors while running CTest


Reply to this email directly or view it on GitHub
#59 (comment).

Mark Yoder, PhD
805 451 8750

"If you bore me, you lose your soul..."
~belly

@kwschultz
Copy link
Contributor Author

It looks like the other grad student is still getting an error:

user@PHP-06:~/Desktop/Fault_NW_3_1/RUN$ mpirun -np 1 ./vq ./params_G.d

*******************************

*** Virtual Quake ***

*** Version 1.2.0 ***

*** QuakeLib 1.2.0 ***

*** MPI process count : 1 ***

*** OpenMP not enabled ***

*******************************

Initializing blocks.

To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).

Calculating Greens function with the standard Okada class....0%....2%....3%....4%....5%....6%....7%....8%....10%....11%....12%....13%....14%....15%....16%....18%....19%....20%....21%....22%....23%....24%....25%....27%....28%....29%....30%....31%....32%....33%....35%....36%....37%....38%....39%....40%....41%....43%....44%....45%....46%....47%....48%....49%....51%....52%....53%....54%....55%....56%....57%....59%....60%....61%....62%....63%....64%....65%....67%....68%....69%....70%....71%....72%....73%....75%....76%....77%....78%....79%....80%....81%....83%....84%....85%....86%....88%....89%....90%....91%....92%....94%....95%....96%....97%....98%....99%

Greens function took 468.306 seconds.

Greens shear matrix takes 71.9531 megabytes

Greens normal matrix takes 71.9531 megabytes

Greens output file: Green_NW_Iran.h5

vq: /home/user/Desktop/vq-master_3_10/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed.
[PHP-06:04232] *** Process received signal ***
[PHP-06:04232] Signal: Aborted (6)
[PHP-06:04232] Signal code: (-6)
[PHP-06:04232] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36c30) [0x7fc9ae17fc30]
[PHP-06:04232] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7fc9ae17fbb9]
[PHP-06:04232] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7fc9ae182fc8]
[PHP-06:04232] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fa76) [0x7fc9ae178a76]
[PHP-06:04232] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb22) [0x7fc9ae178b22]
[PHP-06:04232] [ 5] ./vq() [0x42980a]
[PHP-06:04232] [ 6] ./vq(_ZN17UpdateBlockStress4initEP12SimFramework+0x51d) [0x44937d]
[PHP-06:04232] [ 7] ./vq(_ZN12SimFramework4initEv+0x51e) [0x453cee]
[PHP-06:04232] [ 8] ./vq(_ZN10Simulation4initEv+0x29) [0x468149]
[PHP-06:04232] [ 9] ./vq(main+0x109b) [0x42a99b]
[PHP-06:04232] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fc9ae16aec5]
[PHP-06:04232] [11] ./vq() [0x42bca2]

[PHP-06:04232] *** End of error message ***

mpirun noticed that process rank 0 with PID 4232 on node PHP-06 exited on signal 6 (Aborted).

@markyoder
Copy link
Contributor

this is a different error what was getting. i'd been seeing, basically, an
error that no blocks had been created; this is telling us that the sim is
looking for a block off the end of the blocks array
(block_num>=blocks.size() -- which could, conceivable, be size=0, but i
don't think so in this case. this could also be due to recycling a fault
model. did he recreate his fault model before running? otherwise, it might
be related to the general seg-fault problem. standby... can we get a copy
of those fault traces?

On Tue, Mar 10, 2015 at 10:40 AM, Kasey Schultz notifications@github.com
wrote:

It looks like the other grad student is still getting an error:

user@PHP-06:~/Desktop/Fault_NW_3_1/RUN$ mpirun -np 1 ./vq ./params_G.d
******************************* *** Virtual Quake *** *** Version 1.2.0
*** *** QuakeLib 1.2.0 *** *** MPI process count : 1 *** *** OpenMP not
enabled *** ******************************* Initializing blocks. To
gracefully quit, create the file quit_vq in the run directory or use a
SIGINT (Control-C). Calculating Greens function with the standard Okada
class....0%....2%....3%....4%....5%....6%....7%....8%....10%....11%....12%....13%....14%....15%....16%....18%....19%....20%....21%....22%....23%....24%....25%....27%....28%....29%....30%....31%....32%....33%....35%....36%....37%....38%....39%....40%....41%....43%....44%....45%....46%....47%....48%....49%....51%....52%....53%....54%....55%....56%....57%....59%....60%....61%....62%....63%....64%....65%....67%....68%....69%....70%....71%....72%....73%....75%....76%....77%....78%....79%....80%....81%....83%....84%....85%....86%....88%....89%....90%....91%....92%....94%....95%....96%....97%....98%....99% Greens
function took 468.306 seconds. Greens shear matrix takes 71.9531 megabytes Greens
normal matrix takes 71.9531 megabytes Greens output file: Green_NW_Iran.h5

vq: /home/user/Desktop/vq-master_3_10/src/core/SimDataBlocks.h:47: Block&
VCSimDataBlocks::getBlock(const BlockID&): Assertion
`block_num<blocks.size()' failed.
[PHP-06:04232] *** Process received signal ***
[PHP-06:04232] Signal: Aborted (6)
[PHP-06:04232] Signal code: (-6)
[PHP-06:04232] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36c30)
[0x7fc9ae17fc30]
[PHP-06:04232] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39)
[0x7fc9ae17fbb9]
[PHP-06:04232] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)
[0x7fc9ae182fc8]
[PHP-06:04232] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fa76)
[0x7fc9ae178a76]
[PHP-06:04232] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb22)
[0x7fc9ae178b22]
[PHP-06:04232] [ 5] ./vq() [0x42980a]
[PHP-06:04232] [ 6]
./vq(_ZN17UpdateBlockStress4initEP12SimFramework+0x51d) [0x44937d]
[PHP-06:04232] [ 7] ./vq(_ZN12SimFramework4initEv+0x51e) [0x453cee]
[PHP-06:04232] [ 8] ./vq(_ZN10Simulation4initEv+0x29) [0x468149]
[PHP-06:04232] [ 9] ./vq(main+0x109b) [0x42a99b]
[PHP-06:04232] [10]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fc9ae16aec5]
[PHP-06:04232] [11] ./vq() [0x42bca2]
[PHP-06:04232] *** End of error message *** mpirun noticed that process
rank 0 with PID 4232 on node PHP-06 exited on signal 6 (Aborted).


Reply to this email directly or view it on GitHub
#59 (comment).

Mark Yoder, PhD
805 451 8750

"If you bore me, you lose your soul..."
~belly

@markyoder
Copy link
Contributor

ok, so i think i've been making progress on this. hopefully eric can clarify some syntax for those of us less awesome at C++. valgrind tells me that we are not properly deallocating memory in a bunch of places because we need to use "delete [ ] ary" when we declare and allocate that array like "int *X = new int[n];"... which we don't do in about a thousand places. similarly, when we use malloc() or valloc(), we need to use free(), though i'm not sure if something more needs to be done for valloc(); i only found articles for malloc(). this seems to be clearing up valgrind complaints one by one.

one area of concern:
in Simulation.cpp:Simulation:collectEventSweep(quakelib::ModelSweeps &sweeps) {...}
(lines 746 or so), i'm a bit confused about how to handle the sweep_counts, sweep_offsets, and all_sweeps arrays. they appear to be declared/allocated differently for root/child nodes, so nominally they need to be deallocated differently. for child nodes, we set the pointer int * sweep_offsets to NULL, and it is subsequently involved in an MPI_Gather() call, so i presume the NULL value (actually NULL address; should it be a value?) is handled on the other end.

Is this it???:
having corrected (??) a bunch of memory allocation new/delete bits, we're back to GreensInit.cpp, somewhere around line 139, where we wrtie "global greens shear matrix takes..." ,etc. it looks like this mpi call might not be happining correctly. mpi_reduce is supposed to collect numbers for global_shear/normal_bytes, right? but it does not seem to collect those values correctly. to make memcheck happy, i initialize these upon declaration to nan; i get nan again for those values on non-root nodes -- or is that by design?

@markyoder
Copy link
Contributor

and it seemed that heisenbug was fixed, but it's back. almost all of the bits above are addressed in the most recent Pull request. BUT, we still get heisenbug, at least for big models, long runs, MPP mode.

the most likely candidate at this point is RunEvent.cpp::RunEvent::processBlocksSecondaryFailures(). note the MPI_Send(), MPI_Recv() bits in the middle of this code block. basically, we have an if_root()/not_root() separation; both the root and child nodes do a bit of both sending and receiving. it seems to me that if one node gets ahead of the others in this send/receive + if-loop process, we could get a hang-up.

... and a quick note: at least on my secondary Ubuntu platform, heisen_hang appears to occur at the same place, when the events.h5 ouptut file reaches 221.9 MB; it has at this moment not been modified for a little more than 2.5 hours.

@markyoder
Copy link
Contributor

... and to make this more and more confusing, it may be the case that there is no heisen_hang. the main parameters to observe this phenomenon is: block_length <= 3km, MPP or SPP, full CA model, HDF5 output mode, and it seems to occur after about 220MB of data are collected.

so it looked like this might be in the HDF5 write code, but i just finished a run with all of the above parameters, including HDF5 output, and it finished. it DID appear to hang (on what turns out the be the last (few) event(s) ), but as any good scientist does, i went to have a swim and some lunch, and when i came back it was finished. similarly, i finished a run on kapalua np=8 in text mode. of course, the np=8 kapalua run using HDF5 mode remains hung for at least the last 6 hours or so, so there might still be a problem. so, the plan at this point:

  1. there are a couple places in the write-data-out code that could, on some platforms, cause the system to hang during a big HDF5 write. namely look to see if child nodes might be trying to write to the output file; all hdf5 writing should be by the root-node... at this time... i think. is "write" process blocking correctly?
  2. there may be no problem. large events take a LOT of time to process, so Exploding California might look like a hang in some cases, but it's really just processing a massive event. let's address the Exploding California problem, maybe heisen_hang will go away.

@markyoder
Copy link
Contributor

update: not sure if heisen_hang and exploding california are actually related. mac systems appear to hang independent of large events. as per the "m>10 events" ticket, we've implemented code to mitigate exploding_california (basically, impose max/min constraints on GF values), but mac os still hangs... at least in mpi, sometimes. hangs in both HDF5 and text output modes.

summary: system appears to hang during processStaticFailure() during the MPI_Allgatherv() call in sim->distributeUpdateField()

this appears to be utterly stable on Linux platforms. note that during the installation (compile time), the Linux (Mint 17 and corresponding ubuntu (14.x?) distros), we do NOT get the "OpenMPI not found" errors that we see during the MacOS installations.

using the "lldb" debugger, backtrace on all "active" processes produces something like:
(lldb) thread backtrace

  • thread Multi-processor tests fail #1: tid = 0x5f2b8e3, 0x00007fff879bcafe libsystem_kernel.dylib`swtch_pri + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    • frame #0: 0x00007fff879bcafe libsystem_kernel.dylibswtch_pri + 10 frame #1: 0x00007fff8b85fc56 libsystem_pthread.dylibsched_yield + 11
      frame Need header on text output from sims #2: 0x0000000101fb6395 libmpi.1.dylibompi_request_default_wait_all + 277 frame #3: 0x00000001027b0471 mca_coll_tuned.soompi_coll_tuned_sendrecv_actual + 161
      frame Ensure consistent ordering of blocks in MPI #4: 0x00000001027b8beb mca_coll_tuned.soompi_coll_tuned_allgatherv_intra_neighborexchange + 859 frame #5: 0x0000000101fc59f3 libmpi.1.dylibMPI_Allgatherv + 307
      frame Added dilatational gravity Greens functions #6: 0x0000000101eaf830 vqSimulation::distributeUpdateField() + 144 frame #7: 0x0000000101e94b7b vqRunEvent::processStaticFailure(Simulation_) + 1723
      frame Okada gravity bug #8: 0x0000000101e962d3 vqRunEvent::run(SimFramework_) + 227 frame #9: 0x0000000101ea19b2 vqSimFramework::run() + 1794
      frame Single processor fix #10: 0x0000000101e82c5f vqmain + 2063 frame #11: 0x00007fff8ad2d5fd libdyld.dylibstart + 1

a "thread list" command usually produces:
Process 62302 stopped

  • thread Multi-processor tests fail #1: tid = 0x5f2b8df, 0x00007fff879bcafe libsystem_kernel.dylibswtch_pri + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP thread #2: tid = 0x5f2b8e9, 0x00007fff879c09aa libsystem_kernel.dylib__select + 10

and maybe on one process:
Process 62301 stopped

  • thread Multi-processor tests fail #1: tid = 0x5f2b8de, 0x000000010fd47031 mca_btl_vader.somca_btl_vader_component_progress + 1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP thread #2: tid = 0x5f2b8ea, 0x00007fff879c09aa libsystem_kernel.dylib__select + 10

i think the state is cycling on the processes, so the "different" message can be hard to catch.

@markyoder
Copy link
Contributor

... and it looks like the problem is... or may be anyway, that there are nested MPI calls in the primary-secondary block failure model. in other words, it may occur that a secondary failure loop on one process starts making calls for blocking, waiting, distributing, etc. at the same time that another process is doing the same for primary failure events. for now, let's try more mpi_barrier(), and we'll see if we can clean it up better later...

@markyoder
Copy link
Contributor

still hanging on a child node (i think) MPI_Recv() call. the solution may be as simple as using MPI_Ssend() instead of MPI_Send(), the former being the "synchronous" version of the latter. see:
http://stackoverflow.com/questions/17582900/difference-between-mpi-send-and-mpi-ssend

this article describes cases where MPI_Send() might be happy to move on when the corresponding MPI_Recv() has not actually done whatever the hell it's supposed to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants