-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heisenbug #59
Comments
I think i'm observing the same bug on several Linux machines (Ubuntu 14.04 and Mint 17). The simulation seems to run fine on a single processor (SPP) -- but the actual output data need to be verified; for multiple processors (MPP), the Greens functions calculate successfully, then the whole thing quits. i get an error message like this: /# note: this error occurs after greens function calcs. [Umbasa:04607] *** End of error message ***mpirun noticed that process rank 1 with PID 4607 on node Umbasa exited on signal 11 (Segmentation fault). Looking at "make test", it seems that the single processor tests go quite well; there is a significant failure rate for multi-processor tests (and note some sort of error at the end as well): Total Test time (real) = 69.30 sec The following tests FAILED: |
... and the "failing at" address: appears to be consistent with at least two runs (seemingly, at the end of the register), |
... and then, if i run vq in mpp mode using the pre-calculated greens functions, i get the same error (and note, the same failure address at the end of the register: Failing at address: 0xffffffffffffffe8) ********************************** Virtual Quake ****** Version 1.2.0 ****** Git revision ID 5289436 ****** QuakeLib 1.2.0 Git revision 5289436 ****** MPI process count : 2 ****** OpenMP not enabled **********************************Initializing blocks.To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).Reading Greens function data from file all_cal_greens_5000.h5....Greens function took 3.95872 seconds.Greens shear matrix takes 45.5098 megabytesGreens normal matrix takes 45.5098 megabytes[yodubuntu:15069] *** Process received signal *** Global Greens shear matrix takes 91.0195 megabytes.Global Greens normal matrix takes 91.0195 megabytes.[yodubuntu:15069] Signal: Segmentation fault (11) mpirun noticed that process rank 1 with PID 15069 on node yodubuntu.physics.ucdavis.edu exited on signal 11 (Segmentation fault). |
I'm guessing it's in the Green's function calculation then. Can you try commenting out the call to symmetrizeMatrix (misc/GreensFunctions.cpp:81), recompile, and see if it still crashes? And if it still crashes, then try removing the whole assignment loop (lines 83-89) after that? |
FYI Kasey, this is also a common debugging technique - keep removing code until it runs, then figure out what was wrong about the code you removed. |
so i tried commentingout mist/GreensFunctions.cpp:81, 83-89, but no joy. the GreensFunctions calculate, but they don't write to file (or other wise proceed to the next step in the sim). i also see this in the error output: which points to this bit in GreensFileOuput.cpp: is this a parallel vs serial HDF5 problem? |
So when you say they don't write to file, does it crash in the same way as before or does it just have the assertion failure? I don't think it's the parallel vs. serial HDF5 problem, because that would manifest as a different sort of error. |
the sim breaks before the write to file; always the same error. i've added some debugging code to SimFramework.cpp (amongst other places). in particular, in the initialization function: ... but the second process (6408) has only reached the 4th (i=3) plugin. initDesc() completes; init() has started, then we get a segmentation fault reported by 6408. output pasted below. so i'm not sure yet which plugins those are. myoder@Umbasa ~/Documents/Research/yoder/VC/vq/examples/ca_model $ mpirun -np 2 ../../build/src/vq params.d *******************************Debug: plugin_init 0, initDesc() pid: 6408 *** Virtual Quake ****** Version 1.2.0 ****** Git revision ID dde492c ***Debug: plugin_init 0, init() pid: 6408 *** QuakeLib 1.2.0 Git revision dde492c ****** MPI process count : 2 ****** OpenMP not enabled **********************************Debug: SimFramework::init(), 'dry run'/normal initialization loop??6407.. Initializing blocks.Debug: plugin_init 1, init() pid: 6407 To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).Debug: plugin_init 2, init() pid: 6407 Reading Greens function data from file greens_15000.h5.Debug: plugin_init 1, timer bit... pid: 6408plugin cycle finished 1/6408 Greens function took 0.012284 seconds.Greens shear matrix takes 1014 kilobytesGreens normal matrix takes 1014 kilobytes[Umbasa:06408] *** Process received signal *** Global Greens shear matrix takes 1.98047 megabytes.Global Greens normal matrix takes 1.98047 megabytes.Debug: plugin_init 3, timer bit... pid: 6407 mpirun noticed that process rank 1 with PID 6408 on node Umbasa exited on signal 11 (Segmentation fault). |
To make sure it's not HDF5 related, can you recompile with HDF5 disabled and run again? |
I'm not sure if it's the same error, but here is an error that the possibly-Iranian graduate student is getting on multiprocessor runs (this is the only output he gave me, I'll ask for full output): vq: /home/user/Desktop/vq-master3.3/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed. [PHP-06:07000] *** End of error message ***mpirun noticed that process rank 0 with PID 7000 on node PHP-06 exited on signal 6 (Aborted). |
... and the winner is (i think): sim->console() << "# Global Greens shear matrix takes " << abbr_global_shear_bytes << " " << space_vals[global_shear_ind] << "." << std::endl; and the string array space_vals[] is not declared in the scope of the child nodes. what is the right way to share scope? the sim appears to run with these lines commented out. we'll run a big sim to improve confidence. |
From what I can tell the space_vals[] is in the scope of the child nodes since it's created at function start. You can try running with this, but I don't think it's the source of the problem. However, it might "solve" the problem by creating extra space that absorbs whatever memory overwriting is normally causing a crash. |
The message the Iranian student is getting is really weird, because there should never be accesses past the number of blocks. My best guess is that this would be related to the same memory corruption that's caused problems on our side, just manifesting itself differently. |
sadly Eric, i with my limited understanding of MPI and the VQ architecture, i was hoping you'd have something different to say. i can manage to reproduce the segmentation fault using 8 processors, (not) fix in place and all. once the sim gets past the initialization, however, it seems to be stable. that said, the manifestation in the mac environment may be less forgiving. ... and on another machine, where i think vq was running before, i'm getting an error for all runs (mpi and spp modes, so maybe this is a good thing). in this case, however, i'm getting something more like: looking at Simulation.cpp::Simulation::partitionBlocks(), i'm not clear on the scope of the arrays {local_block_ids, global_block_ids, block_node_map}. local_block_ids is declared as "New" in Simulation::distributeBlocks(). these are declared in "core/CommPartition.h". CommPartition is inherited by Simulation. however, is the re-declaration of local_block_ids in distributeBlocks() correct? also, these arrays are declared in CommPartition.h, but where are they allocated? |
I made the change and successfully ran a 10kyr sim on Kapalua (fault model is all CA traces from VQ/examples/fault_traces/ca_traces/, actually a copy of Mark's 5km model) with 4 processes. This is not necessarily evidence that it works though, as before I ran an AllCal 3km sim (Michael's model file but meshed with VQ mesher) on multiple processes and it had some successful runs but most of them would get hung up after 4kyr+. I'll try a few more like a 50kyr with aftershocks on Kapalua and a few more with newly meshed 3km models on my Mac, though I never got explicit memory errors like on Linux. |
Full output from graduate student in some middle east country not to be named: user@PHP-06:~/Desktop/NW_3_1_50000_kasey$ mpirun -np 1 ./vq ./params.prm ********************************** Virtual Quake ****** Version 1.2.0 ****** QuakeLib 1.2.0 ****** MPI process count : 1 ****** OpenMP not enabled **********************************Initializing blocks.To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).Reading Greens function data from file Green_NW_Iran.h5.Greens function took 0.22002 seconds.Greens shear matrix takes 61.6003 megabytesGreens normal matrix takes 61.6003 megabytesvq: /home/user/Desktop/vq-master3.3/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed. [PHP-06:14746] *** End of error message ***mpirun noticed that process rank 0 with PID 14746 on node PHP-06 exited on signal 6 (Aborted).I also checked the test, which is included in the VQ source. The following tests FAILED: |
i get the same error. it looks like the ::init() function in the On Mon, Mar 9, 2015 at 12:06 PM, Kasey Schultz notifications@github.com
Mark Yoder, PhD "If you bore me, you lose your soul..." |
It looks like the other grad student is still getting an error: user@PHP-06:~/Desktop/Fault_NW_3_1/RUN$ mpirun -np 1 ./vq ./params_G.d ********************************** Virtual Quake ****** Version 1.2.0 ****** QuakeLib 1.2.0 ****** MPI process count : 1 ****** OpenMP not enabled **********************************Initializing blocks.To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).Calculating Greens function with the standard Okada class....0%....2%....3%....4%....5%....6%....7%....8%....10%....11%....12%....13%....14%....15%....16%....18%....19%....20%....21%....22%....23%....24%....25%....27%....28%....29%....30%....31%....32%....33%....35%....36%....37%....38%....39%....40%....41%....43%....44%....45%....46%....47%....48%....49%....51%....52%....53%....54%....55%....56%....57%....59%....60%....61%....62%....63%....64%....65%....67%....68%....69%....70%....71%....72%....73%....75%....76%....77%....78%....79%....80%....81%....83%....84%....85%....86%....88%....89%....90%....91%....92%....94%....95%....96%....97%....98%....99%Greens function took 468.306 seconds.Greens shear matrix takes 71.9531 megabytesGreens normal matrix takes 71.9531 megabytesGreens output file: Green_NW_Iran.h5vq: /home/user/Desktop/vq-master_3_10/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed. [PHP-06:04232] *** End of error message ***mpirun noticed that process rank 0 with PID 4232 on node PHP-06 exited on signal 6 (Aborted). |
this is a different error what was getting. i'd been seeing, basically, an On Tue, Mar 10, 2015 at 10:40 AM, Kasey Schultz notifications@github.com
Mark Yoder, PhD "If you bore me, you lose your soul..." |
ok, so i think i've been making progress on this. hopefully eric can clarify some syntax for those of us less awesome at C++. valgrind tells me that we are not properly deallocating memory in a bunch of places because we need to use "delete [ ] ary" when we declare and allocate that array like "int *X = new int[n];"... which we don't do in about a thousand places. similarly, when we use malloc() or valloc(), we need to use free(), though i'm not sure if something more needs to be done for valloc(); i only found articles for malloc(). this seems to be clearing up valgrind complaints one by one. one area of concern: Is this it???: |
and it seemed that heisenbug was fixed, but it's back. almost all of the bits above are addressed in the most recent Pull request. BUT, we still get heisenbug, at least for big models, long runs, MPP mode. the most likely candidate at this point is RunEvent.cpp::RunEvent::processBlocksSecondaryFailures(). note the MPI_Send(), MPI_Recv() bits in the middle of this code block. basically, we have an if_root()/not_root() separation; both the root and child nodes do a bit of both sending and receiving. it seems to me that if one node gets ahead of the others in this send/receive + if-loop process, we could get a hang-up. ... and a quick note: at least on my secondary Ubuntu platform, heisen_hang appears to occur at the same place, when the events.h5 ouptut file reaches 221.9 MB; it has at this moment not been modified for a little more than 2.5 hours. |
... and to make this more and more confusing, it may be the case that there is no heisen_hang. the main parameters to observe this phenomenon is: block_length <= 3km, MPP or SPP, full CA model, HDF5 output mode, and it seems to occur after about 220MB of data are collected. so it looked like this might be in the HDF5 write code, but i just finished a run with all of the above parameters, including HDF5 output, and it finished. it DID appear to hang (on what turns out the be the last (few) event(s) ), but as any good scientist does, i went to have a swim and some lunch, and when i came back it was finished. similarly, i finished a run on kapalua np=8 in text mode. of course, the np=8 kapalua run using HDF5 mode remains hung for at least the last 6 hours or so, so there might still be a problem. so, the plan at this point:
|
update: not sure if heisen_hang and exploding california are actually related. mac systems appear to hang independent of large events. as per the "m>10 events" ticket, we've implemented code to mitigate exploding_california (basically, impose max/min constraints on GF values), but mac os still hangs... at least in mpi, sometimes. hangs in both HDF5 and text output modes. summary: system appears to hang during processStaticFailure() during the MPI_Allgatherv() call in sim->distributeUpdateField() this appears to be utterly stable on Linux platforms. note that during the installation (compile time), the Linux (Mint 17 and corresponding ubuntu (14.x?) distros), we do NOT get the "OpenMPI not found" errors that we see during the MacOS installations. using the "lldb" debugger, backtrace on all "active" processes produces something like:
a "thread list" command usually produces:
and maybe on one process:
i think the state is cycling on the processes, so the "different" message can be hard to catch. |
... and it looks like the problem is... or may be anyway, that there are nested MPI calls in the primary-secondary block failure model. in other words, it may occur that a secondary failure loop on one process starts making calls for blocking, waiting, distributing, etc. at the same time that another process is doing the same for primary failure events. for now, let's try more mpi_barrier(), and we'll see if we can clean it up better later... |
still hanging on a child node (i think) MPI_Recv() call. the solution may be as simple as using MPI_Ssend() instead of MPI_Send(), the former being the "synchronous" version of the latter. see: this article describes cases where MPI_Send() might be happy to move on when the corresponding MPI_Recv() has not actually done whatever the hell it's supposed to do. |
Error in running VQ on multiple processors with a model more complex than a few faults and a few thousand elements. Error is that the simulation gets stuck and cannot continue. Error is randomly reproducible, occurring at different points in multiple runs of the same simulation.
We must fix this bug before we can reliably run simulations on multiple processors. Right now we are effectively dead in the water with respect to a full CA simulation or even including aftershocks on smaller simulations.
Eric's summary of the bug when it occurred on a 3 processor run of a 6 fault subset from the UCERF2 California model:
"So, here’s my Sherlock Holmes take:
From the backtrace, we know process 0 and 1 are stuck in distributeUpdateField(), while process 2 is in MPI_Recv() in processBlocksSecondaryFailures()
Since the processes are in order, this means the MPI_Recv that is stuck must correspond to the solution of Ax=b being sent back from the root (process 0) to process 2
The only way this could happen is if the number of MPI_Send() calls from root does not match the number of MPI_Recv() calls in the other processes
The only way this mismatch could happen is if the total number of entries in global_id_list is not equal to the sum of the number of entries in local_id_list for each process
or if processes have different understandings of the assignment of blocks to each process
Since my laptop run is already at 4300 events with no problems, it seems more likely this is a bug caused by bad memory writing
Such that one of theses structures is being corrupted by something overwriting the existing data
So the question is how do we check whether this corruption is happening"
The text was updated successfully, but these errors were encountered: