-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpiflute terminates with std::bad_alloc #2
Comments
I don't think you can run the whole USA on a laptop. I needed about 30-40 cores for that (but that was 10 years ago). Maybe try MPI with a smaller population that fits in RAM, like Los Angeles. |
Hmm, I tried running with config-minimal and config-laiv-vs-tiv (e.g. $mpiexec -n 1 ./mpiflute config-minimal), and I still get a segfault with mpiflute, while serial flute works fine for both of those config files. Here's the full output (that's sent to stdout, anyway) from the mpiflute run on config-minimal: FluTE version 1.17 mpiflute:7761 terminated with signal 11 at PC=5649f2f2b324 SP=7ffdac80c0f8. Backtrace: |
I would have thought that my MPI installation isn't working correctly or somehow isn't linked correctly with mpiflute, but the fact that it correctly reads R0 and interpolates beta indicates that there may be another problem. |
I'm currently chasing the bug down, and I am completely certain that the segfault happens between the line 'int agegroups[TAG] = {0,0,0,0,0};' (around line 1000) and the line closing the big if-else tree following the previously mentioned line in epimodel.cpp. I'm pretty sure that the problem is in the random number generator; is there any reason FluTE uses its own Mersenne twister implementation as opposed to an std::mersenne_twister_engine, or any of the existing parallel random number generators (e.g. http://www.sprng.org/Version5.0/simple-reference.html)? |
I chose to include code for the random number generator so that there would be no dependencies and to ensure that results will always be the same across platforms. I must admit that I have not used the parallel version of FluTE in about 9 years, so I don't know what the bug could be. |
Hi @DiffeoInvariant , |
@DiffeoInvariant I figured it out. mpiflute tries to keep all the census tracts of a county on a single processor, so you can only test mpiflute with a population that includes several counties. If a processor does not have a county to simulate, it does something bad. My code to detect that problem did not work, and it should now exit if mpiflute has trouble assigning the population across processors. I have a new population "kingsnohomishpierce" that has three counties where you can test mpiflute with 2 cores. |
@dlchao Thank you! I was just starting on writing a simple wrapper around a prng from the C++ standard library (I could still do that if you want, to test for any likely minute portability/performance advantages it may offer), but after pulling your changes and recompiling, mpiflute appears to work correctly on my laptop. Looks like it's time to get everything working on Summit... |
Hi FluTE devs,
I'm currently trying to get mpiflute working on my local machine (before I try to get it running on a cluster) with either MPICH or Intel MPI, but when I compile with either (editing the makefile appropriately) and try to run with config-usa, the R0 calculation and interpolated beta are calculated correctly, then the application terminates with a segmentation fault, propagating the std::bad_alloc exception to stderr when run with one MPI process (the exception never makes it to stderr when run with more than one MPI process). Does this config file just require a lot of RAM to run? I should have about 8GB of RAM available when running, so I'd be surprised if there isn't something else going on.
I edited my makefile to use the lines
MPICFLAGS = -Wall -I/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/include -pthread -DDSFMT_MEXP=19937
MPILDFLAGS = -L. -L/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib -lm -lmpi_ilp64 -lutil -lnsl -ldl -lrt
Is that correct? And if so, am I just running out of memory (which isn't a problem, since I have access to a cluster for real runs), or is there something else I'm doing wrong? Thanks!
The text was updated successfully, but these errors were encountered: