mpiflute terminates with std::bad_alloc #2

DiffeoInvariant · 2020-03-20T16:05:53Z

Hi FluTE devs,

I'm currently trying to get mpiflute working on my local machine (before I try to get it running on a cluster) with either MPICH or Intel MPI, but when I compile with either (editing the makefile appropriately) and try to run with config-usa, the R0 calculation and interpolated beta are calculated correctly, then the application terminates with a segmentation fault, propagating the std::bad_alloc exception to stderr when run with one MPI process (the exception never makes it to stderr when run with more than one MPI process). Does this config file just require a lot of RAM to run? I should have about 8GB of RAM available when running, so I'd be surprised if there isn't something else going on.

I edited my makefile to use the lines
MPICFLAGS = -Wall -I/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/include -pthread -DDSFMT_MEXP=19937
MPILDFLAGS = -L. -L/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib -lm -lmpi_ilp64 -lutil -lnsl -ldl -lrt

Is that correct? And if so, am I just running out of memory (which isn't a problem, since I have access to a cluster for real runs), or is there something else I'm doing wrong? Thanks!

dlchao · 2020-03-20T16:26:44Z

I don't think you can run the whole USA on a laptop. I needed about 30-40 cores for that (but that was 10 years ago). Maybe try MPI with a smaller population that fits in RAM, like Los Angeles.

DiffeoInvariant · 2020-03-20T16:33:19Z

Hmm, I tried running with config-minimal and config-laiv-vs-tiv (e.g. $mpiexec -n 1 ./mpiflute config-minimal), and I still get a segfault with mpiflute, while serial flute works fine for both of those config files. Here's the full output (that's sent to stdout, anyway) from the mpiflute run on config-minimal:

FluTE version 1.17
Parameter set: example-minimal
one population and workflow data
1.6 read in for R0
interpolated beta is 0.281215

mpiflute:7761 terminated with signal 11 at PC=5649f2f2b324 SP=7ffdac80c0f8. Backtrace:
./mpiflute(+0x23324)[0x5649f2f2b324]
./mpiflute(+0xa39d)[0x5649f2f1239d]
./mpiflute(+0xe7c3)[0x5649f2f167c3]
./mpiflute(+0x1850d)[0x5649f2f2050d]
./mpiflute(+0x198c7)[0x5649f2f218c7]
./mpiflute(+0x5b6d)[0x5649f2f0db6d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f1b1defc1e3]
./mpiflute(+0x5cde)[0x5649f2f0dcde]

DiffeoInvariant · 2020-03-20T16:34:14Z

I would have thought that my MPI installation isn't working correctly or somehow isn't linked correctly with mpiflute, but the fact that it correctly reads R0 and interpolates beta indicates that there may be another problem.

DiffeoInvariant · 2020-03-20T22:47:27Z

I'm currently chasing the bug down, and I am completely certain that the segfault happens between the line 'int agegroups[TAG] = {0,0,0,0,0};' (around line 1000) and the line closing the big if-else tree following the previously mentioned line in epimodel.cpp. I'm pretty sure that the problem is in the random number generator; is there any reason FluTE uses its own Mersenne twister implementation as opposed to an std::mersenne_twister_engine, or any of the existing parallel random number generators (e.g. http://www.sprng.org/Version5.0/simple-reference.html)?

dlchao · 2020-03-20T23:03:34Z

I chose to include code for the random number generator so that there would be no dependencies and to ensure that results will always be the same across platforms. I must admit that I have not used the parallel version of FluTE in about 9 years, so I don't know what the bug could be.

pbentkowski · 2020-03-20T23:15:50Z

Hi @DiffeoInvariant ,
These days there are libraries that should work across platforms in RNG in C++ (9 y ego things were bit different). Some time ago I extracted an RNG procedure form the ABM model I was developing to be used in multi-threaded programs, where each thread has its own RNG instance. Maybe you can just replace the problematic code with something more up to standards of today?
https://github.com/pbentkowski/Random-Numbers-and-Multithreading-in-C-11

dlchao · 2020-03-21T15:42:53Z

@DiffeoInvariant I figured it out. mpiflute tries to keep all the census tracts of a county on a single processor, so you can only test mpiflute with a population that includes several counties. If a processor does not have a county to simulate, it does something bad. My code to detect that problem did not work, and it should now exit if mpiflute has trouble assigning the population across processors. I have a new population "kingsnohomishpierce" that has three counties where you can test mpiflute with 2 cores.

DiffeoInvariant · 2020-03-21T21:37:08Z

@dlchao Thank you! I was just starting on writing a simple wrapper around a prng from the C++ standard library (I could still do that if you want, to test for any likely minute portability/performance advantages it may offer), but after pulling your changes and recompiling, mpiflute appears to work correctly on my laptop. Looks like it's time to get everything working on Summit...

dlchao closed this as completed Mar 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpiflute terminates with std::bad_alloc #2

mpiflute terminates with std::bad_alloc #2

DiffeoInvariant commented Mar 20, 2020

dlchao commented Mar 20, 2020

DiffeoInvariant commented Mar 20, 2020

DiffeoInvariant commented Mar 20, 2020 •

edited

Loading

DiffeoInvariant commented Mar 20, 2020

dlchao commented Mar 20, 2020

pbentkowski commented Mar 20, 2020

dlchao commented Mar 21, 2020

DiffeoInvariant commented Mar 21, 2020

mpiflute terminates with std::bad_alloc #2

mpiflute terminates with std::bad_alloc #2

Comments

DiffeoInvariant commented Mar 20, 2020

dlchao commented Mar 20, 2020

DiffeoInvariant commented Mar 20, 2020

DiffeoInvariant commented Mar 20, 2020 • edited Loading

DiffeoInvariant commented Mar 20, 2020

dlchao commented Mar 20, 2020

pbentkowski commented Mar 20, 2020

dlchao commented Mar 21, 2020

DiffeoInvariant commented Mar 21, 2020

DiffeoInvariant commented Mar 20, 2020 •

edited

Loading