Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpiflute terminates with std::bad_alloc #2

Closed
DiffeoInvariant opened this issue Mar 20, 2020 · 8 comments
Closed

mpiflute terminates with std::bad_alloc #2

DiffeoInvariant opened this issue Mar 20, 2020 · 8 comments

Comments

@DiffeoInvariant
Copy link
Contributor

Hi FluTE devs,

I'm currently trying to get mpiflute working on my local machine (before I try to get it running on a cluster) with either MPICH or Intel MPI, but when I compile with either (editing the makefile appropriately) and try to run with config-usa, the R0 calculation and interpolated beta are calculated correctly, then the application terminates with a segmentation fault, propagating the std::bad_alloc exception to stderr when run with one MPI process (the exception never makes it to stderr when run with more than one MPI process). Does this config file just require a lot of RAM to run? I should have about 8GB of RAM available when running, so I'd be surprised if there isn't something else going on.

I edited my makefile to use the lines
MPICFLAGS = -Wall -I/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/include -pthread -DDSFMT_MEXP=19937
MPILDFLAGS = -L. -L/opt/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/lib -lm -lmpi_ilp64 -lutil -lnsl -ldl -lrt

Is that correct? And if so, am I just running out of memory (which isn't a problem, since I have access to a cluster for real runs), or is there something else I'm doing wrong? Thanks!

@dlchao
Copy link
Owner

dlchao commented Mar 20, 2020

I don't think you can run the whole USA on a laptop. I needed about 30-40 cores for that (but that was 10 years ago). Maybe try MPI with a smaller population that fits in RAM, like Los Angeles.

@DiffeoInvariant
Copy link
Contributor Author

Hmm, I tried running with config-minimal and config-laiv-vs-tiv (e.g. $mpiexec -n 1 ./mpiflute config-minimal), and I still get a segfault with mpiflute, while serial flute works fine for both of those config files. Here's the full output (that's sent to stdout, anyway) from the mpiflute run on config-minimal:

FluTE version 1.17
Parameter set: example-minimal
one population and workflow data
1.6 read in for R0
interpolated beta is 0.281215

mpiflute:7761 terminated with signal 11 at PC=5649f2f2b324 SP=7ffdac80c0f8. Backtrace:
./mpiflute(+0x23324)[0x5649f2f2b324]
./mpiflute(+0xa39d)[0x5649f2f1239d]
./mpiflute(+0xe7c3)[0x5649f2f167c3]
./mpiflute(+0x1850d)[0x5649f2f2050d]
./mpiflute(+0x198c7)[0x5649f2f218c7]
./mpiflute(+0x5b6d)[0x5649f2f0db6d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f1b1defc1e3]
./mpiflute(+0x5cde)[0x5649f2f0dcde]

@DiffeoInvariant
Copy link
Contributor Author

DiffeoInvariant commented Mar 20, 2020

I would have thought that my MPI installation isn't working correctly or somehow isn't linked correctly with mpiflute, but the fact that it correctly reads R0 and interpolates beta indicates that there may be another problem.

@DiffeoInvariant
Copy link
Contributor Author

I'm currently chasing the bug down, and I am completely certain that the segfault happens between the line 'int agegroups[TAG] = {0,0,0,0,0};' (around line 1000) and the line closing the big if-else tree following the previously mentioned line in epimodel.cpp. I'm pretty sure that the problem is in the random number generator; is there any reason FluTE uses its own Mersenne twister implementation as opposed to an std::mersenne_twister_engine, or any of the existing parallel random number generators (e.g. http://www.sprng.org/Version5.0/simple-reference.html)?

@dlchao
Copy link
Owner

dlchao commented Mar 20, 2020

I chose to include code for the random number generator so that there would be no dependencies and to ensure that results will always be the same across platforms. I must admit that I have not used the parallel version of FluTE in about 9 years, so I don't know what the bug could be.

@pbentkowski
Copy link

Hi @DiffeoInvariant ,
These days there are libraries that should work across platforms in RNG in C++ (9 y ego things were bit different). Some time ago I extracted an RNG procedure form the ABM model I was developing to be used in multi-threaded programs, where each thread has its own RNG instance. Maybe you can just replace the problematic code with something more up to standards of today?
https://github.com/pbentkowski/Random-Numbers-and-Multithreading-in-C-11

@dlchao
Copy link
Owner

dlchao commented Mar 21, 2020

@DiffeoInvariant I figured it out. mpiflute tries to keep all the census tracts of a county on a single processor, so you can only test mpiflute with a population that includes several counties. If a processor does not have a county to simulate, it does something bad. My code to detect that problem did not work, and it should now exit if mpiflute has trouble assigning the population across processors. I have a new population "kingsnohomishpierce" that has three counties where you can test mpiflute with 2 cores.

@DiffeoInvariant
Copy link
Contributor Author

@dlchao Thank you! I was just starting on writing a simple wrapper around a prng from the C++ standard library (I could still do that if you want, to test for any likely minute portability/performance advantages it may offer), but after pulling your changes and recompiling, mpiflute appears to work correctly on my laptop. Looks like it's time to get everything working on Summit...

@dlchao dlchao closed this as completed Mar 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants