Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined behavior from static global variables #4856

Closed
jngrad opened this issue Jan 30, 2024 · 12 comments · Fixed by #4858
Closed

Undefined behavior from static global variables #4856

jngrad opened this issue Jan 30, 2024 · 12 comments · Fixed by #4858

Comments

@jngrad
Copy link
Member

jngrad commented Jan 30, 2024

TL:DR: ESPResSo manages the lifetime of MPI global variables in a manner that is subject to destruction order fiasco.

Problem statement

ESPResSo 4.2 and 4.3-dev manage the following static globals:

static auto const &mpi_datatype_cache =
boost::mpi::detail::mpi_datatype_cache();
static std::shared_ptr<boost::mpi::environment> mpi_env;

The MPI datatype cache is a Myers singleton, i.e. it is managed by reference rather than by pointer (which has pros and cons). This makes object lifetime difficult to control.

Destruction of the boost::mpi::environment depends on the mpi_datatype_cache in such a way that one isn't allowed to keep a handle to boost::mpi::environment in a static global (for more details please see boostorg/mpi#92). In addition, storing mpi_datatype_cache by reference in a static global alters its lifetime, which can trigger UB after normal program termination.

Reproducing the bug

Reproducing UB caused by static initialization order fiasco is notoriously difficult. Fortunately for us, Boost 1.84.0 alters the order of static initialization, such that we can reproduce the issue in ESPResSo 4.2 and 4.3-dev in a reliable manner:

mkdir /tmp/boost
cd /tmp/boost
curl -sL https://boostorg.jfrog.io/artifactory/main/release/1.84.0/source/boost_1_84_0.tar.bz2 \
    | tar xj --strip-components=1
mkdir opt
echo 'using mpi ;' > tools/build/src/user-config.jam
./bootstrap.sh --with-libraries=filesystem,system,mpi,serialization,test
./b2 -j $(nproc) install --prefix=$(realpath opt) variant=debug inlining=off debug-symbols=on
export BOOST_ROOT=$(realpath opt)
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}"
git clone --depth=20 --recursive -b python https://github.com/espressomd/espresso.git
cd espresso
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug
make -j $(nproc)
make -j $(nproc) check_unit_tests

Output:

[...]
 96/120 Test  #96: specfunc_test ............................***Failed    0.07 sec
Running 3 test cases...

*** No errors detected
*** The MPI_Type_contiguous() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[lama:2299701] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
91% tests passed, 11 tests failed out of 120
[...]
Total Test time (real) =  41.95 sec

The following tests FAILED:
	 63 - SingleReaction_test (Failed)
	 66 - reaction_methods_utils_test (Failed)
	 75 - p3m_test (Failed)
	 79 - rotation_test (Failed)
	 89 - VerletCriterion_test (Failed)
	 94 - bonded_interactions_map_test (Failed)
	 96 - specfunc_test (Failed)
	 99 - ObjectHandle_test (Failed)
	100 - AutoParameters_test (Failed)
	114 - Constraints_test (Failed)
	115 - Actors_test (Failed)

You may get false positives due to incorrect linkage of Boost::serialization, and may not get debug symbols in GDB. For an improved GDB experience, please use this Dockerfile instead:

FROM fedora:36 as base
RUN yum -y install \
  gcc gcc-c++ make \
  cmake \
  gdb \
  git \
  zlib-devel \
  bzip2 \
  vim \
  which \
  openmpi-devel \
  python3 \
  python3-devel \
  python3-Cython \
  python3-numpy \
  python3-scipy \
  python3-setuptools \
  && yum clean all

RUN cd /tmp \
 && mkdir boost \
 && for f in /etc/profile.d/*module*.sh; do . "${f}"; done; module load mpi \
 && cd boost \
 && curl -sL https://boostorg.jfrog.io/artifactory/main/release/1.84.0/source/boost_1_84_0.tar.bz2 | tar xj \
 && cd boost_1_84_0 \
 && echo 'using mpi ;' > tools/build/src/user-config.jam \
 && ./bootstrap.sh --with-libraries=filesystem,system,mpi,serialization,test \
 && ./b2 -j $(nproc) install --prefix=/opt/boost variant=debug inlining=off debug-symbols=on \
 && cd \
 && rm -r /tmp/boost

ENV BOOST_ROOT=/opt/boost LD_LIBRARY_PATH="/opt/boost/lib:${LD_LIBRARY_PATH}"
RUN useradd -m espresso
USER 1000
WORKDIR /home/espresso
docker build --tag fed:mwe -f Dockerfile .
docker run --user espresso -it fed:mwe bash

git clone --depth=20 --recursive -b 4.2 https://github.com/espressomd/espresso.git
cd espresso
mkdir build
cd build
module load mpi
cmake .. -DCMAKE_BUILD_TYPE=Debug
make -j $(nproc)
make -j $(nproc) check_unit_tests
gdb src/core/unit_tests/specfunc_test
(gdb) b backend_fatal
(gdb) r

GDB trace:

#0  backend_fatal (type=0x7f4c29b07340 "communicator", comm=0x0, name=0x0, error_code=0x0, 
    arglist=0x7fffbfc1e408) at errhandler/errhandler_predefined.c:380
#1  0x00007f4c299f760a in ompi_mpi_errors_are_fatal_comm_handler (comm=0x0, error_code=0x0)
    at errhandler/errhandler_predefined.c:70
#2  0x00007f4c29a823a5 in PMPI_Type_contiguous (count=1, oldtype=0x7f4c29b6c100 <ompi_mpi_byte>, 
    newtype=0x7fffbfc1e548) at ptype_contiguous.c:55
#3  0x00007f4c29c66134 in boost::mpi::detail::build_mpi_datatype_for_bool () at ./boost/mpi/datatype.hpp:325
#4  0x00007f4c29c6618b in boost::mpi::get_mpi_datatype<bool> () at ./boost/mpi/datatype.hpp:336
#5  0x00007f4c29c65e50 in boost::mpi::detail::mpi_datatype_map::clear (
    this=0x7f4c29c834a8 <boost::mpi::detail::mpi_datatype_cache()::cache>)
    at libs/mpi/src/mpi_datatype_cache.cpp:39
#6  0x00007f4c29c65e9d in boost::mpi::detail::mpi_datatype_map::~mpi_datatype_map (
    this=0x7f4c29c834a8 <boost::mpi::detail::mpi_datatype_cache()::cache>, __in_chrg=<optimized out>)
    at libs/mpi/src/mpi_datatype_cache.cpp:47
#7  0x00007f4c294bc507 in __cxa_finalize () from /usr/lib64/libc.so.6
#8  0x00007f4c29c54747 in __do_global_dtors_aux () from /opt/boost/lib/libboost_mpi.so.1.84.0
#9  0x00007fffbfc1e7b0 in ?? ()
#10 0x00007f4c2a05dd3e in _dl_fini () at dl-fini.c:142

Outlook

Due to how Boost 1.84.0 alters the order of static initialization, ESPResSo is now unusable on most environments. The bug was successfully reproduced on Ubuntu, Fedora and openSUSE.

The openSUSE package managers have already started the formal process of removing ESPResSo from their repositories (request 1142685). Fedora will probably do the same as soon as the Boost version gets bumped, probably after the rawhide fork. Not sure how EasyBuild and EESSI will react.

@jngrad
Copy link
Member Author

jngrad commented Jan 30, 2024

One can influence the destruction order by adding the following two code snippets before and after #include <boost/test/unit_test.hpp> in every failing test, and linking them against Espresso::core Boost::mpi via CMake:

#define BOOST_TEST_NO_MAIN
#include "communication.hpp"
int main(int argc, char **argv) {
  auto mpi_env = std::make_shared<boost::mpi::environment>(argc, argv);
  Communication::init(mpi_env);
  return boost::unit_test::unit_test_main(init_unit_test, argc, argv);
}

This is not a sustainable solution, and it will break as soon as the order of instantiation of static globals changes, for example when ESPResSo features that introduce global variables are disabled, or when a different Boost release is used.

@jngrad
Copy link
Member Author

jngrad commented Jan 31, 2024

After fixing

export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}"

to

export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib"

the bug can be properly investigated with GDB on Ubuntu 22.04.

It's not clear to me, why ESPResSo needs to manage the lifetime of the boost::mpi::detail::mpi_datatype_map singleton. When we don't extend its lifetime, several tests experience segmentation faults or timeouts during normal program termination. Here is the trace when the static global that keeps a reference to the singleton is removed:

Thread 1 "ReactionAlgorit" received signal SIGSEGV, Segmentation fault.
0x00001555544c6504 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) bt
#0  0x00001555544c6504 in std::_Rb_tree_increment(std::_Rb_tree_node_base*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00001555549e1b45 in std::_Rb_tree_iterator<std::pair<std::type_info const* const, ompi_datatype_t*> >::operator++ (
    this=0x7fffffffd320) at /usr/include/c++/10/bits/stl_tree.h:287
#2  0x00001555549e155f in boost::mpi::detail::mpi_datatype_map::clear (
    this=0x1555549ff568 <boost::mpi::detail::mpi_datatype_cache()::cache>) at libs/mpi/src/mpi_datatype_cache.cpp:36
#3  0x00001555549dd6f8 in boost::mpi::environment::~environment (this=0x5555555ca930, __in_chrg=<optimized out>)
    at libs/mpi/src/environment.cpp:184
#4  0x0000155554fa621c in __gnu_cxx::new_allocator<boost::mpi::environment>::destroy<boost::mpi::environment> (
    this=0x5555555ca930, __p=0x5555555ca930) at /usr/include/c++/10/ext/new_allocator.h:162
#5  0x0000155554fa61e7 in std::allocator_traits<std::allocator<boost::mpi::environment> >::destroy<boost::mpi::environment> (
    __a=..., __p=0x5555555ca930) at /usr/include/c++/10/bits/alloc_traits.h:531
#6  0x0000155554fa60a1 in std::_Sp_counted_ptr_inplace<boost::mpi::environment, std::allocator<boost::mpi::environment>,
    (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x5555555ca920)
    at /usr/include/c++/10/bits/shared_ptr_base.h:560
#7  0x0000555555588bd7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5555555ca920)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#8  0x0000555555584321 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (
    this=0x15555539cd38 <Communication::mpi_env+8>, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:736
#9  0x0000155554f9eb16 in std::__shared_ptr<boost::mpi::environment, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (
    this=0x15555539cd30 <Communication::mpi_env>, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1188
#10 0x0000155554f9ec86 in std::shared_ptr<boost::mpi::environment>::~shared_ptr (this=0x15555539cd30 <Communication::mpi_env>,
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr.h:121
#11 0x0000155554045a56 in __cxa_finalize (d=0x15555539c560) at ./stdlib/cxa_finalize.c:83
#12 0x0000155554f71d67 in __do_global_dtors_aux () from /tmp/boost/espresso/build/src/core/espresso_core.so
#13 0x00007fffffffd8c0 in ?? ()
#14 0x000015555552024e in _dl_fini () at ./elf/dl-fini.c:142
Backtrace stopped: frame did not save the PC

Running the Python testsuite shows three outcomes: no error, a segmentation fault, or a timeout. The latter most likely happens during atexit, because the test reports success. The latter behavior is quite similar to what is being observed by EESSI maintainers on ARM architectures, although they didn't report segmentation faults (EESSI/software-layer#363), and to my knowledge didn't remove the static reference. Here is an excerpt of the Python testsuite log:

187/191 Test #100: dipolar_p3m .......................................................***Timeout 300.10 sec
.
----------------------------------------------------------------------
Ran 1 test in 10.268s

OK

188/191 Test  #55: observables .......................................................***Timeout 300.24 sec
.......s.........
----------------------------------------------------------------------
Ran 17 tests in 5.649s

OK (skipped=1)

189/191 Test  #10: test_checkpoint__therm_lb__p3m_gpu__lj__lb_walberla_cpu_binary ....***Timeout 300.24 sec
.......s.sss.s.sssss.ss....s..ss..ssssss....
----------------------------------------------------------------------
Ran 44 tests in 0.394s

OK (skipped=21)

@jngrad
Copy link
Member Author

jngrad commented Jan 31, 2024

I took a different angle by creating a struct MpiContainer to encapsulate all 3 globals and manage their lifetime and destruction order through a static smart pointer:

namespace Communication {
//static auto const &mpi_datatype_cache = boost::mpi::detail::mpi_datatype_cache();
static std::shared_ptr<boost::mpi::environment> mpi_env;
static std::shared_ptr<MpiCallbacks> m_callbacks;
} // namespace Communication

struct MpiContainer {
    boost::mpi::detail::mpi_datatype_map const &mpi_datatype_cache = boost::mpi::detail::mpi_datatype_cache();
    std::shared_ptr<boost::mpi::environment> mpi_env;
    std::shared_ptr<Communication::MpiCallbacks> m_callbacks;
    ~MpiContainer() {
        m_callbacks.reset();
        Communication::m_callbacks.reset();
        mpi_env.reset();
        Communication::mpi_env.reset();
    }
    MpiContainer() {
        mpi_env = Communication::mpi_env;
        m_callbacks = Communication::m_callbacks;
    }
};

static std::unique_ptr<MpiContainer> mpi_container;

void atexit_handler() {
    if (Communication::m_callbacks) {
        mpi_container->m_callbacks.reset();
        Communication::m_callbacks.reset();
    }
}

The atexit event handler frees the MpiCallbacks handle, which frees the boost::mpi::environment handle, but this introduces more issues, because the MpiCallbacks must remain alive until all dependent Context objects from the ESPResSo ScriptInterface go through their destructors. In C++, the rule is that static variables are destructed during atexit, in reverse order of static initialization resp. event registration. The atexit event was registered right after MpiCallbacks was initialized in Communication::init(), so that the event would be traversed before ~MpiCallbacks(). However, script interface objects live in a different translation unit, and thus we cannot control the order of destruction. The order actually changes between two simulations, making this new bug difficult to reproduce.

@jngrad
Copy link
Member Author

jngrad commented Feb 1, 2024

I think the way to solve this issue is to:

  1. not tamper with the MPI datatype cache lifetime, i.e. don't manually call the singleton
  2. don't keep the boost::mpi::environment alive until atexit

Resolving 1. is easy: never call boost::mpi::detail::mpi_datatype_cache() anywhere. It's an implementation detail. Resolving 2. is a lot harder: the MpiCallbacks handle needs the MPI environment handle during Python atexit. We can rewrite MpiCallbacks to keep the MPI environment handle alive. Here is how we need to adapt MPI initialization:

  1. Python interface: the MPI environment is kept alive by _init.pyx (inside a shared pointer) and the MpiCallbacks handle is kept alive by both script_interface.pyx (inside the global context shared pointer) and communication.cpp (inside the static global); a cleanup function is registered to delete all 3 shared pointers during Python atexit
  2. C++ unit tests: the main function contains a MpiContainer handle that keeps the MpiCallbacks handle alive, and it expires as soon as the testsuite ends

A proof-of-concept is available in jngrad/espresso@boost_mpi_bugfix. I get the desired behavior on the python branch compiled with the default config using Boost 1.74, 1.82 and 1.84, at the notable exception of ek_eof.py. The fix can be backported to ESPResSo 4.2, but all LB tests fail due to the LB actor calling the lb_lbfluid_set_lattice_switch() MPI callback after the MPI environment is destroyed.

Here I'm assuming all Python atexit functions run before the first C++ atexit function, please correct me if I'm wrong!

To help with debugging, source the following GDB script in your session to print out calls to relevant symbols in a way that doesn't interrupt the flow of the GDB session with user prompts:

set breakpoint pending on
set pagination off

define handler
break $arg0
commands
cont
end
end

handler MPI_Type_contiguous
handler boost::mpi::environment::environment
handler boost::mpi::environment::~environment
handler boost::mpi::detail::mpi_datatype_cache
handler boost::mpi::detail::mpi_datatype_map::~mpi_datatype_map
handler boost::mpi::detail::mpi_datatype_map::clear

run

@jngrad
Copy link
Member Author

jngrad commented Feb 2, 2024

Bugfix backported to 4.2.1 and submitted to openSUSE Tumbleweed (request 1143707) and Factory (request 1143710).

The python branch bugfix is a bit more delicate due to FFTW and HDF5 dependencies, and might take a few more days.

@junghans
Copy link
Member

junghans commented Feb 2, 2024

On Fedora there was no issue?

@jngrad
Copy link
Member Author

jngrad commented Feb 2, 2024

I'm still running up the testsuite locally in a Docker image and will make a bugfix ASAP. They are currently still using Boost 1.83.0 in f40 and rawhide (https://src.fedoraproject.org/rpms/boost), so I gave priority to openSUSE.

@jngrad
Copy link
Member Author

jngrad commented Feb 2, 2024

The MpiCallbacks_test unit test fails in a Koji scratch build, but not on my workstation in a Docker image... I'll look into it next week.

bmwiedemann pushed a commit to bmwiedemann/openSUSE that referenced this issue Feb 4, 2024
https://build.opensuse.org/request/show/1143680
by user cjunghans + anag+factory
- Drop python3-espressomd testing dependency, revert once
  gh#espressomd/espresso#4856 is fixed.
- Dropped 1093.patch, merged upstream
  - fix links in README and doc ([gh#votca/votca#1091])
  - fix python shebang to python3 ([gh#votca/votca#1093])
  - Clean-up CI ([gh#votca/votca#1092], [gh#votca/votca#1095])
  - remove reference to old webpage ([gh#votca/votca#1094])
  - fix doc generation without pyxtp ([gh#votca/votca#1097])
  - Do not run gmx tests without libgromacs ([gh#votca/votca#1099])
  - KS-QMMM for single-particle states ([gh#votca/votca#1100])
  - Spin-orbitals from ORCA in QMMM ([gh#votca/votca#1101])
  - better unit test for DFT embedding ([gh#votca/votca#1102])
- Update to 2024
bmwiedemann pushed a commit to bmwiedemann/openSUSE that referenced this issue Feb 7, 2024
https://build.opensuse.org/request/show/1143680
by user cjunghans + anag+factory
- Drop python3-espressomd testing dependency, revert once
  gh#espressomd/espresso#4856 is fixed.
- Dropped 1093.patch, merged upstream
  - fix links in README and doc ([gh#votca/votca#1091])
  - fix python shebang to python3 ([gh#votca/votca#1093])
  - Clean-up CI ([gh#votca/votca#1092], [gh#votca/votca#1095])
  - remove reference to old webpage ([gh#votca/votca#1094])
  - fix doc generation without pyxtp ([gh#votca/votca#1097])
  - Do not run gmx tests without libgromacs ([gh#votca/votca#1099])
  - KS-QMMM for single-particle states ([gh#votca/votca#1100])
  - Spin-orbitals from ORCA in QMMM ([gh#votca/votca#1101])
  - better unit test for DFT embedding ([gh#votca/votca#1102])
- Update to 2024
bmwiedemann pushed a commit to bmwiedemann/openSUSE that referenced this issue Feb 7, 2024
https://build.opensuse.org/request/show/1143680
by user cjunghans + anag+factory
- Drop python3-espressomd testing dependency, revert once
  gh#espressomd/espresso#4856 is fixed.
- Dropped 1093.patch, merged upstream
  - fix links in README and doc ([gh#votca/votca#1091])
  - fix python shebang to python3 ([gh#votca/votca#1093])
  - Clean-up CI ([gh#votca/votca#1092], [gh#votca/votca#1095])
  - remove reference to old webpage ([gh#votca/votca#1094])
  - fix doc generation without pyxtp ([gh#votca/votca#1097])
  - Do not run gmx tests without libgromacs ([gh#votca/votca#1099])
  - KS-QMMM for single-particle states ([gh#votca/votca#1100])
  - Spin-orbitals from ORCA in QMMM ([gh#votca/votca#1101])
  - better unit test for DFT embedding ([gh#votca/votca#1102])
- Update to 2024
@jngrad
Copy link
Member Author

jngrad commented Feb 21, 2024

The Homebrew formulae for boost-mpi (link) was bumped to Boost 1.84.0 recently. This package does not provide pinned formulae for older Boost versions. All ESPResSo releases since 4.0.0 and the development version are now unusable on macOS computers that have up-to-date dependencies.

kodiakhq bot added a commit that referenced this issue Feb 26, 2024
Fixes #4859, fixes #4855

Description of changes:
- bugfix:
   - LB boundaries are now properly communicated in the ghost layer
   - see details in #4859
- performance:
   - the LB flag field is no longer communicated at every time step
   - the LB UBB field is no longer recalculated at every time step
   - LB boundary setters (node, slice, shape) now always trigger a full ghost communication
- maintainability:
   - the waLBerla header files are no longer visible in the ESPResSo core and script interface
   - the Boost dependency is now checked at the CMake level to prevent building broken ESPResSo shared libraries (see details in #4856)
@jngrad
Copy link
Member Author

jngrad commented Feb 26, 2024

Progress report:

  • the bugfix seems to work on the macOS GitHub Action with Boost 1.84
  • the remaining undefined behavior revealed by the bugfix still persists on Fedora

Fedora Rawhide still hasn't updated to Boost 1.84 (bugzilla 2178871). Fedora 40 will enter Beta on February 27 (timeline).

On f40 with all architectures, I get random failures of the MpiCallbacks_test and ParallelExceptionHandler_test unit tests with error message Communicator (handle=44000000) being freed has 1 unmatched message(s) on Koji (see below), even though the boost::mpi::environment shared pointer lifetime is tied to the main function lifetime (+2 weak pointers with static linkage), and the MpiCallbacks shared pointer lifetime is tied to the lifetime of the individual test functions. This error appeared last week, and is reproducible locally in a Docker image with MPICH 4.1.2 (and with docker run --shm-size 8G, otherwise a signal 7 fatal error is triggered by OpenMPI).

The MPI deadlocks in Python tests have disappeared last week too, although on the Power9 architecture, when building with C++ assertions, the Boost histogram library triggers an assertion in the observable_cylindricalLB.py test (see below). It happens every time in the CylindricalLBObservableCPU.test_cylindrical_lb_profile_interface test when attempting to set the value of the velocity vector at array[1, 0, 0, :].

MpiCallbacks MPICH error message
Entering test module "MpiCallbacks test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Entering test case "invoke_test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(57): info: check f(i, j) == (invoke<decltype(f), int, unsigned>(f, ia)) has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Leaving test case "invoke_test"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Entering test case "callback_model_t"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(82): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(83): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(93): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(100): info: check 19 == state has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(101): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(102): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(110): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Leaving test case "callback_model_t"; testing time: 420us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Entering test case "adding_function_ptr_cb"
Test case adding_function_ptr_cb did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Leaving test case "adding_function_ptr_cb"; testing time: 611us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Entering test case "RegisterCallback"
Test case RegisterCallback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Leaving test case "RegisterCallback"; testing time: 393us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Entering test case "CallbackHandle"
Test case CallbackHandle did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Leaving test case "CallbackHandle"; testing time: 497us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Entering test case "reduce_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(194): info: check ret == (n * (n - 1)) / 2 has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Leaving test case "reduce_callback"; testing time: 455us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Entering test case "ignore_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(217): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Leaving test case "ignore_callback"; testing time: 374us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Entering test case "one_rank_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(238): info: check cbs.call(Communication::Result::one_rank, fp) == world.size() - 1 has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Leaving test case "one_rank_callback"; testing time: 378us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Entering test case "main_rank_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(263): info: check cbs.call(Communication::Result::main_rank, fp) == world.size() has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Leaving test case "main_rank_callback"; testing time: 354us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Entering test case "call_all"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(287): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Leaving test case "call_all"; testing time: 375us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Entering test case "check_exceptions"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(304): info: check 'exception "std::out_of_range" raised as expected' has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Leaving test case "check_exceptions"; testing time: 400us
Leaving test module "MpiCallbacks test"; testing time: 5020us

*** No errors detected
Running 11 test cases...
Entering test module "MpiCallbacks test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Entering test case "invoke_test"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(57): info: check f(i, j) == (invoke<decltype(f), int, unsigned>(f, ia)) has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(43): Leaving test case "invoke_test"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Entering test case "callback_model_t"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(82): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(83): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(93): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(100): info: check 19 == state has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(101): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(102): info: check 3.4 == d has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(110): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(65): Leaving test case "callback_model_t"; testing time: 420us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Entering test case "adding_function_ptr_cb"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(119): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(120): info: check "adding_function_ptr_cb" == s has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(133): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(114): Leaving test case "adding_function_ptr_cb"; testing time: 614us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Entering test case "RegisterCallback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(139): info: check 537 == i has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(140): info: check "2nd" == s has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(156): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(137): Leaving test case "RegisterCallback"; testing time: 398us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Entering test case "CallbackHandle"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(168): info: check "CallbackHandle" == s has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(177): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(160): Leaving test case "CallbackHandle"; testing time: 504us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Entering test case "reduce_callback"
Test case reduce_callback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(181): Leaving test case "reduce_callback"; testing time: 461us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Entering test case "ignore_callback"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(217): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(200): Leaving test case "ignore_callback"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Entering test case "one_rank_callback"
Test case one_rank_callback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(220): Leaving test case "one_rank_callback"; testing time: 377us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Entering test case "main_rank_callback"
Test case main_rank_callback did not check any assertions
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(245): Leaving test case "main_rank_callback"; testing time: 346us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Entering test case "call_all"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(287): info: check called has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(270): Leaving test case "call_all"; testing time: 369us
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Entering test case "check_exceptions"
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(307): info: check 'exception "std::logic_error" raised as expected' has passed
/home/user/espresso/espresso/src/core/unit_tests/MpiCallbacks_test.cpp(290): Leaving test case "check_exceptions"; testing time: 400us
Leaving test module "MpiCallbacks test"; testing time: 5020us

*** No errors detected
terminate called after throwing an instance of 'boost::wrapexcept<boost::mpi::exception>'
  what():  MPI_Finalize: Other MPI error, error stack:
internal_Finalize(50)...........: MPI_Finalize failed
MPII_Finalize(394)..............: 
MPIR_finalize_builtin_comms(154): 
MPIR_Comm_release_always(1250)..: 
MPIR_Comm_delete_internal(1224).: Communicator (handle=44000000) being freed has 1 unmatched message(s)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 19812 RUNNING AT 3ae1d6cd6b37
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
cylindrical LB velocity profile observable error message
111/201 Test #133: observable_cylindricalLB ......................................***Failed    1.58 sec
test_cylindrical_lb_flux_density_obs (__main__.CylindricalLBObservableCPU.test_cylindrical_lb_flux_density_obs)
Check that the result from the observable (in its own frame) ...
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[1, 1, 3, 0] = 0.024000; array[1, 1, 3, 1] = 0.048000; array[1, 1, 3, 2] = 0.036000
M = 3 N = 3 ; array[2, 1, 3, 0] = 0.024000; array[2, 1, 3, 1] = 0.048000; array[2, 1, 3, 2] = 0.036000
ok
test_cylindrical_lb_profile_interface (__main__.CylindricalLBObservableCPU.test_cylindrical_lb_profile_interface)
Test setters and getters of the script interface ...
M = 3 N = 3 ; array[0, 3, 0, 0] = -0.000000; array[0, 3, 0, 1] = 0.000000; array[0, 3, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 0, 0] = -0.000000; array[0, 4, 0, 1] = 0.000000; array[0, 4, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 0, 0] = 0.000000; array[0, 5, 0, 1] = 0.000000; array[0, 5, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 0, 0] = 0.000000; array[0, 0, 0, 1] = 0.000000; array[0, 0, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 0, 0] = 0.000000; array[0, 1, 0, 1] = 0.000000; array[0, 1, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 0, 0] = 0.000000; array[0, 2, 0, 1] = -0.000000; array[0, 2, 0, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 1, 0] = -0.000000; array[0, 3, 1, 1] = 0.000000; array[0, 3, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 1, 0] = -0.000000; array[0, 4, 1, 1] = 0.000000; array[0, 4, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 1, 0] = 0.000000; array[0, 5, 1, 1] = 0.000000; array[0, 5, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 1, 0] = 0.000000; array[0, 0, 1, 1] = 0.000000; array[0, 0, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 1, 0] = 0.000000; array[0, 1, 1, 1] = 0.000000; array[0, 1, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 1, 0] = 0.000000; array[0, 2, 1, 1] = -0.000000; array[0, 2, 1, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 2, 0] = -0.000000; array[0, 3, 2, 1] = 0.000000; array[0, 3, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 2, 0] = -0.000000; array[0, 4, 2, 1] = 0.000000; array[0, 4, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 2, 0] = 0.000000; array[0, 5, 2, 1] = 0.000000; array[0, 5, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 2, 0] = 0.000000; array[0, 0, 2, 1] = 0.000000; array[0, 0, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 2, 0] = 0.000000; array[0, 1, 2, 1] = 0.000000; array[0, 1, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 2, 0] = 0.000000; array[0, 2, 2, 1] = -0.000000; array[0, 2, 2, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 3, 0] = -0.000000; array[0, 3, 3, 1] = 0.000000; array[0, 3, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 3, 0] = -0.000000; array[0, 4, 3, 1] = 0.000000; array[0, 4, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 3, 0] = 0.000000; array[0, 5, 3, 1] = 0.000000; array[0, 5, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 3, 0] = 0.000000; array[0, 0, 3, 1] = 0.000000; array[0, 0, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 3, 0] = 0.000000; array[0, 1, 3, 1] = 0.000000; array[0, 1, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 3, 0] = 0.000000; array[0, 2, 3, 1] = -0.000000; array[0, 2, 3, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 4, 0] = -0.000000; array[0, 3, 4, 1] = 0.000000; array[0, 3, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 4, 0] = -0.000000; array[0, 4, 4, 1] = 0.000000; array[0, 4, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 4, 0] = 0.000000; array[0, 5, 4, 1] = 0.000000; array[0, 5, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 4, 0] = 0.000000; array[0, 0, 4, 1] = 0.000000; array[0, 0, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 4, 0] = 0.000000; array[0, 1, 4, 1] = 0.000000; array[0, 1, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 4, 0] = 0.000000; array[0, 2, 4, 1] = -0.000000; array[0, 2, 4, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 5, 0] = -0.000000; array[0, 3, 5, 1] = 0.000000; array[0, 3, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 5, 0] = -0.000000; array[0, 4, 5, 1] = 0.000000; array[0, 4, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 5, 0] = 0.000000; array[0, 5, 5, 1] = 0.000000; array[0, 5, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 5, 0] = 0.000000; array[0, 0, 5, 1] = 0.000000; array[0, 0, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 5, 0] = 0.000000; array[0, 1, 5, 1] = 0.000000; array[0, 1, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 5, 0] = 0.000000; array[0, 2, 5, 1] = -0.000000; array[0, 2, 5, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 6, 0] = -0.000000; array[0, 3, 6, 1] = 0.000000; array[0, 3, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 6, 0] = -0.000000; array[0, 4, 6, 1] = 0.000000; array[0, 4, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 6, 0] = 0.000000; array[0, 5, 6, 1] = 0.000000; array[0, 5, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 6, 0] = 0.000000; array[0, 0, 6, 1] = 0.000000; array[0, 0, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 6, 0] = 0.000000; array[0, 1, 6, 1] = 0.000000; array[0, 1, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 6, 0] = 0.000000; array[0, 2, 6, 1] = -0.000000; array[0, 2, 6, 2] = 0.000000
M = 3 N = 3 ; array[0, 3, 7, 0] = -0.000000; array[0, 3, 7, 1] = 0.000000; array[0, 3, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 4, 7, 0] = -0.000000; array[0, 4, 7, 1] = 0.000000; array[0, 4, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 5, 7, 0] = 0.000000; array[0, 5, 7, 1] = 0.000000; array[0, 5, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 0, 7, 0] = 0.000000; array[0, 0, 7, 1] = 0.000000; array[0, 0, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 1, 7, 0] = 0.000000; array[0, 1, 7, 1] = 0.000000; array[0, 1, 7, 2] = 0.000000
M = 3 N = 3 ; array[0, 2, 7, 0] = 0.000000; array[0, 2, 7, 1] = -0.000000; array[0, 2, 7, 2] = 0.000000
M = 3 N = 3 ; array[1, 3, 0, 0] = 0.000000; array[1, 3, 0, 1] = -0.000000; array[1, 3, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 3, 0, 0] = -0.000000; array[1, 3, 0, 1] = 0.000000; array[1, 3, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 4, 0, 0] = -0.000000; array[1, 4, 0, 1] = 0.000000; array[1, 4, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 4, 0, 0] = 0.000000; array[1, 4, 0, 1] = 0.000000; array[1, 4, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 5, 0, 0] = 0.000000; array[1, 5, 0, 1] = 0.000000; array[1, 5, 0, 2] = 0.000000
M = 3 N = 3 ; array[1, 5, 0, 0] = 0.000000; array[1, 5, 0, 1] = 0.000000; array[1, 5, 0, 2] = 0.000000
python3: /usr/include/boost/multi_array/base.hpp:312: Reference boost::detail::multi_array::multi_array_impl_base<T, NumDims>::access_element(boost::type<Reference>, const IndexList&, TPtr, const size_type*, const index*, const index*) const [with Reference = double&; IndexList = boost::array<long int, 4>; TPtr = double*; T = double; long unsigned int NumDims = 4; size_type = long unsigned int; index = long int]: Assertion `size_type(indices[i] - index_bases[i]) < extents[i]' failed.
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 11887 RUNNING AT 92f33c5094034595b63af25b7528b0b9
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

@jngrad
Copy link
Member Author

jngrad commented Feb 26, 2024

Adapting the code that generates the error message (raffenet/mpich@v4.1.2:src/mpi/comm/commutil.c#L1125-L1147) in the unit test helped me find out the issue was due to a missing cbs.loop(); on the worker nodes. The worker nodes receive a LOOP_ABORT from ~MpiCallbacks() but cannot process it without a blocking MpiCallbacks::loop(), hence the receive queue wasn't empty when the communicator was destroyed, which is a fatal error in MPICH version 4.1+.

int main(int argc, char **argv) {
  auto const mpi_env = std::make_shared<boost::mpi::environment>(argc, argv);
  ::mpi_env = mpi_env;
  auto const retval = boost::unit_test::unit_test_main(init_unit_test, argc, argv);
  {
        boost::mpi::communicator world;
        int flag;
        int unmatched_messages = 0;
        MPI_Comm comm = world;
        MPI_Status status;
        do {
            int mpi_errno = MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &flag, &status);
            printf("rank %d has mpi_errno=%d\n", world.rank(), mpi_errno);
            char buffer[10] = {0};
            if (flag) {
                int count = 0;
                MPI_Get_count(&status, MPI_CHAR, &count);
                MPI_Recv(buffer, count, MPI_CHAR, status.MPI_SOURCE, status.MPI_TAG, comm, MPI_STATUS_IGNORE);
                unmatched_messages++;
                printf("rank %d received values {%d,%d,%d,%d} from rank %d, with tag %d, size %d Bytes and error code %d.\n",
                       world.rank(), (int)(buffer[0]), (int)(buffer[1]), (int)(buffer[2]), (int)(buffer[3]),
                       status.MPI_SOURCE, status.MPI_TAG, count, status.MPI_ERROR);
            }
        } while (false);
        printf("rank %d has %d unmatched messages\n", world.rank(), unmatched_messages);
  }
  return retval;
}

Output:

Running 1 test case...
0: ~MpiCallbacks()
call(0)
Running 1 test case...
1: ~MpiCallbacks()


*** No errors detected
*** No errors detected
rank 0 has mpi_errno=0
rank 0 has 0 unmatched messages
rank 1 has mpi_errno=0
rank 1 received values {0,0,0,0} from rank 0, with tag 2147483647, size 4 Bytes and error code -33873408.
rank 1 has 1 unmatched messages

@jngrad
Copy link
Member Author

jngrad commented Feb 28, 2024

The bugfix is now in Fedora 41 stable as release espresso-4.2.1-11.fc41 (rpms/espresso).

@jngrad jngrad added this to the ESPResSo 4.2.2 milestone Feb 28, 2024
@kodiakhq kodiakhq bot closed this as completed in #4858 Feb 29, 2024
kodiakhq bot added a commit that referenced this issue Feb 29, 2024
Fixes #4856

Description of changes:
- fix multiple bugs caused by undefined behavior due to the static initialization order of MPI global objects
- ESPResSo is now compatible with Boost 1.84+
jngrad pushed a commit to jngrad/espresso that referenced this issue Feb 29, 2024
Fixes espressomd#4856

Description of changes:
- fix multiple bugs caused by undefined behavior due to the static initialization order of MPI global objects
- ESPResSo is now compatible with Boost 1.84+
jngrad pushed a commit to jngrad/espresso that referenced this issue Feb 29, 2024
Fixes espressomd#4856

Description of changes:
- fix multiple bugs caused by undefined behavior due to the static initialization order of MPI global objects
- ESPResSo is now compatible with Boost 1.84+
jngrad pushed a commit to jngrad/espresso that referenced this issue Feb 29, 2024
Fixes espressomd#4856

Description of changes:
- fix multiple bugs caused by undefined behavior due to the static initialization order of MPI global objects
- ESPResSo is now compatible with Boost 1.84+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants