-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undefined behavior from static global variables #4856
Comments
One can influence the destruction order by adding the following two code snippets before and after #define BOOST_TEST_NO_MAIN #include "communication.hpp"
int main(int argc, char **argv) {
auto mpi_env = std::make_shared<boost::mpi::environment>(argc, argv);
Communication::init(mpi_env);
return boost::unit_test::unit_test_main(init_unit_test, argc, argv);
} This is not a sustainable solution, and it will break as soon as the order of instantiation of static globals changes, for example when ESPResSo features that introduce global variables are disabled, or when a different Boost release is used. |
After fixing export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}" to export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+$LD_LIBRARY_PATH:}${BOOST_ROOT}/lib" the bug can be properly investigated with GDB on Ubuntu 22.04. It's not clear to me, why ESPResSo needs to manage the lifetime of the
Running the Python testsuite shows three outcomes: no error, a segmentation fault, or a timeout. The latter most likely happens during
|
I took a different angle by creating a namespace Communication {
//static auto const &mpi_datatype_cache = boost::mpi::detail::mpi_datatype_cache();
static std::shared_ptr<boost::mpi::environment> mpi_env;
static std::shared_ptr<MpiCallbacks> m_callbacks;
} // namespace Communication
struct MpiContainer {
boost::mpi::detail::mpi_datatype_map const &mpi_datatype_cache = boost::mpi::detail::mpi_datatype_cache();
std::shared_ptr<boost::mpi::environment> mpi_env;
std::shared_ptr<Communication::MpiCallbacks> m_callbacks;
~MpiContainer() {
m_callbacks.reset();
Communication::m_callbacks.reset();
mpi_env.reset();
Communication::mpi_env.reset();
}
MpiContainer() {
mpi_env = Communication::mpi_env;
m_callbacks = Communication::m_callbacks;
}
};
static std::unique_ptr<MpiContainer> mpi_container;
void atexit_handler() {
if (Communication::m_callbacks) {
mpi_container->m_callbacks.reset();
Communication::m_callbacks.reset();
}
} The |
I think the way to solve this issue is to:
Resolving 1. is easy: never call
A proof-of-concept is available in jngrad/espresso@ Here I'm assuming all Python atexit functions run before the first C++ atexit function, please correct me if I'm wrong! To help with debugging, source the following GDB script in your session to print out calls to relevant symbols in a way that doesn't interrupt the flow of the GDB session with user prompts:
|
On Fedora there was no issue? |
I'm still running up the testsuite locally in a Docker image and will make a bugfix ASAP. They are currently still using Boost 1.83.0 in f40 and rawhide (https://src.fedoraproject.org/rpms/boost), so I gave priority to openSUSE. |
The |
https://build.opensuse.org/request/show/1143680 by user cjunghans + anag+factory - Drop python3-espressomd testing dependency, revert once gh#espressomd/espresso#4856 is fixed. - Dropped 1093.patch, merged upstream - fix links in README and doc ([gh#votca/votca#1091]) - fix python shebang to python3 ([gh#votca/votca#1093]) - Clean-up CI ([gh#votca/votca#1092], [gh#votca/votca#1095]) - remove reference to old webpage ([gh#votca/votca#1094]) - fix doc generation without pyxtp ([gh#votca/votca#1097]) - Do not run gmx tests without libgromacs ([gh#votca/votca#1099]) - KS-QMMM for single-particle states ([gh#votca/votca#1100]) - Spin-orbitals from ORCA in QMMM ([gh#votca/votca#1101]) - better unit test for DFT embedding ([gh#votca/votca#1102]) - Update to 2024
https://build.opensuse.org/request/show/1143680 by user cjunghans + anag+factory - Drop python3-espressomd testing dependency, revert once gh#espressomd/espresso#4856 is fixed. - Dropped 1093.patch, merged upstream - fix links in README and doc ([gh#votca/votca#1091]) - fix python shebang to python3 ([gh#votca/votca#1093]) - Clean-up CI ([gh#votca/votca#1092], [gh#votca/votca#1095]) - remove reference to old webpage ([gh#votca/votca#1094]) - fix doc generation without pyxtp ([gh#votca/votca#1097]) - Do not run gmx tests without libgromacs ([gh#votca/votca#1099]) - KS-QMMM for single-particle states ([gh#votca/votca#1100]) - Spin-orbitals from ORCA in QMMM ([gh#votca/votca#1101]) - better unit test for DFT embedding ([gh#votca/votca#1102]) - Update to 2024
https://build.opensuse.org/request/show/1143680 by user cjunghans + anag+factory - Drop python3-espressomd testing dependency, revert once gh#espressomd/espresso#4856 is fixed. - Dropped 1093.patch, merged upstream - fix links in README and doc ([gh#votca/votca#1091]) - fix python shebang to python3 ([gh#votca/votca#1093]) - Clean-up CI ([gh#votca/votca#1092], [gh#votca/votca#1095]) - remove reference to old webpage ([gh#votca/votca#1094]) - fix doc generation without pyxtp ([gh#votca/votca#1097]) - Do not run gmx tests without libgromacs ([gh#votca/votca#1099]) - KS-QMMM for single-particle states ([gh#votca/votca#1100]) - Spin-orbitals from ORCA in QMMM ([gh#votca/votca#1101]) - better unit test for DFT embedding ([gh#votca/votca#1102]) - Update to 2024
The Homebrew formulae for |
Fixes #4859, fixes #4855 Description of changes: - bugfix: - LB boundaries are now properly communicated in the ghost layer - see details in #4859 - performance: - the LB flag field is no longer communicated at every time step - the LB UBB field is no longer recalculated at every time step - LB boundary setters (node, slice, shape) now always trigger a full ghost communication - maintainability: - the waLBerla header files are no longer visible in the ESPResSo core and script interface - the Boost dependency is now checked at the CMake level to prevent building broken ESPResSo shared libraries (see details in #4856)
Progress report:
Fedora Rawhide still hasn't updated to Boost 1.84 (bugzilla 2178871). Fedora 40 will enter Beta on February 27 (timeline). On f40 with all architectures, I get random failures of the The MPI deadlocks in Python tests have disappeared last week too, although on the Power9 architecture, when building with C++ assertions, the Boost histogram library triggers an assertion in the MpiCallbacks MPICH error message
cylindrical LB velocity profile observable error message
|
Adapting the code that generates the error message (raffenet/mpich@v4.1.2:src/mpi/comm/commutil.c#L1125-L1147) in the unit test helped me find out the issue was due to a missing int main(int argc, char **argv) {
auto const mpi_env = std::make_shared<boost::mpi::environment>(argc, argv);
::mpi_env = mpi_env;
auto const retval = boost::unit_test::unit_test_main(init_unit_test, argc, argv);
{
boost::mpi::communicator world;
int flag;
int unmatched_messages = 0;
MPI_Comm comm = world;
MPI_Status status;
do {
int mpi_errno = MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &flag, &status);
printf("rank %d has mpi_errno=%d\n", world.rank(), mpi_errno);
char buffer[10] = {0};
if (flag) {
int count = 0;
MPI_Get_count(&status, MPI_CHAR, &count);
MPI_Recv(buffer, count, MPI_CHAR, status.MPI_SOURCE, status.MPI_TAG, comm, MPI_STATUS_IGNORE);
unmatched_messages++;
printf("rank %d received values {%d,%d,%d,%d} from rank %d, with tag %d, size %d Bytes and error code %d.\n",
world.rank(), (int)(buffer[0]), (int)(buffer[1]), (int)(buffer[2]), (int)(buffer[3]),
status.MPI_SOURCE, status.MPI_TAG, count, status.MPI_ERROR);
}
} while (false);
printf("rank %d has %d unmatched messages\n", world.rank(), unmatched_messages);
}
return retval;
} Output:
|
The bugfix is now in Fedora 41 stable as release |
Fixes #4856 Description of changes: - fix multiple bugs caused by undefined behavior due to the static initialization order of MPI global objects - ESPResSo is now compatible with Boost 1.84+
Fixes espressomd#4856 Description of changes: - fix multiple bugs caused by undefined behavior due to the static initialization order of MPI global objects - ESPResSo is now compatible with Boost 1.84+
Fixes espressomd#4856 Description of changes: - fix multiple bugs caused by undefined behavior due to the static initialization order of MPI global objects - ESPResSo is now compatible with Boost 1.84+
Fixes espressomd#4856 Description of changes: - fix multiple bugs caused by undefined behavior due to the static initialization order of MPI global objects - ESPResSo is now compatible with Boost 1.84+
TL:DR: ESPResSo manages the lifetime of MPI global variables in a manner that is subject to destruction order fiasco.
Problem statement
ESPResSo 4.2 and 4.3-dev manage the following static globals:
espresso/src/core/communication.cpp
Lines 50 to 52 in 5cc8361
The MPI datatype cache is a Myers singleton, i.e. it is managed by reference rather than by pointer (which has pros and cons). This makes object lifetime difficult to control.
Destruction of the
boost::mpi::environment
depends on thempi_datatype_cache
in such a way that one isn't allowed to keep a handle toboost::mpi::environment
in a static global (for more details please see boostorg/mpi#92). In addition, storingmpi_datatype_cache
by reference in a static global alters its lifetime, which can trigger UB after normal program termination.Reproducing the bug
Reproducing UB caused by static initialization order fiasco is notoriously difficult. Fortunately for us, Boost 1.84.0 alters the order of static initialization, such that we can reproduce the issue in ESPResSo 4.2 and 4.3-dev in a reliable manner:
Output:
You may get false positives due to incorrect linkage of
Boost::serialization
, and may not get debug symbols in GDB. For an improved GDB experience, please use this Dockerfile instead:GDB trace:
Outlook
Due to how Boost 1.84.0 alters the order of static initialization, ESPResSo is now unusable on most environments. The bug was successfully reproduced on Ubuntu, Fedora and openSUSE.
The openSUSE package managers have already started the formal process of removing ESPResSo from their repositories (request 1142685). Fedora will probably do the same as soon as the Boost version gets bumped, probably after the rawhide fork. Not sure how EasyBuild and EESSI will react.
The text was updated successfully, but these errors were encountered: