Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epresso-4.1.2: Many tests fail on OpenSuse tumbleweed due to IndexError: _Map_base::at #3396

Closed
junghans opened this issue Jan 4, 2020 · 32 comments

Comments

@junghans
Copy link
Member

junghans commented Jan 4, 2020

Due to https://build.opensuse.org/request/show/760741, I tried to bump espresso to v4.1.2 on OpenSUSE, but many tests fail on x86_64 with:

[ 1114s]         Start 131: lb_momentum_conservation
[ 1116s] 135/145 Test #131: lb_momentum_conservation ..........................***Failed    2.53 sec
[ 1116s] Traceback (most recent call last):
[ 1116s]   File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/lb_momentum_conservation.py", line 40, in <module>
[ 1116s]     class Momentum(object):
[ 1116s]   File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/lb_momentum_conservation.py", line 48, in Momentum
[ 1116s]     system = espressomd.System(box_l=[BOX_SIZE] * 3)
[ 1116s]   File "system.pyx", line 149, in espressomd.system.System.__init__
[ 1116s]   File "collision_detection.pyx", line 53, in espressomd.collision_detection.CollisionDetection.__init__
[ 1116s]   File "script_interface.pyx", line 297, in espressomd.script_interface.ScriptInterfaceHelper.__init__
[ 1116s]   File "script_interface.pyx", line 78, in espressomd.script_interface.PScriptInterface.__init__
[ 1116s] IndexError: _Map_base::at
[ 1116s] -------------------------------------------------------
[ 1116s] Primary job  terminated normally, but 1 process returned
[ 1116s] a non-zero exit code.. Per user-direction, the job has been aborted.
[ 1116s] -------------------------------------------------------
[ 1116s] --------------------------------------------------------------------------
[ 1116s] mpiexec detected that one or more processes exited with non-zero status, thus causing
[ 1116s] the job to be terminated. The first process to do so was:
[ 1116s] 
[ 1116s]   Process name: [[3280,1],0]
[ 1116s]   Exit code:    1
[ 1116s] --------------------------------------------------------------------------
[ 1116s] 

Details here.

Surprisingly on i586, only matrix_vector_product fails:

[  765s] 48/68 Test #48: matrix_vector_product ............***Failed    0.00 sec
[  765s] Running 1 test case...
[  765s] /home/abuild/rpmbuild/BUILD/espresso/src/utils/tests/matrix_vector_product.cpp(36): [1;31;49merror: in "inner_product": difference{4.31155e-17} between result[i]{30.900000000000002} and boost::inner_product(matrix[i], vector, 0.0){30.900000000000002} exceeds 2.22045e-16%[0;39;49m
[  765s] /home/abuild/rpmbuild/BUILD/espresso/src/utils/tests/matrix_vector_product.cpp(36): [1;31;49merror: in "inner_product": difference{8.41875e-17} between result[i]{73.849999999999994} and boost::inner_product(matrix[i], vector, 0.0){73.849999999999994} exceeds 2.22045e-16%[0;39;49m
[  765s] /home/abuild/rpmbuild/BUILD/espresso/src/utils/tests/matrix_vector_product.cpp(36): [1;31;49merror: in "inner_product": difference{7.41993e-17} between result[i]{341.15000000000003} and boost::inner_product(matrix[i], vector, 0.0){341.15000000000003} exceeds 2.22045e-16%[0;39;49m
[  765s] 
[  765s] [1;31;49m*** 3 failures are detected in the test module "matrix_vector_product test"
[  765s] [0;39;49m
[  765s] 

Details here

@junghans
Copy link
Member Author

junghans commented Jan 4, 2020

@mcepl saved us from OpenSUSE removal.

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Jan 4, 2020 via email

@fweik
Copy link
Contributor

fweik commented Jan 4, 2020

This looks more like a linker/shared object problem. The exception is raised because an unknown class name is encountered by the factory, which lives in global static memory. In the place where the exception is raised the access is from a different shared object than in all the other cases (where it works). I can maybe have a look next week. The question is also how urgent this is, because some of the relevant code is also up for replacement, so the problem could also go away by itself given some time.

@mcepl
Copy link

mcepl commented Jan 4, 2020

The question is also how urgent this is,

not super urgent … if we know that somebody actually works on this issue, I can keep espresso put in openSUSE.

@junghans
Copy link
Member Author

junghans commented Jan 4, 2020

I was able to reproduce this on OpenSuse tumbleweed with:

osc co home:cjunghans:branches:devel:languages:python:numeric python3-espressomd #create account for https://build.opensuse.org/ first
cd "home:cjunghans:branches:devel:languages:python:numeric/python3-espressomd"
osc build --ccache
osc shell --ccache
cd rpmbuild/BUILD/espresso/build/
LD_LIBRARY_PATH='/home/abuild/rpmbuild/BUILDROOT/python3-espressomd-4.1.2-0.x86_64/usr/lib64/python3.7/site-packages/espressomd:/usr/lib64/mpi/gcc/openmpi2/lib64/' make check ARGS="-E all_compare_test"

(Not sure why all_compare_test fails in my VM`)

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Jan 6, 2020 via email

@junghans
Copy link
Member Author

junghans commented Jan 6, 2020

@RudolfWeeber, if you make a clean patch, I will just apply it in the 4.1.2 rpm for OpenSUSE.

@jngrad
Copy link
Member

jngrad commented Jan 13, 2020

Could not reproduce the error in a docker container with the opensuse/tumbleweed base image on an x86_64 machine with espresso 4.1.2.

@junghans
Copy link
Member Author

Did you use osc?

@junghans
Copy link
Member Author

For me it still persist using osc.

@jngrad
Copy link
Member

jngrad commented Jan 14, 2020

In an openSUSE Docker container:

  • direct build: issue not reproducible
  • rpm build with your openSUSE spec file: issue not reproducible
  • osc build with your script: issue reproducible for Python tests

@junghans I thought osc would use rpmbuild internally. Nevermind.

@RudolfWeeber @fweik In GDB, pypresso throws at the call to make_shared("CollisionDetection::CollisionDetection", ScriptInterface::ScriptInterfaceBase::CreationPolicy::GLOBAL) in self.set_sip(make_shared(to_char_pointer(name), policy_)) when instantiating a System class. However, disabling the COLLISION_DETECTION feature doesn't help, as the error message persists with ComFixed. Removing ComFixed still doesn't help, as the error message persists with Constraints. The other System attributes defined prior to the CollisionDetection attribute don't throw, e.g. CellSystem or interactions.BondedInteractions. If you re-order the attributes of the System class judiciously, a pattern emerges: Cython classes deriving from ScriptInterfaceHelper and Python classes throw IndexError: _Map_base::at, while other Cython classes don't throw.

@mcepl
Copy link

mcepl commented Jan 14, 2020

@junghans I thought osc would use rpmbuild internally. Nevermind.

It does, but first it generates isolated environment. Would diff between output of rpmbuild -ba and osc build (or osc lbl) be any helpful?

@jngrad
Copy link
Member

jngrad commented Jan 14, 2020

@mcepl rpmbuild -ba succeeds while osc build fails due to the index error. There are several differences in the compiler and linker flags passed as argument to CMake in the osc build vs. the rpmbuild, but even when using the rpmbuild-generated CMake command in an osc shell, there are still Rpath issues, so osc must be setting environment variables as well. This means we have to open an account on OpenSUSE to be able to troubleshoot the specfile in an osc shell.

@junghans I think there is something wrong in the CMake command, either an incorrect linking flag or an Rpath error. Within osc shell, I can compile espresso and run the Python tests just fine:

cd /home/abuild/rpmbuild/BUILD/espresso/build
rm -rf *
rm -rf /home/abuild/rpmbuild/BUILDROOT/python3-espressomd-4.1.2-0.x86_64/usr/lib64/python3.7/site-packages/espressomd/*
cmake .. -DCMAKE_BUILD_TYPE=Debug -DMPIEXEC_PREFLAGS=--allow-run-as-root
make -j24 check_python ARGS=-j24

In that shell, I also tried the fully expanded CMake command used by osc build and tinkered with the linker flags, both the additional ones from your specfile and those added by osc, without success.

@junghans
Copy link
Member Author

@jngrad yeah rpm hates rpath hence %cmake injects something like -DCMAKE_SKIP_RPATH:BOOL=ON and that is why we have to explicitly use LD_LIBRARY_PATH with make check. Is there a way to compile espresso without an rpath being injected after install?
(On Fedora, which isn't as strict with rpaths, I found that even with -DCMAKE_INSTALL_RPATH="" the libraries will have some like /tmp/foo/bar in the rpath.)

@jngrad
Copy link
Member

jngrad commented Jan 15, 2020

@junghans Just had a look with objdump -a -x in the .so files and the RPATH field is indeed empty. This makes sense now. OpenSUSE forbids hardcoded Rpath values in .so files, so we have to work around it. The issue is reproducible on Ubuntu 18.04 simply by copying the osc CMake command:

cd build
rm -rf *
rm -rf /tmp/espresso-unit-tests
cp ../maintainer/configs/empty.hpp myconfig.hpp
cmake .. '-GUnix Makefiles' -DCMAKE_INSTALL_PREFIX:PATH=/tmp/espresso-unit-tests \
      -DCMAKE_INSTALL_LIBDIR:PATH=/usr/lib -DCMAKE_BUILD_TYPE=Debug \
      -DCMAKE_C_FLAGS='-Og -Wall -D_FORTIFY_SOURCE=2 -Werror=return-type -flto=auto -DNDEBUG' \
      -DCMAKE_CXX_FLAGS='-Og -Wall -D_FORTIFY_SOURCE=2 -Werror=return-type -flto=auto -DNDEBUG' \
      -DCMAKE_EXE_LINKER_FLAGS='-flto=auto -Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
      -DCMAKE_MODULE_LINKER_FLAGS='-flto=auto -Wl,--as-needed' \
      -DCMAKE_SHARED_LINKER_FLAGS='-flto=auto -Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
      -DCMAKE_SKIP_RPATH:BOOL=ON -DBUILD_SHARED_LIBS:BOOL=ON \
      -DCMAKE_COLOR_MAKEFILE:BOOL=OFF \
      -DCMAKE_SHARED_LINKER_FLAGS='-Wl,--as-needed -Wl,-z,now' \
      -DPYTHON_EXECUTABLE=/usr/bin/python3 -DINSTALL_PYPRESSO=OFF
make -j$(nproc) install
# trigger index error
LD_LIBRARY_PATH="/tmp/espresso-unit-tests/lib/python3.6/site-packages/espressomd/:/usr/lib/x86_64-linux-gnu/openmpi" make -j$(nproc) check_python ARGS="-R ^lb_momentum_conservation$"
# GDB backtrace
LD_LIBRARY_PATH="/tmp/espresso-unit-tests/lib/python3.6/site-packages/espressomd/:/usr/lib/x86_64-linux-gnu/openmpi" ./pypresso --gdb "testsuite/python/lb_momentum_conservation.py"
(gdb) catch throw
(gdb) run
(gdb) bt

For some reason, with the Ubuntu build the GDB backtrace doesn't allow printing the value of Cython variables (either empty or optimized out), while in the OpenSUSE build most of them can be printed.

@mcepl
Copy link

mcepl commented Jan 15, 2020

Just a note, the hate towards rpath is not only openSUSE-thing, Fedora Packaging Guidelines forbids it as well, we are just a way more thorough about actual checking it (running rpmlint on every package build).

@jngrad
Copy link
Member

jngrad commented Jan 15, 2020

@RudolfWeeber using the procedure above for Ubuntu 18:

>>> from espressomd import script_interface
>>> script_interface._python_class_by_so_name['CollisionDetection::CollisionDetection']
<class 'espressomd.collision_detection.CollisionDetection'>
>>> from espressomd import System
>>> s = System(box_l=(1,1,1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "system.pyx", line 149, in espressomd.system.System.__init__
  File "collision_detection.pyx", line 53, in espressomd.collision_detection.CollisionDetection.__init__
  File "script_interface.pyx", line 297, in espressomd.script_interface.ScriptInterfaceHelper.__init__
  File "script_interface.pyx", line 78, in espressomd.script_interface.PScriptInterface.__init__
IndexError: _Map_base::at

When compiling espresso with the minimal config, the issue happens for the ComFixed attribute, as described earlier. I don't get why the Rpath would have any impact on ComFixed, which is defined in a plain python file. Could it be another issue? The GDB trace for both CollisionDetection() and ComFixed() actually fails before the name string is looked up, at this line:

m_cb->call(make_remote_handle);

which calls
const int id = m_func_ptr_to_id.at(reinterpret_cast<void (*)()>(fp));

with fp = (void (*)(void)) <(anonymous namespace)::make_remote_handle()> and m_func_ptr_to_id a container of type std::unordered_map<void (*)(), int>. The container throws because the fp callback is not registered. Adding the following assertion catches the mistake:

assert(m_func_ptr_to_id.find(reinterpret_cast<void (*)()>(fp))!=m_func_ptr_to_id.end());

@fweik
Copy link
Contributor

fweik commented Jan 15, 2020

@jngrad can you please try giving the function make_remote_handle external linkage? Since the lookup is by address, it's possible that the address is not the same for every point of view for weak linkage. That behavior could also be (at least in theory) be influence by whether the rpath is set (because the symbol resolution is done at link/load time).

@jngrad
Copy link
Member

jngrad commented Jan 15, 2020

Adding extern didn't help, additionally pulling make_remote_handle out of the anonymous namespace also didn't help. Adding the Rpath manually with patchelf --set-rpath didn't help. Removing successively -Wl,-z,now, -Wl,--as-needed, -Wl,--no-undefined (in that order) didn't help.

I think the root cause is the LTO (link time optimization), enabled through the -flto compiler/linker flag, introduced by default in OpenSUSE Tumbleweed (mailing list thread), and was proposed for inclusion in Fedora (proposal). LTO recompiles the objects during the linking step to reduce binary sizes.

@junghans Until espresso gets LTO'ed, a temporary fix would be to manually remove the -flto=auto flag from every CMake flag (I couldn't find a global LTO switch). Applying the following patch to espressomd.spec allowed osc build to successfully run the Python tests on my side.

--- python3-espressomd.spec     (revision fdbeb87bede60ca9aa7dac9bebb5ab7f)
+++ python3-espressomd.spec     (working copy)
@@ -81,7 +82,13 @@
   '-DCMAKE_SHARED_LINKER_FLAGS=-Wl,--as-needed -Wl,-z,now' \
   -DLIBDIR=%{_lib} \
   -DPYTHON_EXECUTABLE=%{_bindir}/python3 \
-  -DINSTALL_PYPRESSO=OFF
+  -DINSTALL_PYPRESSO=OFF \
+  '-DCMAKE_C_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+  '-DCMAKE_CXX_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+  '-DCMAKE_Fortran_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+  '-DCMAKE_EXE_LINKER_FLAGS=-Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
+  '-DCMAKE_MODULE_LINKER_FLAGS=-Wl,--as-needed' \
+  '-DCMAKE_SHARED_LINKER_FLAGS=-Wl,--as-needed -Wl,-z,now'
 %make_jobs
 
 %install
[  526s] RPMLINT report:
[  526s] ===============
[  530s] python3-espressomd.x86_64: W: shared-lib-without-dependency-information /usr/lib64/python3.7/site-packages/espressomd/EspressoConfig.so
[  530s] 2 packages and 0 specfiles checked; 0 errors, 2 warnings.
[  530s] 
[  530s] 
[  530s]  finished "build python3-espressomd.spec" at Wed Jan 15 18:36:15 UTC 2020.
[  530s] 

/var/tmp/build-root/openSUSE_Tumbleweed-x86_64/home/abuild/rpmbuild/SRPMS/python3-espressomd-4.1.2-0.src.rpm

/var/tmp/build-root/openSUSE_Tumbleweed-x86_64/home/abuild/rpmbuild/RPMS/x86_64/python3-espressomd-4.1.2-0.x86_64.rpm

@junghans
Copy link
Member Author

@jngrad will test this
@mcepl is there a more elegant way to do this?

@junghans
Copy link
Member Author

Ok, x86_64 is ok now, i586 still fails on matrix_vector_product.

@mcepl
Copy link

mcepl commented Jan 15, 2020

@jngrad will test this
@mcepl is there a more elegant way to do this?

Yes, add definition

%define _lto_cflags %{nil}

See https://build.opensuse.org/request/show/764818

@junghans
Copy link
Member Author

On i586 in addition:

[ 1868s] 108/145 Test  #84: collision_detection ...............................***Failed    7.01 sec
[ 1868s] .ERROR: Particle 1 moved more than one local box length in one timestep.
[ 1868s] E......
[ 1868s] ======================================================================
[ 1868s] ERROR: test_bind_at_point_of_collision (__main__.CollisionDetection)
[ 1868s] ----------------------------------------------------------------------
[ 1868s] Traceback (most recent call last):
[ 1868s]   File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/collision_detection.py", line 225, in test_bind_at_point_of_collision
[ 1868s]     self.run_test_bind_at_point_of_collision_for_pos(np.array((0, 0, 0)))
[ 1868s]   File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/collision_detection.py", line 140, in run_test_bind_at_point_of_collision_for_pos
[ 1868s]     self.s.integrator.run(3000)
[ 1868s]   File "integrate.pyx", line 104, in espressomd.integrate.Integrator.run
[ 1868s]   File "utils.pyx", line 264, in espressomd.utils.handle_errors
[ 1868s] Exception: Encountered errors during integrate: ERROR: Particle 1 moved more than one local box length in one timestep.
[ 1868s] 

@jngrad
Copy link
Member

jngrad commented Jan 16, 2020

On making espresso LTO-compliant: I originally thought the make_remote_handle function was either inlined or optimized out by LTO, making the corresponding function pointer either pointing to a copy of the function or dangling. Adding __attribute__((used)) to an unused function prevents its removal by LTO (tested in both gcc 7.4 and clang 6.0), yet adding this attribute to void make_remote_handle() did not help. Either such attributes are irrelevant for shared objects (looks like all symbols are still available in LTO'ed .so files), or LTO does something else than removing that function. Adding set_source_files_properties("${CMAKE_CURRENT_SOURCE_DIR}/ParallelScriptInterface.cpp" PROPERTIES COMPILE_FLAGS -fno-lto) to prevent the generation of the GIMPLE sections didn't help. Manually linking script_interface.so with the linking flag -fno-lto also didn't help.

We will not move forward with LTO-compliance for the moment. The specfile _lto_cflags variable seems to have fixed the build issue.

@junghans Errors of the type Particle X moved more than one local box length in one timestep happen from time to time in a few electrostatics tests. Don't remember seeing it in collision_detection before. Does it go away if you restart the job?

@fweik The i586 precision error reported in the first post of this issue is a recurring problem. There is no QEMU for i586 that I'm aware of, so we cannot test it in our CI infrastructure. Could we rely on GNU C Intel x86 macros in every double precision boost test? E.g., something like:

#ifdef __i586__
auto constexpr epsilon = 3 * std::numeric_limits<double>::epsilon();
#else
auto constexpr epsilon = std::numeric_limits<double>::epsilon();
#endif

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Jan 16, 2020 via email

@fweik
Copy link
Contributor

fweik commented Jan 17, 2020

So how does one find error limits for floating point calculations? Is the error bounded? Introducing random error bounds with magic numbers does not increase confidence IMHO... In general, epsilon is also just a relevant quantity if the numbers are O(1), because the precision of floating point numbers depends on their magnitude.

@jngrad
Copy link
Member

jngrad commented Jan 17, 2020

Let’s just increase the tolerance to 4 epsilon or 8 epsilon for affected tests.

Not sure this will help. If a contribution leads to loss of precision, our CI infrastructure will not catch it, but OpenSUSE i586 will, and a ticket will be opened here again. If we can't test it in CI and there is no reliable way to detect it at compile time, we need to officially drop support for i586 and remove all epsilon prefactors added so far. Christoph can then craft an i586-specific specfile with a sed expression that adds an epsilon prefactor in all unit tests.

the precision of floating point numbers depends on their magnitude

There are adaptative epsilon check algorithms that take the magnitude of the floats into account and can deal with zero, but they're quite involved and need to be used with care, because some mathematical operations can lead to large deviations (e.g. sin(pi) != 0 because pi cannot be represented exactly) and will require an arbitrary absolute epsilon anyway, e.g. std::numeric_limits<double>::epsilon(). In my opinion we should remove the epsilon prefactors and let i586 users tweak the value of epsilon.

@mcepl
Copy link

mcepl commented Jan 17, 2020

Christoph can then craft an i586-specific specfile with a sed expression that adds an epsilon prefactor in all unit tests.

I don’t understand, what’s so complicated about adding %ifarch condition to the existing SPEC file? Also, no Linux distribution I know about uses upstream-provided SPEC files, so you don’t have to bother that much.

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Jan 17, 2020 via email

@fweik
Copy link
Contributor

fweik commented Jan 17, 2020

observed results agreeing to 10E-16 should be convincing enough

why?

@jngrad jngrad added this to the Espresso 4.1.3 milestone Jan 21, 2020
@jngrad
Copy link
Member

jngrad commented Feb 24, 2020

Update: the epsilon value in the boost unit tests was wrong and got fixed in #3427, which is now a patch in the openSUSE package (openSUSE:Factory/python3-espressomd/3427.patch). Other failing tests were patched (#3315 (comment)), excepted collision_detection which will not be investigated further as per #3315 (comment). I'm closing this.

@jngrad jngrad closed this as completed Feb 24, 2020
@junghans
Copy link
Member Author

junghans commented Aug 4, 2020

Now we have that problem on Fedora 33 as well.
build.log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants