Barnes-Hut calculations are wrong when ESPResSo is compiled in Debug mode #4774

jngrad · 2023-08-16T15:04:11Z

When compiling CUDA code with the nvcc compiler and the -G flag (generate device debug symbols), the behavior of the Barnes-Hut algorithm changes. The octree permutation vector is wrong, which causes variables n and i to be assigned incorrect values. This affects the distance calculation in the following loop:

espresso/src/core/magnetostatics/barnes_hut_gpu_cuda.cu

Lines 790 to 794 in a770f49

    
           auto tmp = 0.0f; // compute distance squared 
        
           for (int l = 0; l < 3; l++) { 
        
             dr[l] = -bhpara->r[3 * n + l] + bhpara->r[3 * i + l]; 
        
             tmp += dr[l] * dr[l]; 
        
           }

In the end, the computed forces, torques and energies are wrong. The deviation from the correct value is random, and the forces on individual particles can differ by an order of magnitude. The total energy can be quite close to the real value, and a small fraction of the particles will have the correct forces (up to machine precision), so this bug is easy to overlook when tracking the total energy or the force of a lucky particle.

Something might be fundamentally broken here, probably undefined behavior is invoked. Running all tools from the NVIDIA Compute Sanitizer suite (memcheck, racecheck, initcheck, synccheck) didn't return any error. The bug isn't reproducible with the nvcc -g flag (generate host debug symbols), nor with the clang --cuda-noopt-device-debug flag (generate device debug info).

Here is a MWE adapted from dawaanr-and-bh-gpu.py:

import numpy as np
import espressomd.magnetostatics

np.random.seed(42)

system = espressomd.System(box_l=[1, 1, 1])
system.box_l = [15., 15., 15.]
system.periodicity = 3 * [False]
system.time_step = 1E-4
system.cell_system.skin = 0.1

n_part = 3
part_pos = np.random.random((n_part, 3)) * system.box_l[0]
part_dip = np.random.random((n_part, 3)) * 1.3
system.part.add(pos=part_pos, dip=part_dip)

system.actors.add(espressomd.magnetostatics.DipolarDirectSumGpu(prefactor=1.))
system.integrator.run(steps=0, recalc_forces=True)
dawaanr_f = np.copy(system.part.all().f)
dawaanr_t = np.copy(system.part.all().torque_lab)
dawaanr_e = system.analysis.energy()["total"]

system.actors.clear()

system.actors.add(espressomd.magnetostatics.DipolarBarnesHutGpu(
    prefactor=1., epssq=200., itolsq=8.))
system.integrator.run(steps=0, recalc_forces=True)
bhgpu_f = np.copy(system.part.all().f)
bhgpu_t = np.copy(system.part.all().torque_lab)
bhgpu_e = system.analysis.energy()["total"]

assert np.linalg.norm(bhgpu_f - dawaanr_f) < 1e-6
assert np.linalg.norm(bhgpu_t - dawaanr_t) < 1e-6
assert np.linalg.norm(bhgpu_e - dawaanr_e) < 1e-6

ESPResSo was compiled with maxset and these CMake options:

CC=gcc-10 CXX=g++-10 CUDACXX=/usr/local/cuda-11.5/bin/nvcc /usr/bin/cmake .. \
  -D ESPRESSO_BUILD_WITH_CUDA=ON -D CUDAToolkit_ROOT=/usr/local/cuda-11.5 \
  -D ESPRESSO_BUILD_WITH_CCACHE=ON -D ESPRESSO_BUILD_WITH_STOKESIAN_DYNAMICS=ON -D ESPRESSO_BUILD_WITH_WALBERLA=ON \
  -D ESPRESSO_BUILD_WITH_WALBERLA_FFT=ON -D ESPRESSO_BUILD_WITH_WALBERLA_AVX=ON \
  -D ESPRESSO_BUILD_WITH_SCAFACOS=OFF -D ESPRESSO_BUILD_WITH_HDF5=OFF -D ESPRESSO_BUILD_WITH_GSL=ON  \
  -D CMAKE_CUDA_FLAGS="--compiler-bindir=/usr/bin/g++-10" -D CMAKE_BUILD_TYPE=Debug

The text was updated successfully, but these errors were encountered:

jngrad · 2023-11-16T12:07:15Z

Barnes-Hut is also broken on AMD GPUs: #3895

jngrad added Bug Core labels Aug 16, 2023

jngrad added the Wontfix label Sep 14, 2023

jngrad closed this as completed Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Barnes-Hut calculations are wrong when ESPResSo is compiled in Debug mode #4774

Barnes-Hut calculations are wrong when ESPResSo is compiled in Debug mode #4774

jngrad commented Aug 16, 2023

jngrad commented Nov 16, 2023

Barnes-Hut calculations are wrong when ESPResSo is compiled in Debug mode #4774

Barnes-Hut calculations are wrong when ESPResSo is compiled in Debug mode #4774

Comments

jngrad commented Aug 16, 2023

jngrad commented Nov 16, 2023