Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize #5065

archermarx · 2024-07-19T21:19:41Z

Hi,

When running electrostatic simulations on NVIDIA H100 nodes (further details described in #5036), I encounter the following error during the final simulation step:

amrex::Abort::3::CUDA error 700 in file /home/marksta/src/warpx/build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp
line 648: an illegal memory access was encountered !!!
SIGABRT

This points to this section of code

`Amrex/Src/Base/AMReX_GpuDevice.cpp`::648

void
Device::streamSynchronize () noexcept
{
#ifdef AMREX_USE_SYCL
    auto& q = streamQueue();
    try {
        q.wait_and_throw();
    } catch (sycl::exception const& ex) {
        amrex::Abort(std::string("streamSynchronize: ")+ex.what()+"!!!!!");
    }
#else
    AMREX_HIP_OR_CUDA( AMREX_HIP_SAFE_CALL(hipStreamSynchronize(gpuStream()));,
                       AMREX_CUDA_SAFE_CALL(cudaStreamSynchronize(gpuStream())); ) // error is occurring here I think
#endif
}

Looking at the attached backtrace file, I can see that this call ultimately originates from the final half velocity push in WarpXEvolve.cpp

`WarpXEvolve.cpp::177`

  if (cur_time + dt[0] >= stop_time - 1.e-3*dt[0] || step == numsteps_max-1) {
    // At the end of last step, push p by 0.5*dt to synchronize
    FillBoundaryE(guard_cells.ng_FieldGather);
    FillBoundaryB(guard_cells.ng_FieldGather);
    if (fft_do_time_averaging)
    {
        FillBoundaryE_avg(guard_cells.ng_FieldGather);
        FillBoundaryB_avg(guard_cells.ng_FieldGather);
    }
    UpdateAuxilaryData();
    FillBoundaryAux(guard_cells.ng_UpdateAux);
    for (int lev = 0; lev <= finest_level; ++lev) {
        mypc->PushP(lev, 0.5_rt*dt[lev],   // error starts here
                    *Efield_aux[lev][0],*Efield_aux[lev][1],
                    *Efield_aux[lev][2],
                    *Bfield_aux[lev][0],*Bfield_aux[lev][1],
                    *Bfield_aux[lev][2]);
    }
    is_synchronized = true;
}

Commenting out the call to PushP fixes things. Looking into that further, I found this block of code in PushP was responsible and commenting it out fixed things again.

`Particles/PhysicalParticleContainer.cpp::2652`

if (!t_do_not_gather){
    // first gather E and B to the particle positions
    doGatherShapeN(xp, yp, zp, Exp, Eyp, Ezp, Bxp, Byp, Bzp,
                   ex_arr, ey_arr, ez_arr, bx_arr, by_arr, bz_arr,
                   ex_type, ey_type, ez_type, bx_type, by_type, bz_type,
                   dinv, xyzmin, lo, n_rz_azimuthal_modes,
                   nox, galerkin_interpolation);
}

This all fine, but it does seem tangential to the problem, which is that a call to amrex::Gpu::streamSynchronize is failing for some reason. However, if I insert a manual streamSynchronize into the main Evolve loop, to be executed every step, the simulation proceeds until about 1/3 to 1/2 of the simulation duration (the amount varies run-to-run) steps begin taking a minute or more, and I do not obtain the same error.

One last problem is that, even when the error does not occur, the simulation never actually exits, and instead just hangs after outputting AMReX profiling information. I haven't been able to figure out why this is occurring.

Some simulation details

Error occurs for between 1 and 8 NVIDIA H100 GPUs
Simulations are electrostatic uniform plasma simulations with periodic boundary conditions
Workload is 128x128x128 cells and 8x8x8 particles per cell (electrons only)
SLURM settings are

#!/bin/bash
#SBATCH --job-name=uniform
#SBATCH --account=###
#SBATCH --partition=###
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=h100:1
#SBATCH --gpu-bind=single:1
#SBATCH --time=0-01:00:00
#SBATCH --mem-per-gpu=80g
#SBATCH -o output.log
#SBATCH -e error.log
#SBATCH --mail-type=END,FAIL

PICMI input file is given below. It runs with arguments specifying the particles per cell, number of processes, and grid info, i.e

python3 picmi.py --numprocs 2 2 2 --particles 8 8 8 --grid 128 128 128

max_steps = 100
diagnostic_interval = max_steps

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--grid', nargs=3, required=True)
parser.add_argument('--particles', nargs=3, required=True)
parser.add_argument('--numprocs', nargs=3)
args = parser.parse_args()

# --- Grid
nx = args.grid[0]
ny = args.grid[1]
nz = args.grid[2]

xmin = -20e-6
xmax = +20e-6
ymin = -20e-6
ymax = +20e-6
zmin = -20e-6
zmax = +20e-6

particles_per_cell_each_dim = args.particles

if args.numprocs is not None:
    numprocs = args.numprocs
else:
    numprocs = [1,1,1]

from pywarpx import picmi

plasma_density = 1.e25
plasma_thermal_velocity = 0.01*picmi.constants.c

uniform_plasma = picmi.UniformDistribution(
    density = plasma_density,
     lower_bound = [xmin, ymin, zmin],
     upper_bound = [xmax, ymax, zmax],
     rms_velocity = [plasma_thermal_velocity, plasma_thermal_velocity, plasma_thermal_velocity]
)

electrons = picmi.Species(particle_type='electron', name='electrons', initial_distribution=uniform_plasma)

grid = picmi.Cartesian3DGrid(
    number_of_cells = [nx, ny, nz],
    lower_bound = [xmin, ymin, zmin],
    upper_bound = [xmax, ymax, zmax],
    lower_boundary_conditions = ['periodic', 'periodic', 'periodic'],
    upper_boundary_conditions = ['periodic', 'periodic', 'periodic'],
)

solver = picmi.ElectrostaticSolver(grid=grid, required_precision=1e-5)

sim = picmi.Simulation(solver = solver,
    max_steps = max_steps,
    verbose = 1,
    warpx_current_deposition_algo = 'direct',
    warpx_numprocs = numprocs,
    warpx_amrex_use_gpu_aware_mpi = True,
    time_step_size = 5e-12,
)

sim.add_species(electrons,
                layout = picmi.GriddedLayout(n_macroparticle_per_cell=particles_per_cell_each_dim, grid=grid))

sim.step()

The text was updated successfully, but these errors were encountered:

WeiqunZhang · 2024-07-19T22:00:10Z

amrex::Gpu::streamSynchronize waits for async GPU kernels launched earlier to finish and then checks if there are any errors in those earlier kernels reported by the CUDA runtime.

Could you provide an inputs file for C++ that I can test without using python?

archermarx · 2024-07-19T22:05:47Z

Certainly, here's the file generated by the python interface, with some apparently duplicated outputs removed.

max_step = 1000
warpx.verbose = 1
warpx.const_dt = 5e-12
warpx.numprocs = 2 2 2

warpx.do_electrostatic = labframe
warpx.self_fields_required_precision = 1e-05

amr.n_cell = 128 128 128
amr.max_grid_size = 32
amr.blocking_factor = 1
amr.max_level = 0
amrex.use_gpu_aware_mpi = 1

geometry.dims = 3
geometry.prob_lo = -2e-05 -2e-05 -2e-05
geometry.prob_hi = 2e-05 2e-05 2e-05
geometry.is_periodic = 1 1 1
geometry.coord_sys = 0

boundary.field_lo = periodic periodic periodic
boundary.field_hi = periodic periodic periodic
boundary.particle_lo = periodic periodic periodic
boundary.particle_hi = periodic periodic periodic

algo.current_deposition = direct
algo.particle_shape = 1
particles.species_names = electrons

electrons.mass = m_e
electrons.charge = -q_e
electrons.injection_style = nuniformpercell
electrons.initialize_self_fields = 0
electrons.num_particles_per_cell_each_dim = 8 8 8
electrons.xmin = -2e-05
electrons.xmax = 2e-05
electrons.ymin = -2e-05
electrons.ymax = 2e-05
electrons.zmin = -2e-05
electrons.zmax = 2e-05
electrons.momentum_distribution_type = gaussian
electrons.ux_m = 0.0
electrons.uy_m = 0.0
electrons.uz_m = 0.0
electrons.ux_th = 0.01
electrons.uy_th = 0.01
electrons.uz_th = 0.01
electrons.profile = constant
electrons.density = 1e+25

amrex.abort_on_out_of_gpu_memory = 1
amrex.the_arena_is_managed = 0
amrex.omp_threads = nosmt
tiny_profiler.device_synchronize_around_region = 1
particles.do_tiling = 0
amrex.v = 1
amrex.verbose = 1
amrex.max_gpu_streams = 4
device.v = 0
device.verbose = 0
device.numThreads.x = 0
device.numThreads.y = 0
device.numThreads.z = 0
device.numBlocks.x = 0
device.numBlocks.y = 0
device.numBlocks.z = 0
device.graph_init = 0
device.graph_init_nodes = 10000
amrex.regtest_reduction = 0
amrex.signal_handling = 1
amrex.throw_exception = 0
amrex.call_addr2line = 1
amrex.abort_on_unused_inputs = 0
amrex.handle_sigsegv = 1
amrex.handle_sigterm = 0
amrex.handle_sigint = 1
amrex.handle_sigabrt = 1
amrex.handle_sigfpe = 1
amrex.handle_sigill = 1
amrex.fpe_trap_invalid = 0
amrex.fpe_trap_zero = 0
amrex.fpe_trap_overflow = 0
amrex.the_arena_init_size = 63697108992
amrex.the_device_arena_init_size = 8388608
amrex.the_managed_arena_init_size = 8388608
amrex.the_pinned_arena_init_size = 8388608
amrex.the_comms_arena_init_size = 8388608
amrex.the_arena_release_threshold = 9223372036854775807
amrex.the_device_arena_release_threshold = 9223372036854775807
amrex.the_managed_arena_release_threshold = 9223372036854775807
amrex.the_pinned_arena_release_threshold = 42464739328
amrex.the_comms_arena_release_threshold = 9223372036854775807
amrex.the_async_arena_release_threshold = 9223372036854775807
fab.init_snan = 0
DistributionMapping.v = 0
DistributionMapping.verbose = 0
DistributionMapping.efficiency = 0.90000000000000002
DistributionMapping.sfc_threshold = 0
DistributionMapping.node_size = 0
DistributionMapping.verbose_mapper = 0
fab.initval = nan
fab.do_initval = 0
fabarray.maxcomp = 25
amrex.mf.alloc_single_chunk = 0
vismf.v = 0
vismf.headerversion = 1
vismf.groupsets = 0
vismf.setbuf = 1
vismf.usesingleread = 0
vismf.usesinglewrite = 0
vismf.checkfilepositions = 0
vismf.usepersistentifstreams = 0
vismf.usesynchronousreads = 0
vismf.usedynamicsetselection = 1
vismf.iobuffersize = 2097152
vismf.allowsparsewrites = 1
amrex.async_out = 0
amrex.async_out_nfiles = 64
amrex.vector_growth_factor = 1.5
machine.verbose = 0
machine.very_verbose = 0
tiny_profiler.verbose = 0
tiny_profiler.v = 0
tiny_profiler.print_threshold = 1
amrex.use_profiler_syncs = 0

amr.v = 0
amr.n_proper = 1
amr.grid_eff = 0.69999999999999996
amr.refine_grid_layout = 1
amr.refine_grid_layout_x = 1
amr.refine_grid_layout_y = 1
amr.refine_grid_layout_z = 1
amr.check_input = 1
vismf.usesingleread = 1
vismf.usesinglewrite = 1
particles.particles_nfiles = 1024
particles.use_prepost = 0
particles.do_unlink = 1
particles.do_mem_efficient_sort = 1
lattice.reverse = 0

WeiqunZhang · 2024-07-20T01:14:29Z

I can see an issue. For the first multigrid solver, the min and max of the rhs are 1.8095128179727603e+17, and 1.8095128179728016e+17. Since the problem has all periodic boundaries, the matrix is singular. The solver is having trouble with this singular matrix problem with rhs being almost a constant.

I don't know if this is the issue you are seeing. I only tested it with a much smaller setup on a single CPU core. Could you try with the following change? If the hack works, we can then discuss how to implement a real fix.

--- a/Source/ablastr/fields/PoissonSolver.H
+++ b/Source/ablastr/fields/PoissonSolver.H
@@ -265,8 +265,19 @@ computePhi (amrex::Vector<amrex::MultiFab*> const & rho,
         mlmg.setAlwaysUseBNorm(always_use_bnorm);
 
         // Solve Poisson equation at lev
-        mlmg.solve( {phi[lev]}, {rho[lev]},
-                    relative_tolerance, absolute_tolerance );
+        auto rhomin = rho[lev]->min(0);
+        auto rhomax = rho[lev]->max(0);
+        {
+            amrex::Print().SetPrecision(17) << "xxxxx " << rhomin
+                                            << ", " << rhomax
+                                            << ", " << (rhomax-rhomin)/rhomin << std::endl;
+        }
+        if (std::abs(rhomax-rhomin) <= 1.e-12_rt * std::abs(rhomax+rhomin)) {
+            phi[lev]->setVal(0.0_rt);
+        } else {
+            mlmg.solve( {phi[lev]}, {rho[lev]},
+                        relative_tolerance, absolute_tolerance );
+        }
 
         // needed for solving the levels by levels:
         // - coarser level is initial guess for finer level

WeiqunZhang · 2024-07-20T02:06:36Z

In this test, there are no initial fields because all particles are uniformly distributed. But particles have ux_th = 0.01, uy_th = 0.01 and uz_th = 0.01. The test uses warpx.const_dt = 5e-12. So in one step, rms velocity will move a particle by 1.5e-5. The domain size is only 4e-5. So some particles are probably way out of bound such that one periodic shift will not bring them back to the initial domain. This will result in out of bound access to arrays, thus the invalid memory access.

Maybe you need to use a smaller const_dt. Maybe you need to change the initial setup. Maybe WarpX needs to implement a way to allow for the simulation to start with a smaller dt and then gradually increase to warpx.const_dt. @RemiLehe

archermarx · 2024-07-20T03:54:43Z

Oh interesting thoughts. I hadn't considered that last point. I'll give it a try and report back

archermarx · 2024-07-20T04:17:55Z

Looks like reducing the timestep fixed the primary issue, thanks! I didn't notice much of a change by implementing your precision fix, but I have run into problems with this sort of thing before. It's quite common to initialize domains with uniform plasmas which may then become excited by a perturbation. It would be good if the ES solver could handle uniform plasmas gracefully. I noticed that in the first ten iterations of these uniform plasma tests, each timestep took between 2 and 10 seconds, versus 0.4 seconds per step once the simulation had progressed a bit. I suspect this may be related.

Unfortunately, I am still having issues with the simulation not finalizing. It hangs just before outputting the expected "AMReX finalized" when running the ES simulation, but not when running EM.

WeiqunZhang · 2024-07-20T04:25:25Z

I cannot reproduce the hang before "amrex finalized". Maybe it's in the python part?

archermarx · 2024-07-20T15:12:33Z

Unfortunately not. Running it with the binaries directly still exhibits this problem on my system. I will try running with CPU only later to see if that is the issue.

WeiqunZhang · 2024-07-20T18:44:08Z

We need to figure out where it hangs. If your job is interactive, pressing ctrl-c might produce backtrace files. If it's a batch job, you need to send a signal to the hanging job to terminal the job and hopefully backtrace files are then produced. If it's slurm, you could use scancel --signal=INT or when you submit job you do sbatch --signal=INT@300 (where 300 means sending SIGINT 300 seconds before the time limit.

If you are using cmake to build the code, you probably want to build it with RelWithDebInfo, not Release so that you can get some debug information into the executable.

I don't understand how python handles signals. Maybe you need to run the executable directly without python in the middle for the signal handling stuff work properly.

archermarx · 2024-07-22T16:48:39Z

OK, I found that reducing the timestep further to 1e-14 seconds fixes all of the problems, seemingly. It might be nice to emit a warning if the user has picked a timestep that is likely to result in problems. I can make a PR to do that, if that seems reasonable.

archermarx closed this as completed Jul 22, 2024

archermarx mentioned this issue Aug 6, 2024

CFL limitation for electrostatic simulations #5113

Closed

archermarx mentioned this issue Aug 14, 2024

Add call to HandleParticlesAtBoundaries in WarpXEvolve.cpp to prevent oob segfaults #5136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize #5065

Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize #5065

archermarx commented Jul 19, 2024 •

edited

Loading

WeiqunZhang commented Jul 19, 2024

archermarx commented Jul 19, 2024

WeiqunZhang commented Jul 20, 2024

WeiqunZhang commented Jul 20, 2024

archermarx commented Jul 20, 2024 via email •

edited

Loading

archermarx commented Jul 20, 2024

WeiqunZhang commented Jul 20, 2024

archermarx commented Jul 20, 2024 •

edited

Loading

WeiqunZhang commented Jul 20, 2024

archermarx commented Jul 22, 2024 •

edited

Loading

Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize #5065

Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize #5065

Comments

archermarx commented Jul 19, 2024 • edited Loading

Amrex/Src/Base/AMReX_GpuDevice.cpp::648

WarpXEvolve.cpp::177

Particles/PhysicalParticleContainer.cpp::2652

Some simulation details

WeiqunZhang commented Jul 19, 2024

archermarx commented Jul 19, 2024

WeiqunZhang commented Jul 20, 2024

WeiqunZhang commented Jul 20, 2024

archermarx commented Jul 20, 2024 via email • edited Loading

archermarx commented Jul 20, 2024

WeiqunZhang commented Jul 20, 2024

archermarx commented Jul 20, 2024 • edited Loading

WeiqunZhang commented Jul 20, 2024

archermarx commented Jul 22, 2024 • edited Loading

archermarx commented Jul 19, 2024 •

edited

Loading

`Amrex/Src/Base/AMReX_GpuDevice.cpp`::648

`WarpXEvolve.cpp::177`

`Particles/PhysicalParticleContainer.cpp::2652`

archermarx commented Jul 20, 2024 via email •

edited

Loading

archermarx commented Jul 20, 2024 •

edited

Loading

archermarx commented Jul 22, 2024 •

edited

Loading