Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize #5065

Closed
archermarx opened this issue Jul 19, 2024 · 10 comments
Closed

Comments

@archermarx
Copy link
Contributor

archermarx commented Jul 19, 2024

Hi,

When running electrostatic simulations on NVIDIA H100 nodes (further details described in #5036), I encounter the following error during the final simulation step:

amrex::Abort::3::CUDA error 700 in file /home/marksta/src/warpx/build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp
line 648: an illegal memory access was encountered !!!
SIGABRT

This points to this section of code

Amrex/Src/Base/AMReX_GpuDevice.cpp::648

void
Device::streamSynchronize () noexcept
{
#ifdef AMREX_USE_SYCL
    auto& q = streamQueue();
    try {
        q.wait_and_throw();
    } catch (sycl::exception const& ex) {
        amrex::Abort(std::string("streamSynchronize: ")+ex.what()+"!!!!!");
    }
#else
    AMREX_HIP_OR_CUDA( AMREX_HIP_SAFE_CALL(hipStreamSynchronize(gpuStream()));,
                       AMREX_CUDA_SAFE_CALL(cudaStreamSynchronize(gpuStream())); ) // error is occurring here I think
#endif
}

Looking at the attached backtrace file, I can see that this call ultimately originates from the final half velocity push in WarpXEvolve.cpp

WarpXEvolve.cpp::177

  if (cur_time + dt[0] >= stop_time - 1.e-3*dt[0] || step == numsteps_max-1) {
    // At the end of last step, push p by 0.5*dt to synchronize
    FillBoundaryE(guard_cells.ng_FieldGather);
    FillBoundaryB(guard_cells.ng_FieldGather);
    if (fft_do_time_averaging)
    {
        FillBoundaryE_avg(guard_cells.ng_FieldGather);
        FillBoundaryB_avg(guard_cells.ng_FieldGather);
    }
    UpdateAuxilaryData();
    FillBoundaryAux(guard_cells.ng_UpdateAux);
    for (int lev = 0; lev <= finest_level; ++lev) {
        mypc->PushP(lev, 0.5_rt*dt[lev],   // error starts here
                    *Efield_aux[lev][0],*Efield_aux[lev][1],
                    *Efield_aux[lev][2],
                    *Bfield_aux[lev][0],*Bfield_aux[lev][1],
                    *Bfield_aux[lev][2]);
    }
    is_synchronized = true;
}

Commenting out the call to PushP fixes things. Looking into that further, I found this block of code in PushP was responsible and commenting it out fixed things again.

Particles/PhysicalParticleContainer.cpp::2652

if (!t_do_not_gather){
    // first gather E and B to the particle positions
    doGatherShapeN(xp, yp, zp, Exp, Eyp, Ezp, Bxp, Byp, Bzp,
                   ex_arr, ey_arr, ez_arr, bx_arr, by_arr, bz_arr,
                   ex_type, ey_type, ez_type, bx_type, by_type, bz_type,
                   dinv, xyzmin, lo, n_rz_azimuthal_modes,
                   nox, galerkin_interpolation);
}

This all fine, but it does seem tangential to the problem, which is that a call to amrex::Gpu::streamSynchronize is failing for some reason. However, if I insert a manual streamSynchronize into the main Evolve loop, to be executed every step, the simulation proceeds until about 1/3 to 1/2 of the simulation duration (the amount varies run-to-run) steps begin taking a minute or more, and I do not obtain the same error.

One last problem is that, even when the error does not occur, the simulation never actually exits, and instead just hangs after outputting AMReX profiling information. I haven't been able to figure out why this is occurring.

Some simulation details

  • Error occurs for between 1 and 8 NVIDIA H100 GPUs
  • Simulations are electrostatic uniform plasma simulations with periodic boundary conditions
  • Workload is 128x128x128 cells and 8x8x8 particles per cell (electrons only)
  • SLURM settings are
#!/bin/bash
#SBATCH --job-name=uniform
#SBATCH --account=###
#SBATCH --partition=###
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --gpus-per-task=h100:1
#SBATCH --gpu-bind=single:1
#SBATCH --time=0-01:00:00
#SBATCH --mem-per-gpu=80g
#SBATCH -o output.log
#SBATCH -e error.log
#SBATCH --mail-type=END,FAIL
  • PICMI input file is given below. It runs with arguments specifying the particles per cell, number of processes, and grid info, i.e
python3 picmi.py --numprocs 2 2 2 --particles 8 8 8 --grid 128 128 128
max_steps = 100
diagnostic_interval = max_steps

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--grid', nargs=3, required=True)
parser.add_argument('--particles', nargs=3, required=True)
parser.add_argument('--numprocs', nargs=3)
args = parser.parse_args()

# --- Grid
nx = args.grid[0]
ny = args.grid[1]
nz = args.grid[2]

xmin = -20e-6
xmax = +20e-6
ymin = -20e-6
ymax = +20e-6
zmin = -20e-6
zmax = +20e-6

particles_per_cell_each_dim = args.particles

if args.numprocs is not None:
    numprocs = args.numprocs
else:
    numprocs = [1,1,1]

from pywarpx import picmi

plasma_density = 1.e25
plasma_thermal_velocity = 0.01*picmi.constants.c

uniform_plasma = picmi.UniformDistribution(
    density = plasma_density,
     lower_bound = [xmin, ymin, zmin],
     upper_bound = [xmax, ymax, zmax],
     rms_velocity = [plasma_thermal_velocity, plasma_thermal_velocity, plasma_thermal_velocity]
)

electrons = picmi.Species(particle_type='electron', name='electrons', initial_distribution=uniform_plasma)

grid = picmi.Cartesian3DGrid(
    number_of_cells = [nx, ny, nz],
    lower_bound = [xmin, ymin, zmin],
    upper_bound = [xmax, ymax, zmax],
    lower_boundary_conditions = ['periodic', 'periodic', 'periodic'],
    upper_boundary_conditions = ['periodic', 'periodic', 'periodic'],
)

solver = picmi.ElectrostaticSolver(grid=grid, required_precision=1e-5)

sim = picmi.Simulation(solver = solver,
    max_steps = max_steps,
    verbose = 1,
    warpx_current_deposition_algo = 'direct',
    warpx_numprocs = numprocs,
    warpx_amrex_use_gpu_aware_mpi = True,
    time_step_size = 5e-12,
)

sim.add_species(electrons,
                layout = picmi.GriddedLayout(n_macroparticle_per_cell=particles_per_cell_each_dim, grid=grid))

sim.step()
@WeiqunZhang
Copy link
Member

amrex::Gpu::streamSynchronize waits for async GPU kernels launched earlier to finish and then checks if there are any errors in those earlier kernels reported by the CUDA runtime.

Could you provide an inputs file for C++ that I can test without using python?

@archermarx
Copy link
Contributor Author

Certainly, here's the file generated by the python interface, with some apparently duplicated outputs removed.

max_step = 1000
warpx.verbose = 1
warpx.const_dt = 5e-12
warpx.numprocs = 2 2 2

warpx.do_electrostatic = labframe
warpx.self_fields_required_precision = 1e-05

amr.n_cell = 128 128 128
amr.max_grid_size = 32
amr.blocking_factor = 1
amr.max_level = 0
amrex.use_gpu_aware_mpi = 1

geometry.dims = 3
geometry.prob_lo = -2e-05 -2e-05 -2e-05
geometry.prob_hi = 2e-05 2e-05 2e-05
geometry.is_periodic = 1 1 1
geometry.coord_sys = 0

boundary.field_lo = periodic periodic periodic
boundary.field_hi = periodic periodic periodic
boundary.particle_lo = periodic periodic periodic
boundary.particle_hi = periodic periodic periodic

algo.current_deposition = direct
algo.particle_shape = 1
particles.species_names = electrons

electrons.mass = m_e
electrons.charge = -q_e
electrons.injection_style = nuniformpercell
electrons.initialize_self_fields = 0
electrons.num_particles_per_cell_each_dim = 8 8 8
electrons.xmin = -2e-05
electrons.xmax = 2e-05
electrons.ymin = -2e-05
electrons.ymax = 2e-05
electrons.zmin = -2e-05
electrons.zmax = 2e-05
electrons.momentum_distribution_type = gaussian
electrons.ux_m = 0.0
electrons.uy_m = 0.0
electrons.uz_m = 0.0
electrons.ux_th = 0.01
electrons.uy_th = 0.01
electrons.uz_th = 0.01
electrons.profile = constant
electrons.density = 1e+25

amrex.abort_on_out_of_gpu_memory = 1
amrex.the_arena_is_managed = 0
amrex.omp_threads = nosmt
tiny_profiler.device_synchronize_around_region = 1
particles.do_tiling = 0
amrex.v = 1
amrex.verbose = 1
amrex.max_gpu_streams = 4
device.v = 0
device.verbose = 0
device.numThreads.x = 0
device.numThreads.y = 0
device.numThreads.z = 0
device.numBlocks.x = 0
device.numBlocks.y = 0
device.numBlocks.z = 0
device.graph_init = 0
device.graph_init_nodes = 10000
amrex.regtest_reduction = 0
amrex.signal_handling = 1
amrex.throw_exception = 0
amrex.call_addr2line = 1
amrex.abort_on_unused_inputs = 0
amrex.handle_sigsegv = 1
amrex.handle_sigterm = 0
amrex.handle_sigint = 1
amrex.handle_sigabrt = 1
amrex.handle_sigfpe = 1
amrex.handle_sigill = 1
amrex.fpe_trap_invalid = 0
amrex.fpe_trap_zero = 0
amrex.fpe_trap_overflow = 0
amrex.the_arena_init_size = 63697108992
amrex.the_device_arena_init_size = 8388608
amrex.the_managed_arena_init_size = 8388608
amrex.the_pinned_arena_init_size = 8388608
amrex.the_comms_arena_init_size = 8388608
amrex.the_arena_release_threshold = 9223372036854775807
amrex.the_device_arena_release_threshold = 9223372036854775807
amrex.the_managed_arena_release_threshold = 9223372036854775807
amrex.the_pinned_arena_release_threshold = 42464739328
amrex.the_comms_arena_release_threshold = 9223372036854775807
amrex.the_async_arena_release_threshold = 9223372036854775807
fab.init_snan = 0
DistributionMapping.v = 0
DistributionMapping.verbose = 0
DistributionMapping.efficiency = 0.90000000000000002
DistributionMapping.sfc_threshold = 0
DistributionMapping.node_size = 0
DistributionMapping.verbose_mapper = 0
fab.initval = nan
fab.do_initval = 0
fabarray.maxcomp = 25
amrex.mf.alloc_single_chunk = 0
vismf.v = 0
vismf.headerversion = 1
vismf.groupsets = 0
vismf.setbuf = 1
vismf.usesingleread = 0
vismf.usesinglewrite = 0
vismf.checkfilepositions = 0
vismf.usepersistentifstreams = 0
vismf.usesynchronousreads = 0
vismf.usedynamicsetselection = 1
vismf.iobuffersize = 2097152
vismf.allowsparsewrites = 1
amrex.async_out = 0
amrex.async_out_nfiles = 64
amrex.vector_growth_factor = 1.5
machine.verbose = 0
machine.very_verbose = 0
tiny_profiler.verbose = 0
tiny_profiler.v = 0
tiny_profiler.print_threshold = 1
amrex.use_profiler_syncs = 0

amr.v = 0
amr.n_proper = 1
amr.grid_eff = 0.69999999999999996
amr.refine_grid_layout = 1
amr.refine_grid_layout_x = 1
amr.refine_grid_layout_y = 1
amr.refine_grid_layout_z = 1
amr.check_input = 1
vismf.usesingleread = 1
vismf.usesinglewrite = 1
particles.particles_nfiles = 1024
particles.use_prepost = 0
particles.do_unlink = 1
particles.do_mem_efficient_sort = 1
lattice.reverse = 0

@WeiqunZhang
Copy link
Member

I can see an issue. For the first multigrid solver, the min and max of the rhs are 1.8095128179727603e+17, and 1.8095128179728016e+17. Since the problem has all periodic boundaries, the matrix is singular. The solver is having trouble with this singular matrix problem with rhs being almost a constant.

I don't know if this is the issue you are seeing. I only tested it with a much smaller setup on a single CPU core. Could you try with the following change? If the hack works, we can then discuss how to implement a real fix.

--- a/Source/ablastr/fields/PoissonSolver.H
+++ b/Source/ablastr/fields/PoissonSolver.H
@@ -265,8 +265,19 @@ computePhi (amrex::Vector<amrex::MultiFab*> const & rho,
         mlmg.setAlwaysUseBNorm(always_use_bnorm);
 
         // Solve Poisson equation at lev
-        mlmg.solve( {phi[lev]}, {rho[lev]},
-                    relative_tolerance, absolute_tolerance );
+        auto rhomin = rho[lev]->min(0);
+        auto rhomax = rho[lev]->max(0);
+        {
+            amrex::Print().SetPrecision(17) << "xxxxx " << rhomin
+                                            << ", " << rhomax
+                                            << ", " << (rhomax-rhomin)/rhomin << std::endl;
+        }
+        if (std::abs(rhomax-rhomin) <= 1.e-12_rt * std::abs(rhomax+rhomin)) {
+            phi[lev]->setVal(0.0_rt);
+        } else {
+            mlmg.solve( {phi[lev]}, {rho[lev]},
+                        relative_tolerance, absolute_tolerance );
+        }
 
         // needed for solving the levels by levels:
         // - coarser level is initial guess for finer level

@WeiqunZhang
Copy link
Member

In this test, there are no initial fields because all particles are uniformly distributed. But particles have ux_th = 0.01, uy_th = 0.01 and uz_th = 0.01. The test uses warpx.const_dt = 5e-12. So in one step, rms velocity will move a particle by 1.5e-5. The domain size is only 4e-5. So some particles are probably way out of bound such that one periodic shift will not bring them back to the initial domain. This will result in out of bound access to arrays, thus the invalid memory access.

Maybe you need to use a smaller const_dt. Maybe you need to change the initial setup. Maybe WarpX needs to implement a way to allow for the simulation to start with a smaller dt and then gradually increase to warpx.const_dt. @RemiLehe

@archermarx
Copy link
Contributor Author

archermarx commented Jul 20, 2024 via email

@archermarx
Copy link
Contributor Author

Looks like reducing the timestep fixed the primary issue, thanks! I didn't notice much of a change by implementing your precision fix, but I have run into problems with this sort of thing before. It's quite common to initialize domains with uniform plasmas which may then become excited by a perturbation. It would be good if the ES solver could handle uniform plasmas gracefully. I noticed that in the first ten iterations of these uniform plasma tests, each timestep took between 2 and 10 seconds, versus 0.4 seconds per step once the simulation had progressed a bit. I suspect this may be related.

Unfortunately, I am still having issues with the simulation not finalizing. It hangs just before outputting the expected "AMReX finalized" when running the ES simulation, but not when running EM.

@WeiqunZhang
Copy link
Member

I cannot reproduce the hang before "amrex finalized". Maybe it's in the python part?

@archermarx
Copy link
Contributor Author

archermarx commented Jul 20, 2024

Unfortunately not. Running it with the binaries directly still exhibits this problem on my system. I will try running with CPU only later to see if that is the issue.

@WeiqunZhang
Copy link
Member

We need to figure out where it hangs. If your job is interactive, pressing ctrl-c might produce backtrace files. If it's a batch job, you need to send a signal to the hanging job to terminal the job and hopefully backtrace files are then produced. If it's slurm, you could use scancel --signal=INT or when you submit job you do sbatch --signal=INT@300 (where 300 means sending SIGINT 300 seconds before the time limit.

If you are using cmake to build the code, you probably want to build it with RelWithDebInfo, not Release so that you can get some debug information into the executable.

I don't understand how python handles signals. Maybe you need to run the executable directly without python in the middle for the signal handling stuff work properly.

@archermarx
Copy link
Contributor Author

archermarx commented Jul 22, 2024

OK, I found that reducing the timestep further to 1e-14 seconds fixes all of the problems, seemingly. It might be nice to emit a warning if the user has picked a timestep that is likely to result in problems. I can make a PR to do that, if that seems reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants