-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize #5065
Comments
Could you provide an inputs file for C++ that I can test without using python? |
Certainly, here's the file generated by the python interface, with some apparently duplicated outputs removed.
|
I can see an issue. For the first multigrid solver, the min and max of the rhs are I don't know if this is the issue you are seeing. I only tested it with a much smaller setup on a single CPU core. Could you try with the following change? If the hack works, we can then discuss how to implement a real fix.
|
In this test, there are no initial fields because all particles are uniformly distributed. But particles have Maybe you need to use a smaller const_dt. Maybe you need to change the initial setup. Maybe WarpX needs to implement a way to allow for the simulation to start with a smaller dt and then gradually increase to |
Oh interesting thoughts. I hadn't considered that last point. I'll give it a try and report back
|
Looks like reducing the timestep fixed the primary issue, thanks! I didn't notice much of a change by implementing your precision fix, but I have run into problems with this sort of thing before. It's quite common to initialize domains with uniform plasmas which may then become excited by a perturbation. It would be good if the ES solver could handle uniform plasmas gracefully. I noticed that in the first ten iterations of these uniform plasma tests, each timestep took between 2 and 10 seconds, versus 0.4 seconds per step once the simulation had progressed a bit. I suspect this may be related. Unfortunately, I am still having issues with the simulation not finalizing. It hangs just before outputting the expected "AMReX finalized" when running the ES simulation, but not when running EM. |
I cannot reproduce the hang before "amrex finalized". Maybe it's in the python part? |
Unfortunately not. Running it with the binaries directly still exhibits this problem on my system. I will try running with CPU only later to see if that is the issue. |
We need to figure out where it hangs. If your job is interactive, pressing If you are using cmake to build the code, you probably want to build it with I don't understand how python handles signals. Maybe you need to run the executable directly without python in the middle for the signal handling stuff work properly. |
OK, I found that reducing the timestep further to 1e-14 seconds fixes all of the problems, seemingly. It might be nice to emit a warning if the user has picked a timestep that is likely to result in problems. I can make a PR to do that, if that seems reasonable. |
Hi,
When running electrostatic simulations on NVIDIA H100 nodes (further details described in #5036), I encounter the following error during the final simulation step:
This points to this section of code
Amrex/Src/Base/AMReX_GpuDevice.cpp
::648Looking at the attached backtrace file, I can see that this call ultimately originates from the final half velocity push in
WarpXEvolve.cpp
WarpXEvolve.cpp::177
Commenting out the call to PushP fixes things. Looking into that further, I found this block of code in PushP was responsible and commenting it out fixed things again.
Particles/PhysicalParticleContainer.cpp::2652
This all fine, but it does seem tangential to the problem, which is that a call to
amrex::Gpu::streamSynchronize
is failing for some reason. However, if I insert a manualstreamSynchronize
into the mainEvolve
loop, to be executed every step, the simulation proceeds until about 1/3 to 1/2 of the simulation duration (the amount varies run-to-run) steps begin taking a minute or more, and I do not obtain the same error.One last problem is that, even when the error does not occur, the simulation never actually exits, and instead just hangs after outputting AMReX profiling information. I haven't been able to figure out why this is occurring.
Some simulation details
The text was updated successfully, but these errors were encountered: