Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Collection for stack allocator backend #169

Merged
merged 8 commits into from
Mar 11, 2021

Conversation

sethrj
Copy link
Member

@sethrj sethrj commented Mar 7, 2021

This changes the stack allocator interface to be compatible with both host and device memory using Collections.

Runtime error on CUDA 10.1

The errors below only occur on CUDA 10.1, not 10.2 or later. @jefflarkin thinks this may be a bug in NVCC since it only shows up in 10.1.


Currently I'm encountering a really bizarre error when building in debug on wildstyle (and no dependencies aside from json):

Thread 1 "demo-interactor" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 4, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 3, lane 0]
0x0000000001f54390 in celeritas::Collection<celeritas::ParticleDef, (celeritas::Ownership)2, (celeritas::MemSpace)1, celeritas::OpaqueId<celeritas::ParticleDef, unsigned int> >::storage (this=0x160, this=0x160)
    at /home/s3j/.local/src/celeritas/build-debug/../src/base/Span.hh:115
115	    CELER_CONSTEXPR_FUNCTION size_type size() const { return s_.size; }
(cuda-gdb) bt
#0  0x0000000001f54390 in celeritas::Collection<celeritas::ParticleDef, (celeritas::Ownership)2, (celeritas::MemSpace)1, celeritas::OpaqueId<celeritas::ParticleDef, unsigned int> >::storage (this=0x160, this=0x160)
    at /home/s3j/.local/src/celeritas/build-debug/../src/base/Span.hh:115
#1  celeritas::Collection<celeritas::ParticleDef, (celeritas::Ownership)2, (celeritas::MemSpace)1, celeritas::OpaqueId<celeritas::ParticleDef, unsigned int> >::size (this=0x160)
    at /home/s3j/.local/src/celeritas/build-debug/../src/base/Collection.i.hh:144
#2  celeritas::ParticleTrackView::operator= (this=0x7fffe1fffc58, other=0x7fffe1fffc10)
    at /home/s3j/.local/src/celeritas/build-debug/../src/physics/base/ParticleTrackView.i.hh:36
#3  0x0000000001ef8d40 in demo_interactor::(anonymous namespace)::initialize_kernel<<<(32,1,1),(128,1,1)>>> (
    states=..., init=...) at ../app/demo-interactor/KNDemoKernel.cu:47

Inside the ParticleTrackView, "params_" is apparently point to invalid memory:

#2  celeritas::ParticleTrackView::operator= (this=0x7fffe1fffc58, other=0x7fffe1fffc10)
    at /home/s3j/.local/src/celeritas/build-debug/../src/physics/base/ParticleTrackView.i.hh:36
36	    CELER_EXPECT(other.particle_id < params_.particles.size());
(cuda-gdb) print params_
$1 = (const @generic _ZN9celeritas17ParticleTrackView17ParticleParamsRefE * const @generic) 0x160
(cuda-gdb) print *params_
Error: Failed to read generic memory at address 0x160 on device 0 sm 0 warp 3 lane 0, error=CUDBG_ERROR_INVALID_MEMORY_SEGMENT(0x7).

...and at the kernel level the "params" variable doesn't even exist, and the states look corrupted (shifted by 1 maybe)?:

(cuda-gdb) print params
No symbol "params" in current context.
(cuda-gdb) print states
$3 = {particle = {state = {storage_ = {data = {s_ = {data = 0x0, size = 0}}}}}, rng = {rng = {s_ = {data = 0x0,
        size = 0}}}, position = {s_ = {data = 0x7fffe1fffb60, size = 352}}, direction = {s_ = {data = 0x160,
      size = 140736985037592}}, time = {s_ = {data = 0x0, size = 16285011132970729471}}, alive = {s_ = {
      data = 0xe1fffc5800007fff, size = 12646614903394697215}}, secondaries = {storage = {storage_ = {data = {s_ = {
            data = 0xe1fffb6000007fff, size = 8589934592}}}}, size = {storage_ = {data = {s_ = {data = 0x1,
            size = 0}}}}}, detector = {hit_buffer = {storage = {storage_ = {data = {s_ = {data = 0x7fffe1fffc20,
              size = 140736140318208}}}}, size = {storage_ = {data = {s_ = {data = 0x1000,
              size = 140736139888128}}}}}, tally_deposition = {storage_ = {data = {s_ = {data = 0x1000,
            size = 140736140084736}}}}}}

In comparison, on emmet:

(cuda-gdb) print params
$1 = {particle = {particles = {storage_ = {data = {s_ = {data = 0x2300a60400, size = 2}}}}}, tables = {reals = {
      storage_ = {data = {s_ = {data = 0x2300a60000, size = 84}}}}, xs = {log_energy = {size = 84,
        front = -8.8813996475528896, back = 18.420680743952367, delta = 0.32894072760849707}, prime_index = 27,
      value = {begin_ = {value_ = {value_ = 0}}, end_ = {value_ = {value_ = 84}}}}}, kn_interactor = {model_id = {
      value_ = 0}, inv_electron_mass = 1.9569511984948496, electron_id = {value_ = 0}, gamma_id = {value_ = 1}},
  detector = {tally_grid = {size = 1024, front = -1, back = 254.75, delta = 0.25}}}
(cuda-gdb) print states
$2 = {particle = {state = {storage_ = {data = {s_ = {data = 0x2315040000, size = 1000000}}}}}, rng = {rng = {s_ = {
        data = 0x2300b60000, size = 1000000}}}, position = {s_ = {data = 0x2312240000, size = 1000000}}, direction = {
    s_ = {data = 0x2313940000, size = 1000000}}, time = {s_ = {data = 0x2303940000, size = 1000000}}, alive = {s_ = {
      data = 0x2300a60600, size = 1000000}}, secondaries = {storage = {storage_ = {data = {s_ = {data = 0x2315fa0000,
            size = 1000000}}}}, size = {storage_ = {data = {s_ = {data = 0x2300b54a00, size = 1}}}}}, detector = {
    hit_buffer = {storage = {storage_ = {data = {s_ = {data = 0x23185e0000, size = 1000000}}}}, size = {storage_ = {
          data = {s_ = {data = 0x2300b54c00, size = 1}}}}}, tally_deposition = {storage_ = {data = {s_ = {
            data = 0x2300b54e00, size = 1024}}}}}}

The "initialize" kernel in which it's happening should NOT have been affected by any of the changes I made at all. This sort of off-by-one thing makes me wonder if it's a similar error to the corrupted ELF data seen in #118 ?? I've spent two hours starting at this mess, and a careful review of my code, and have nothing. The changes to the stack allocator were extensive enough that it's going to be difficult or impossible to bisect my changes to see where exactly I went wrong.

This eliminates the need for host allocator, storage class, etc. I
refactored the demo interactor detector stuff as part of it, as well as
some of the runner code that needed to be changed as a cosequence.

Additional changes were required in the livermore interactor since
the stack allocator now has a template parameter on memory space.
I refactored some parts of the livermore atomic relaxation to reduce
propagation of those template paramters and to improve program flow.
@sethrj sethrj added the core Software engineering infrastructure label Mar 7, 2021
@sethrj sethrj requested a review from amandalund March 7, 2021 00:31
@pcanal
Copy link
Contributor

pcanal commented Mar 7, 2021

Another data point(s), it does not crash on wc.fnal.gov with nvcc V11.1.105 on wc.fnal.gov. cuda-gdb with cuda memcheck does not complains and cuda-memcheck fails after printing the output json with (the seemingly unrelated):

========= CUDA-MEMCHECK
========= Internal Memcheck Error: Initialization failed
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/lib64/libcuda.so.1 [0x1b974b]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x27bdc]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x1d09e]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x2e2e4]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x2ff2a]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x25c8e]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x112ee]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 (cudaFree + 0xf2) [0x40532]
=========     Host Frame:/nashome/p/pcanal/from_lq/wclustre/geant/gcc8/builds/scalar-cuda11/celeritas/src/libceleritas.so (_ZN9celeritas15activate_deviceEONS_6DeviceE + 0x25a) [0x1ad97d]
=========     Host Frame:/nashome/p/pcanal/from_lq/wclustre/geant/gcc8/builds/scalar-cuda11/celeritas/app/demo-interactor [0x5eb3a]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
=========     Host Frame:/nashome/p/pcanal/from_lq/wclustre/geant/gcc8/builds/scalar-cuda11/celeritas/app/demo-interactor [0x5daa9]
=========
========= ERROR SUMMARY: 1 error

@sethrj
Copy link
Member Author

sethrj commented Mar 8, 2021

Thanks @pcanal for the check. I spent another hour progressively backing out the changes that might have affected the initialize kernel or that translation unit in general... nothing seems to fix it. :(

Copy link
Contributor

@amandalund amandalund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this became a big job, especially integrating everything with the atomic relaxation/photoelectric classes, but the changes look great! Sounds like that error is looking more and more like a compiler bug?

src/base/StackAllocator.hh Outdated Show resolved Hide resolved
src/physics/em/LivermorePEModel.hh Show resolved Hide resolved
src/base/StackAllocator.hh Outdated Show resolved Hide resolved
@sethrj sethrj merged commit 7f7dc57 into celeritas-project:master Mar 11, 2021
@sethrj sethrj deleted the stack-allocator-backend branch March 11, 2021 15:13
@sethrj sethrj mentioned this pull request Mar 15, 2021
23 tasks
@sethrj sethrj added the minor Minor internal changes or fixes (including CI updates) label Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Software engineering infrastructure minor Minor internal changes or fixes (including CI updates)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants