Use Collection for stack allocator backend #169

sethrj · 2021-03-07T00:31:37Z

This changes the stack allocator interface to be compatible with both host and device memory using Collections.

Runtime error on CUDA 10.1

The errors below only occur on CUDA 10.1, not 10.2 or later. @jefflarkin thinks this may be a bug in NVCC since it only shows up in 10.1.

Currently I'm encountering a really bizarre error when building in debug on wildstyle (and no dependencies aside from json):

Thread 1 "demo-interactor" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 4, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 3, lane 0]
0x0000000001f54390 in celeritas::Collection<celeritas::ParticleDef, (celeritas::Ownership)2, (celeritas::MemSpace)1, celeritas::OpaqueId<celeritas::ParticleDef, unsigned int> >::storage (this=0x160, this=0x160)
    at /home/s3j/.local/src/celeritas/build-debug/../src/base/Span.hh:115
115	    CELER_CONSTEXPR_FUNCTION size_type size() const { return s_.size; }
(cuda-gdb) bt
#0  0x0000000001f54390 in celeritas::Collection<celeritas::ParticleDef, (celeritas::Ownership)2, (celeritas::MemSpace)1, celeritas::OpaqueId<celeritas::ParticleDef, unsigned int> >::storage (this=0x160, this=0x160)
    at /home/s3j/.local/src/celeritas/build-debug/../src/base/Span.hh:115
#1  celeritas::Collection<celeritas::ParticleDef, (celeritas::Ownership)2, (celeritas::MemSpace)1, celeritas::OpaqueId<celeritas::ParticleDef, unsigned int> >::size (this=0x160)
    at /home/s3j/.local/src/celeritas/build-debug/../src/base/Collection.i.hh:144
#2  celeritas::ParticleTrackView::operator= (this=0x7fffe1fffc58, other=0x7fffe1fffc10)
    at /home/s3j/.local/src/celeritas/build-debug/../src/physics/base/ParticleTrackView.i.hh:36
#3  0x0000000001ef8d40 in demo_interactor::(anonymous namespace)::initialize_kernel<<<(32,1,1),(128,1,1)>>> (
    states=..., init=...) at ../app/demo-interactor/KNDemoKernel.cu:47

Inside the ParticleTrackView, "params_" is apparently point to invalid memory:

#2  celeritas::ParticleTrackView::operator= (this=0x7fffe1fffc58, other=0x7fffe1fffc10)
    at /home/s3j/.local/src/celeritas/build-debug/../src/physics/base/ParticleTrackView.i.hh:36
36	    CELER_EXPECT(other.particle_id < params_.particles.size());
(cuda-gdb) print params_
$1 = (const @generic _ZN9celeritas17ParticleTrackView17ParticleParamsRefE * const @generic) 0x160
(cuda-gdb) print *params_
Error: Failed to read generic memory at address 0x160 on device 0 sm 0 warp 3 lane 0, error=CUDBG_ERROR_INVALID_MEMORY_SEGMENT(0x7).

...and at the kernel level the "params" variable doesn't even exist, and the states look corrupted (shifted by 1 maybe)?:

(cuda-gdb) print params
No symbol "params" in current context.
(cuda-gdb) print states
$3 = {particle = {state = {storage_ = {data = {s_ = {data = 0x0, size = 0}}}}}, rng = {rng = {s_ = {data = 0x0,
        size = 0}}}, position = {s_ = {data = 0x7fffe1fffb60, size = 352}}, direction = {s_ = {data = 0x160,
      size = 140736985037592}}, time = {s_ = {data = 0x0, size = 16285011132970729471}}, alive = {s_ = {
      data = 0xe1fffc5800007fff, size = 12646614903394697215}}, secondaries = {storage = {storage_ = {data = {s_ = {
            data = 0xe1fffb6000007fff, size = 8589934592}}}}, size = {storage_ = {data = {s_ = {data = 0x1,
            size = 0}}}}}, detector = {hit_buffer = {storage = {storage_ = {data = {s_ = {data = 0x7fffe1fffc20,
              size = 140736140318208}}}}, size = {storage_ = {data = {s_ = {data = 0x1000,
              size = 140736139888128}}}}}, tally_deposition = {storage_ = {data = {s_ = {data = 0x1000,
            size = 140736140084736}}}}}}

In comparison, on emmet:

(cuda-gdb) print params
$1 = {particle = {particles = {storage_ = {data = {s_ = {data = 0x2300a60400, size = 2}}}}}, tables = {reals = {
      storage_ = {data = {s_ = {data = 0x2300a60000, size = 84}}}}, xs = {log_energy = {size = 84,
        front = -8.8813996475528896, back = 18.420680743952367, delta = 0.32894072760849707}, prime_index = 27,
      value = {begin_ = {value_ = {value_ = 0}}, end_ = {value_ = {value_ = 84}}}}}, kn_interactor = {model_id = {
      value_ = 0}, inv_electron_mass = 1.9569511984948496, electron_id = {value_ = 0}, gamma_id = {value_ = 1}},
  detector = {tally_grid = {size = 1024, front = -1, back = 254.75, delta = 0.25}}}
(cuda-gdb) print states
$2 = {particle = {state = {storage_ = {data = {s_ = {data = 0x2315040000, size = 1000000}}}}}, rng = {rng = {s_ = {
        data = 0x2300b60000, size = 1000000}}}, position = {s_ = {data = 0x2312240000, size = 1000000}}, direction = {
    s_ = {data = 0x2313940000, size = 1000000}}, time = {s_ = {data = 0x2303940000, size = 1000000}}, alive = {s_ = {
      data = 0x2300a60600, size = 1000000}}, secondaries = {storage = {storage_ = {data = {s_ = {data = 0x2315fa0000,
            size = 1000000}}}}, size = {storage_ = {data = {s_ = {data = 0x2300b54a00, size = 1}}}}}, detector = {
    hit_buffer = {storage = {storage_ = {data = {s_ = {data = 0x23185e0000, size = 1000000}}}}, size = {storage_ = {
          data = {s_ = {data = 0x2300b54c00, size = 1}}}}}, tally_deposition = {storage_ = {data = {s_ = {
            data = 0x2300b54e00, size = 1024}}}}}}

The "initialize" kernel in which it's happening should NOT have been affected by any of the changes I made at all. This sort of off-by-one thing makes me wonder if it's a similar error to the corrupted ELF data seen in #118 ?? I've spent two hours starting at this mess, and a careful review of my code, and have nothing. The changes to the stack allocator were extensive enough that it's going to be difficult or impossible to bisect my changes to see where exactly I went wrong.

This eliminates the need for host allocator, storage class, etc. I refactored the demo interactor detector stuff as part of it, as well as some of the runner code that needed to be changed as a cosequence. Additional changes were required in the livermore interactor since the stack allocator now has a template parameter on memory space. I refactored some parts of the livermore atomic relaxation to reduce propagation of those template paramters and to improve program flow.

…ackend

pcanal · 2021-03-07T16:39:09Z

Another data point(s), it does not crash on wc.fnal.gov with nvcc V11.1.105 on wc.fnal.gov. cuda-gdb with cuda memcheck does not complains and cuda-memcheck fails after printing the output json with (the seemingly unrelated):

========= CUDA-MEMCHECK
========= Internal Memcheck Error: Initialization failed
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/lib64/libcuda.so.1 [0x1b974b]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x27bdc]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x1d09e]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x2e2e4]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x2ff2a]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x25c8e]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 [0x112ee]
=========     Host Frame:/srv/software/cuda-toolkits/11.1.1/lib64/libcudart.so.11.0 (cudaFree + 0xf2) [0x40532]
=========     Host Frame:/nashome/p/pcanal/from_lq/wclustre/geant/gcc8/builds/scalar-cuda11/celeritas/src/libceleritas.so (_ZN9celeritas15activate_deviceEONS_6DeviceE + 0x25a) [0x1ad97d]
=========     Host Frame:/nashome/p/pcanal/from_lq/wclustre/geant/gcc8/builds/scalar-cuda11/celeritas/app/demo-interactor [0x5eb3a]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
=========     Host Frame:/nashome/p/pcanal/from_lq/wclustre/geant/gcc8/builds/scalar-cuda11/celeritas/app/demo-interactor [0x5daa9]
=========
========= ERROR SUMMARY: 1 error

sethrj · 2021-03-08T02:20:49Z

Thanks @pcanal for the check. I spent another hour progressively backing out the changes that might have affected the initialize kernel or that translation unit in general... nothing seems to fix it. :(

amandalund

I can see how this became a big job, especially integrating everything with the atomic relaxation/photoelectric classes, but the changes look great! Sounds like that error is looking more and more like a compiler bug?

src/base/StackAllocator.hh

src/physics/em/LivermorePEModel.hh

src/base/StackAllocator.hh

sethrj added 5 commits March 5, 2021 11:23

Add helper algorithms and remove data accessor

5add2c8

Move thrust calls to separate cuda file for improved compile time

21b4a50

Improve documentation

111bfba

Fix copy-paste error and simplify kernel launch code

e094b8b

sethrj added the core Software engineering infrastructure label Mar 7, 2021

sethrj requested a review from amandalund March 7, 2021 00:31

sethrj added 2 commits March 6, 2021 22:53

Merge remote-tracking branch 'upstream/master' into stack-allocator-b…

5bb1d14

…ackend

Merge remote-tracking branch 'upstream/master' into stack-allocator-b…

7175ff1

…ackend

amandalund approved these changes Mar 9, 2021

View reviewed changes

src/base/StackAllocator.hh Outdated Show resolved Hide resolved

src/physics/em/LivermorePEModel.hh Show resolved Hide resolved

src/base/StackAllocator.hh Outdated Show resolved Hide resolved

Address reviewer feedback

27a8d79

sethrj merged commit 7f7dc57 into celeritas-project:master Mar 11, 2021

sethrj deleted the stack-allocator-backend branch March 11, 2021 15:13

sethrj mentioned this pull request Mar 15, 2021

End-of-Q3 code review #163

Closed

23 tasks

sethrj added the minor Minor internal changes or fixes (including CI updates) label Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Collection for stack allocator backend #169

Use Collection for stack allocator backend #169

sethrj commented Mar 7, 2021 •

edited

Loading

pcanal commented Mar 7, 2021 •

edited

Loading

sethrj commented Mar 8, 2021

amandalund left a comment

Use Collection for stack allocator backend #169

Use Collection for stack allocator backend #169

Conversation

sethrj commented Mar 7, 2021 • edited Loading

Runtime error on CUDA 10.1

pcanal commented Mar 7, 2021 • edited Loading

sethrj commented Mar 8, 2021

amandalund left a comment

Choose a reason for hiding this comment

sethrj commented Mar 7, 2021 •

edited

Loading

pcanal commented Mar 7, 2021 •

edited

Loading