Silent cudaLaunch failures when compiling with clang's CUDA implementation #49

illuhad · 2019-04-18T16:52:09Z

Moving the discussion from #42 to a dedicated issue. As mentioned in #42, @psalz found out that this simple code:

#include <CL/sycl.hpp>

int main() {
    cl::sycl::queue queue;
    cl::sycl::buffer<float, 1> buf(10);

    queue.submit([&](cl::sycl::handler& cgh) {}); // The culprit

    queue.submit([&](cl::sycl::handler& cgh) {
        auto acc = buf.get_access<cl::sycl::access::mode::discard_write>(cgh);
        cgh.parallel_for<class fail>(buf.get_range(), [=](cl::sycl::item<1> item) {
            acc[item] = 1.f;
        }); 
    }); 

    return 0;
}

Silently fails to run the kernel. The error is however not restricted to this particular code and can also seemingly "strike at random". An error can only be seen if using cuda-memcheck, which reveals error cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaLaunch.

We know:

the problem disappears when changing small seemingly unrelated bits of code (in this case, either removing the first empty command group or moving it after the second command group solves the issue)
Generated device code and cuda launch code on the host side is the same for both working and non-working versions
Reproducible (at least?) with clang 8
Everything works fine if compiled with nvcc.

Things to try:

Does adding an explicit queue.wait_and_throw() at the end change anything? Terminating a program without synchronization either via queue or by creating a host accessor is not allowed by spec, although hipSYCL historically has handled that well. The question is: Could it happen that hipSYCL in some destructor tries to run the kernel while CUDA runtime has already started shutting down? EDIT: No, explicit synchronization doesn't help

The text was updated successfully, but these errors were encountered:

illuhad · 2019-04-23T11:37:58Z

The kokkos guys may have a similar (same?) problem: kokkos/kokkos#1547

illuhad · 2019-04-23T12:00:58Z

Here it's suggested that such errors can be caused by using data after it has left scope: kokkos/kokkos#1173

psalz · 2019-04-25T10:18:41Z

I have some news on this: We discovered that the bug appears to have been fixed in the current Clang trunk (i.e., Clang 9), however an assertion is still thrown in debug builds. I've also narrowed the fix down to a particular commit, and created an issue about it in the LLVM bug tracker: https://bugs.llvm.org/show_bug.cgi?id=41597.

illuhad · 2019-04-25T10:27:20Z

Excellent, thank you!

psalz · 2019-05-17T11:47:32Z

Unfortunately I have since encountered this issue again, using Clang 9. This means the root cause really hasn't been fixed, only the circumstances triggering the bug are different. I also fear we'll have to dig into Clang ourselves if we want to get this fixed anytime soon...

illuhad · 2019-05-18T12:09:38Z

Okay, let's try figuring this out on our own :) We know that the generated code is identical for both working/non-working versions, with the only difference being the mangled name of the kernel, right? I would propose that we first try to verify if the issue is on the host side as generated by clang:

We know it compiles with nvcc, but nvcc likely uses a different mangled kernel name. Let's verify if the nvcc kernel name is indeed different...
... if this is indeed the case, let's see what happens if we launch the clang-compiled ptx kernel without clang: We can launch the PTX code directly using the CUDA driver API, based on the kernel name. We need to be careful about kernel parameters (in SYCL, captured accessors), so let's see if we can reproduce the behavior using a kernel that doesn't capture anything (e.g. just calls printf) and use that for testing.
If we cannot reproduce the issue with a non-capturing kernel, it may be an issue with clang's implementation of lambda captures or kernel parameters.
Otherwise, if the issue also appears when launching the kernel directly with the driver API, it can either be
- an issue with the generated PTX (which would be weird, because we know that the working version is the same except for the kernel name)
- It may be a bug in CUDA - perhaps it just has a problem with certain mangled kernel names, which may only be triggered when compiling with clang
If the issue doesn't appear
- It is likely a problem on the host side, related to how clang invokes the kernel
- Since we use a kernel that doesn't capture anything it cannot be related to kernel parameters/captures

psalz · 2019-06-28T14:45:05Z

After blissfully ignoring this issue for a couple of weeks I ran into it again in a major way a couple of days ago. I decided to take another look and I think I have a solid lead now. It looks like it might actually be two distinct issues (albeit very closely related), with one being surfaced through pruning on hipSYCL's side, and the other being a pure Clang/LLVM bug. Needs a bit more investigation, I'll check back next week!

psalz · 2019-07-02T09:33:06Z

I've got a minimal pure CUDA test case and preliminary fix in place, see https://reviews.llvm.org/D64015. If this gets merged it'll also require a change in the Clang plugin (i.e., use getSharedMangleContext), but I'll make a PR once that happens!

illuhad · 2019-07-02T10:04:34Z

Wow, great news! Thank you!

illuhad · 2022-05-27T20:03:53Z

Kernel name mangling issues are well known by now and addressed in hipSYCL in various ways, depending on clang version.

illuhad mentioned this issue Apr 20, 2019

Add SYCL extension: Mechanism for automatic calls to require() for placeholder accessors #52

Merged

psalz mentioned this issue Apr 23, 2019

Fix memory leak caused by shared_ptr dependency cycle #53

Merged

psalz mentioned this issue Jul 12, 2019

Work around Clang name mangling bug by using custom kernel names #77

Merged

illuhad mentioned this issue Jul 27, 2020

[SYCL2020] Add experimental support for optional lambda kernel naming #281

Merged

abboomer mentioned this issue Oct 26, 2020

cuda clang++ compilation crash #356

Closed

illuhad closed this as completed May 27, 2022

llvmbot mentioned this issue Aug 5, 2023

Certain CUDA codes produce "invalid device function" - appears to be fixed in trunk llvm/llvm-project#40942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silent cudaLaunch failures when compiling with clang's CUDA implementation #49

Silent cudaLaunch failures when compiling with clang's CUDA implementation #49

illuhad commented Apr 18, 2019 •

edited

illuhad commented Apr 23, 2019

illuhad commented Apr 23, 2019

psalz commented Apr 25, 2019

illuhad commented Apr 25, 2019

psalz commented May 17, 2019

illuhad commented May 18, 2019

psalz commented Jun 28, 2019

psalz commented Jul 2, 2019

illuhad commented Jul 2, 2019

illuhad commented May 27, 2022

Silent cudaLaunch failures when compiling with clang's CUDA implementation #49

Silent cudaLaunch failures when compiling with clang's CUDA implementation #49

Comments

illuhad commented Apr 18, 2019 • edited

illuhad commented Apr 23, 2019

illuhad commented Apr 23, 2019

psalz commented Apr 25, 2019

illuhad commented Apr 25, 2019

psalz commented May 17, 2019

illuhad commented May 18, 2019

psalz commented Jun 28, 2019

psalz commented Jul 2, 2019

illuhad commented Jul 2, 2019

illuhad commented May 27, 2022

illuhad commented Apr 18, 2019 •

edited