-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silent cudaLaunch failures when compiling with clang's CUDA implementation #49
Comments
The kokkos guys may have a similar (same?) problem: kokkos/kokkos#1547 |
Here it's suggested that such errors can be caused by using data after it has left scope: kokkos/kokkos#1173 |
I have some news on this: We discovered that the bug appears to have been fixed in the current Clang trunk (i.e., Clang 9), however an assertion is still thrown in debug builds. I've also narrowed the fix down to a particular commit, and created an issue about it in the LLVM bug tracker: https://bugs.llvm.org/show_bug.cgi?id=41597. |
Excellent, thank you! |
Unfortunately I have since encountered this issue again, using Clang 9. This means the root cause really hasn't been fixed, only the circumstances triggering the bug are different. I also fear we'll have to dig into Clang ourselves if we want to get this fixed anytime soon... |
Okay, let's try figuring this out on our own :) We know that the generated code is identical for both working/non-working versions, with the only difference being the mangled name of the kernel, right? I would propose that we first try to verify if the issue is on the host side as generated by clang:
|
After blissfully ignoring this issue for a couple of weeks I ran into it again in a major way a couple of days ago. I decided to take another look and I think I have a solid lead now. It looks like it might actually be two distinct issues (albeit very closely related), with one being surfaced through pruning on hipSYCL's side, and the other being a pure Clang/LLVM bug. Needs a bit more investigation, I'll check back next week! |
I've got a minimal pure CUDA test case and preliminary fix in place, see https://reviews.llvm.org/D64015. If this gets merged it'll also require a change in the Clang plugin (i.e., use |
Wow, great news! Thank you! |
Kernel name mangling issues are well known by now and addressed in hipSYCL in various ways, depending on clang version. |
Moving the discussion from #42 to a dedicated issue. As mentioned in #42, @psalz found out that this simple code:
Silently fails to run the kernel. The error is however not restricted to this particular code and can also seemingly "strike at random". An error can only be seen if using
cuda-memcheck
, which reveals errorcudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaLaunch
.We know:
Things to try:
queue.wait_and_throw()
at the end change anything? Terminating a program without synchronization either viaqueue
or by creating a host accessor is not allowed by spec, although hipSYCL historically has handled that well. The question is: Could it happen that hipSYCL in some destructor tries to run the kernel while CUDA runtime has already started shutting down? EDIT: No, explicit synchronization doesn't helpThe text was updated successfully, but these errors were encountered: