Throw error from device to host with a msg #2258 #2283

mehmetyusufoglu · 2024-05-31T15:05:37Z

Macros to make signalling an error from the device side to the host code.

In case of Cuda, Hip, Sycl a user defined message is added to the abort. For other backends std::runtime_error exception with the user message is thrown.

Testing:
-Tests could be done for Accs: CpuThreads and CpuBlocks by catching the runtime_error exceptions thrown during exec.
-Aborts can not be catched from Cuda, Hip, Sycl as we call exec. (Only tested by running a temporary fail test at CI)
But for Cuda; __trap triggers runtime_error can be catched during the wait(queue).
-For sycl assert(false) is used. std::abort generated a runtime error at HAL no compile error but not "aborted"
-I turned out to be OpenMP specification mandates std::runtime errors should be handled by the same thread otherwise it is converted to abort. I checked with a signal handler and SIGABRT is fired. (Therefore openmp cases could not be tested other than a one time fail test at CI.)

Issue

fix #2258

include/alpaka/core/RuntimeMacros.hpp

test/unit/kernel/src/KernelThrow.cpp

fwyzard · 2024-06-13T13:44:43Z

include/alpaka/core/RuntimeMacros.hpp

+#    define ALPAKA_DEVICE_THROW(MSG)                                                                                  \
+        {                                                                                                             \
+            printf("%s [ALPAKA_DEVICE_THROW: Calling std::abort(). Sycl backend is enabled.]\n", (MSG));              \
+            std::abort();                                                                                             \


Are you sure we can call std::abort() from inside a SYCL kernel running e.g. on a GPU ?

We only support one-api and it is tested on CI and HAL. . I wanted to do assert(false) but assert turned out to be calling std::abort from kernel in sycl (url)

mehmetyusufoglu · 2024-06-16T19:34:52Z

I've tried looking into the specification of SYCL and the implementation of oneAPI, and I suspect the only portable solution may be assert(false).

Hover
* on my laptop integrated GPU that actually does not do anything :-(

* on a Flax 170 datacenter GPU it causes the program to abort in a way that cannot be caught.
I'll bring it up with Intel at the next occasion - but for the time being, that seems better than nothing.

Ok I have added assert(false) , thanks a lot. I can not catch the exception at HAL computer so i did not test it in the test code.

1
2
AssertHandler::printMessage
test_abort.cpp:27: auto main()::(anonymous class)::operator()(sycl::handler &)::(anonymous class)::operator()() const: global id: [0,0,0], local id: [0,0,0] Assertion `false` failed.

fwyzard · 2024-06-17T10:44:02Z

can not catch the exception at HAL computer so i did not test it in the test code.

Unrelated to the test, but what is the HAL computer ?

fwyzard · 2024-06-17T10:47:23Z

include/alpaka/core/RuntimeMacros.hpp

+//! Therefore std::runtime_error thrown by ALPAKA_DEVICE_THROW aborts the the program for OpenMP backends. If needed
+//! the SIGABRT signal can be catched by signal handler.
+#if defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && defined(__CUDA_ARCH__)
+#    define ALPAKA_DEVICE_THROW(MSG)                                                                                  \


Does this need to be a macro ?
Can we use a device function instead ?

Will not make much difference since only 2 lines are inlined. On the other hand Isn't it better to call the aborts or exceptions directly ?

@psychocoderHPC what do you prefer here, functions or macros ?

Functions are always prefered above macros. We should only use macros if we need __LINE__, ...

include/alpaka/core/RuntimeMacros.hpp

fwyzard · 2024-06-17T10:52:01Z

include/alpaka/core/RuntimeMacros.hpp

+#if defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && defined(__CUDA_ARCH__)
+#    define ALPAKA_DEVICE_THROW(MSG)                                                                                  \
+        {                                                                                                             \
+            printf("%s [ALPAKA_DEVICE_THROW: Calling __trap(). Cuda backend is enabled.]\n", (MSG));                  \


I think a message like

%s [ALPAKA_DEVICE_THROW: Calling __trap(). Cuda backend is enabled.]\n

is fine for debugging the implementation - but once this goes in production can we print just the user-supplied message ?

Or, if we do want to add more information, something like

alpaka encountered a user-defined error condition while running on the ___ back-end:\n %s

ok, changed. Thanks.

mehmetyusufoglu · 2024-06-17T11:48:00Z

can not catch the exception at HAL computer so i did not test it in the test code.

Unrelated to the test, but what is the HAL computer ?

A HPC system at HZDR :) Has different kind of gpus and easily configurable.

psychocoderHPC · 2024-06-17T15:02:37Z

include/alpaka/core/RuntimeMacros.hpp

+//! Therefore std::runtime_error thrown by ALPAKA_DEVICE_THROW aborts the the program for OpenMP backends. If needed
+//! the SIGABRT signal can be catched by signal handler.
+#if defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && defined(__CUDA_ARCH__)
+#    define ALPAKA_DEVICE_THROW(MSG)                                                                                  \


I was on the way to say change it to a C++ function and than this should be deviceThrow() within the alpaka namespace.
BUT than it should be a trait which gets the acc as argument with the coresponding class equal to the trait name which is than specialized for each acc. Device functions always take the acc as first argument.

Never the less I would handle it equal to ALPAKA_ASSERT_ACC and would add the line and error message to the printf.
To follow the naming schema this macro should be named ALPAKA_THROW_ACC

Hence; according to you it should be a macro called ALPAKA_THROW_ACC without an Acc argument?

yes this will follow our macro ALPAKA_ASSERT_ACC and avoid introducing a new trait. IMO this function/macro will be used in seldem cases therefore I would like to avoid increasing the code base because of this tiny throw.

Ok, then now we have a macro which is taking a string, throwing or aborting depending on backend.

macro name changed to ALPAKA_THROW_ACC. Thanks.

include/alpaka/core/RuntimeMacros.hpp

test/unit/runtime/src/KernelThrow.cpp

include/alpaka/core/RuntimeMacros.hpp

psychocoderHPC · 2024-06-19T15:15:43Z

@mehmetyusufoglu please rebase against develop branch to fix the CI issues

include/alpaka/core/RuntimeMacros.hpp

mehmetyusufoglu force-pushed the throwAtRuntime branch from fbffb91 to 09c787d Compare May 31, 2024 15:08

mehmetyusufoglu marked this pull request as draft May 31, 2024 15:10

mehmetyusufoglu force-pushed the throwAtRuntime branch from 09c787d to 970405d Compare May 31, 2024 15:13

psychocoderHPC reviewed Jun 4, 2024

View reviewed changes

include/alpaka/core/RuntimeMacros.hpp Outdated Show resolved Hide resolved

mehmetyusufoglu force-pushed the throwAtRuntime branch 9 times, most recently from 6e2a9e5 to d58d3b2 Compare June 10, 2024 14:25

mehmetyusufoglu changed the title ~~[WIP] signal error from device to host #2258~~ Throw error from device to host #2258 Jun 10, 2024

mehmetyusufoglu marked this pull request as ready for review June 10, 2024 14:29

mehmetyusufoglu changed the title ~~Throw error from device to host #2258~~ Throw error from device to host with a msg #2258 Jun 10, 2024

mehmetyusufoglu force-pushed the throwAtRuntime branch 7 times, most recently from 9f719d1 to b03addd Compare June 13, 2024 10:57

psychocoderHPC added this to the 1.2.0 milestone Jun 13, 2024

psychocoderHPC added the Type:Enhancement label Jun 13, 2024

psychocoderHPC requested changes Jun 13, 2024

View reviewed changes

include/alpaka/core/RuntimeMacros.hpp Outdated Show resolved Hide resolved

include/alpaka/core/RuntimeMacros.hpp Show resolved Hide resolved

psychocoderHPC requested a review from fwyzard June 13, 2024 11:40

psychocoderHPC reviewed Jun 13, 2024

View reviewed changes

test/unit/kernel/src/KernelThrow.cpp Outdated Show resolved Hide resolved

mehmetyusufoglu force-pushed the throwAtRuntime branch from b03addd to ee7d889 Compare June 13, 2024 11:59

fwyzard reviewed Jun 13, 2024

View reviewed changes

mehmetyusufoglu force-pushed the throwAtRuntime branch from 42b8049 to 0789e32 Compare June 16, 2024 19:32

mehmetyusufoglu force-pushed the throwAtRuntime branch 2 times, most recently from 332300c to ad62470 Compare June 17, 2024 09:46

fwyzard reviewed Jun 17, 2024

View reviewed changes

include/alpaka/core/RuntimeMacros.hpp Outdated Show resolved Hide resolved

fwyzard reviewed Jun 17, 2024

View reviewed changes

mehmetyusufoglu force-pushed the throwAtRuntime branch 2 times, most recently from 22d1838 to 26fbae4 Compare June 17, 2024 11:33

psychocoderHPC requested changes Jun 17, 2024

View reviewed changes

mehmetyusufoglu force-pushed the throwAtRuntime branch from 26fbae4 to a5fade5 Compare June 17, 2024 17:22

fwyzard reviewed Jun 17, 2024

View reviewed changes

include/alpaka/core/RuntimeMacros.hpp Outdated Show resolved Hide resolved

fwyzard reviewed Jun 17, 2024

View reviewed changes

include/alpaka/core/RuntimeMacros.hpp Outdated Show resolved Hide resolved

fwyzard reviewed Jun 17, 2024

View reviewed changes

include/alpaka/core/RuntimeMacros.hpp Outdated Show resolved Hide resolved

mehmetyusufoglu force-pushed the throwAtRuntime branch from a5fade5 to f73328e Compare June 18, 2024 05:23

fwyzard reviewed Jun 18, 2024

View reviewed changes

include/alpaka/core/RuntimeMacros.hpp Show resolved Hide resolved

mehmetyusufoglu force-pushed the throwAtRuntime branch from 6635930 to 472b74d Compare June 18, 2024 11:11

fwyzard reviewed Jun 18, 2024

View reviewed changes

test/unit/runtime/src/KernelThrow.cpp Show resolved Hide resolved

fwyzard reviewed Jun 18, 2024

View reviewed changes

include/alpaka/core/RuntimeMacros.hpp Outdated Show resolved Hide resolved

mehmetyusufoglu force-pushed the throwAtRuntime branch 2 times, most recently from 9564419 to 697cea7 Compare June 19, 2024 22:10

psychocoderHPC reviewed Jun 21, 2024

View reviewed changes

include/alpaka/core/RuntimeMacros.hpp Show resolved Hide resolved

mehmetyusufoglu force-pushed the throwAtRuntime branch 2 times, most recently from cfd024c to 0a6bb4b Compare June 25, 2024 09:24

fwyzard approved these changes Jun 25, 2024

View reviewed changes

Add runtime macro alpaka_throw_acc

0f221e7

mehmetyusufoglu force-pushed the throwAtRuntime branch from 0a6bb4b to 0f221e7 Compare June 25, 2024 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throw error from device to host with a msg #2258 #2283

Throw error from device to host with a msg #2258 #2283

mehmetyusufoglu commented May 31, 2024 •

edited

Loading

fwyzard Jun 13, 2024

mehmetyusufoglu Jun 14, 2024

mehmetyusufoglu commented Jun 16, 2024

fwyzard commented Jun 17, 2024

fwyzard Jun 17, 2024

mehmetyusufoglu Jun 17, 2024

fwyzard Jun 17, 2024

psychocoderHPC Jun 17, 2024

fwyzard Jun 17, 2024

mehmetyusufoglu Jun 17, 2024

mehmetyusufoglu commented Jun 17, 2024

psychocoderHPC Jun 17, 2024 •

edited

Loading

mehmetyusufoglu Jun 18, 2024

psychocoderHPC Jun 20, 2024

mehmetyusufoglu Jun 21, 2024

mehmetyusufoglu Jun 21, 2024

psychocoderHPC commented Jun 19, 2024

Throw error from device to host with a msg #2258 #2283

Are you sure you want to change the base?

Throw error from device to host with a msg #2258 #2283

Conversation

mehmetyusufoglu commented May 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehmetyusufoglu commented Jun 16, 2024

fwyzard commented Jun 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mehmetyusufoglu commented Jun 17, 2024

psychocoderHPC Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

psychocoderHPC commented Jun 19, 2024

mehmetyusufoglu commented May 31, 2024 •

edited

Loading

psychocoderHPC Jun 17, 2024 •

edited

Loading