Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw error from device to host with a msg #2258 #2283

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

mehmetyusufoglu
Copy link
Contributor

@mehmetyusufoglu mehmetyusufoglu commented May 31, 2024

Macros to make signalling an error from the device side to the host code.

In case of Cuda, Hip, Sycl a user defined message is added to the abort. For other backends std::runtime_error exception with the user message is thrown.

Testing:
-Tests could be done for Accs: CpuThreads and CpuBlocks by catching the runtime_error exceptions thrown during exec.
-Aborts can not be catched from Cuda, Hip, Sycl as we call exec. (Only tested by running a temporary fail test at CI)
But for Cuda; __trap triggers runtime_error can be catched during the wait(queue).
-For sycl assert(false) is used. std::abort generated a runtime error at HAL no compile error but not "aborted"
-I turned out to be OpenMP specification mandates std::runtime errors should be handled by the same thread otherwise it is converted to abort. I checked with a signal handler and SIGABRT is fired. (Therefore openmp cases could not be tested other than a one time fail test at CI.)

Issue

fix #2258

@mehmetyusufoglu mehmetyusufoglu force-pushed the throwAtRuntime branch 9 times, most recently from 6e2a9e5 to d58d3b2 Compare June 10, 2024 14:25
@mehmetyusufoglu mehmetyusufoglu changed the title [WIP] signal error from device to host #2258 Throw error from device to host #2258 Jun 10, 2024
@mehmetyusufoglu mehmetyusufoglu marked this pull request as ready for review June 10, 2024 14:29
@mehmetyusufoglu mehmetyusufoglu changed the title Throw error from device to host #2258 Throw error from device to host with a msg #2258 Jun 10, 2024
@mehmetyusufoglu mehmetyusufoglu force-pushed the throwAtRuntime branch 7 times, most recently from 9f719d1 to b03addd Compare June 13, 2024 10:57
@psychocoderHPC psychocoderHPC added this to the 1.2.0 milestone Jun 13, 2024
include/alpaka/core/RuntimeMacros.hpp Outdated Show resolved Hide resolved
include/alpaka/core/RuntimeMacros.hpp Show resolved Hide resolved
# define ALPAKA_DEVICE_THROW(MSG) \
{ \
printf("%s [ALPAKA_DEVICE_THROW: Calling std::abort(). Sycl backend is enabled.]\n", (MSG)); \
std::abort(); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we can call std::abort() from inside a SYCL kernel running e.g. on a GPU ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only support one-api and it is tested on CI and HAL. . I wanted to do assert(false) but assert turned out to be calling std::abort from kernel in sycl (url)

@mehmetyusufoglu
Copy link
Contributor Author

I've tried looking into the specification of SYCL and the implementation of oneAPI, and I suspect the only portable solution may be assert(false).

Hover

* on my laptop integrated GPU that actually does not do anything :-(

* on a Flax 170 datacenter GPU it causes the program to abort in a way that cannot be caught.

I'll bring it up with Intel at the next occasion - but for the time being, that seems better than nothing.

Ok I have added assert(false) , thanks a lot. I can not catch the exception at HAL computer so i did not test it in the test code.

1
2
AssertHandler::printMessage
test_abort.cpp:27: auto main()::(anonymous class)::operator()(sycl::handler &)::(anonymous class)::operator()() const: global id: [0,0,0], local id: [0,0,0] Assertion `false` failed.

@mehmetyusufoglu mehmetyusufoglu force-pushed the throwAtRuntime branch 2 times, most recently from 332300c to ad62470 Compare June 17, 2024 09:46
@fwyzard
Copy link
Contributor

fwyzard commented Jun 17, 2024

can not catch the exception at HAL computer so i did not test it in the test code.

Unrelated to the test, but what is the HAL computer ?

//! Therefore std::runtime_error thrown by ALPAKA_DEVICE_THROW aborts the the program for OpenMP backends. If needed
//! the SIGABRT signal can be catched by signal handler.
#if defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && defined(__CUDA_ARCH__)
# define ALPAKA_DEVICE_THROW(MSG) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a macro ?
Can we use a device function instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will not make much difference since only 2 lines are inlined. On the other hand Isn't it better to call the aborts or exceptions directly ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@psychocoderHPC what do you prefer here, functions or macros ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions are always prefered above macros. We should only use macros if we need __LINE__, ...

#if defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && defined(__CUDA_ARCH__)
# define ALPAKA_DEVICE_THROW(MSG) \
{ \
printf("%s [ALPAKA_DEVICE_THROW: Calling __trap(). Cuda backend is enabled.]\n", (MSG)); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a message like

%s [ALPAKA_DEVICE_THROW: Calling __trap(). Cuda backend is enabled.]\n

is fine for debugging the implementation - but once this goes in production can we print just the user-supplied message ?

Or, if we do want to add more information, something like

alpaka encountered a user-defined error condition while running on the ___ back-end:\n
%s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed. Thanks.

@mehmetyusufoglu mehmetyusufoglu force-pushed the throwAtRuntime branch 2 times, most recently from 22d1838 to 26fbae4 Compare June 17, 2024 11:33
@mehmetyusufoglu
Copy link
Contributor Author

can not catch the exception at HAL computer so i did not test it in the test code.

Unrelated to the test, but what is the HAL computer ?

A HPC system at HZDR :) Has different kind of gpus and easily configurable.

//! Therefore std::runtime_error thrown by ALPAKA_DEVICE_THROW aborts the the program for OpenMP backends. If needed
//! the SIGABRT signal can be catched by signal handler.
#if defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && defined(__CUDA_ARCH__)
# define ALPAKA_DEVICE_THROW(MSG) \
Copy link
Member

@psychocoderHPC psychocoderHPC Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the way to say change it to a C++ function and than this should be deviceThrow() within the alpaka namespace.
BUT than it should be a trait which gets the acc as argument with the coresponding class equal to the trait name which is than specialized for each acc. Device functions always take the acc as first argument.

Never the less I would handle it equal to ALPAKA_ASSERT_ACC and would add the line and error message to the printf.
To follow the naming schema this macro should be named ALPAKA_THROW_ACC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence; according to you it should be a macro called ALPAKA_THROW_ACC without an Acc argument?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this will follow our macro ALPAKA_ASSERT_ACC and avoid introducing a new trait. IMO this function/macro will be used in seldem cases therefore I would like to avoid increasing the code base because of this tiny throw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then now we have a macro which is taking a string, throwing or aborting depending on backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macro name changed to ALPAKA_THROW_ACC. Thanks.

@psychocoderHPC
Copy link
Member

@mehmetyusufoglu please rebase against develop branch to fix the CI issues

@mehmetyusufoglu mehmetyusufoglu force-pushed the throwAtRuntime branch 2 times, most recently from 9564419 to 697cea7 Compare June 19, 2024 22:10
@mehmetyusufoglu mehmetyusufoglu force-pushed the throwAtRuntime branch 2 times, most recently from cfd024c to 0a6bb4b Compare June 25, 2024 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

portable device-side abort() or throw ?
4 participants