polling: low performance with CPU accelerator and async streams #368

psychocoderHPC · 2017-08-08T14:59:31Z

If multiple asynchronous streams will be used on a CPU accelerator and the user program is using a lot of test calls to check the states of events and streams than we lose one cpu core for the polling action.

The overhead can be reduced if we add the possibility to add priorities to a device and/or stream.

This is only a suggestion we need to evaluate the possibilities.

BenjaminW3 · 2017-08-09T16:30:04Z

I do not know if I understand your problem correctly, could you please elaborate it some more?

BenjaminW3 · 2017-08-09T18:07:42Z

It may be possible to replace the mutex lock within the alpaka::event::test method for EventCpu with an atomic memory access. This should be by far faster.

psychocoderHPC · 2017-08-09T18:20:52Z

The performance of the test call is ok. The issue is coming that one the master thread (thread which create kernel calls and records events) is highly frequently polling all eventa to find out which event is finished.
This is some how a user code issue which can slidly solved by alpaka if we can set the priority of the master thread. This means that the polling thread is not so often activated by the operating system if compite threads have a higher priority.

A clean solution would be asynchron callbacks but this is not supported by cuda, or would be increase the cost to start a kernel.

This issue has no high priority but I opened it that we can work on it in the future.

BenjaminW3 · 2017-08-10T05:29:39Z

Asynchronous callbacks are supported via cudaStreamAddCallback. It would be no problem to add this to alpaka looking like this:

alpaka::stream::enqueue(stream, [&arbitrary](){do_something();});

CPU backends simply start a new thread with the user callback when it is reached.
The CUDA backend directly starts a thread that waits on a condition_variable. This condition_variable is activated by the CUDA callback we register internally and allows the thread to go on executing the user calback.
This would even work around the CUDA limitations: Callbacks must not make any CUDA API calls. Attempting to use CUDA APIs will result in cudaErrorNotPermitted.

psychocoderHPC · 2017-08-10T08:09:44Z

The problem with cuda calls back is that this means we need to start a kernel, add a call back add a event.
This will increase the time within the driver.
The event is needed to build dependencies between streams.
Currently we start a kernel and than add an event for status checks and dependency management.

A driver call like adding a event or a kernel call cost around 14us if I remember right. In the case of PIConGPU we start up to 10000 Events per second.

If we will implement callback we need to take care of HIP and openACC.

BenjaminW3 · 2017-08-10T10:31:25Z

I will try to implement those callbacks nevertheless. There is an equivalent hipStreamAddCallback. The AccCpuOpenAcc2 backend would use the StreamCpuXxx versions. Only a possible future AccGpuOpenAcc2 could be problematic.

psychocoderHPC added the Type:Enhancement label Aug 8, 2017

BenjaminW3 mentioned this issue Aug 12, 2017

add stream callback #373

Merged

j-stephan mentioned this issue Dec 17, 2020

Feature wish list #1232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polling: low performance with CPU accelerator and async streams #368

polling: low performance with CPU accelerator and async streams #368

psychocoderHPC commented Aug 8, 2017

BenjaminW3 commented Aug 9, 2017

BenjaminW3 commented Aug 9, 2017

psychocoderHPC commented Aug 9, 2017

BenjaminW3 commented Aug 10, 2017 •

edited

Loading

psychocoderHPC commented Aug 10, 2017

BenjaminW3 commented Aug 10, 2017

polling: low performance with CPU accelerator and async streams #368

polling: low performance with CPU accelerator and async streams #368

Comments

psychocoderHPC commented Aug 8, 2017

BenjaminW3 commented Aug 9, 2017

BenjaminW3 commented Aug 9, 2017

psychocoderHPC commented Aug 9, 2017

BenjaminW3 commented Aug 10, 2017 • edited Loading

psychocoderHPC commented Aug 10, 2017

BenjaminW3 commented Aug 10, 2017

BenjaminW3 commented Aug 10, 2017 •

edited

Loading