Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polling: low performance with CPU accelerator and async streams #368

Open
psychocoderHPC opened this issue Aug 8, 2017 · 6 comments
Open

Comments

@psychocoderHPC
Copy link
Member

If multiple asynchronous streams will be used on a CPU accelerator and the user program is using a lot of test calls to check the states of events and streams than we lose one cpu core for the polling action.

The overhead can be reduced if we add the possibility to add priorities to a device and/or stream.

This is only a suggestion we need to evaluate the possibilities.

@BenjaminW3
Copy link
Member

I do not know if I understand your problem correctly, could you please elaborate it some more?

@BenjaminW3
Copy link
Member

It may be possible to replace the mutex lock within the alpaka::event::test method for EventCpu with an atomic memory access. This should be by far faster.

@psychocoderHPC
Copy link
Member Author

The performance of the test call is ok. The issue is coming that one the master thread (thread which create kernel calls and records events) is highly frequently polling all eventa to find out which event is finished.
This is some how a user code issue which can slidly solved by alpaka if we can set the priority of the master thread. This means that the polling thread is not so often activated by the operating system if compite threads have a higher priority.

A clean solution would be asynchron callbacks but this is not supported by cuda, or would be increase the cost to start a kernel.

This issue has no high priority but I opened it that we can work on it in the future.

@BenjaminW3
Copy link
Member

BenjaminW3 commented Aug 10, 2017

Asynchronous callbacks are supported via cudaStreamAddCallback. It would be no problem to add this to alpaka looking like this:

alpaka::stream::enqueue(stream, [&arbitrary](){do_something();});

CPU backends simply start a new thread with the user callback when it is reached.
The CUDA backend directly starts a thread that waits on a condition_variable. This condition_variable is activated by the CUDA callback we register internally and allows the thread to go on executing the user calback.
This would even work around the CUDA limitations: Callbacks must not make any CUDA API calls. Attempting to use CUDA APIs will result in cudaErrorNotPermitted.

@psychocoderHPC
Copy link
Member Author

The problem with cuda calls back is that this means we need to start a kernel, add a call back add a event.
This will increase the time within the driver.
The event is needed to build dependencies between streams.
Currently we start a kernel and than add an event for status checks and dependency management.

A driver call like adding a event or a kernel call cost around 14us if I remember right. In the case of PIConGPU we start up to 10000 Events per second.

If we will implement callback we need to take care of HIP and openACC.

@BenjaminW3
Copy link
Member

I will try to implement those callbacks nevertheless. There is an equivalent hipStreamAddCallback. The AccCpuOpenAcc2 backend would use the StreamCpuXxx versions. Only a possible future AccGpuOpenAcc2 could be problematic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants