-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add stream callback #373
add stream callback #373
Conversation
fe54008
to
8a6d892
Compare
note: I am two weeks not available and will check all PRs after my holidays. |
685e468
to
22a84a9
Compare
08a7a3b
to
c9006ab
Compare
c9006ab
to
f1cfe26
Compare
Could you please add a short description which concepts are new in this pull request and maybe a code snipped (or pointing to a code snipped) how to use the call backs. How does the call back works (in which context is the callback executed). Is there a new thread opened and executes the call back code? |
There are no new concepts. The
A usage example can also be found in the new unit test. Callbacks are always executed within an independent thread. The CPU streams always supported this (unintentionally ;-) ) because it uses a thread pool to execute the tasks. Only for CUDA streams there is new code. CUDA itself calls the callback from within its own thread which does not allow to call CUDA runtime API functions from within the callback. This limitation is lifted by starting a new thread per callback for async CUDA streams and using the waiting thread itself for sync CUDA streams. |
ready for review ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a performance question
pCallbackSynchronizationData.get(), | ||
0u)); | ||
|
||
std::thread t( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To create a thread per callback can be very expensive in time. e.g. PIConGPU is spawning over 2000 kernel/memcpy per second and will have over 100 tasks waiting in streams. This means we need to spawn 2k threads/s and have over 100 active threads.
Is it possible to use on thread for all callbacks (waiting in the background), add callbacks to a list and than execute the callback always from the callback thread?
We can move this also to later pull request if it is currently not easy to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I would like to merge this as is. I would create a follow-up issue for optimizing this.
The most flexible solution would be a thread pool with a queue of ready callbacks. If there is only one thread, the latency would be highest but it would equal your single thread solution, if there are multiple threads, the latency/resource tradeoff can be adapted per use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this and create the follow up ticket? My upcoming event unit tests are based on some of those stream test helpers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sry I was busy, yes I will merge it
possible solution for #368