[WIP] Support async response to open stream calls to avoid blocking netty threads#3593
[WIP] Support async response to open stream calls to avoid blocking netty threads#3593saurabhd336 wants to merge 3 commits intoapache:mainfrom
Conversation
| if (sorted.contains(fileId)) { | ||
| try { | ||
| return CompletableFuture.completedFuture( | ||
| resolve( |
There was a problem hiding this comment.
This should already be running on the caller thread i.e. netty thread
|
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
zaynt4606
left a comment
There was a problem hiding this comment.
It is necessary to async response for the sorting tasks!
Would it be better to use a dedicated thread to handle callbacks?
Currently, every request is returned via the EventLoop thread, which might incur significant overhead when handling a large number of requests with long sort times.
|
Hey @zaynt4606 thanks for reviewing this. Currently occupied with some other threads, let me revisit this PR this week.
My idea was to simply make sure the eventloop thread isn't block (i.e. avoid the thread.sleep()), with the current changes the response is still sent by the eventloop threads, but the wait is eliminated. I can check if we can / should use a seperate threadpool for responding |
What changes were proposed in this pull request?
During openStream calls to workers, the worker netty threads can get blocked for
sortTimeoutmilliseconds. This could degrade other worker RPC calls. Ideally, if we introduce an async reply mechanism, we can free the netty threads thereby ensuring other RPC calls are not impacted.Eg: On a 48 core VM, if > (48 * 2) OpenStream calls with AQE are made requesting the same partition file, all the netty threads will get blocked. Even if a non AQE OpenStream call is then received, we'll not have free netty threads to respond
Why are the changes needed?
Avoid blocking worker netty threads during heavy sorting openStream calls.
Does this PR resolve a correctness bug?
No
Does this PR introduce any user-facing change?
Set worker config "celeborn.worker.async.openStream.enabled" to move to async reply mechanism for openStream calls.
How was this patch tested?
TODO