Skip to content

[WIP] Support async response to open stream calls to avoid blocking netty threads#3593

Open
saurabhd336 wants to merge 3 commits intoapache:mainfrom
saurabhd336:sortedOpenStreamAsync
Open

[WIP] Support async response to open stream calls to avoid blocking netty threads#3593
saurabhd336 wants to merge 3 commits intoapache:mainfrom
saurabhd336:sortedOpenStreamAsync

Conversation

@saurabhd336
Copy link
Contributor

@saurabhd336 saurabhd336 commented Feb 3, 2026

What changes were proposed in this pull request?

During openStream calls to workers, the worker netty threads can get blocked for sortTimeout milliseconds. This could degrade other worker RPC calls. Ideally, if we introduce an async reply mechanism, we can free the netty threads thereby ensuring other RPC calls are not impacted.

Eg: On a 48 core VM, if > (48 * 2) OpenStream calls with AQE are made requesting the same partition file, all the netty threads will get blocked. Even if a non AQE OpenStream call is then received, we'll not have free netty threads to respond

Why are the changes needed?

Avoid blocking worker netty threads during heavy sorting openStream calls.

Does this PR resolve a correctness bug?

No

Does this PR introduce any user-facing change?

Set worker config "celeborn.worker.async.openStream.enabled" to move to async reply mechanism for openStream calls.

How was this patch tested?

TODO

if (sorted.contains(fileId)) {
try {
return CompletableFuture.completedFuture(
resolve(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should already be running on the caller thread i.e. netty thread

@github-actions
Copy link

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Feb 25, 2026
@zaynt4606 zaynt4606 removed the stale label Feb 25, 2026
Copy link
Contributor

@zaynt4606 zaynt4606 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary to async response for the sorting tasks!
Would it be better to use a dedicated thread to handle callbacks?
Currently, every request is returned via the EventLoop thread, which might incur significant overhead when handling a large number of requests with long sort times.

@saurabhd336
Copy link
Contributor Author

saurabhd336 commented Feb 25, 2026

Hey @zaynt4606 thanks for reviewing this. Currently occupied with some other threads, let me revisit this PR this week.

Currently, every request is returned via the EventLoop thread, which might incur significant overhead when handling a large number of requests with long sort times.

My idea was to simply make sure the eventloop thread isn't block (i.e. avoid the thread.sleep()), with the current changes the response is still sent by the eventloop threads, but the wait is eliminated. I can check if we can / should use a seperate threadpool for responding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants