Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] [DataFusion] Improve threading model #24921

Closed
asfimport opened this issue May 12, 2020 · 7 comments
Closed

[Rust] [DataFusion] Improve threading model #24921

asfimport opened this issue May 12, 2020 · 7 comments

Comments

@asfimport
Copy link

DataFusion currently spawns one thread per partition and this results in poor performance if there are more partitions than available cores/threads. It would be better to have a thread-pool that defaults to number of available cores.

Here is a Google doc where we can collaborate on a design discussion.

https://docs.google.com/document/d/1_wc6diy3YrRgEIhVIGzrO5AK8yhwfjWlmKtGnvbsrrY/edit?usp=sharing

Reporter: Andy Grove / @andygrove
Assignee: Andy Grove / @andygrove

Note: This issue was originally created as ARROW-8774. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Adam Lippai / @alippai:
A thing to consider on the long run (but a single default threadpool now helps a lot!):
For an optimal result reading files should have less priority than the computation part later.

This is what I've found on the topic: https://users.rust-lang.org/t/dealing-with-work-priority-and-rayon/30954/2

Rayon doesn't support setting priority for the tasks, but as a workaround we could create two threadpools, eg one with <=10 threads for file reading and CPU_NUM threads for the computation. If you need fine tuning the workload (S3, HDFS, NFS behave differently, local HDD or SSD is a different topic too) you could either configure the threadpool sizes (even down to 1 thread) or setting "nice" for the threadpool threads.

Should I add this to the design doc or is this out of scope for a while?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
FYI we're also working on our thread scheduling and concurrent CPU/IO management APIs for C++ for nearly the exact same reasons you are. We might try to see if there are shared learnings, we can at least link issues on JIRA to point our design docs or work happening

cc @fsaintjacques @lidavidm @pitrou

@asfimport
Copy link
Author

Andy Grove / @andygrove:
I think it would be good to add tot he design doc. I expect we'll end up
creating more JIRAs as an outcome of the discussions there.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
We don't have a design doc per se, just a bunch of requirements here: https://docs.google.com/document/d/1917ilU6BHfwb5ujzDIVolZgfDN9bC15RTdgWLd7-ik0/edit?usp=sharing

@asfimport
Copy link
Author

Adam Lippai / @alippai:
@andygrove  I don't have access to edit, so my addition is pending as suggestion in the doc.

@asfimport
Copy link
Author

Andy Grove / @andygrove:
I've been having some good success with async and smol and am ready to start contributing some changes back to DataFusion based on what I have learned so far. Smol is much easier to use than Tokio and co-exists with Tokio quite nicely and would be more efficient than our current threading model.

@asfimport
Copy link
Author

@asfimport asfimport added this to the 2.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants