Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] [DataFusion] Use tokio and Futures instead of spawning threads #23038

Closed
Tracked by #16900
asfimport opened this issue Sep 25, 2019 · 6 comments
Closed
Tracked by #16900

Comments

@asfimport
Copy link

The current implementation of the physical query plan uses "thread::spawn" which is expensive. We should switch to using Futures, async!/await!, and tokio so that we are launching tasks in a thread pool instead and writing idiomatic Rust code with futures combinators to chain actions together.

Reporter: Andy Grove / @andygrove
Assignee: Andy Grove / @andygrove

Original Issue Attachments:

Note: This issue was originally created as ARROW-6691. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Adam Lippai / @alippai:
I've tried to add this feature, but I didn't mange to get all the types right in merge.rs. I still need more experience in Rust :). Notes: 

image-2019-12-07-17-54-57-862.png

@asfimport
Copy link
Author

Ben Sully:
My hunch is also that rayon would be better suited to this part ('working' on lots of things in a threadpool, rather than 'waiting' for lots of things) than tokio + async/await, although it does sound like doing IO using gRPC might be well suited to tokio. I'm not familiar with the Datafusion codebase at all unfortunately, how approachable would this issue be for a first time Arrow contributor? :D

@asfimport
Copy link
Author

Andy Grove / @andygrove:
[~sd2k] You might be right, but given that we now have a flight server example using tokio, it seems ideal to use tokio all the way through to query execution to have a complete async implementation? I agree that this is perhaps not an ideal issue to try to take on for a new contributor though!

@asfimport
Copy link
Author

Andy Grove / @andygrove:
I spent a little time working on a PoC of this [1] but the advice from the Tokio team is that we are better off just using dedicated threads for file io and just use tokio for network io, which we already do in the flight server (thanks to Tonic) so I am closing this.

 

 [1] https://github.com/andygrove/async-query/blob/master/src/main.rs

@asfimport
Copy link
Author

Adam Lippai / @alippai:
@andygrove Skipping Tokio can be reasonable, but we still need a threadpool or limiting the number of opened files somehow. Currently it opens and read all the files in parallel, then it does all the computation in parallel (and AFAIK reading a lot of files concurrently won't produce optimal performance on linux). 

@asfimport
Copy link
Author

Andy Grove / @andygrove:
Thanks. yes. I've been looking at Rayon and threadpool so far this morning.
I think it might be worth me creating a Google doc for some collaboration
on a design here.

I also created a new JIRA: https://issues.apache.org/jira/browse/ARROW-8774

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants