ARROW-9754: [Rust] [DataFusion] Implement async in ExecutionPlan trait #8285

andygrove · 2020-09-26T22:32:29Z

This PR implements async in the ExecutionPlan trait. I ran the TPC-H benchmark and the performance is about the same.

Master branch:

Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 24, batch_size: 4096, path: "/mnt/tpch/parquet/100-240", file_format: "parquet" }
Query 1 iteration 0 took 14365 ms
Query 1 iteration 1 took 14284 ms
Query 1 iteration 2 took 14269 ms

This PR:

Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 24, batch_size: 4096, path: "/mnt/tpch/parquet/100-240", file_format: "parquet" }
Query 1 iteration 0 took 14305 ms
Query 1 iteration 1 took 14372 ms
Query 1 iteration 2 took 14323 ms

EDIT: I previously posted perf numbers that were marginally better but that was due to using a more recent Rust nightly during the development of this PR.

andygrove · 2020-09-26T22:37:27Z

Currently fails to compile due to

error[E0700]: hidden type for `impl Trait` captures lifetime that does not appear in bounds
   --> datafusion/src/execution/context.rs:352:10
    |
352 |     ) -> Result<()> {
    |          ^^^^^^^^^^
    |
note: hidden type `impl std::future::Future` captures the scope of call-site for function at 352:21
   --> datafusion/src/execution/context.rs:352:21
    |
352 |       ) -> Result<()> {
    |  _____________________^
353 | |         // create directory to contain the CSV files (one per partition)
354 | |         let path = path.to_string();
355 | |         fs::create_dir(&path)?;
...   |
377 | |         Ok(())
378 | |     }
    | |_____^

@BatmanAoD I was wondering if you might know what is causing this?

andygrove · 2020-09-26T22:38:06Z

fyi @jorgecarleitao @alamb This is some groundwork for async that the scheduler will need

github-actions · 2020-09-26T22:46:34Z

https://issues.apache.org/jira/browse/ARROW-9754

BatmanAoD · 2020-09-26T23:20:20Z

Hm, I'd guess it's probably because of the &str in the input args. Let me take a quick look...

BatmanAoD · 2020-09-26T23:26:48Z

Yes, I think it's because the &str is actually alive in between calls to the future. I'm not sure why that is, though, since you reassign it to an owned type in the first line.

jorgecarleitao · 2020-09-27T04:11:09Z

I agree with this: what we want from ExecutionPlan's execute() is not just to run something, but to present to a scheduler code that:

can execute
threads can switch in an out of that execution

which is exactly what async offers. 👍

andygrove · 2020-09-27T14:50:58Z

Thanks @BatmanAoD that was it.

alamb

I really like this change @andygrove -- I think it is a great step forward and is a great start towards running DataFusion more efficiently. I would be happy if it were merged!

alamb · 2020-09-27T14:55:11Z

rust/datafusion/benches/sort_limit_query_sql.rs

-    ctx.state.config.concurrency = 1;
-    ctx.register_table("aggregate_test_100", Box::new(mem_table));
-    ctx
+    unimplemented!()


working on this now ... about to push the fix

The issue was that I couldn't call async code from criterion so I had to create a separate tokio runtime and block on some async code. I expect that there may be a cleaner way to do this.

alamb · 2020-09-27T14:57:27Z

rust/datafusion/src/physical_plan/merge.rs

+                for chunk in chunks {
+                    let chunk = chunk.to_vec();
+                    let input = self.input.clone();
+                    let task: JoinHandle<Result<Vec<Arc<RecordBatch>>>> =


Yeah, this is really much nicer in my opinion -- we spawn tasks (not threads) -- and thus we won't create more threads than cpus and give users better control over how the tasks are run. 👍

alamb · 2020-09-27T14:58:21Z

rust/rust-toolchain

@@ -1 +1 @@
-nightly-2020-04-22
+nightly-2020-08-24


I understand this upgrade was helpful during development, but I suggest we don't upgrade upon final merge.

andygrove · 2020-09-27T16:37:35Z

@alamb @jorgecarleitao I just realized that once this PR is merged, I could go ahead and implement join support because it should be relatively efficient now that MergeExec is executing tasks not threads. I'm not sure which is the higher priority for 2.0.0 out of implementing joins (just inner equijoins for now) or implementing the scheduler. What are your thoughts?

andygrove · 2020-09-27T16:38:39Z

@alippai @vertexclique @svenwb fyi

jorgecarleitao

LGTM. Really great, @andygrove .

jorgecarleitao · 2020-09-27T16:54:36Z

I +1 the one that you think you will have the most fun working on :-)

If both are equally fun, I would go for the joins, just because feature-wise IMO it is one of the two major features that we are missing (together with windowing). This is because the type of queries I am used to use a lot of joins - it may not be so relevant for other folks.

andygrove · 2020-09-27T16:57:20Z

@jorgecarleitao I know how to implement joins, but I am still learning on the scheduler front, so I think it would make more sense to ship join support in 2.0.0 and this may make DataFusion more compelling for a larger audience. I could then focus on scheduling for the next release.

alamb · 2020-09-27T21:16:02Z

Sounds like a good plan @andygrove -- regarding the scheduler I may have time to help out in a few weeks as well as it is directly applicable to what I am working on at work

I actually think the move to async will help a lot (by partly constraining the implementation). I am also very happy to review / help out with Joins (but I will have limited time to do so as they are not directly relevant to what I am doing for work)

This PR implements async in the ExecutionPlan trait. I ran the TPC-H benchmark and the performance is about the same. Master branch: ``` Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 24, batch_size: 4096, path: "/mnt/tpch/parquet/100-240", file_format: "parquet" } Query 1 iteration 0 took 14365 ms Query 1 iteration 1 took 14284 ms Query 1 iteration 2 took 14269 ms ``` This PR: ``` Running benchmarks with the following options: TpchOpt { query: 1, debug: false, iterations: 3, concurrency: 24, batch_size: 4096, path: "/mnt/tpch/parquet/100-240", file_format: "parquet" } Query 1 iteration 0 took 14305 ms Query 1 iteration 1 took 14372 ms Query 1 iteration 2 took 14323 ms ``` _EDIT: I previously posted perf numbers that were marginally better but that was due to using a more recent Rust nightly during the development of this PR._ Closes apache#8285 from andygrove/async-execution Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

ExecutionPlan.execute async

f64aa0d

andygrove added Component: Rust Component: Rust - DataFusion labels Sep 26, 2020

andygrove self-assigned this Sep 26, 2020

andygrove marked this pull request as draft September 26, 2020 22:32

jorgecarleitao mentioned this pull request Sep 27, 2020

ARROW-9707: [Rust] [DataFusion] DataFusion Scheduler Prototype [WIP] #8283

Closed

andygrove added 4 commits September 27, 2020 08:14

convert more code to async

d81dd22

Tests pass

ae53231

add debug printlns

0ec864b

use threaded tokio runtime

3371480

andygrove changed the title ~~ARROW-9754: [Rust] [DataFusion] Implement async in ExecutionPlan trait [WIP]~~ ARROW-9754: [Rust] [DataFusion] Implement async in ExecutionPlan trait Sep 27, 2020

andygrove marked this pull request as ready for review September 27, 2020 14:55

alamb approved these changes Sep 27, 2020

View reviewed changes

andygrove added 4 commits September 27, 2020 09:06

fix bench

d4c3598

switch back to previous nightly

183f4e9

linter

8dbb454

fmt

d7957d9

jorgecarleitao approved these changes Sep 27, 2020

View reviewed changes

andygrove closed this in 75cdad4 Sep 27, 2020

asfimport mentioned this pull request Sep 27, 2020

[Rust] [DataFusion] Implement async in DataFusion traits #25806

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9754: [Rust] [DataFusion] Implement async in ExecutionPlan trait #8285

ARROW-9754: [Rust] [DataFusion] Implement async in ExecutionPlan trait #8285

andygrove commented Sep 26, 2020 •

edited

andygrove commented Sep 26, 2020

andygrove commented Sep 26, 2020

github-actions bot commented Sep 26, 2020

BatmanAoD commented Sep 26, 2020

BatmanAoD commented Sep 26, 2020

jorgecarleitao commented Sep 27, 2020

andygrove commented Sep 27, 2020

alamb left a comment

alamb Sep 27, 2020

andygrove Sep 27, 2020

andygrove Sep 27, 2020

alamb Sep 27, 2020

alamb Sep 27, 2020

andygrove commented Sep 27, 2020

andygrove commented Sep 27, 2020

jorgecarleitao left a comment

jorgecarleitao commented Sep 27, 2020

andygrove commented Sep 27, 2020

alamb commented Sep 27, 2020

		@@ -1 +1 @@
		nightly-2020-04-22
		nightly-2020-08-24

ARROW-9754: [Rust] [DataFusion] Implement async in ExecutionPlan trait #8285

ARROW-9754: [Rust] [DataFusion] Implement async in ExecutionPlan trait #8285

Conversation

andygrove commented Sep 26, 2020 • edited

andygrove commented Sep 26, 2020

andygrove commented Sep 26, 2020

github-actions bot commented Sep 26, 2020

BatmanAoD commented Sep 26, 2020

BatmanAoD commented Sep 26, 2020

jorgecarleitao commented Sep 27, 2020

andygrove commented Sep 27, 2020

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 27, 2020

Choose a reason for hiding this comment

andygrove Sep 27, 2020

Choose a reason for hiding this comment

andygrove Sep 27, 2020

Choose a reason for hiding this comment

alamb Sep 27, 2020

Choose a reason for hiding this comment

alamb Sep 27, 2020

Choose a reason for hiding this comment

andygrove commented Sep 27, 2020

andygrove commented Sep 27, 2020

jorgecarleitao left a comment

Choose a reason for hiding this comment

jorgecarleitao commented Sep 27, 2020

andygrove commented Sep 27, 2020

alamb commented Sep 27, 2020

andygrove commented Sep 26, 2020 •

edited