ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan #9639

Dandandan · 2021-03-05T19:31:45Z

I believe we should expect ExecutionContext::sql to return an optimized logical plan (with current applying config) rather than a DataFrame with an unoptimized plan.
I believe so because

it is a high level function that should use the current configuration
it is hard to optimize the logical plan afterwards, as it already returns a dataframe
many examples, but also DataFusion repl in docs use ExecutionContext::sql

The TPC-H benchmarks don't use ExecutionContext::sql which is I guess why it was missed before.

FYI @alamb @andygrove

github-actions · 2021-03-05T19:32:06Z

https://issues.apache.org/jira/browse/ARROW-11879

Dandandan · 2021-03-05T19:38:25Z

rust/datafusion/src/execution/context.rs

+
+        let opt_plan1 = ctx.optimize(&plan1)?;
+
+        let plan2 = ctx.sql("SELECT * FROM (SELECT 1) WHERE TRUE AND TRUE")?;


Before the PR the test fails, as it doesn't optimize the plan (an optimized plan just returns the same as a plan for SELECT 1).

jorgecarleitao

LGTM. Well spotted. Thanks @Dandandan !

alamb

I agree -- nice catch @Dandandan. There appears to be a test failure in one of the tests on this PR however

Dandandan · 2021-03-05T22:07:00Z

hm it seems it's slightly more complicated

DataFrame::collect currently also runs optimize (makes sense, as this is a kind of a last "build" function)
But not every user wants to run collect (e.g. in Ballista, the logical plan from the DataFrame is used, it is not directly collected)

Dandandan · 2021-03-05T22:09:43Z

keeping as a draft for now, I think it's more open for discussion maybe what to do here.

Do we want the dataframe from ExecutionContext::sql to return an optimized plan or only on .collecting that dataframe?
Someone still might want to add some filter / aggregate on the dataframe, so maybe it makes sense the optimization pass only works on collect?

alamb · 2021-03-05T22:22:22Z

Someone still might want to add some filter / aggregate on the dataframe, so maybe it makes sense the optimization pass only works on collect?

Ideally in my mind we would be able to run the optimizations twice (so we could do it with the initial call to sql but then if someone added more grouping or reparitioning or something, we could run the optimizer passes again.

@Dandandan something I have been thinking recently (as I prepared for my talk next week on DataFusion as well as talking with @NGA-TRAN on my team at Influx) was how similar the LogicalPlanBuilder and DataFrame APIs were (and in fact the DataFrameImpl basically calls the functions on LogicalPlanBuilder.

I almost wonder if we should combine the two somehow... I don't have a concrete proposal now just 🤔

ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least #9612 and #9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes #9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

Dandandan · 2021-03-06T09:57:05Z

@alamb

Ideally in my mind we would be able to run the optimizations twice (so we could do it with the initial call to sql but then if someone added more grouping or reparitioning or something, we could run the optimizer passes again.

I removed the check / added a test for the projection pushdown that it returns the same plan when optimizing twice and removed the check. I am not sure what the check was trying to prevent? It seems it passes all the tests (which use sql + collect quite often).

@Dandandan something I have been thinking recently (as I prepared for my talk next week on DataFusion as well as talking with @NGA-TRAN on my team at Influx) was how similar the LogicalPlanBuilder and DataFrame APIs were (and in fact the DataFrameImpl basically calls the functions on LogicalPlanBuilder.

I almost wonder if we should combine the two somehow... I don't have a concrete proposal now just thinking

Thanks. Yeah DataFrame and LogicalPlan are pretty similar, not sure whether there is anything to change about it? As the DataFrame is just a higher level layer over the LogicalPlan.
I think maybe some methods in ExecutionContext can be changed / deprecated so users will be nudged to use DataFrames more?

For example the public function create_logical_plan has a comment "This function is intended for internal use and should not be called directly", but both the tpc-h benchmarks and flight-server example do use more operations on the logical physical plan, but probably should use the sql/Dataframe::collect API instead. Snippet from flight_server example:

                let plan = ctx
                    .create_logical_plan(&sql)
                    .and_then(|plan| ctx.optimize(&plan))
                    .and_then(|plan| ctx.create_physical_plan(&plan))
                    .map_err(|e| to_tonic_err(&e))?;

                // execute the query
                let results =
                    collect(plan.clone()).await.map_err(|e| to_tonic_err(&e))?;

But this PR now runs the optimizer twice if you use sql + .collect.
I am not sure what would the expected end result be. I guess one could keep some kind of flag that a certain node of a plan is optimized, and when it is the root it doesn't run a full optimization again, but maybe that's not worth it.

alamb

I think this looks good to me. Thanks again @Dandandan

I also re-ran the DataFusion tests locally on this branch after merging from master to make sure all still looks good. 👍

alamb · 2021-03-14T11:14:22Z

rust/datafusion/src/execution/context.rs

+
+        let opt_plan1 = ctx.optimize(&plan1)?;
+
+        let plan2 = ctx.sql("SELECT * FROM (SELECT 1) WHERE TRUE AND TRUE")?;


ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least apache#9612 and apache#9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes apache#9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@alamb

…frame with optimized plan I believe we should expect `ExecutionContext::sql` to return an optimized logical plan (with current applying config) rather than a `DataFrame` with an unoptimized plan. I believe so because * it is a high level function that should use the current configuration * it is hard to optimize the logical plan afterwards, as it already returns a dataframe * many examples, but also DataFusion `repl` in docs use `ExecutionContext::sql` The TPC-H benchmarks don't use `ExecutionContext::sql` which is I guess why it was missed before. FYI @alamb @andygrove Closes apache#9639 from Dandandan/ctx_sql_optimize Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least apache#9612 and apache#9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes apache#9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

@alamb

…frame with optimized plan I believe we should expect `ExecutionContext::sql` to return an optimized logical plan (with current applying config) rather than a `DataFrame` with an unoptimized plan. I believe so because * it is a high level function that should use the current configuration * it is hard to optimize the logical plan afterwards, as it already returns a dataframe * many examples, but also DataFusion `repl` in docs use `ExecutionContext::sql` The TPC-H benchmarks don't use `ExecutionContext::sql` which is I guess why it was missed before. FYI @alamb @andygrove Closes apache#9639 from Dandandan/ctx_sql_optimize Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

github-actions bot added Component: Rust - DataFusion Component: Rust labels Mar 5, 2021

Dandandan changed the title ~~ARROW-11879 [Rust][DataFusion] ExecutionContext::sql should optimize plan~~ ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan Mar 5, 2021

Dandandan commented Mar 5, 2021

View reviewed changes

jorgecarleitao approved these changes Mar 5, 2021

View reviewed changes

alamb approved these changes Mar 5, 2021

View reviewed changes

This was referenced Mar 5, 2021

ARROW-11824: [Rust] [Parquet] Use logical types in Arrow schema conversion #9612

Closed

ARROW-11881: [Rust][DataFusion] Fix clippy lint #9642

Closed

Dandandan closed this Mar 5, 2021

Dandandan reopened this Mar 5, 2021

Dandandan marked this pull request as draft March 5, 2021 22:07

Dandandan marked this pull request as ready for review March 6, 2021 09:34

Dandandan closed this Mar 6, 2021

Dandandan reopened this Mar 6, 2021

Dandandan added 7 commits March 7, 2021 22:28

ctx.sql should optimize plan

833f8ab

fmt

ce423cf

Revert unrelated

d0f1f27

Clean up

a02d91e

Move optimize

99cca32

Remove existing projection check

3bfaae4

Add test for running optimizer twice in projection push down

42420b3

Dandandan force-pushed the ctx_sql_optimize branch from 45f2800 to 42420b3 Compare March 7, 2021 21:28

alamb approved these changes Mar 14, 2021

View reviewed changes

alamb closed this in 29feea0 Mar 14, 2021

asfimport mentioned this pull request Mar 14, 2021

[Rust][DataFusion] ExecutionContext::sql should optimize query plan #27721

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan #9639

ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan #9639

Dandandan commented Mar 5, 2021 •

edited

Loading

github-actions bot commented Mar 5, 2021

Dandandan Mar 5, 2021 •

edited

Loading

alamb Mar 14, 2021

jorgecarleitao left a comment

alamb left a comment

Dandandan commented Mar 5, 2021

Dandandan commented Mar 5, 2021

alamb commented Mar 5, 2021

Dandandan commented Mar 6, 2021

alamb left a comment

alamb Mar 14, 2021


		let opt_plan1 = ctx.optimize(&plan1)?;

		let plan2 = ctx.sql("SELECT * FROM (SELECT 1) WHERE TRUE AND TRUE")?;

ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan #9639

ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan #9639

Conversation

Dandandan commented Mar 5, 2021 • edited Loading

github-actions bot commented Mar 5, 2021

Dandandan Mar 5, 2021 • edited Loading

Choose a reason for hiding this comment

alamb Mar 14, 2021

Choose a reason for hiding this comment

jorgecarleitao left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Dandandan commented Mar 5, 2021

Dandandan commented Mar 5, 2021

alamb commented Mar 5, 2021

Dandandan commented Mar 6, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 14, 2021

Choose a reason for hiding this comment

Dandandan commented Mar 5, 2021 •

edited

Loading

Dandandan Mar 5, 2021 •

edited

Loading