-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan #9639
Conversation
|
||
let opt_plan1 = ctx.optimize(&plan1)?; | ||
|
||
let plan2 = ctx.sql("SELECT * FROM (SELECT 1) WHERE TRUE AND TRUE")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the PR the test fails, as it doesn't optimize the plan (an optimized plan just returns the same as a plan for SELECT 1
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Well spotted. Thanks @Dandandan !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- nice catch @Dandandan. There appears to be a test failure in one of the tests on this PR however
hm it seems it's slightly more complicated
|
keeping as a draft for now, I think it's more open for discussion maybe what to do here. Do we want the dataframe from |
Ideally in my mind we would be able to run the optimizations twice (so we could do it with the initial call to @Dandandan something I have been thinking recently (as I prepared for my talk next week on DataFusion as well as talking with @NGA-TRAN on my team at Influx) was how similar the I almost wonder if we should combine the two somehow... I don't have a concrete proposal now just 🤔 |
ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least #9612 and #9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes #9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
I removed the check / added a test for the projection pushdown that it returns the same plan when optimizing twice and removed the check. I am not sure what the check was trying to prevent? It seems it passes all the tests (which use sql + collect quite often).
Thanks. Yeah For example the public function
But this PR now runs the optimizer twice if you use |
45f2800
to
42420b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good to me. Thanks again @Dandandan
I also re-ran the DataFusion
tests locally on this branch after merging from master to make sure all still looks good. 👍
|
||
let opt_plan1 = ctx.optimize(&plan1)?; | ||
|
||
let plan2 = ctx.sql("SELECT * FROM (SELECT 1) WHERE TRUE AND TRUE")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least apache#9612 and apache#9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes apache#9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…frame with optimized plan I believe we should expect `ExecutionContext::sql` to return an optimized logical plan (with current applying config) rather than a `DataFrame` with an unoptimized plan. I believe so because * it is a high level function that should use the current configuration * it is hard to optimize the logical plan afterwards, as it already returns a dataframe * many examples, but also DataFusion `repl` in docs use `ExecutionContext::sql` The TPC-H benchmarks don't use `ExecutionContext::sql` which is I guess why it was missed before. FYI @alamb @andygrove Closes apache#9639 from Dandandan/ctx_sql_optimize Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
ARROW-11881: [Rust][DataFusion] Fix clippy lint A linter error has appeared on master somehow: ``` error: unnecessary parentheses around `for` iterator expression --> datafusion/src/physical_plan/merge.rs:124:31 | 124 | for part_i in (0..input_partitions) { | ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses | = note: `-D unused-parens` implied by `-D warnings` ``` Seen on at least apache#9612 and apache#9639: https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472 https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120 Closes apache#9642 from alamb/fix_clippy Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…frame with optimized plan I believe we should expect `ExecutionContext::sql` to return an optimized logical plan (with current applying config) rather than a `DataFrame` with an unoptimized plan. I believe so because * it is a high level function that should use the current configuration * it is hard to optimize the logical plan afterwards, as it already returns a dataframe * many examples, but also DataFusion `repl` in docs use `ExecutionContext::sql` The TPC-H benchmarks don't use `ExecutionContext::sql` which is I guess why it was missed before. FYI @alamb @andygrove Closes apache#9639 from Dandandan/ctx_sql_optimize Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
I believe we should expect
ExecutionContext::sql
to return an optimized logical plan (with current applying config) rather than aDataFrame
with an unoptimized plan.I believe so because
repl
in docs useExecutionContext::sql
The TPC-H benchmarks don't use
ExecutionContext::sql
which is I guess why it was missed before.FYI @alamb @andygrove