Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan #9639

Closed
wants to merge 7 commits into from

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Mar 5, 2021

I believe we should expect ExecutionContext::sql to return an optimized logical plan (with current applying config) rather than a DataFrame with an unoptimized plan.
I believe so because

  • it is a high level function that should use the current configuration
  • it is hard to optimize the logical plan afterwards, as it already returns a dataframe
  • many examples, but also DataFusion repl in docs use ExecutionContext::sql

The TPC-H benchmarks don't use ExecutionContext::sql which is I guess why it was missed before.

FYI @alamb @andygrove

@github-actions
Copy link

github-actions bot commented Mar 5, 2021

@Dandandan Dandandan changed the title ARROW-11879 [Rust][DataFusion] ExecutionContext::sql should optimize plan ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan Mar 5, 2021

let opt_plan1 = ctx.optimize(&plan1)?;

let plan2 = ctx.sql("SELECT * FROM (SELECT 1) WHERE TRUE AND TRUE")?;
Copy link
Contributor Author

@Dandandan Dandandan Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the PR the test fails, as it doesn't optimize the plan (an optimized plan just returns the same as a plan for SELECT 1).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Well spotted. Thanks @Dandandan !

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- nice catch @Dandandan. There appears to be a test failure in one of the tests on this PR however

@Dandandan
Copy link
Contributor Author

hm it seems it's slightly more complicated

  • DataFrame::collect currently also runs optimize (makes sense, as this is a kind of a last "build" function)
  • But not every user wants to run collect (e.g. in Ballista, the logical plan from the DataFrame is used, it is not directly collected)

@Dandandan Dandandan closed this Mar 5, 2021
@Dandandan Dandandan reopened this Mar 5, 2021
@Dandandan Dandandan marked this pull request as draft March 5, 2021 22:07
@Dandandan
Copy link
Contributor Author

keeping as a draft for now, I think it's more open for discussion maybe what to do here.

Do we want the dataframe from ExecutionContext::sql to return an optimized plan or only on .collecting that dataframe?
Someone still might want to add some filter / aggregate on the dataframe, so maybe it makes sense the optimization pass only works on collect?

@alamb
Copy link
Contributor

alamb commented Mar 5, 2021

Someone still might want to add some filter / aggregate on the dataframe, so maybe it makes sense the optimization pass only works on collect?

Ideally in my mind we would be able to run the optimizations twice (so we could do it with the initial call to sql but then if someone added more grouping or reparitioning or something, we could run the optimizer passes again.

@Dandandan something I have been thinking recently (as I prepared for my talk next week on DataFusion as well as talking with @NGA-TRAN on my team at Influx) was how similar the LogicalPlanBuilder and DataFrame APIs were (and in fact the DataFrameImpl basically calls the functions on LogicalPlanBuilder.

I almost wonder if we should combine the two somehow... I don't have a concrete proposal now just 🤔

nevi-me pushed a commit that referenced this pull request Mar 6, 2021
ARROW-11881: [Rust][DataFusion] Fix clippy lint

A linter error has appeared on master somehow:

```
error: unnecessary parentheses around `for` iterator expression
   --> datafusion/src/physical_plan/merge.rs:124:31
    |
124 |                 for part_i in (0..input_partitions) {
    |                               ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses
    |
    = note: `-D unused-parens` implied by `-D warnings`
```

Seen on at least #9612 and #9639:

https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472

https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120

Closes #9642 from alamb/fix_clippy

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
@Dandandan Dandandan marked this pull request as ready for review March 6, 2021 09:34
@Dandandan
Copy link
Contributor Author

@alamb

Ideally in my mind we would be able to run the optimizations twice (so we could do it with the initial call to sql but then if someone added more grouping or reparitioning or something, we could run the optimizer passes again.

I removed the check / added a test for the projection pushdown that it returns the same plan when optimizing twice and removed the check. I am not sure what the check was trying to prevent? It seems it passes all the tests (which use sql + collect quite often).

@Dandandan something I have been thinking recently (as I prepared for my talk next week on DataFusion as well as talking with @NGA-TRAN on my team at Influx) was how similar the LogicalPlanBuilder and DataFrame APIs were (and in fact the DataFrameImpl basically calls the functions on LogicalPlanBuilder.

I almost wonder if we should combine the two somehow... I don't have a concrete proposal now just thinking

Thanks. Yeah DataFrame and LogicalPlan are pretty similar, not sure whether there is anything to change about it? As the DataFrame is just a higher level layer over the LogicalPlan.
I think maybe some methods in ExecutionContext can be changed / deprecated so users will be nudged to use DataFrames more?

For example the public function create_logical_plan has a comment "This function is intended for internal use and should not be called directly", but both the tpc-h benchmarks and flight-server example do use more operations on the logical physical plan, but probably should use the sql/Dataframe::collect API instead. Snippet from flight_server example:

                let plan = ctx
                    .create_logical_plan(&sql)
                    .and_then(|plan| ctx.optimize(&plan))
                    .and_then(|plan| ctx.create_physical_plan(&plan))
                    .map_err(|e| to_tonic_err(&e))?;

                // execute the query
                let results =
                    collect(plan.clone()).await.map_err(|e| to_tonic_err(&e))?;

But this PR now runs the optimizer twice if you use sql + .collect.
I am not sure what would the expected end result be. I guess one could keep some kind of flag that a certain node of a plan is optimized, and when it is the root it doesn't run a full optimization again, but maybe that's not worth it.

@Dandandan Dandandan closed this Mar 6, 2021
@Dandandan Dandandan reopened this Mar 6, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good to me. Thanks again @Dandandan

I also re-ran the DataFusion tests locally on this branch after merging from master to make sure all still looks good. 👍


let opt_plan1 = ctx.optimize(&plan1)?;

let plan2 = ctx.sql("SELECT * FROM (SELECT 1) WHERE TRUE AND TRUE")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb closed this in 29feea0 Mar 14, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
ARROW-11881: [Rust][DataFusion] Fix clippy lint

A linter error has appeared on master somehow:

```
error: unnecessary parentheses around `for` iterator expression
   --> datafusion/src/physical_plan/merge.rs:124:31
    |
124 |                 for part_i in (0..input_partitions) {
    |                               ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses
    |
    = note: `-D unused-parens` implied by `-D warnings`
```

Seen on at least apache#9612 and apache#9639:

https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472

https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120

Closes apache#9642 from alamb/fix_clippy

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…frame with optimized plan

I believe we should expect `ExecutionContext::sql` to return an optimized logical plan (with current applying config) rather than a `DataFrame` with an unoptimized plan.
I believe so because

* it is a high level function that should use the current configuration
* it is hard to optimize the logical plan afterwards, as it already returns a dataframe
* many examples, but also DataFusion `repl`  in docs use `ExecutionContext::sql`

The TPC-H benchmarks don't use `ExecutionContext::sql` which is I guess why it was missed before.

FYI @alamb @andygrove

Closes apache#9639 from Dandandan/ctx_sql_optimize

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
ARROW-11881: [Rust][DataFusion] Fix clippy lint

A linter error has appeared on master somehow:

```
error: unnecessary parentheses around `for` iterator expression
   --> datafusion/src/physical_plan/merge.rs:124:31
    |
124 |                 for part_i in (0..input_partitions) {
    |                               ^^^^^^^^^^^^^^^^^^^^^ help: remove these parentheses
    |
    = note: `-D unused-parens` implied by `-D warnings`
```

Seen on at least apache#9612 and apache#9639:

https://github.com/apache/arrow/pull/9612/checks?check_run_id=2042047472

https://github.com/apache/arrow/pull/9639/checks?check_run_id=2042649120

Closes apache#9642 from alamb/fix_clippy

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…frame with optimized plan

I believe we should expect `ExecutionContext::sql` to return an optimized logical plan (with current applying config) rather than a `DataFrame` with an unoptimized plan.
I believe so because

* it is a high level function that should use the current configuration
* it is hard to optimize the logical plan afterwards, as it already returns a dataframe
* many examples, but also DataFusion `repl`  in docs use `ExecutionContext::sql`

The TPC-H benchmarks don't use `ExecutionContext::sql` which is I guess why it was missed before.

FYI @alamb @andygrove

Closes apache#9639 from Dandandan/ctx_sql_optimize

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants