Skip to content

Conversation

@Dandandan
Copy link
Contributor

Which issue does this PR close?

Closes #264

Rationale for this change

SQL planning to logical plan had a wrong rule: whenever the schema fields names were equal, it applies an "optimization" to remove the projection.
I found this while working on the SELECT DISTINCT.
If we want to have an optimization pass like this, it should look at whether the expressions are equal instead and I think it would be better to include as a real optimization rule instead of while building the logical plan from the query.

What changes are included in this PR?

Skip the "optimization".

Are there any user-facing changes?

No.

Copy link
Contributor

@returnString returnString left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Though looks like removing the optimisation breaks a crapload of logical plan text assertions in the tests that now contain an extra top-level projection node :(

Could we retain the optimisation but only apply it when expr consists entirely of simple column refs? Not sure if that'd patch up all the tests or not.

Edit: or we could just modify the test assertions, I suppose, if we want to keep the initial planning process simple and do this kind of optimisation later?

@Dandandan
Copy link
Contributor Author

Nice catch! Though looks like removing the optimisation breaks a crapload of logical plan text assertions in the tests that now contain an extra top-level projection node :(

Could we retain the optimisation but only apply it when expr consists entirely of simple column refs? Not sure if that'd patch up all the tests or not.

Edit: or we could just modify the test assertions, I suppose, if we want to keep the initial planning process simple and do this kind of optimisation later?

Thanks @returnString !
I will have a look later. I think it would be best to update the tests instead for now.

I think the optimization also doesn't improve performance that much, as the projection with just the fields should be almost "free" in DF. I will add an issue to add the optimization as a rule.

@houqp
Copy link
Member

houqp commented May 5, 2021

+1 for moving this to a separate optimization rule instead of doing it within the SQL planner.

@codecov-commenter
Copy link

Codecov Report

Merging #268 (89463ea) into master (ea451c9) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #268      +/-   ##
==========================================
- Coverage   76.21%   76.20%   -0.01%     
==========================================
  Files         140      140              
  Lines       23553    23556       +3     
==========================================
+ Hits        17950    17952       +2     
- Misses       5603     5604       +1     
Impacted Files Coverage Δ
datafusion/src/execution/dataframe_impl.rs 88.88% <100.00%> (ø)
datafusion/src/sql/planner.rs 83.62% <100.00%> (-0.09%) ⬇️
datafusion/tests/sql.rs 99.89% <100.00%> (+<0.01%) ⬆️
datafusion/src/physical_plan/coalesce_batches.rs 83.18% <0.00%> (-1.77%) ⬇️
datafusion/src/scalar.rs 54.61% <0.00%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea451c9...89463ea. Read the comment docs.

@Dandandan
Copy link
Contributor Author

Merging this - seems a uncontroversial change / correctness improvement. Adding an issue for bringing back an optimization to get rid of unnecessary projections.

@Dandandan Dandandan merged commit 24e248d into apache:master May 6, 2021
@alamb
Copy link
Contributor

alamb commented May 6, 2021

Thank you for this @Dandandan 👍

@houqp houqp added bug Something isn't working datafusion labels Jul 29, 2021
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
…#268)

* fix: incorrect result on Comet multiple column distinct count

* Update core/src/execution/datafusion/planner.rs

Co-authored-by: Andy Grove <andygrove73@gmail.com>

---------

Co-authored-by: Andy Grove <andygrove73@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong result from projection

5 participants