Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add assertion for invariant in create_physical_expression and fix ViewTable projection #3242

Merged
merged 6 commits into from
Aug 25, 2022

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Aug 23, 2022

Which issue does this PR close?

Closes #3240

Rationale for this change

create_physical_expression accepts an Arrow schema and a DataFusion schema. They should have the same number of fields but there was no check.

What changes are included in this PR?

  • Add an assertion that both schemas have the same length
  • Fix regressions caused by this change
  • ViewTable no longer ignores projections
  • Remove some unwraps from simplify_expressions

Are there any user-facing changes?

Not sure

@github-actions github-actions bot added core Core datafusion crate optimizer Optimizer rules physical-expr Physical Expressions labels Aug 23, 2022
@andygrove
Copy link
Member Author

@matthewmturner @DaltonModlin @kmitchener Since this affects views, could you please review if you have time?

@codecov-commenter
Copy link

codecov-commenter commented Aug 23, 2022

Codecov Report

Merging #3242 (53d6e52) into master (bdc56ca) will increase coverage by 0.00%.
The diff coverage is 95.58%.

❗ Current head 53d6e52 differs from pull request most recent head 6719fc1. Consider uploading reports for the commit 6719fc1 to get more accurate results

@@           Coverage Diff           @@
##           master    #3242   +/-   ##
=======================================
  Coverage   85.84%   85.85%           
=======================================
  Files         293      293           
  Lines       53246    53310   +64     
=======================================
+ Hits        45710    45770   +60     
- Misses       7536     7540    +4     
Impacted Files Coverage Δ
datafusion/core/src/datasource/view.rs 86.54% <86.95%> (-0.03%) ⬇️
datafusion/core/tests/sql/expr.rs 99.85% <100.00%> (+<0.01%) ⬆️
datafusion/optimizer/src/expr_simplifier.rs 100.00% <100.00%> (ø)
datafusion/optimizer/src/simplify_expressions.rs 84.05% <100.00%> (+0.05%) ⬆️
datafusion/physical-expr/src/planner.rs 95.31% <100.00%> (+0.03%) ⬆️
datafusion/sql/src/planner.rs 80.59% <100.00%> (+0.07%) ⬆️
datafusion/expr/src/logical_plan/plan.rs 78.38% <0.00%> (-0.35%) ⬇️
datafusion/expr/src/window_frame.rs 93.27% <0.00%> (+0.84%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@kmitchener
Copy link
Contributor

@matthewmturner @DaltonModlin @kmitchener Since this affects views, could you please review if you have time?

LGTM. If you want to reproduce, this triggered the error prior to the fix:

create view v as select 1 as a, 2 as b, 3 as c;
select * from (select b from v); -- returned column a prior to fix

@andygrove andygrove added the bug Something isn't working label Aug 23, 2022
@kmitchener
Copy link
Contributor

Given the above sql, it results in this plan. Isn't this 1 too many projections?

create view v as select 1 as a, 2 as b, 3 as c;
explain analyze select * from (select b from v);
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                         |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalescePartitionsExec, metrics=[output_rows=1, elapsed_compute=72.1µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                          |
|                   |   ProjectionExec: expr=[b@0 as b], metrics=[output_rows=1, elapsed_compute=3.311µs, spill_count=0, spilled_bytes=0, mem_used=0]                                              |
|                   |     ProjectionExec: expr=[b@0 as b], metrics=[output_rows=1, elapsed_compute=5.511µs, spill_count=0, spilled_bytes=0, mem_used=0]                                            |
|                   |       ProjectionExec: expr=[2 as b], metrics=[output_rows=1, elapsed_compute=53.011µs, spill_count=0, spilled_bytes=0, mem_used=0]                                           |
|                   |         RepartitionExec: partitioning=RoundRobinBatch(12), metrics=[repart_time{inputPartition=0}=1ns, fetch_time{inputPartition=0}=27µs, send_time{inputPartition=0}=7.6µs] |
|                   |           EmptyExec: produce_one_row=true, metrics=[]                                                                                                                        |
|                   |                                                                                                                                                                              |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

@andygrove
Copy link
Member Author

Given the above sql, it results in this plan. Isn't this 1 too many projections?

create view v as select 1 as a, 2 as b, 3 as c;
explain analyze select * from (select b from v);
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                         |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalescePartitionsExec, metrics=[output_rows=1, elapsed_compute=72.1µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                          |
|                   |   ProjectionExec: expr=[b@0 as b], metrics=[output_rows=1, elapsed_compute=3.311µs, spill_count=0, spilled_bytes=0, mem_used=0]                                              |
|                   |     ProjectionExec: expr=[b@0 as b], metrics=[output_rows=1, elapsed_compute=5.511µs, spill_count=0, spilled_bytes=0, mem_used=0]                                            |
|                   |       ProjectionExec: expr=[2 as b], metrics=[output_rows=1, elapsed_compute=53.011µs, spill_count=0, spilled_bytes=0, mem_used=0]                                           |
|                   |         RepartitionExec: partitioning=RoundRobinBatch(12), metrics=[repart_time{inputPartition=0}=1ns, fetch_time{inputPartition=0}=27µs, send_time{inputPartition=0}=7.6µs] |
|                   |           EmptyExec: produce_one_row=true, metrics=[]                                                                                                                        |
|                   |                                                                                                                                                                              |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

These could easily be optimized out if we could run the optimizer here, but we can't. This also means we lose projection push-down when selecting from a view. This does at least fix the correctness issue, even if this is not efficient so it might be best to file a specific issue for this. I'll take another look at this tomorrow.

@andygrove
Copy link
Member Author

These could easily be optimized out if we could run the optimizer here, but we can't. This also means we lose projection push-down when selecting from a view. This does at least fix the correctness issue, even if this is not efficient so it might be best to file a specific issue for this. I'll take another look at this tomorrow.

I was wrong about this. We are optimizing. I will add a check to avoid adding a redundant projection.

@liukun4515
Copy link
Contributor

I will review it tomorrow, it's too later for me today. @andygrove

@andygrove
Copy link
Member Author

andygrove commented Aug 24, 2022

I filed #3248 for removing the redundant projection. The physical plan being returned from ViewTable::scan does not have it, so I think this is a bug that is unrelated to this PR.

ViewScan::scan returns this:

ProjectionExec: expr=[b@0 as b]
  ProjectionExec: expr=[2 as b]
    EmptyExec: produce_one_row=true

EXPLAIN ANALYZE shows this:

+---------------+---------------------------------------+
| plan_type     | plan                                  |
+---------------+---------------------------------------+
| logical_plan  | Projection: #v.b                      |
|               |   TableScan: v projection=[b]         |
| physical_plan | ProjectionExec: expr=[b@0 as b]       |
|               |   ProjectionExec: expr=[b@0 as b]     |
|               |     ProjectionExec: expr=[2 as b]     |
|               |       EmptyExec: produce_one_row=true |
|               |                                       |
+---------------+---------------------------------------+

@andygrove andygrove requested a review from yjshen August 24, 2022 18:30
Copy link
Contributor

@jdye64 jdye64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. Like the idea of using the Arrow schema for determining the expression data types.

Copy link
Member

@yjshen yjshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@andygrove andygrove merged commit 92110dd into apache:master Aug 25, 2022
@ursabot
Copy link

ursabot commented Aug 25, 2022

Benchmark runs are scheduled for baseline = 82da46d and contender = 92110dd. 92110dd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@matthewmturner
Copy link
Contributor

@andygrove sorry i didnt get to this sooner, looks good.

@andygrove andygrove deleted the create-physical-expr-invariant branch January 27, 2023 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core datafusion crate optimizer Optimizer rules physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

View TableProvider ignores projections, resulting in invalid plans
8 participants