Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add optimize rule rewrite_disjunctive_predicate #2858

Merged
merged 3 commits into from
Jul 26, 2022

Conversation

xudong963
Copy link
Member

@xudong963 xudong963 commented Jul 9, 2022

Closes #217

Query plan for tpch 18, please focus on filter plan

=== Logical plan ===
Projection: #SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount) AS revenue
  Aggregate: groupBy=[[]], aggr=[[SUM(#lineitem.l_extendedprice * Int64(1) - #lineitem.l_discount)]]
    Filter: #part.p_partkey = #lineitem.l_partkey AND #part.p_brand = Utf8("Brand#12") AND #part.p_container IN ([Utf8("SM CASE"), Utf8("SM BOX"), Utf8("SM PACK"), Utf8("SM PKG")]) AND #lineitem.l_quantity >= Int64(1) AND #lineitem.l_quantity <= Int64(1) + Int64(10) AND #part.p_size BETWEEN Int64(1) AND Int64(5) AND #lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]) AND #lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON") OR #part.p_partkey = #lineitem.l_partkey AND #part.p_brand = Utf8("Brand#23") AND #part.p_container IN ([Utf8("MED BAG"), Utf8("MED BOX"), Utf8("MED PKG"), Utf8("MED PACK")]) AND #lineitem.l_quantity >= Int64(10) AND #lineitem.l_quantity <= Int64(10) + Int64(10) AND #part.p_size BETWEEN Int64(1) AND Int64(10) AND #lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]) AND #lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON") OR #part.p_partkey = #lineitem.l_partkey AND #part.p_brand = Utf8("Brand#34") AND #part.p_container IN ([Utf8("LG CASE"), Utf8("LG BOX"), Utf8("LG PACK"), Utf8("LG PKG")]) AND #lineitem.l_quantity >= Int64(20) AND #lineitem.l_quantity <= Int64(20) + Int64(10) AND #part.p_size BETWEEN Int64(1) AND Int64(15) AND #lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]) AND #lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON")
      CrossJoin:
        TableScan: lineitem
        TableScan: part

=== Optimized logical plan ===
Projection: #SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount) AS revenue
  Aggregate: groupBy=[[]], aggr=[[SUM(#lineitem.l_extendedprice * Int64(1) - #lineitem.l_discount)]]
    Projection: #part.p_partkey = #lineitem.l_partkey AS BinaryExpr-=Column-lineitem.l_partkeyColumn-part.p_partkey, #lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON") AS BinaryExpr-=LiteralDELIVER IN PERSONColumn-lineitem.l_shipinstruct, #lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]) AS InList-falseLiteralAIR REGLiteralAIRColumn-lineitem.l_shipmode, #lineitem.l_quantity, #lineitem.l_extendedprice, #lineitem.l_discount, #part.p_brand, #part.p_size, #part.p_container
      Filter: #part.p_partkey = #lineitem.l_partkey AND #part.p_brand = Utf8("Brand#12") AND #part.p_container IN ([Utf8("SM CASE"), Utf8("SM BOX"), Utf8("SM PACK"), Utf8("SM PKG")]) AND #lineitem.l_quantity >= Int64(1) AND #lineitem.l_quantity <= Int64(11) AND #part.p_size BETWEEN Int64(1) AND Int64(5) OR #part.p_brand = Utf8("Brand#23") AND #part.p_container IN ([Utf8("MED BAG"), Utf8("MED BOX"), Utf8("MED PKG"), Utf8("MED PACK")]) AND #lineitem.l_quantity >= Int64(10) AND #lineitem.l_quantity <= Int64(20) AND #part.p_size BETWEEN Int64(1) AND Int64(10) OR #part.p_brand = Utf8("Brand#34") AND #part.p_container IN ([Utf8("LG CASE"), Utf8("LG BOX"), Utf8("LG PACK"), Utf8("LG PKG")]) AND #lineitem.l_quantity >= Int64(20) AND #lineitem.l_quantity <= Int64(30) AND #part.p_size BETWEEN Int64(1) AND Int64(15)
        CrossJoin:
          Filter: #lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]) AND #lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON")
            TableScan: lineitem projection=[l_partkey, l_quantity, l_extendedprice, l_discount, l_shipinstruct, l_shipmode], partial_filters=[#lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]), #lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON")]
          TableScan: part projection=[p_partkey, p_brand, p_size, p_container]

We need to migrate the cross join -> inner join optimization from the planner to the optimizer so that tpch 19 can be further optimized to inner join using the predicate extracted by rewrite_disjunctive_predicate. #2859

@xudong963 xudong963 requested review from alamb and andygrove July 9, 2022 09:55
@github-actions github-actions bot added core Core datafusion crate optimizer Optimizer rules labels Jul 9, 2022
@xudong963 xudong963 added enhancement New feature or request and removed core Core datafusion crate labels Jul 9, 2022
@github-actions github-actions bot added the core Core datafusion crate label Jul 9, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 9, 2022

Codecov Report

Merging #2858 (688571d) into master (7b0f2f8) will increase coverage by 0.06%.
The diff coverage is 97.04%.

@@            Coverage Diff             @@
##           master    #2858      +/-   ##
==========================================
+ Coverage   85.62%   85.69%   +0.06%     
==========================================
  Files         279      280       +1     
  Lines       50965    51245     +280     
==========================================
+ Hits        43641    43916     +275     
- Misses       7324     7329       +5     
Impacted Files Coverage Δ
...ion/optimizer/src/rewrite_disjunctive_predicate.rs 96.75% <96.75%> (ø)
datafusion/core/src/execution/context.rs 78.05% <100.00%> (+0.02%) ⬆️
datafusion/core/tests/sql/predicates.rs 100.00% <100.00%> (ø)
datafusion/optimizer/src/reduce_outer_join.rs 98.79% <0.00%> (-0.61%) ⬇️
datafusion/common/src/scalar.rs 84.73% <0.00%> (-0.14%) ⬇️
datafusion/expr/src/logical_plan/plan.rs 77.77% <0.00%> (+0.17%) ⬆️
datafusion/expr/src/expr.rs 85.39% <0.00%> (+0.63%) ⬆️
datafusion/expr/src/window_frame.rs 93.27% <0.00%> (+0.84%) ⬆️
datafusion/optimizer/src/simplify_expressions.rs 84.02% <0.00%> (+2.02%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool @xudong963 - thank you

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR also needs tests --

Perhaps some unit tests in datafusion/optimizer/src/rewrite_disjunctive_predicate.rs and then a explain test for q18 showing the inner join?

I think the very nice code structure in datafusion/optimizer/src/rewrite_disjunctive_predicate.rs would make it quite easy to write unit tests.

@xudong963
Copy link
Member Author

xudong963 commented Jul 11, 2022

Perhaps some unit tests in datafusion/optimizer/src/rewrite_disjunctive_predicate.rs and then a explain test for q18 showing the inner join?

Currently, q19 can't be converted to inner join, because the logic of cross join -> inner join is in planner not in optimizer
#2859

I think the very nice code structure in datafusion/optimizer/src/rewrite_disjunctive_predicate.rs would make it quite easy to write unit tests.

Yes, added

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xudong963 -- I think this PR is correct and could be merged as is. Very nice 👌

I left several suggestions on how to improve the code and I think the testing coverage could also be improved, but that could be done in a follow on PR, and I would be happy to help do so too

datafusion/core/tests/sql/predicates.rs Show resolved Hide resolved
datafusion/core/tests/sql/predicates.rs Show resolved Hide resolved
datafusion/optimizer/src/rewrite_disjunctive_predicate.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/rewrite_disjunctive_predicate.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/rewrite_disjunctive_predicate.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/rewrite_disjunctive_predicate.rs Outdated Show resolved Hide resolved
fn delete_duplicate_predicates(or_predicates: &[Predicate]) -> Predicate {
let mut shortest_exprs: Vec<Predicate> = vec![];
let mut shortest_exprs_len = 0;
// choose the shortest AND predicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the need for checking the shortest AND predicate -- is there some test case that would show why picking this is important?

Or maybe another question is "why not check all elements?" Perhaps by keeping a set of expressions that were common (and checking each element in the set for its inclusion in all the arguments)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the need for checking the shortest AND predicate

The shortest AND predicate could be the common expression to be extracted if each of its elements appears in all OR predicates.

why not check all elements

We don't need to check all elements, only the shortest could be the common expression.

keeping a set of expressions

Yes, shortest_exprs should be a HashSet, but to avoid implementing Eq and Hash(it'll be spread deeply), I use Vec instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to check all elements, only the shortest could be the common expression.

I don't understand why only the shortest could be the common expression. I am probably missing something.

For example, in the following predicate, the common sub expression(p_partkey = l_partkey OR p_partkey > 5) is not the shortest

  (
            (p_partkey = l_partkey OR p_partkey > 5)
            and p_brand = 'Brand#12'
        )
    or
    (
            (p_partkey = l_partkey OR p_partkey > 5)
            and p_size between 1 and 10
        )
    or
    (
             (p_partkey = l_partkey OR p_partkey > 5)
            and p_size between 1 and 15
        )";

and yet it could be factored out

 (p_partkey = l_partkey OR p_partkey > 5)
and 
  (
            p_brand = 'Brand#12'
        )
    or
    (
            p_size between 1 and 10
        )
    or
    (
            p_size between 1 and 15
        )";

🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I don't think it is incorrect to pick the shortest predicate, but I do think it may miss potential rewrites. We can always improve it in the future perhaps

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in the following predicate, the common sub expression(p_partkey = l_partkey OR p_partkey > 5) is not the shortest

The common sub-expression is from the shortest AND predicate, but the shortest AND predicate is not equal to the common sub-expression(beside each element in the shortest AND predicate is in all the OR arguments.)

datafusion/optimizer/src/rewrite_disjunctive_predicate.rs Outdated Show resolved Hide resolved
@xudong963
Copy link
Member Author

xudong963 commented Jul 13, 2022

Thanks @alamb, learn much from your comments, and I'll address your comments in the PR later. We can keep it open.

@alamb
Copy link
Contributor

alamb commented Jul 15, 2022

Marking as draft to signify more work is planned on this PR

@alamb alamb marked this pull request as draft July 15, 2022 13:37
@xudong963 xudong963 marked this pull request as ready for review July 22, 2022 13:22
@xudong963
Copy link
Member Author

Sorry for the delay in updating, I'm busy with work recently.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thank you @xudong963

datafusion/optimizer/src/rewrite_disjunctive_predicate.rs Outdated Show resolved Hide resolved
fn delete_duplicate_predicates(or_predicates: &[Predicate]) -> Predicate {
let mut shortest_exprs: Vec<Predicate> = vec![];
let mut shortest_exprs_len = 0;
// choose the shortest AND predicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to check all elements, only the shortest could be the common expression.

I don't understand why only the shortest could be the common expression. I am probably missing something.

For example, in the following predicate, the common sub expression(p_partkey = l_partkey OR p_partkey > 5) is not the shortest

  (
            (p_partkey = l_partkey OR p_partkey > 5)
            and p_brand = 'Brand#12'
        )
    or
    (
            (p_partkey = l_partkey OR p_partkey > 5)
            and p_size between 1 and 10
        )
    or
    (
             (p_partkey = l_partkey OR p_partkey > 5)
            and p_size between 1 and 15
        )";

and yet it could be factored out

 (p_partkey = l_partkey OR p_partkey > 5)
and 
  (
            p_brand = 'Brand#12'
        )
    or
    (
            p_size between 1 and 10
        )
    or
    (
            p_size between 1 and 15
        )";

🤔

}))
}
_ => {
let expr = plan.expressions();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 very nice

}
);
let rewritten_predicate = rewrite_predicate(predicate);
assert_eq!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

fn delete_duplicate_predicates(or_predicates: &[Predicate]) -> Predicate {
let mut shortest_exprs: Vec<Predicate> = vec![];
let mut shortest_exprs_len = 0;
// choose the shortest AND predicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I don't think it is incorrect to pick the shortest predicate, but I do think it may miss potential rewrites. We can always improve it in the future perhaps

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@xudong963 xudong963 requested a review from alamb July 26, 2022 13:44
@alamb
Copy link
Contributor

alamb commented Jul 26, 2022

🚀 -- thanks again @xudong963 !

@alamb alamb merged commit 4005076 into apache:master Jul 26, 2022
@ursabot
Copy link

ursabot commented Jul 26, 2022

Benchmark runs are scheduled for baseline = 0f19990 and contender = 4005076. 4005076 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate enhancement New feature or request optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimizer: Predicate Rewrite pass for TPCH Q19
4 participants