Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizer: Predicate Rewrite pass for TPCH Q19 #217

Closed
Tracked by #2255
alamb opened this issue Apr 28, 2021 · 3 comments · Fixed by #2858
Closed
Tracked by #2255

Optimizer: Predicate Rewrite pass for TPCH Q19 #217

alamb opened this issue Apr 28, 2021 · 3 comments · Fixed by #2858
Assignees
Labels
datafusion Changes in the datafusion crate enhancement New feature or request performance

Comments

@alamb
Copy link
Contributor

alamb commented Apr 28, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

As @Dandandan and I were discussing on #78 (comment)

The good news is that after #78, DataFusion can run TPCH Q19 🎉 . The downside is that Q19 currently has abysmal performance(basically it will never finish) because DataFusion plans it as a CROSS JOIN followed by filter. A more optimal plan would recognize a join predicate (so INNER JOIN) can be used, as well as several "single column predicates" and "single table predicates" which could be pushed down to the scans (aka applied prior to the joins)

For reference, TPCH Q19 looks like this

select
sum(l_extendedprice * (1 - l_discount) ) as revenue
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = ‘[BRAND1]’
and p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’)
and l_quantity >= [QUANTITY1] and l_quantity <= [QUANTITY1] + 10
and p_size between 1 and 5
and l_shipmode in (‘AIR’, ‘AIR REG’)
and l_shipinstruct = ‘DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ‘[BRAND2]’
and p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’)
and l_quantity >= [QUANTITY2] and l_quantity <= [QUANTITY2] + 10
and p_size between 1 and 10
and l_shipmode in (‘AIR’, ‘AIR REG’)
and l_shipinstruct = ‘DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ‘[BRAND3]’
and p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’)
and l_quantity >= [QUANTITY3] and l_quantity <= [QUANTITY3] + 10
and p_size between 1 and 15
and l_shipmode in (‘AIR’, ‘AIR REG’)
and l_shipinstruct = ‘DELIVER IN PERSON’
);

Note that while the predicate is one big OR, it can be rewritten like:

where
  p_partkey = l_partkey
  and l_shipmode in (‘AIR’, ‘AIR REG’)
  and l_shipinstruct = ‘DELIVER IN PERSON’
  and (
    (
      and p_brand = ‘[BRAND1]’
      and p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’)
      and l_quantity >= [QUANTITY1] and l_quantity <= [QUANTITY1] + 10
      and p_size between 1 and 5
    )
    or
    (
      and p_brand = ‘[BRAND2]’
      and p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’)
      and l_quantity >= [QUANTITY2] and l_quantity <= [QUANTITY2] + 10
      and p_size between 1 and 10
    )
    or
    (
      and p_brand = ‘[BRAND3]’
      and p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’)
      and l_quantity >= [QUANTITY3] and l_quantity <= [QUANTITY3] + 10
      and p_size between 1 and 15
    )
)

in which case the input cardinality into the join would be much lower.

Note there are further rewrites possible (aka introducing additional single table predicates like p_size between 1 and 15 that can filter the input to the joins even further (although the final filter is also still needed).

Describe the solution you'd like
The "classic" way to implement this is as a "predicate rewrite" pass that rearranges predicates for further downstream operations

The goal is basically to get the predicate into a form of

good_predicate1 AND good_predicate2 AND ...

Where good_predicate means the predicate has special support in the execution engine.

Since OR is not typically handled specially, rewrites to AND are helpful. Some common rewrites:

  • Rewrite 1 (needed for TPCH 19) : (p and q1) OR (p and q2) OR (p and ..) ==> p AND (q1 or q2)
  • Rewrite 2 (not in Q19, but useful elsewhere) : (col1 = A) OR (col1 = B) OR (col1 = C) ==> col1 IN (A, B, C)

Which then the execution engine can treat like a single column predicate (push down to scan) and build a hash table for (A, B, C) and do fast filtering.

This kind of rewrite can get all sorts of fancy and sometimes needs a cost model (to estimate, for example, if redundantly applying a filter during scan and after a join is worthwhile). It probably makes sense to implement a basic rewrite pass with the single table predicate extraction first, and then make it fancier from there

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@alamb alamb added the enhancement New feature or request label Apr 28, 2021
@alamb alamb added the datafusion Changes in the datafusion crate label Apr 28, 2021
@Dandandan
Copy link
Contributor

Thanks @alamb for the write-up!

@alamb
Copy link
Contributor Author

alamb commented Jun 30, 2022

@xudong963 noted that this optimization was added to datafuselabs/databend#6301 👍

@xudong963
Copy link
Member

@xudong963 noted that this optimization was added to datafuselabs/databend#6301 👍

Yes, I'll add it for df on the weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants