You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The good news is that after #78, DataFusion can run TPCH Q19 🎉 . The downside is that Q19 currently has abysmal performance(basically it will never finish) because DataFusion plans it as a CROSS JOIN followed by filter. A more optimal plan would recognize a join predicate (so INNER JOIN) can be used, as well as several "single column predicates" and "single table predicates" which could be pushed down to the scans (aka applied prior to the joins)
For reference, TPCH Q19 looks like this
selectsum(l_extendedprice * (1- l_discount) ) as revenue
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = ‘[BRAND1]’
and p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’)
and l_quantity >= [QUANTITY1] and l_quantity <= [QUANTITY1] +10and p_size between 1and5and l_shipmode in (‘AIR’, ‘AIR REG’)
and l_shipinstruct = ‘DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ‘[BRAND2]’
and p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’)
and l_quantity >= [QUANTITY2] and l_quantity <= [QUANTITY2] +10and p_size between 1and10and l_shipmode in (‘AIR’, ‘AIR REG’)
and l_shipinstruct = ‘DELIVER IN PERSON’
)
or
(
p_partkey = l_partkey
and p_brand = ‘[BRAND3]’
and p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’)
and l_quantity >= [QUANTITY3] and l_quantity <= [QUANTITY3] +10and p_size between 1and15and l_shipmode in (‘AIR’, ‘AIR REG’)
and l_shipinstruct = ‘DELIVER IN PERSON’
);
Note that while the predicate is one big OR, it can be rewritten like:
where
p_partkey = l_partkey
and l_shipmode in (‘AIR’, ‘AIR REG’)
and l_shipinstruct = ‘DELIVER IN PERSON’
and (
(
and p_brand = ‘[BRAND1]’
and p_container in ( ‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’)
and l_quantity >= [QUANTITY1] and l_quantity <= [QUANTITY1] + 10
and p_size between 1 and 5
)
or
(
and p_brand = ‘[BRAND2]’
and p_container in (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’)
and l_quantity >= [QUANTITY2] and l_quantity <= [QUANTITY2] + 10
and p_size between 1 and 10
)
or
(
and p_brand = ‘[BRAND3]’
and p_container in ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’)
and l_quantity >= [QUANTITY3] and l_quantity <= [QUANTITY3] + 10
and p_size between 1 and 15
)
)
in which case the input cardinality into the join would be much lower.
Note there are further rewrites possible (aka introducing additional single table predicates like p_size between 1 and 15 that can filter the input to the joins even further (although the final filter is also still needed).
Describe the solution you'd like
The "classic" way to implement this is as a "predicate rewrite" pass that rearranges predicates for further downstream operations
The goal is basically to get the predicate into a form of
good_predicate1 AND good_predicate2 AND ...
Where good_predicate means the predicate has special support in the execution engine.
Since OR is not typically handled specially, rewrites to AND are helpful. Some common rewrites:
Rewrite 1 (needed for TPCH 19) : (p and q1) OR (p and q2) OR (p and ..) ==> p AND (q1 or q2)
Rewrite 2 (not in Q19, but useful elsewhere) : (col1 = A) OR (col1 = B) OR (col1 = C) ==> col1 IN (A, B, C)
Which then the execution engine can treat like a single column predicate (push down to scan) and build a hash table for (A, B, C) and do fast filtering.
This kind of rewrite can get all sorts of fancy and sometimes needs a cost model (to estimate, for example, if redundantly applying a filter during scan and after a join is worthwhile). It probably makes sense to implement a basic rewrite pass with the single table predicate extraction first, and then make it fancier from there
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As @Dandandan and I were discussing on #78 (comment)
The good news is that after #78, DataFusion can run TPCH Q19 🎉 . The downside is that Q19 currently has abysmal performance(basically it will never finish) because DataFusion plans it as a CROSS JOIN followed by filter. A more optimal plan would recognize a join predicate (so INNER JOIN) can be used, as well as several "single column predicates" and "single table predicates" which could be pushed down to the scans (aka applied prior to the joins)
For reference, TPCH Q19 looks like this
Note that while the predicate is one big
OR
, it can be rewritten like:in which case the input cardinality into the join would be much lower.
Note there are further rewrites possible (aka introducing additional single table predicates like
p_size between 1 and 15
that can filter the input to the joins even further (although the final filter is also still needed).Describe the solution you'd like
The "classic" way to implement this is as a "predicate rewrite" pass that rearranges predicates for further downstream operations
The goal is basically to get the predicate into a form of
good_predicate1
ANDgood_predicate2
AND ...Where
good_predicate
means the predicate has special support in the execution engine.Since OR is not typically handled specially, rewrites to AND are helpful. Some common rewrites:
Which then the execution engine can treat like a single column predicate (push down to scan) and build a hash table for
(A, B, C)
and do fast filtering.This kind of rewrite can get all sorts of fancy and sometimes needs a cost model (to estimate, for example, if redundantly applying a filter during scan and after a join is worthwhile). It probably makes sense to implement a basic rewrite pass with the single table predicate extraction first, and then make it fancier from there
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: