-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Closed
Description
Queries like below cannot finish in a acceptable time, store_sales has 2800w rows, customer_address has 5w rows, for now Doris will create only one cross join node to execute this sql,
the time of eval the where clause is about 200-300 ns, the total count of eval will be 2800w * 5w, this is extremely large, and this will cost 2800w * 5w * 250 ns = 4 billion seconds;
select avg(ss_quantity)
,avg(ss_ext_sales_price)
,avg(ss_ext_wholesale_cost)
,sum(ss_ext_wholesale_cost)
from store_sales, customer_address
where ((ss_addr_sk = ca_address_sk
and ca_country = 'United States'
and ca_state in ('CO', 'IL', 'MN')
and ss_net_profit between 100 and 200
) or
(ss_addr_sk = ca_address_sk
and ca_country = 'United States'
and ca_state in ('OH', 'MT', 'NM')
and ss_net_profit between 150 and 300
) or
(ss_addr_sk = ca_address_sk
and ca_country = 'United States'
and ca_state in ('TX', 'MO', 'MI')
and ss_net_profit between 50 and 250
))
but this sql can be rewrite to
select avg(ss_quantity)
,avg(ss_ext_sales_price)
,avg(ss_ext_wholesale_cost)
,sum(ss_ext_wholesale_cost)
from store_sales, customer_address
where ss_addr_sk = ca_address_sk
and ca_country = 'United States' and (((ca_state in ('CO', 'IL', 'MN')
and ss_net_profit between 100 and 200
) or
(ca_state in ('OH', 'MT', 'NM')
and ss_net_profit between 150 and 300
) or
(ca_state in ('TX', 'MO', 'MI')
and ss_net_profit between 50 and 250
))
)
there for we can do a hash join first and then use
(((ca_state in ('CO', 'IL', 'MN')
and ss_net_profit between 100 and 200
) or
(ca_state in ('OH', 'MT', 'NM')
and ss_net_profit between 150 and 300
) or
(ca_state in ('TX', 'MO', 'MI')
and ss_net_profit between 50 and 250
))
)
to filter the value,
in TPCDS 10g dataset, the rewritten sql only cost about 1 seconds.

so we should implements this optimize
morningman