Skip to content

Sort Merge Join is extremely slow on LeftAnti joins #18487

@comphead

Description

@comphead

Is your feature request related to a problem or challenge?

I'm testing TPCDS Q69 correctness issues with Comet(apache/datafusion-comet#2667) and found the Q69 which contains multiple anti joins performs way worse comparing to hash join

Modified query

select
  cd_gender,
  cd_marital_status,
  cd_education_status,
  count(*) cnt1,
  cd_purchase_estimate,
  count(*) cnt2,
  cd_credit_rating,
  count(*) cnt3
 from
  customer c,customer_address ca,customer_demographics
 where
  c.c_current_addr_sk = ca.ca_address_sk and
  ca_state in ('IN','VA','MS') and
  cd_demo_sk = c.c_current_cdemo_sk and
  exists (select *
          from store_sales,date_dim
          where c.c_customer_sk = ss_customer_sk and
                ss_sold_date_sk = d_date_sk and
                d_year = 2002 and
                d_moy between 2 and 2+2)   and (not exists (select *
            from web_sales,date_dim
            where c.c_customer_sk = ws_bill_customer_sk and
                  ws_sold_date_sk = d_date_sk and
                  d_year = 2002 and
                  d_moy between 2 and 2+2))
 group by cd_gender,
          cd_marital_status,
          cd_education_status,
          cd_purchase_estimate, cd_credit_rating order by 1, 2, 3, 4, 5, 6, 7, 8;

HJ Elapsed 9.879 seconds.
SMJ Elapsed 59.865 seconds.

Also it is looks like to be a reason for apache/datafusion-comet#901

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestperformanceMake DataFusion faster

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions