[SPARK-37899][SQL] Support converting inner join to left semi join by wangyum · Pull Request #37236 · apache/spark

wangyum · 2022-07-20T09:55:11Z

What changes were proposed in this pull request?

This PR adds a new rule(EliminateInnerJoin) to support converting inner join to left semi join.
Left semi join can be further optimized by PushDownLeftSemiAntiJoin and PushLeftSemiLeftAntiThroughJoin.

For example:

CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b, id AS c FROM range(10000000);
CREATE TABLE t2 using parquet AS SELECT id AS a, id AS b, id AS c FROM range(10000000);
CREATE TABLE t3 using parquet AS SELECT id AS a, id AS b, id AS c FROM range(1000);

SELECT t1.a, t2.b FROM t1 JOIN t2 ON t1.a = t2.a JOIN (SELECT DISTINCT c FROM t3) tmp ON t2.b = tmp.c;

Before this PR:

== Optimized Logical Plan ==
Project [a#13050L, b#13054L]
+- Join Inner, (b#13054L = c#13058L)
   :- Project [a#13050L, b#13054L]
   :  +- Join Inner, (a#13050L = a#13053L)
   :     :- Project [a#13050L]
   :     :  +- Filter isnotnull(a#13050L)
   :     :     +- Relation spark_catalog.t1[a#13050L,b#13051L,c#13052L] parquet
   :     +- Project [a#13053L, b#13054L]
   :        +- Filter (isnotnull(a#13053L) AND isnotnull(b#13054L))
   :           +- Relation spark_catalog.t2[a#13053L,b#13054L,c#13055L] parquet
   +- Aggregate [c#13058L], [c#13058L]
      +- Project [c#13058L]
         +- Filter isnotnull(c#13058L)
            +- Relation spark_catalog.t3[a#13056L,b#13057L,c#13058L] parquet

After this PR:

== Optimized Logical Plan ==
Project [a#13050L, b#13054L]
+- Join Inner, (a#13050L = a#13053L)
   :- Project [a#13050L]
   :  +- Filter isnotnull(a#13050L)
   :     +- Relation spark_catalog.t1[a#13050L,b#13051L,c#13052L] parquet
   +- Join LeftSemi, (b#13054L = c#13058L)
      :- Project [a#13053L, b#13054L]
      :  +- Filter (isnotnull(b#13054L) AND isnotnull(a#13053L))
      :     +- Relation spark_catalog.t2[a#13053L,b#13054L,c#13055L] parquet
      +- Aggregate [c#13058L], [c#13058L]
         +- Project [c#13058L]
            +- Filter isnotnull(c#13058L)
               +- Relation spark_catalog.t3[a#13056L,b#13057L,c#13058L] parquet

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and TPC-DS benchmark test.

SQL	Before this PR(Seconds)	After this PR(Seconds)
q14a	187	164
q64	78	72

wangyum · 2022-07-20T09:59:46Z

A production case.

Before this PR	After this PR

wangyum · 2022-07-23T14:09:35Z

Close it because this idea needs statistics. We should not convert to a left semi-join if the left side can broadcast and the right side cannot.

Another idea is to push this join through join . But I think it's too special.

Convert inner join to left semi join

d7bbcd7

github-actions bot added the SQL label Jul 20, 2022

wangyum closed this Jul 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37899][SQL] Support converting inner join to left semi join #37236

[SPARK-37899][SQL] Support converting inner join to left semi join #37236
wangyum wants to merge 1 commit intoapache:masterfrom
wangyum:SPARK-37899

wangyum commented Jul 20, 2022 •

edited

Loading

Uh oh!

wangyum commented Jul 20, 2022

Uh oh!

wangyum commented Jul 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wangyum commented Jul 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangyum commented Jul 20, 2022

Uh oh!

wangyum commented Jul 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wangyum commented Jul 20, 2022 •

edited

Loading