Skip to content

[SPARK-37899][SQL] Support converting inner join to left semi join #37236

Closed
wangyum wants to merge 1 commit intoapache:masterfrom
wangyum:SPARK-37899
Closed

[SPARK-37899][SQL] Support converting inner join to left semi join #37236
wangyum wants to merge 1 commit intoapache:masterfrom
wangyum:SPARK-37899

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jul 20, 2022

What changes were proposed in this pull request?

This PR adds a new rule(EliminateInnerJoin) to support converting inner join to left semi join.
Left semi join can be further optimized by PushDownLeftSemiAntiJoin and PushLeftSemiLeftAntiThroughJoin.

For example:

CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b, id AS c FROM range(10000000);
CREATE TABLE t2 using parquet AS SELECT id AS a, id AS b, id AS c FROM range(10000000);
CREATE TABLE t3 using parquet AS SELECT id AS a, id AS b, id AS c FROM range(1000);

SELECT t1.a, t2.b FROM t1 JOIN t2 ON t1.a = t2.a JOIN (SELECT DISTINCT c FROM t3) tmp ON t2.b = tmp.c;

Before this PR:

== Optimized Logical Plan ==
Project [a#13050L, b#13054L]
+- Join Inner, (b#13054L = c#13058L)
   :- Project [a#13050L, b#13054L]
   :  +- Join Inner, (a#13050L = a#13053L)
   :     :- Project [a#13050L]
   :     :  +- Filter isnotnull(a#13050L)
   :     :     +- Relation spark_catalog.t1[a#13050L,b#13051L,c#13052L] parquet
   :     +- Project [a#13053L, b#13054L]
   :        +- Filter (isnotnull(a#13053L) AND isnotnull(b#13054L))
   :           +- Relation spark_catalog.t2[a#13053L,b#13054L,c#13055L] parquet
   +- Aggregate [c#13058L], [c#13058L]
      +- Project [c#13058L]
         +- Filter isnotnull(c#13058L)
            +- Relation spark_catalog.t3[a#13056L,b#13057L,c#13058L] parquet

After this PR:

== Optimized Logical Plan ==
Project [a#13050L, b#13054L]
+- Join Inner, (a#13050L = a#13053L)
   :- Project [a#13050L]
   :  +- Filter isnotnull(a#13050L)
   :     +- Relation spark_catalog.t1[a#13050L,b#13051L,c#13052L] parquet
   +- Join LeftSemi, (b#13054L = c#13058L)
      :- Project [a#13053L, b#13054L]
      :  +- Filter (isnotnull(b#13054L) AND isnotnull(a#13053L))
      :     +- Relation spark_catalog.t2[a#13053L,b#13054L,c#13055L] parquet
      +- Aggregate [c#13058L], [c#13058L]
         +- Project [c#13058L]
            +- Filter isnotnull(c#13058L)
               +- Relation spark_catalog.t3[a#13056L,b#13057L,c#13058L] parquet

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and TPC-DS benchmark test.

SQL Before this PR(Seconds) After this PR(Seconds)
q14a 187  164
q64 78  72

@github-actions github-actions bot added the SQL label Jul 20, 2022
@wangyum
Copy link
Member Author

wangyum commented Jul 20, 2022

A production case.

Before this PR After this PR
image image

@wangyum
Copy link
Member Author

wangyum commented Jul 23, 2022

Close it because this idea needs statistics. We should not convert to a left semi-join if the left side can broadcast and the right side cannot.

Another idea is to push this join through join . But I think it's too special.

@wangyum wangyum closed this Jul 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant