Skip to content

Comments

[SPARK-37899][SQL] EliminateInnerJoin to support convert inner join to left semi join#35194

Closed
wangyum wants to merge 2 commits intoapache:masterfrom
wangyum:SPARK-37899
Closed

[SPARK-37899][SQL] EliminateInnerJoin to support convert inner join to left semi join#35194
wangyum wants to merge 2 commits intoapache:masterfrom
wangyum:SPARK-37899

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jan 13, 2022

What changes were proposed in this pull request?

Add a new rule(EliminateInnerJoin) to support convert inner join to left semi join. It has two advantages:

  1. Statistics are more accurate and more BroadcastHashJoins can be planned.
  2. We have 2 other rules(PushDownLeftSemiAntiJoin and PushLeftSemiLeftAntiThroughJoin) to optimize left semi join.

This is a real case:

sql("CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b, id AS c FROM range(10000000)")
sql("CREATE TABLE t2 using parquet AS SELECT id AS a, id AS b, id AS c FROM range(10000000)")
sql("CREATE TABLE t3 using parquet AS SELECT id AS a, id AS b, id AS c FROM range(1000)")

sql(
  """
    |SELECT tmp1.*
    |FROM   (SELECT *
    |        FROM   t1
    |        UNION
    |        SELECT *
    |        FROM   t2) tmp1
    |       INNER JOIN (SELECT DISTINCT a,
    |                                   b
    |                   FROM   t3) tmp2
    |               ON tmp1.a = tmp2.a
    |                  AND tmp1.b = tmp2.b 
  """.stripMargin).explain

Before this pr:

== Optimized Logical Plan ==
Project [a#12L, b#13L, c#14L]
+- Join Inner, ((a#12L = a#18L) AND (b#13L = b#19L))
   :- Aggregate [a#12L, b#13L, c#14L], [a#12L, b#13L, c#14L]
   :  +- Union false, false
   :     :- Filter (isnotnull(a#12L) AND isnotnull(b#13L))
   :     :  +- Relation default.t1[a#12L,b#13L,c#14L] parquet
   :     +- Filter (isnotnull(a#15L) AND isnotnull(b#16L))
   :        +- Relation default.t2[a#15L,b#16L,c#17L] parquet
   +- Aggregate [a#18L, b#19L], [a#18L, b#19L]
      +- Project [a#18L, b#19L]
         +- Filter (isnotnull(a#18L) AND isnotnull(b#19L))
            +- Relation default.t3[a#18L,b#19L,c#20L] parquet

After this pr:

Aggregate [a#12L, b#13L, c#14L], [a#12L, b#13L, c#14L]
+- Union false, false
   :- Join LeftSemi, ((a#12L = a#18L) AND (b#13L = b#19L))
   :  :- Filter (isnotnull(a#12L) AND isnotnull(b#13L))
   :  :  +- Relation default.t1[a#12L,b#13L,c#14L] parquet
   :  +- Aggregate [a#18L, b#19L], [a#18L, b#19L]
   :     +- Project [a#18L, b#19L]
   :        +- Filter (isnotnull(a#18L) AND isnotnull(b#19L))
   :           +- Relation default.t3[a#18L,b#19L,c#20L] parquet
   +- Join LeftSemi, ((a#15L = a#18L) AND (b#16L = b#19L))
      :- Filter (isnotnull(a#15L) AND isnotnull(b#16L))
      :  +- Relation default.t2[a#15L,b#16L,c#17L] parquet
      +- Aggregate [a#18L, b#19L], [a#18L, b#19L]
         +- Project [a#18L, b#19L]
            +- Filter (isnotnull(a#18L) AND isnotnull(b#19L))
               +- Relation default.t3[a#18L,b#19L,c#20L] parquet

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and TPC-DS benchmark test.

SQL Before this PR(Seconds) After this PR(Seconds)
q14a 174  150

@github-actions github-actions bot added the SQL label Jan 13, 2022
@zhengruifeng
Copy link
Contributor

Just curious why Statistics are more accurate?

def apply(plan: LogicalPlan): LogicalPlan = {
plan.transformWithPruning(_.containsPattern(INNER_LIKE_JOIN), ruleId) {
case p @ Project(_,
j @ ExtractEquiJoinKeys(Inner, _, rightKeys, None, _, left, right: Aggregate, _))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inner join is symmetrical, can this rule also be applied to inner join with left Aggregate child?

j @ ExtractEquiJoinKeys(Inner, leftKeys, _, _, None, left: Aggregate, _, _))

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants