-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40177][SQL] Simplify condition of form (a==b) || (a==null&&b==null) to a<=>b #37625
Conversation
Merge from apache master
Merge with master
Merge apache spark master
Can one of the admins verify this patch? |
gently ping @cloud-fan @srowen |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimization seems logically correct. I don't know a lot about this part of the code, to review the code change. My only question would be how common it is to find this type of join condition, but I could believe it for join conditions
@@ -412,6 +412,16 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { | |||
} | |||
} | |||
|
|||
case Or(EqualTo(l, r), And(IsNull(c1), IsNull(c2))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume that we have a chain of predicates combined by OR cond1 OR cond2 OR cond3 OR ... condN
. I think we can merge condX
and condY
if they are EqualTo(l, r)
and And(IsNull(l), isNull(r))
. This is more general than the current approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I didn't generalize it because it would be tricky and will add complexity to code. Also it might be less common where these conditions are separated out with some other expressions in between.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
New case is added in Boolean simplification to convert condition of form (a==b) || (a==null&&b==null) to a<=>b.
Why are the changes needed?
If the join condition is like key1==key2 || (key1==null && key2==null), join is executed as Broadcast Nested Loop Join as this condition doesn't satisfy equi join condition. BNLJ takes more time as compared to Sort merge or broadcast hash join. This condition can be converted to key1<=>key2 to make the join execute as Broadcast or sort merge join. It will improve the performance of queries which have join with condition which matches this pattern.
Sample query:
val dfAns = df.join(df1, (df("v")===df1("x") or (isnull(df("v")) and isnull(df1("x")))), "leftanti")
Plan before change
OptimizedPlan:
Join LeftAnti, ((v#1 = x#15) || (isnull(v#1) && isnull(x#15)))
:- LocalRelation [g#0, v#1, o#2, x#3]
+- LocalRelation [x#15]
dfAns.queryExecution.executedPlan
*(1) BroadcastNestedLoopJoin BuildRight, LeftAnti, ((v#256 = x#270) || (isnull(v#256) && isnull(x#270)))
:- LocalTableScan [g#255, v#256, o#257, x#258]
+- BroadcastExchange IdentityBroadcastMode, [id=#91]
+- LocalTableScan [x#270]
Plan after change
OptimizedPlan
Join LeftAnti, (v#29 <=> x#79)
:- LocalRelation [g#28, v#29, o#30, x#31]
+- LocalRelation [x#79]
ExecutedPlan
*(1) BroadcastHashJoin [coalesce(v#29, 0), isnull(v#29)], [coalesce(x#71, 0), isnull(x#71)], LeftAnti, BuildRight
:- LocalTableScan [g#28, v#29, o#30, x#31]
+- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(coalesce(input[0, int, false], 0), isnull(input[0, int, false]))), [id=#57]
+- LocalTableScan [x#71]
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests run