New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27359] [OPTIMIZER] [SQL] Rewrite ArraysOverlap Join #24563
Conversation
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work!
I have few questions for now:
- Do you have some benchmark numbers regarding this optimization?
- This is exploding the arrays. So if the length of arrays is long, is it big impact on the perf?
Btw, the PR title is empty, currently. Could you write a proper title for this work? |
Oops, thanks for pointing out the missing title! :) I’ve only used this when the size of the arrays is several orders of magnitude less than the number of records on the largest side of the join. I don’t have any benchmarks to back this up yet (I’ll do some experiments and post the result here). An assumption is that the number of items in the largest array is several orders of magnitude less than the number of records on either side of the join. This feels similar to how the replication factor used to optimize skew joins is also small. |
Re: benchmarks. This is only anecdotal, but I’ve used this technique at work to bring a join that ran for a day without making progress down to only a few hours. As part of the experiments I mentioned above, I’ll try to make some dummy data that was similar to that use case. |
val (leftArray, rightArray) = | ||
if (isIn(left, arrA) && isIn(right, arrB)) { | ||
(arrA, arrB) | ||
} else { // other cases would be caught be the analyzer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: by the analyzer?
private def isIn(p: LogicalPlan, e: Expression) = p.output.map(_.expr).contains(e) | ||
|
||
def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
case Join(left, right, joinType, Some(ArraysOverlap(arrA: NamedExpression, arrB: NamedExpression))) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may fail the scala style test?
val (leftAlias, rightAlias) = ("explode_larr", "explode_rarr") | ||
val (leftPrime, leftExp) = makePrime(left, leftArray, leftAlias) | ||
val (rightPrime, rightExp) = makePrime(right, rightArray, rightAlias) | ||
val joined = Join(leftPrime, rightPrime, joinType, Some(leftExp === rightExp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we usually use EqualTo(leftExp, rightExp)
instead of the dsl here.
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
An optimization for joins on a condition of
arrays_overlap
. I believe this worthwhile to integrate into Spark due to the recent release of several new array functions in Spark 2.4. This optimization will allowusers to make better use of the arrays overlap function. The technique proposed in the patch can also be trivially extended to joins with a condition involving
array_contains
.The following code will produce a cartesian product in the physical plans.
This is unacceptable for joins on large datasets.
The query can be written into an equivalent equijoin by:
Doing so will bring a query that might otherwise never complete, down to a reasonable time.
How was this patch tested?
This patch has only been tested via manual tests on a large dataset.
I've used the technique implemented by this patch to perform similar joins with ~300 million records on either side of the join. If you agree that this is a worthwhile optimization, I'll happily contribute some unit tests to ensure the robustness of the optimization.