-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28090][SQL] Improve replaceAliasButKeepName
performance
#35382
[SPARK-28090][SQL] Improve replaceAliasButKeepName
performance
#35382
Conversation
I think there is a build error currently on
|
@peter-toth it should be fixed now, can you rebase this PR? thanks! |
f3cff6c
to
2c47f91
Compare
Thanks. Done. |
replaceAliasButKeepName
performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @peter-toth and @cloud-fan .
Merged to master for Apache Spark 3.4.
Thanks @cloud-fan, @dongjoon-hyun for the review! |
### What changes were proposed in this pull request? SPARK-28090 ticket description contains an example query with multiple nested struct creation and field extraction. The following is is an example of the query when the sample code range is set to only 3: ``` Project [struct(num1, numerics#23.num1, num10, numerics#23.num10, num11, numerics#23.num11, num12, numerics#23.num12, num13, numerics#23.num13, num14, numerics#23.num14, num15, numerics#23.num15, num2, numerics#23.num2, num3, numerics#23.num3, num4, numerics#23.num4, num5, numerics#23.num5, num6, numerics#23.num6, num7, numerics#23.num7, num8, numerics#23.num8, num9, numerics#23.num9, out_num1, numerics#23.out_num1, out_num2, -numerics#23.num2) AS numerics#42] +- Project [struct(num1, numerics#5.num1, num10, numerics#5.num10, num11, numerics#5.num11, num12, numerics#5.num12, num13, numerics#5.num13, num14, numerics#5.num14, num15, numerics#5.num15, num2, numerics#5.num2, num3, numerics#5.num3, num4, numerics#5.num4, num5, numerics#5.num5, num6, numerics#5.num6, num7, numerics#5.num7, num8, numerics#5.num8, num9, numerics#5.num9, out_num1, -numerics#5.num1) AS numerics#23] +- LogicalRDD [numerics#5], false ``` If the level of nesting reaches 7 the query plan generation becomes extremely slow on Spark 2.4. That is because both - `CollapseProject` rule is slow and - some of the expression optimization rules running on the huge, not yet simplified expression tree of the single, collapsed `Project` node are slow. On Spark 3.3, after SPARK-36718, `CollapseProject` doesn't collapse such plans so the above issues don't occur, but `PhysicalOperation` extractor has an issue that it also builds up that huge expression tree and then traverses and modifies it in `AliasHelper.replaceAliasButKeepName()`. With a small change in that function we can avoid such costly operations. ### Why are the changes needed? The suggested change reduced the plan generation time of the example query from minutes (range = 7) or hours (range = 8+) to seconds. ### Does this PR introduce _any_ user-facing change? The example query can be executed. ### How was this patch tested? Existing UTs + manual test of the example query in the ticket description. Closes #35382 from peter-toth/SPARK-28090-improve-replacealiasbutkeepname. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
I've backported it to 3.3 as well, as it fixes a perf regression in 3.3: #36024 |
+1 for @cloud-fan 's decision. Thanks! |
What changes were proposed in this pull request?
SPARK-28090 ticket description contains an example query with multiple nested struct creation and field extraction. The following is is an example of the query when the sample code range is set to only 3:
If the level of nesting reaches 7 the query plan generation becomes extremely slow on Spark 2.4. That is because both
CollapseProject
rule is slow andProject
node are slow.On Spark 3.3, after SPARK-36718,
CollapseProject
doesn't collapse such plans so the above issues don't occur,but
PhysicalOperation
extractor has an issue that it also builds up that huge expression tree and then traverses and modifies it inAliasHelper.replaceAliasButKeepName()
. With a small change in that function we can avoid such costly operations.Why are the changes needed?
The suggested change reduced the plan generation time of the example query from minutes (range = 7) or hours (range = 8+) to seconds.
Does this PR introduce any user-facing change?
The example query can be executed.
How was this patch tested?
Existing UTs + manual test of the example query in the ticket description.