[SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance #35382

peter-toth · 2022-02-02T16:09:56Z

What changes were proposed in this pull request?

SPARK-28090 ticket description contains an example query with multiple nested struct creation and field extraction. The following is is an example of the query when the sample code range is set to only 3:

Project [struct(num1, numerics#23.num1, num10, numerics#23.num10, num11, numerics#23.num11, num12, numerics#23.num12, num13, numerics#23.num13, num14, numerics#23.num14, num15, numerics#23.num15, num2, numerics#23.num2, num3, numerics#23.num3, num4, numerics#23.num4, num5, numerics#23.num5, num6, numerics#23.num6, num7, numerics#23.num7, num8, numerics#23.num8, num9, numerics#23.num9, out_num1, numerics#23.out_num1, out_num2, -numerics#23.num2) AS numerics#42]
+- Project [struct(num1, numerics#5.num1, num10, numerics#5.num10, num11, numerics#5.num11, num12, numerics#5.num12, num13, numerics#5.num13, num14, numerics#5.num14, num15, numerics#5.num15, num2, numerics#5.num2, num3, numerics#5.num3, num4, numerics#5.num4, num5, numerics#5.num5, num6, numerics#5.num6, num7, numerics#5.num7, num8, numerics#5.num8, num9, numerics#5.num9, out_num1, -numerics#5.num1) AS numerics#23]
   +- LogicalRDD [numerics#5], false

If the level of nesting reaches 7 the query plan generation becomes extremely slow on Spark 2.4. That is because both

CollapseProject rule is slow and
some of the expression optimization rules running on the huge, not yet simplified expression tree of the single, collapsed Project node are slow.

On Spark 3.3, after SPARK-36718, CollapseProject doesn't collapse such plans so the above issues don't occur,
but PhysicalOperation extractor has an issue that it also builds up that huge expression tree and then traverses and modifies it in AliasHelper.replaceAliasButKeepName(). With a small change in that function we can avoid such costly operations.

Why are the changes needed?

The suggested change reduced the plan generation time of the example query from minutes (range = 7) or hours (range = 8+) to seconds.

Does this PR introduce any user-facing change?

The example query can be executed.

How was this patch tested?

Existing UTs + manual test of the example query in the ticket description.

peter-toth · 2022-03-31T15:16:38Z

I think there is a build error currently on master:

[error] [error] /__w/spark/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala:426:22: not enough arguments for constructor TaskContextImpl: (stageId: Int, stageAttemptNumber: Int, partitionId: Int, taskAttemptId: Long, attemptNumber: Int, numPartitions: Int, taskMemoryManager: org.apache.spark.memory.TaskMemoryManager, localProperties: java.util.Properties, metricsSystem: org.apache.spark.metrics.MetricsSystem, taskMetrics: org.apache.spark.executor.TaskMetrics, cpus: Int, resources: Map[String,org.apache.spark.resource.ResourceInformation])org.apache.spark.TaskContextImpl.
[error] Unspecified value parameter metricsSystem.
[error]       val context1 = new TaskContextImpl(0, 0, 0, 0, 0, null, null, null, cpus = 0)
[error]                      ^
[error] one error found

cloud-fan · 2022-04-01T04:23:36Z

@peter-toth it should be fixed now, can you rebase this PR? thanks!

peter-toth · 2022-04-01T07:11:43Z

@peter-toth it should be fixed now, can you rebase this PR? thanks!

Thanks. Done.

dongjoon-hyun

+1, LGTM. Thank you, @peter-toth and @cloud-fan .
Merged to master for Apache Spark 3.4.

peter-toth · 2022-04-03T08:10:36Z

Thanks @cloud-fan, @dongjoon-hyun for the review!

### What changes were proposed in this pull request? SPARK-28090 ticket description contains an example query with multiple nested struct creation and field extraction. The following is is an example of the query when the sample code range is set to only 3: ``` Project [struct(num1, numerics#23.num1, num10, numerics#23.num10, num11, numerics#23.num11, num12, numerics#23.num12, num13, numerics#23.num13, num14, numerics#23.num14, num15, numerics#23.num15, num2, numerics#23.num2, num3, numerics#23.num3, num4, numerics#23.num4, num5, numerics#23.num5, num6, numerics#23.num6, num7, numerics#23.num7, num8, numerics#23.num8, num9, numerics#23.num9, out_num1, numerics#23.out_num1, out_num2, -numerics#23.num2) AS numerics#42] +- Project [struct(num1, numerics#5.num1, num10, numerics#5.num10, num11, numerics#5.num11, num12, numerics#5.num12, num13, numerics#5.num13, num14, numerics#5.num14, num15, numerics#5.num15, num2, numerics#5.num2, num3, numerics#5.num3, num4, numerics#5.num4, num5, numerics#5.num5, num6, numerics#5.num6, num7, numerics#5.num7, num8, numerics#5.num8, num9, numerics#5.num9, out_num1, -numerics#5.num1) AS numerics#23] +- LogicalRDD [numerics#5], false ``` If the level of nesting reaches 7 the query plan generation becomes extremely slow on Spark 2.4. That is because both - `CollapseProject` rule is slow and - some of the expression optimization rules running on the huge, not yet simplified expression tree of the single, collapsed `Project` node are slow. On Spark 3.3, after SPARK-36718, `CollapseProject` doesn't collapse such plans so the above issues don't occur, but `PhysicalOperation` extractor has an issue that it also builds up that huge expression tree and then traverses and modifies it in `AliasHelper.replaceAliasButKeepName()`. With a small change in that function we can avoid such costly operations. ### Why are the changes needed? The suggested change reduced the plan generation time of the example query from minutes (range = 7) or hours (range = 8+) to seconds. ### Does this PR introduce _any_ user-facing change? The example query can be executed. ### How was this patch tested? Existing UTs + manual test of the example query in the ticket description. Closes #35382 from peter-toth/SPARK-28090-improve-replacealiasbutkeepname. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

cloud-fan · 2022-04-06T06:49:50Z

I've backported it to 3.3 as well, as it fixes a perf regression in 3.3: #36024

dongjoon-hyun · 2022-04-06T17:33:58Z

+1 for @cloud-fan 's decision. Thanks!

github-actions bot added the SQL label Feb 2, 2022

peter-toth mentioned this pull request Mar 31, 2022

[SPARK-38712][SQL] Fix perf regression in PhysicalOperation #36024

Closed

cloud-fan approved these changes Mar 31, 2022

View reviewed changes

peter-toth added 2 commits April 1, 2022 09:11

[SPARK-28090][SQL] Improve replaceAliasButKeepName performance

3857f62

fix comment

2c47f91

peter-toth force-pushed the SPARK-28090-improve-replacealiasbutkeepname branch from f3cff6c to 2c47f91 Compare April 1, 2022 07:11

dongjoon-hyun changed the title ~~[SPARK-28090][SQL] Improve replaceAliasButKeepName performance~~ [SPARK-28090][SQL] Improve replaceAliasButKeepName performance Apr 3, 2022

dongjoon-hyun approved these changes Apr 3, 2022

View reviewed changes

dongjoon-hyun closed this in 61074d4 Apr 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance #35382

[SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance #35382

peter-toth commented Feb 2, 2022

peter-toth commented Mar 31, 2022

cloud-fan commented Apr 1, 2022

peter-toth commented Apr 1, 2022

dongjoon-hyun left a comment

peter-toth commented Apr 3, 2022

cloud-fan commented Apr 6, 2022

dongjoon-hyun commented Apr 6, 2022

[SPARK-28090][SQL] Improve replaceAliasButKeepName performance #35382

[SPARK-28090][SQL] Improve replaceAliasButKeepName performance #35382

Conversation

peter-toth commented Feb 2, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

peter-toth commented Mar 31, 2022

cloud-fan commented Apr 1, 2022

peter-toth commented Apr 1, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

peter-toth commented Apr 3, 2022

cloud-fan commented Apr 6, 2022

dongjoon-hyun commented Apr 6, 2022

[SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance #35382

[SPARK-28090][SQL] Improve `replaceAliasButKeepName` performance #35382