Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28090][SQL] Improve replaceAliasButKeepName performance #35382

Conversation

peter-toth
Copy link
Contributor

What changes were proposed in this pull request?

SPARK-28090 ticket description contains an example query with multiple nested struct creation and field extraction. The following is is an example of the query when the sample code range is set to only 3:

Project [struct(num1, numerics#23.num1, num10, numerics#23.num10, num11, numerics#23.num11, num12, numerics#23.num12, num13, numerics#23.num13, num14, numerics#23.num14, num15, numerics#23.num15, num2, numerics#23.num2, num3, numerics#23.num3, num4, numerics#23.num4, num5, numerics#23.num5, num6, numerics#23.num6, num7, numerics#23.num7, num8, numerics#23.num8, num9, numerics#23.num9, out_num1, numerics#23.out_num1, out_num2, -numerics#23.num2) AS numerics#42]
+- Project [struct(num1, numerics#5.num1, num10, numerics#5.num10, num11, numerics#5.num11, num12, numerics#5.num12, num13, numerics#5.num13, num14, numerics#5.num14, num15, numerics#5.num15, num2, numerics#5.num2, num3, numerics#5.num3, num4, numerics#5.num4, num5, numerics#5.num5, num6, numerics#5.num6, num7, numerics#5.num7, num8, numerics#5.num8, num9, numerics#5.num9, out_num1, -numerics#5.num1) AS numerics#23]
   +- LogicalRDD [numerics#5], false

If the level of nesting reaches 7 the query plan generation becomes extremely slow on Spark 2.4. That is because both

  • CollapseProject rule is slow and
  • some of the expression optimization rules running on the huge, not yet simplified expression tree of the single, collapsed Project node are slow.

On Spark 3.3, after SPARK-36718, CollapseProject doesn't collapse such plans so the above issues don't occur,
but PhysicalOperation extractor has an issue that it also builds up that huge expression tree and then traverses and modifies it in AliasHelper.replaceAliasButKeepName(). With a small change in that function we can avoid such costly operations.

Why are the changes needed?

The suggested change reduced the plan generation time of the example query from minutes (range = 7) or hours (range = 8+) to seconds.

Does this PR introduce any user-facing change?

The example query can be executed.

How was this patch tested?

Existing UTs + manual test of the example query in the ticket description.

@peter-toth
Copy link
Contributor Author

I think there is a build error currently on master:

[error] [error] /__w/spark/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala:426:22: not enough arguments for constructor TaskContextImpl: (stageId: Int, stageAttemptNumber: Int, partitionId: Int, taskAttemptId: Long, attemptNumber: Int, numPartitions: Int, taskMemoryManager: org.apache.spark.memory.TaskMemoryManager, localProperties: java.util.Properties, metricsSystem: org.apache.spark.metrics.MetricsSystem, taskMetrics: org.apache.spark.executor.TaskMetrics, cpus: Int, resources: Map[String,org.apache.spark.resource.ResourceInformation])org.apache.spark.TaskContextImpl.
[error] Unspecified value parameter metricsSystem.
[error]       val context1 = new TaskContextImpl(0, 0, 0, 0, 0, null, null, null, cpus = 0)
[error]                      ^
[error] one error found

@cloud-fan
Copy link
Contributor

@peter-toth it should be fixed now, can you rebase this PR? thanks!

@peter-toth peter-toth force-pushed the SPARK-28090-improve-replacealiasbutkeepname branch from f3cff6c to 2c47f91 Compare April 1, 2022 07:11
@peter-toth
Copy link
Contributor Author

@peter-toth it should be fixed now, can you rebase this PR? thanks!

Thanks. Done.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28090][SQL] Improve replaceAliasButKeepName performance [SPARK-28090][SQL] Improve replaceAliasButKeepName performance Apr 3, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @peter-toth and @cloud-fan .
Merged to master for Apache Spark 3.4.

@peter-toth
Copy link
Contributor Author

Thanks @cloud-fan, @dongjoon-hyun for the review!

cloud-fan pushed a commit that referenced this pull request Apr 6, 2022
### What changes were proposed in this pull request?

SPARK-28090 ticket description contains an example query with multiple nested struct creation and field extraction. The following is is an example of the query when the sample code range is set to only 3:
```
Project [struct(num1, numerics#23.num1, num10, numerics#23.num10, num11, numerics#23.num11, num12, numerics#23.num12, num13, numerics#23.num13, num14, numerics#23.num14, num15, numerics#23.num15, num2, numerics#23.num2, num3, numerics#23.num3, num4, numerics#23.num4, num5, numerics#23.num5, num6, numerics#23.num6, num7, numerics#23.num7, num8, numerics#23.num8, num9, numerics#23.num9, out_num1, numerics#23.out_num1, out_num2, -numerics#23.num2) AS numerics#42]
+- Project [struct(num1, numerics#5.num1, num10, numerics#5.num10, num11, numerics#5.num11, num12, numerics#5.num12, num13, numerics#5.num13, num14, numerics#5.num14, num15, numerics#5.num15, num2, numerics#5.num2, num3, numerics#5.num3, num4, numerics#5.num4, num5, numerics#5.num5, num6, numerics#5.num6, num7, numerics#5.num7, num8, numerics#5.num8, num9, numerics#5.num9, out_num1, -numerics#5.num1) AS numerics#23]
   +- LogicalRDD [numerics#5], false
```
If the level of nesting reaches 7 the query plan generation becomes extremely slow on Spark 2.4. That is because both
- `CollapseProject` rule is slow and
- some of the expression optimization rules running on the huge, not yet simplified expression tree of the single, collapsed `Project` node are slow.

On Spark 3.3, after SPARK-36718, `CollapseProject` doesn't collapse such plans so the above issues don't occur,
but `PhysicalOperation` extractor has an issue that it also builds up that huge expression tree and then traverses and modifies it in `AliasHelper.replaceAliasButKeepName()`. With a small change in that function we can avoid such costly operations.

### Why are the changes needed?
The suggested change reduced the plan generation time of the example query from minutes (range = 7) or hours (range = 8+) to seconds.

### Does this PR introduce _any_ user-facing change?
The example query can be executed.

### How was this patch tested?
Existing UTs + manual test of the example query in the ticket description.

Closes #35382 from peter-toth/SPARK-28090-improve-replacealiasbutkeepname.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@cloud-fan
Copy link
Contributor

I've backported it to 3.3 as well, as it fixes a perf regression in 3.3: #36024

@dongjoon-hyun
Copy link
Member

+1 for @cloud-fan 's decision. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants