[SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase #29812

viirya · 2020-09-20T01:45:19Z

What changes were proposed in this pull request?

This patch proposes to add more optimization to UpdateFields expression chain. And optimize UpdateFields early in analysis phase.

In particular, this optimization includes:

Deduplicates WithField at UpdateFields
In SimplifyExtractValueOps, respect nullability in input struct at GetStructField(UpdateFields(struct, ...)), and unwrap if-else.

Why are the changes needed?

UpdateFields can manipulate complex nested data, but using UpdateFields can easily create inefficient expression chain. We should optimize it further.

Because when manipulating deeply nested schema, the UpdateFields expression tree could be too complex to analyze, this change optimizes UpdateFields early in analysis phase.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

SparkQA · 2020-09-20T07:05:02Z

Test build #128904 has finished for PR 29812 at commit 0217130.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-20T07:05:02Z

Test build #128905 has finished for PR 29812 at commit 74cf2dd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-09-20T17:04:56Z

GitHub Actions was passed.

viirya · 2020-09-20T17:05:09Z

cc @cloud-fan @dongjoon-hyun @maropu

viirya · 2020-09-20T17:05:31Z

retest this please

SparkQA · 2020-09-20T21:33:58Z

Test build #128918 has finished for PR 29812 at commit 74cf2dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-21T05:58:35Z

LGTM, cc @fqaiser94

fqaiser94 · 2020-09-21T10:11:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala

+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (!newNames.contains(name)) {


should use resolver here otherwise I think we will have correct-ness issues.

fqaiser94 · 2020-09-21T10:40:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala

  def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case WithFields(structExpr, names, values) if names.distinct.length != names.length =>


could this case statement be after the next case statement? So that we combine the chains first before deduplicating?

We don't run this rule just once, so the order should be fine.

fqaiser94 · 2020-09-21T22:48:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala

+          newValues += value
+        }
+      }
+      WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)


For my understanding, can you explain how we expect to benefit from this optimization?

I ask because we do this kind of deduplication inside of WithFields already as part of the foldLeft operation here. It will only keep the last valExpr for each name. So I think the optimized logical plan will be the same with or without this optimization in all scenarios? CMIIW

You are right. It is eventually the same. But for some cases, before we extend WithFields, the expression tree might be very complex. This is coming from improving scalability of #29587. This is applied during I fixed the scalability issue. I found this is useful to reduce the complex of WithFields expression tree.

I will run these rules in #29587 to simplify expression tree before entering optimizer.

Okay, so I took a look at the PR you linked and left a related comment there. I don't think you actually need this optimization for #29587

This optimization is only useful if someone uses WithFields to update the same field multiple times. However, it would simply be better to not update the same field multiple times. At the very least, we should not do this when we re-use this Expression internally within Spark.

Unfortunately, "bad" end-users might still update the same field multiple times. Assuming we should optimize for such users (not sure), since this batch is only applied half-way through the optimization cycle anyway, I think we could just move up the Batch("ReplaceWithFieldsExpression", Once, ReplaceWithFieldsExpression) to get the same benefit (which is just simplified tree). What do you reckon?

Actually I'd like to run these rules to simplify WithFields tree early in analysis stage. During fixing scale issue of #29587, I thought that it is very likely to write bad WithFields tree. Once hitting that, it is very hard to debug and the analyzer/optimizer spend a lot of time traversing expression tree. So I think it is very useful keep this rule to simplify the expression tree, but I don't think we want to do ReplaceWithFieldsExpression in analysis stage.

ahh I see, yes, in the analysis stage this would likely be helpful!

Okay in that case, could this PR wait till #29795 goes in? I'm refactoring WithFields so this optimization would need to change anyway.

I'm fine to wait until #29795.

SparkQA · 2020-09-22T05:36:46Z

Test build #128955 has finished for PR 29812 at commit 00acff9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-22T08:31:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/WithFields.scala

+      val newNames = mutable.ArrayBuffer.empty[String]
+      val newValues = mutable.ArrayBuffer.empty[Expression]
+      names.zip(values).reverse.foreach { case (name, value) =>
+        if (newNames.find(resolver(_, name)).isEmpty) {


this is a bit inefficient. Shall we build a set with lowercased names if case sensitivity is false?

Added a set for case-sensitive case.

SparkQA · 2020-09-23T03:21:26Z

Test build #128996 has finished for PR 29812 at commit cb8872c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-18T19:26:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34570/

SparkQA · 2020-10-18T19:55:56Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34570/

dongjoon-hyun · 2020-10-18T20:31:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala

+  val optimizeUpdateFields: PartialFunction[Expression, Expression] = {
+    case UpdateFields(structExpr, fieldOps)
+      if fieldOps.forall(_.isInstanceOf[WithField]) &&
+        fieldOps.map(_.asInstanceOf[WithField].name.toLowerCase(Locale.ROOT)).distinct.length !=


In case of case-sensitive mode, this seems to allow unnecessarily computation. Can we improve this if statement to handle both case-sensitive and case-insensitive together?

The if condition should cover both case-sensitive and case-insensitive cases now. I compare names in lowercase in the condition.

No, what I meant is that we don't need to execute line 39~69 at all.

For example, for case-sensitive case, fieldOps.map(_.asInstanceOf[WithField].name).distinct.length != fieldOps.length should be used.

I see. Updated. Thanks.

dongjoon-hyun

+1, LGTM (except one comment). Could you consider it, @viirya ?

SparkQA · 2020-10-18T20:35:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34574/

SparkQA · 2020-10-18T20:56:50Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34574/

SparkQA · 2020-10-18T23:14:04Z

Test build #129964 has finished for PR 29812 at commit 82ad8c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-19T00:21:06Z

Test build #129967 has finished for PR 29812 at commit 38bdefd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-19T02:15:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34580/

SparkQA · 2020-10-19T02:36:57Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34580/

dongjoon-hyun

Thank you for updates.

SparkQA · 2020-10-19T06:03:35Z

Test build #129972 has finished for PR 29812 at commit f41900c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-10-19T17:36:03Z

Merged to master.

HyukjinKwon · 2020-10-20T04:00:18Z

@viirya, BTW, do you mind fixing the PR description to explain what this PR specifically improves?

This patch proposes to add more optimization to UpdateFields expression chain.

Seems like this PR does not describe what exactly optimizes. Is my understanding correct that this PR proposes two separate optimizations?

Deduplicates WithField at UpdateFields
Respect nullability in input struct at GetStructField(UpdateFields(..., struct)), and unwrap if-else.

HyukjinKwon · 2020-10-20T04:03:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala

+      val newValues = mutable.ArrayBuffer.empty[Expression]
+
+      if (caseSensitive) {
+        names.zip(values).reverse.foreach { case (name, value) =>


I wonder if we could just do like: collection.immutable.ListMap(names.zip(values): _*) which will keep the last win here and keep the order of fields to use later. But I guess it's no big deal. Just saying.

viirya · 2020-10-20T05:46:42Z

@HyukjinKwon Thanks for the suggestion. I updated this PR description.

maropu · 2020-10-21T00:24:37Z

(late LGTM)

viirya · 2020-10-21T00:30:59Z

Thanks @maropu

…ined withField operations ### What changes were proposed in this pull request? Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue. ### Why are the changes needed? Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in #29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema. ### Does this PR introduce _any_ user-facing change? It fixes exceptions and incorrect results for valid uses in the latest Spark release. ### How was this patch tested? Added new unit tests for these edge cases. Closes #32338 from Kimahriman/bug/optimize-with-fields. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…ined withField operations ### What changes were proposed in this pull request? Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue. ### Why are the changes needed? Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in #29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema. ### Does this PR introduce _any_ user-facing change? It fixes exceptions and incorrect results for valid uses in the latest Spark release. ### How was this patch tested? Added new unit tests for these edge cases. Closes #32338 from Kimahriman/bug/optimize-with-fields. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 74afc68) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…ined withField operations ### What changes were proposed in this pull request? Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue. ### Why are the changes needed? Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in apache#29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema. ### Does this PR introduce _any_ user-facing change? It fixes exceptions and incorrect results for valid uses in the latest Spark release. ### How was this patch tested? Added new unit tests for these edge cases. Closes apache#32338 from Kimahriman/bug/optimize-with-fields. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 74afc68) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

probot-autolabeler bot added the SQL label Sep 20, 2020

viirya force-pushed the SPARK-32941 branch 2 times, most recently from 8662d76 to 0217130 Compare September 20, 2020 02:11

Optimize WithFields expression chain.

74cf2dd

viirya force-pushed the SPARK-32941 branch from 0217130 to 74cf2dd Compare September 20, 2020 02:17

This comment has been minimized.

Sign in to view

viirya mentioned this pull request Sep 20, 2020

[SPARK-32376][SQL] Make unionByName null-filling behavior work with struct columns #29587

Closed

fqaiser94 reviewed Sep 21, 2020

View reviewed changes

Use resolver.

00acff9

cloud-fan reviewed Sep 22, 2020

View reviewed changes

Address comment.

cb8872c

Merge remote-tracking branch 'upstream/master' into SPARK-32941

82ad8c8

Simplify UpdateFields in analysis too.

38bdefd

viirya changed the title ~~[SPARK-32941][SQL] Optimize WithFields expression chain~~ [SPARK-32941][SQL] Optimize UpdateFields expression chain Oct 18, 2020

viirya changed the title ~~[SPARK-32941][SQL] Optimize UpdateFields expression chain~~ [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase Oct 18, 2020

dongjoon-hyun reviewed Oct 18, 2020

View reviewed changes

dongjoon-hyun approved these changes Oct 18, 2020

View reviewed changes

Skip the rule if possible.

f41900c

dongjoon-hyun approved these changes Oct 19, 2020

View reviewed changes

dongjoon-hyun closed this in 66c5e01 Oct 19, 2020

HyukjinKwon reviewed Oct 20, 2020

View reviewed changes

Kimahriman mentioned this pull request Apr 25, 2021

[SPARK-35213][SQL] Keep the correct ordering of nested structs in chained withField operations #32338

Closed

viirya deleted the SPARK-32941 branch December 27, 2023 18:28

		def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
		case WithFields(structExpr, names, values) if names.distinct.length != names.length =>

[SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase #29812

[SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase #29812

Conversation

viirya commented Sep 20, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

This comment has been minimized.

This comment has been minimized.

SparkQA commented Sep 20, 2020

SparkQA commented Sep 20, 2020

viirya commented Sep 20, 2020

viirya commented Sep 20, 2020

viirya commented Sep 20, 2020

SparkQA commented Sep 20, 2020

cloud-fan commented Sep 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Sep 22, 2020 • edited Loading

Choose a reason for hiding this comment

fqaiser94 Sep 23, 2020 • edited Loading

Choose a reason for hiding this comment

viirya Sep 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 23, 2020

SparkQA commented Oct 18, 2020

SparkQA commented Oct 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Oct 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Oct 18, 2020

SparkQA commented Oct 18, 2020

SparkQA commented Oct 18, 2020

SparkQA commented Oct 19, 2020

SparkQA commented Oct 19, 2020

SparkQA commented Oct 19, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Oct 19, 2020

dongjoon-hyun commented Oct 19, 2020

HyukjinKwon commented Oct 20, 2020

Choose a reason for hiding this comment

viirya commented Oct 20, 2020

maropu commented Oct 21, 2020

viirya commented Oct 21, 2020

viirya commented Sep 20, 2020 •

edited

Loading

viirya Sep 22, 2020 •

edited

Loading

fqaiser94 Sep 23, 2020 •

edited

Loading

viirya Sep 23, 2020 •

edited

Loading

dongjoon-hyun Oct 18, 2020 •

edited

Loading