Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase #29812

Closed
wants to merge 6 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Sep 20, 2020

What changes were proposed in this pull request?

This patch proposes to add more optimization to UpdateFields expression chain. And optimize UpdateFields early in analysis phase.

In particular, this optimization includes:

  • Deduplicates WithField at UpdateFields
  • In SimplifyExtractValueOps, respect nullability in input struct at GetStructField(UpdateFields(struct, ...)), and unwrap if-else.

Why are the changes needed?

UpdateFields can manipulate complex nested data, but using UpdateFields can easily create inefficient expression chain. We should optimize it further.

Because when manipulating deeply nested schema, the UpdateFields expression tree could be too complex to analyze, this change optimizes UpdateFields early in analysis phase.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

@viirya viirya force-pushed the SPARK-32941 branch 2 times, most recently from 8662d76 to 0217130 Compare September 20, 2020 02:11
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Sep 20, 2020

Test build #128904 has finished for PR 29812 at commit 0217130.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 20, 2020

Test build #128905 has finished for PR 29812 at commit 74cf2dd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 20, 2020

GitHub Actions was passed.

@viirya
Copy link
Member Author

viirya commented Sep 20, 2020

cc @cloud-fan @dongjoon-hyun @maropu

@viirya
Copy link
Member Author

viirya commented Sep 20, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Sep 20, 2020

Test build #128918 has finished for PR 29812 at commit 74cf2dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, cc @fqaiser94

val newNames = mutable.ArrayBuffer.empty[String]
val newValues = mutable.ArrayBuffer.empty[Expression]
names.zip(values).reverse.foreach { case (name, value) =>
if (!newNames.contains(name)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use resolver here otherwise I think we will have correct-ness issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
case WithFields(structExpr, names, values) if names.distinct.length != names.length =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this case statement be after the next case statement? So that we combine the chains first before deduplicating?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't run this rule just once, so the order should be fine.

newValues += value
}
}
WithFields(structExpr, names = newNames.reverse.toSeq, valExprs = newValues.reverse.toSeq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, can you explain how we expect to benefit from this optimization?

I ask because we do this kind of deduplication inside of WithFields already as part of the foldLeft operation here. It will only keep the last valExpr for each name. So I think the optimized logical plan will be the same with or without this optimization in all scenarios? CMIIW

Copy link
Member Author

@viirya viirya Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It is eventually the same. But for some cases, before we extend WithFields, the expression tree might be very complex. This is coming from improving scalability of #29587. This is applied during I fixed the scalability issue. I found this is useful to reduce the complex of WithFields expression tree.

I will run these rules in #29587 to simplify expression tree before entering optimizer.

Copy link
Contributor

@fqaiser94 fqaiser94 Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I took a look at the PR you linked and left a related comment there. I don't think you actually need this optimization for #29587

This optimization is only useful if someone uses WithFields to update the same field multiple times. However, it would simply be better to not update the same field multiple times. At the very least, we should not do this when we re-use this Expression internally within Spark.

Unfortunately, "bad" end-users might still update the same field multiple times. Assuming we should optimize for such users (not sure), since this batch is only applied half-way through the optimization cycle anyway, I think we could just move up the Batch("ReplaceWithFieldsExpression", Once, ReplaceWithFieldsExpression) to get the same benefit (which is just simplified tree). What do you reckon?

Copy link
Member Author

@viirya viirya Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'd like to run these rules to simplify WithFields tree early in analysis stage. During fixing scale issue of #29587, I thought that it is very likely to write bad WithFields tree. Once hitting that, it is very hard to debug and the analyzer/optimizer spend a lot of time traversing expression tree. So I think it is very useful keep this rule to simplify the expression tree, but I don't think we want to do ReplaceWithFieldsExpression in analysis stage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I see, yes, in the analysis stage this would likely be helpful!

Okay in that case, could this PR wait till #29795 goes in? I'm refactoring WithFields so this optimization would need to change anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to wait until #29795.

@SparkQA
Copy link

SparkQA commented Sep 22, 2020

Test build #128955 has finished for PR 29812 at commit 00acff9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val newNames = mutable.ArrayBuffer.empty[String]
val newValues = mutable.ArrayBuffer.empty[Expression]
names.zip(values).reverse.foreach { case (name, value) =>
if (newNames.find(resolver(_, name)).isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit inefficient. Shall we build a set with lowercased names if case sensitivity is false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a set for case-sensitive case.

@SparkQA
Copy link

SparkQA commented Sep 23, 2020

Test build #128996 has finished for PR 29812 at commit cb8872c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34570/

@SparkQA
Copy link

SparkQA commented Oct 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34570/

@viirya viirya changed the title [SPARK-32941][SQL] Optimize WithFields expression chain [SPARK-32941][SQL] Optimize UpdateFields expression chain Oct 18, 2020
@viirya viirya changed the title [SPARK-32941][SQL] Optimize UpdateFields expression chain [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the rule early in Analysis phase Oct 18, 2020
val optimizeUpdateFields: PartialFunction[Expression, Expression] = {
case UpdateFields(structExpr, fieldOps)
if fieldOps.forall(_.isInstanceOf[WithField]) &&
fieldOps.map(_.asInstanceOf[WithField].name.toLowerCase(Locale.ROOT)).distinct.length !=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of case-sensitive mode, this seems to allow unnecessarily computation. Can we improve this if statement to handle both case-sensitive and case-insensitive together?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if condition should cover both case-sensitive and case-insensitive cases now. I compare names in lowercase in the condition.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Oct 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I meant is that we don't need to execute line 39~69 at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, for case-sensitive case, fieldOps.map(_.asInstanceOf[WithField].name).distinct.length != fieldOps.length should be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Updated. Thanks.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (except one comment). Could you consider it, @viirya ?

@SparkQA
Copy link

SparkQA commented Oct 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34574/

@SparkQA
Copy link

SparkQA commented Oct 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34574/

@SparkQA
Copy link

SparkQA commented Oct 18, 2020

Test build #129964 has finished for PR 29812 at commit 82ad8c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2020

Test build #129967 has finished for PR 29812 at commit 38bdefd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34580/

@SparkQA
Copy link

SparkQA commented Oct 19, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34580/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updates.

@SparkQA
Copy link

SparkQA commented Oct 19, 2020

Test build #129972 has finished for PR 29812 at commit f41900c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

@viirya, BTW, do you mind fixing the PR description to explain what this PR specifically improves?

This patch proposes to add more optimization to UpdateFields expression chain.

Seems like this PR does not describe what exactly optimizes. Is my understanding correct that this PR proposes two separate optimizations?

  • Deduplicates WithField at UpdateFields
  • Respect nullability in input struct at GetStructField(UpdateFields(..., struct)), and unwrap if-else.

val newValues = mutable.ArrayBuffer.empty[Expression]

if (caseSensitive) {
names.zip(values).reverse.foreach { case (name, value) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could just do like: collection.immutable.ListMap(names.zip(values): _*) which will keep the last win here and keep the order of fields to use later. But I guess it's no big deal. Just saying.

@viirya
Copy link
Member Author

viirya commented Oct 20, 2020

@HyukjinKwon Thanks for the suggestion. I updated this PR description.

@maropu
Copy link
Member

maropu commented Oct 21, 2020

(late LGTM)

@viirya
Copy link
Member Author

viirya commented Oct 21, 2020

Thanks @maropu

viirya pushed a commit that referenced this pull request Apr 26, 2021
…ined withField operations

### What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

### Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in #29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

### Does this PR introduce _any_ user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

### How was this patch tested?

Added new unit tests for these edge cases.

Closes #32338 from Kimahriman/bug/optimize-with-fields.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
viirya pushed a commit that referenced this pull request Apr 26, 2021
…ined withField operations

### What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

### Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in #29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

### Does this PR introduce _any_ user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

### How was this patch tested?

Added new unit tests for these edge cases.

Closes #32338 from Kimahriman/bug/optimize-with-fields.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 74afc68)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…ined withField operations

### What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

### Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in apache#29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

### Does this PR introduce _any_ user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

### How was this patch tested?

Added new unit tests for these edge cases.

Closes apache#32338 from Kimahriman/bug/optimize-with-fields.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 74afc68)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…ined withField operations

### What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

### Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in apache#29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

### Does this PR introduce _any_ user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

### How was this patch tested?

Added new unit tests for these edge cases.

Closes apache#32338 from Kimahriman/bug/optimize-with-fields.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 74afc68)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
@viirya viirya deleted the SPARK-32941 branch December 27, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants