Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32948][SQL] Optimize to_json and from_json expression chain #29828

Closed
wants to merge 14 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Sep 21, 2020

What changes were proposed in this pull request?

This patch proposes to optimize from_json + to_json expression chain.

Why are the changes needed?

To optimize json expression chain that could be manually generated or generated automatically during query optimization.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

@viirya viirya changed the title [SPARK-32948][]SQL] Optimize to_json and from_json expression chain [SPARK-32948][SQL] Optimize to_json and from_json expression chain Sep 21, 2020
@tanelk
Copy link
Contributor

tanelk commented Sep 21, 2020

I believe you didn't add the rule to the Optimizer, only to the one in UT.
Also what happens if the input string is not a valid JSON?

@viirya
Copy link
Member Author

viirya commented Sep 21, 2020

I believe you didn't add the rule to the Optimizer, only to the one in UT.
Also what happens if the input string is not a valid JSON?

Oops, forgot to do it.

Thanks. That's good point. So I think this is currently limited to PermissiveMode only. Even only for PermissiveMode, the optimization seems still changing the evaluation. Seems this is only valid for from_json + to_json.

@SparkQA
Copy link

SparkQA commented Sep 22, 2020

Test build #128949 has finished for PR 29828 at commit cc8ab4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 22, 2020

cc @maropu @cloud-fan @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Sep 22, 2020

Test build #128950 has finished for PR 29828 at commit a82fb7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 22, 2020

Test build #128951 has finished for PR 29828 at commit b9472bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Simplify redundant json related expressions.
*/
object OptimizeJsonExprs extends Rule[LogicalPlan] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Json.scala looks too broad for this single optimizer. Please rename this file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed. Thanks.

object OptimizeJsonExprs extends Rule[LogicalPlan] {
override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case p => p.transformExpressionsUp {
case JsonToStructs(_, options1, StructsToJson(options2, child, timeZoneId2), timeZoneId1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a general rule RemoveNoopOperators for operators. I'm wondering if we can make a general rule RemoveNoopExpression for expressions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we already have SimplifyExtractValueOps, which is somehow close to you meant. BTW, I will later work on another PR to add column pruning of JsonToStructs to OptimizeJsonExprs. It sounds not exactly matching RemoveNoopExpression, seems to me.

@SparkQA
Copy link

SparkQA commented Sep 22, 2020

Test build #128967 has finished for PR 29828 at commit 6c3b15d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 22, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Sep 22, 2020

Test build #128969 has finished for PR 29828 at commit 6c3b15d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case p => p.transformExpressionsUp {
case JsonToStructs(_, options1, StructsToJson(options2, child, timeZoneId2), timeZoneId1)
if options1.isEmpty && options2.isEmpty && timeZoneId1 == timeZoneId2 =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about options1 == options2? we may need to look at the json options and see whether the same option is symmetrical in read and write.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we may look at that. Currently the safest way is two options both are empty, or they are equal.

/**
* Simplify redundant json related expressions.
*/
object OptimizeJsonExprs extends Rule[LogicalPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we optimize csv in this rule as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. We could do it in other PR.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Sep 22, 2020

Test build #128989 has finished for PR 29828 at commit fec7357.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay otherwise.

@SparkQA
Copy link

SparkQA commented Sep 23, 2020

Test build #129003 has finished for PR 29828 at commit 5c84f65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Test build #129059 has finished for PR 29828 at commit 227c126.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Test build #129066 has finished for PR 29828 at commit b537f56.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 24, 2020

retest this please

import org.apache.spark.sql.catalyst.util.DateTimeUtils.getZoneId
import org.apache.spark.sql.types._

class JsonSuite extends PlanTest with ExpressionEvalHelper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use OptimizeJsonExprsSuite instead of JsonSuite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

case JsonToStructs(schema, options1,
StructsToJson(options2, child, timeZoneId2), timeZoneId1)
if options1.isEmpty && options2.isEmpty && timeZoneId1 == timeZoneId2 &&
schema == child.dataType =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have a test case for case-sensitivity? Technically, schema of JsonToStructs is a user-given value, isn't it? So, in the default mode(case-insensitive), the users might have a different idea. Let's make it sure with test cases.

Copy link
Member Author

@viirya viirya Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the safest is like currently, only optimize the exprs when schema is exactly matched. Technically, under case-insensitive, two schema is considered the same if letter case is the only difference. But I am thinking if user gives explicitly a schema, we should respect it, right? I can add one test case for the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33704/

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33704/

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Test build #129080 has finished for PR 29828 at commit b537f56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 24, 2020

Test build #129083 has finished for PR 29828 at commit 28f67e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 28, 2020

Any more comments or corner cases this misses? @dongjoon-hyun @HyukjinKwon @maropu?

@HyukjinKwon
Copy link
Member

Looks okay to me too

viirya and others added 2 commits September 28, 2020 18:54
…mizer/OptimizeJsonExprs.scala

Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33815/

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33817/

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33815/

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33817/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @viirya . Merged to master for Apache Spark 3.1.0 on December 2020.
The last commits are comments-only changes.
Sorry for the delay, @viirya .

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Test build #129201 has finished for PR 29828 at commit 6ef08c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 29, 2020

Test build #129200 has finished for PR 29828 at commit 43a4620.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
### What changes were proposed in this pull request?

This patch proposes to optimize from_json + to_json expression chain.

### Why are the changes needed?

To optimize json expression chain that could be manually generated or generated automatically during query optimization.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes apache#29828 from viirya/SPARK-32948.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@viirya viirya deleted the SPARK-32948 branch December 27, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants