[SPARK-32948][SQL] Optimize to_json and from_json expression chain #29828

viirya · 2020-09-21T20:14:42Z

What changes were proposed in this pull request?

This patch proposes to optimize from_json + to_json expression chain.

Why are the changes needed?

To optimize json expression chain that could be manually generated or generated automatically during query optimization.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

tanelk · 2020-09-21T20:58:24Z

I believe you didn't add the rule to the Optimizer, only to the one in UT.
Also what happens if the input string is not a valid JSON?

viirya · 2020-09-21T21:11:36Z

I believe you didn't add the rule to the Optimizer, only to the one in UT.
Also what happens if the input string is not a valid JSON?

Oops, forgot to do it.

Thanks. That's good point. ~~So I think this is currently limited to PermissiveMode only.~~ Even only for PermissiveMode, the optimization seems still changing the evaluation. Seems this is only valid for from_json + to_json.

SparkQA · 2020-09-22T00:42:38Z

Test build #128949 has finished for PR 29828 at commit cc8ab4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-09-22T01:18:13Z

cc @maropu @cloud-fan @dongjoon-hyun

SparkQA · 2020-09-22T01:43:23Z

Test build #128950 has finished for PR 29828 at commit a82fb7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-22T01:50:17Z

Test build #128951 has finished for PR 29828 at commit b9472bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-09-22T05:40:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Json.scala

+/**
+ * Simplify redundant json related expressions.
+ */
+object OptimizeJsonExprs extends Rule[LogicalPlan] {


Json.scala looks too broad for this single optimizer. Please rename this file.

Renamed. Thanks.

dongjoon-hyun · 2020-09-22T05:49:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Json.scala

+object OptimizeJsonExprs extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case p => p.transformExpressionsUp {
+      case JsonToStructs(_, options1, StructsToJson(options2, child, timeZoneId2), timeZoneId1)


We have a general rule RemoveNoopOperators for operators. I'm wondering if we can make a general rule RemoveNoopExpression for expressions.

Hmm, we already have SimplifyExtractValueOps, which is somehow close to you meant. BTW, I will later work on another PR to add column pruning of JsonToStructs to OptimizeJsonExprs. It sounds not exactly matching RemoveNoopExpression, seems to me.

SparkQA · 2020-09-22T07:05:02Z

Test build #128967 has finished for PR 29828 at commit 6c3b15d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-09-22T07:07:55Z

retest this please

SparkQA · 2020-09-22T11:50:59Z

Test build #128969 has finished for PR 29828 at commit 6c3b15d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JsonSuite.scala

cloud-fan · 2020-09-22T11:54:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

+  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case p => p.transformExpressionsUp {
+      case JsonToStructs(_, options1, StructsToJson(options2, child, timeZoneId2), timeZoneId1)
+          if options1.isEmpty && options2.isEmpty && timeZoneId1 == timeZoneId2 =>


how about options1 == options2? we may need to look at the json options and see whether the same option is symmetrical in read and write.

Yeah, we may look at that. Currently the safest way is two options both are empty, or they are equal.

cloud-fan · 2020-09-22T11:54:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

+/**
+ * Simplify redundant json related expressions.
+ */
+object OptimizeJsonExprs extends Rule[LogicalPlan] {


shall we optimize csv in this rule as well?

Yes, I think so. We could do it in other PR.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

SparkQA · 2020-09-22T22:31:54Z

Test build #128989 has finished for PR 29828 at commit fec7357.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

Looks okay otherwise.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JsonSuite.scala

SparkQA · 2020-09-23T06:52:29Z

Test build #129003 has finished for PR 29828 at commit 5c84f65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

SparkQA · 2020-09-24T07:05:01Z

Test build #129059 has finished for PR 29828 at commit 227c126.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-24T07:05:02Z

Test build #129066 has finished for PR 29828 at commit b537f56.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-09-24T15:43:20Z

retest this please

dongjoon-hyun · 2020-09-24T16:04:25Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JsonSuite.scala

+import org.apache.spark.sql.catalyst.util.DateTimeUtils.getZoneId
+import org.apache.spark.sql.types._
+
+class JsonSuite extends PlanTest with ExpressionEvalHelper {


Please use OptimizeJsonExprsSuite instead of JsonSuite.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

dongjoon-hyun · 2020-09-24T16:23:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

+      case JsonToStructs(schema, options1,
+        StructsToJson(options2, child, timeZoneId2), timeZoneId1)
+          if options1.isEmpty && options2.isEmpty && timeZoneId1 == timeZoneId2 &&
+            schema == child.dataType =>


Do we need to have a test case for case-sensitivity? Technically, schema of JsonToStructs is a user-given value, isn't it? So, in the default mode(case-insensitive), the users might have a different idea. Let's make it sure with test cases.

Ok, the safest is like currently, only optimize the exprs when schema is exactly matched. Technically, under case-insensitive, two schema is considered the same if letter case is the only difference. But I am thinking if user gives explicitly a schema, we should respect it, right? I can add one test case for the case.

Added test.

SparkQA · 2020-09-24T19:12:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33704/

SparkQA · 2020-09-24T19:29:57Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33704/

SparkQA · 2020-09-24T20:49:07Z

Test build #129080 has finished for PR 29828 at commit b537f56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-24T22:56:24Z

Test build #129083 has finished for PR 29828 at commit 28f67e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-09-28T16:25:30Z

Any more comments or corner cases this misses? @dongjoon-hyun @HyukjinKwon @maropu?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala

HyukjinKwon · 2020-09-29T00:49:05Z

Looks okay to me too

…mizer/OptimizeJsonExprs.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

SparkQA · 2020-09-29T02:45:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33815/

SparkQA · 2020-09-29T02:56:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33817/

SparkQA · 2020-09-29T03:04:22Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33815/

SparkQA · 2020-09-29T03:20:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33817/

dongjoon-hyun

+1, LGTM. Thank you, @viirya . Merged to master for Apache Spark 3.1.0 on December 2020.
The last commits are comments-only changes.
Sorry for the delay, @viirya .

SparkQA · 2020-09-29T06:43:54Z

Test build #129201 has finished for PR 29828 at commit 6ef08c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-29T06:44:09Z

Test build #129200 has finished for PR 29828 at commit 43a4620.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? This patch proposes to optimize from_json + to_json expression chain. ### Why are the changes needed? To optimize json expression chain that could be manually generated or generated automatically during query optimization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes apache#29828 from viirya/SPARK-32948. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

probot-autolabeler bot added the SQL label Sep 21, 2020

viirya changed the title ~~[SPARK-32948][]SQL] Optimize to_json and from_json expression chain~~ [SPARK-32948][SQL] Optimize to_json and from_json expression chain Sep 21, 2020

Add json optimization rule.

a82fb7f

viirya force-pushed the SPARK-32948 branch from cc8ab4d to a82fb7f Compare September 21, 2020 21:14

Remove to_json + from_json.

b9472bc

dongjoon-hyun reviewed Sep 22, 2020

View reviewed changes

Rename Json to OptimizeJsonExprs.

6c3b15d

cloud-fan reviewed Sep 22, 2020

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JsonSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 22, 2020

View reviewed changes

MaxGekk reviewed Sep 22, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala Outdated Show resolved Hide resolved

For comment.

fec7357

viirya force-pushed the SPARK-32948 branch from ce1720a to fec7357 Compare September 22, 2020 17:15

This comment has been minimized.

Sign in to view

maropu reviewed Sep 23, 2020

View reviewed changes

For comment.

5c84f65

maropu approved these changes Sep 23, 2020

View reviewed changes

HyukjinKwon reviewed Sep 23, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala Outdated Show resolved Hide resolved

Check nullability.

b537f56

dongjoon-hyun reviewed Sep 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala Show resolved Hide resolved

dongjoon-hyun reviewed Sep 24, 2020

View reviewed changes

For review comment.

28f67e9

maropu approved these changes Sep 28, 2020

View reviewed changes

HyukjinKwon reviewed Sep 29, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala Outdated Show resolved Hide resolved

viirya and others added 2 commits September 28, 2020 18:54

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/opti…

43a4620

…mizer/OptimizeJsonExprs.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

Remove unused variable.

6ef08c2

viirya added 2 commits September 28, 2020 21:59

Add some comment.

4f76674

Rewording.

dde0126

dongjoon-hyun approved these changes Sep 29, 2020

View reviewed changes

dongjoon-hyun closed this in 202115e Sep 29, 2020

viirya deleted the SPARK-32948 branch December 27, 2023 18:28

[SPARK-32948][SQL] Optimize to_json and from_json expression chain #29828

[SPARK-32948][SQL] Optimize to_json and from_json expression chain #29828

Conversation

viirya commented Sep 21, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

tanelk commented Sep 21, 2020

viirya commented Sep 21, 2020 • edited

SparkQA commented Sep 22, 2020

viirya commented Sep 22, 2020

SparkQA commented Sep 22, 2020

SparkQA commented Sep 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 22, 2020

viirya commented Sep 22, 2020

SparkQA commented Sep 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

SparkQA commented Sep 22, 2020

maropu left a comment

Choose a reason for hiding this comment

SparkQA commented Sep 23, 2020

SparkQA commented Sep 24, 2020

SparkQA commented Sep 24, 2020

viirya commented Sep 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Sep 24, 2020 • edited

Choose a reason for hiding this comment

viirya Sep 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 24, 2020

SparkQA commented Sep 24, 2020

SparkQA commented Sep 24, 2020

SparkQA commented Sep 24, 2020

viirya commented Sep 28, 2020

HyukjinKwon commented Sep 29, 2020

SparkQA commented Sep 29, 2020

SparkQA commented Sep 29, 2020

SparkQA commented Sep 29, 2020

SparkQA commented Sep 29, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Sep 29, 2020

SparkQA commented Sep 29, 2020

viirya commented Sep 21, 2020 •

edited

viirya commented Sep 21, 2020 •

edited

dongjoon-hyun Sep 24, 2020 •

edited

viirya Sep 24, 2020 •

edited