New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32948][SQL] Optimize to_json and from_json expression chain #29828
Conversation
I believe you didn't add the rule to the Optimizer, only to the one in UT. |
Oops, forgot to do it. Thanks. That's good point. |
Test build #128949 has finished for PR 29828 at commit
|
Test build #128950 has finished for PR 29828 at commit
|
Test build #128951 has finished for PR 29828 at commit
|
/** | ||
* Simplify redundant json related expressions. | ||
*/ | ||
object OptimizeJsonExprs extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Json.scala
looks too broad for this single optimizer. Please rename this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed. Thanks.
object OptimizeJsonExprs extends Rule[LogicalPlan] { | ||
override def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
case p => p.transformExpressionsUp { | ||
case JsonToStructs(_, options1, StructsToJson(options2, child, timeZoneId2), timeZoneId1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a general rule RemoveNoopOperators
for operators. I'm wondering if we can make a general rule RemoveNoopExpression
for expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we already have SimplifyExtractValueOps
, which is somehow close to you meant. BTW, I will later work on another PR to add column pruning of JsonToStructs
to OptimizeJsonExprs
. It sounds not exactly matching RemoveNoopExpression
, seems to me.
Test build #128967 has finished for PR 29828 at commit
|
retest this please |
Test build #128969 has finished for PR 29828 at commit
|
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JsonSuite.scala
Outdated
Show resolved
Hide resolved
override def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
case p => p.transformExpressionsUp { | ||
case JsonToStructs(_, options1, StructsToJson(options2, child, timeZoneId2), timeZoneId1) | ||
if options1.isEmpty && options2.isEmpty && timeZoneId1 == timeZoneId2 => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about options1 == options2
? we may need to look at the json options and see whether the same option is symmetrical in read and write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we may look at that. Currently the safest way is two options both are empty, or they are equal.
/** | ||
* Simplify redundant json related expressions. | ||
*/ | ||
object OptimizeJsonExprs extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we optimize csv in this rule as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. We could do it in other PR.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
Test build #128989 has finished for PR 29828 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay otherwise.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JsonSuite.scala
Outdated
Show resolved
Hide resolved
Test build #129003 has finished for PR 29828 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
Outdated
Show resolved
Hide resolved
Test build #129059 has finished for PR 29828 at commit
|
Test build #129066 has finished for PR 29828 at commit
|
retest this please |
import org.apache.spark.sql.catalyst.util.DateTimeUtils.getZoneId | ||
import org.apache.spark.sql.types._ | ||
|
||
class JsonSuite extends PlanTest with ExpressionEvalHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use OptimizeJsonExprsSuite
instead of JsonSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
Show resolved
Hide resolved
case JsonToStructs(schema, options1, | ||
StructsToJson(options2, child, timeZoneId2), timeZoneId1) | ||
if options1.isEmpty && options2.isEmpty && timeZoneId1 == timeZoneId2 && | ||
schema == child.dataType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have a test case for case-sensitivity? Technically, schema
of JsonToStructs is a user-given value, isn't it? So, in the default mode(case-insensitive), the users might have a different idea. Let's make it sure with test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, the safest is like currently, only optimize the exprs when schema is exactly matched. Technically, under case-insensitive, two schema is considered the same if letter case is the only difference. But I am thinking if user gives explicitly a schema, we should respect it, right? I can add one test case for the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129080 has finished for PR 29828 at commit
|
Test build #129083 has finished for PR 29828 at commit
|
Any more comments or corner cases this misses? @dongjoon-hyun @HyukjinKwon @maropu? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
Outdated
Show resolved
Hide resolved
Looks okay to me too |
…mizer/OptimizeJsonExprs.scala Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test status success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test build #129201 has finished for PR 29828 at commit
|
Test build #129200 has finished for PR 29828 at commit
|
### What changes were proposed in this pull request? This patch proposes to optimize from_json + to_json expression chain. ### Why are the changes needed? To optimize json expression chain that could be manually generated or generated automatically during query optimization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes apache#29828 from viirya/SPARK-32948. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This patch proposes to optimize from_json + to_json expression chain.
Why are the changes needed?
To optimize json expression chain that could be manually generated or generated automatically during query optimization.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test.