-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44700][SQL] Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace) #42376
Conversation
… expression like from_json(regexp_replace)
d85fda2
to
a12700d
Compare
gently ping @viirya Could you help me to reivew it? also cc @cloud-fan |
cc @wangyum do you have any ideas? It seems any optimization that changes the expression shape may break common subexpression elimination (CSE). It's hard to come up with a good cost model to fix it. I think a better idea is to make CSE a plan-level optimization, so that we can find all common subexpressions before optimizing expressions. But it's hard to do. @monkeyboy123 is it possible to rewrite your query and use subquery alias or CTE to hold the expression result, to avoid repeated execution? or you can disable this optimization by setting |
@cloud-fan I can disable this optimization by setting |
@monkeyboy123 Have you enabled Seq("""{"a":1, "b":0.8}""").toDF("s").write.saveAsTable("t")
val df = sql(
"""
|SELECT j.*
|FROM (SELECT from_json(regexp_replace(s, 'a', 'new_a'), 'new_a INT, b DOUBLE') AS j
| FROM t) tmp
|""".stripMargin)
df.explain(true)
|
@wangyum Actually, i encounter this problem in spark 3.1.1, but i can do a check in spark 3.4.x. |
It seems like it happens in spark 3.1.1. It has been fixed in spark 3.4.x. |
Thanks @monkeyboy123. Please upgrade your Spark to the latest version. |
I think we can reuse the result of the same regexp_replace functions, for example, we can reuse the result of SELECT from_json(regexp_replace(s, 'a', 'x'), 'x INT, b DOUBLE').x,
from_json(regexp_replace(s, 'a', 'x'), 'x INT, b DOUBLE').b
FROM values('{"a":1, "b":0.8}') t(s) Filed another PR for this purpose: #42450 |
What changes were proposed in this pull request?
Rule OptimizeCsvJsonExprs should not be applied to expression like from_json(regexp_replace)
Why are the changes needed?
It causes performance regression.
Does this PR introduce any user-facing change?
yes,
sql like this:
before this pr:
it takes 42 minutes.
After this pr:
it takes 6 minutes.
If Rule: OptimizeJsonExprs not been applied,
in physical plan : ProjectExec
function: InterpretedUnsafeProjection.createProjection or GenerateUnsafeProjection.generate will eliminate common expression,so that regexp_replace will been computed just one time.
If Rule: OptimizeJsonExprs been applied, regexp_replace will been computed as many times as numbers of ${device_schema} fields .
BTW, it hard to find root cause, in this examples, it took me 2 days to find out the root cause.
How was this patch tested?
NO, it just a rule optimization for OptimizeJsonExprs