New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][SPARK-32939][SQL]Avoid re-compute expensive expression #29807
[WIP][SPARK-32939][SQL]Avoid re-compute expensive expression #29807
Conversation
Test build #128885 has finished for PR 29807 at commit
|
private val EXPENSIVE_EXPR_PREFIX = "expensive_col_" | ||
|
||
def extractExpensiveExprs(e: Expression): Seq[Expression] = e.collect { | ||
case gjo: GetJsonObject => gjo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you handle only GetJsonObject
, please narrow down the PR title exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you handle only
GetJsonObject
, please narrow down the PR title exactly.
Sorry for forgot to add tag [WIP]
. Since I am not familiar with all expensive expr or function.
In our env, many data stored as Json, we use this a lot and always expr with get_json_object
is slow.
So I made this and hope for some advise and to see if it is reasonable enough
cc @viirya |
@dongjoon-hyun Thanks for pinging me. Hmm, this is actually related to what we are working on SPARK-32943. We should not do it at physical plan level. We plan to tackle this kind of issue at optimizer. |
All right, I will try to make this resolved in Optimizer |
For this case, seem we still need to handle Physical plan level
|
if fields.forall(_.deterministic) && canPushThroughCondition(grandChild, condition) => | ||
case f @ Filter(condition, project @ Project(fields, grandChild)) | ||
if fields.forall(_.deterministic) && canPushThroughCondition(grandChild, condition) && | ||
fields.flatMap(extractExpensiveExprs(_)).isEmpty => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not change this, even we write sql like
SELECT c1,
case
when a=1 then "a"
when a=2 then "b"
end as s_type
FROM (
SELECT c1, get_json_object(s,'$.a') as a
FROM t
) tmp
WHERE a in (1, 2)
will be resolved as
== Physical Plan ==
*(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) THEN a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS s_type#0]
+- *(1) Filter get_json_object(s#2, $.a) IN (1,2)
+- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)
Test build #128915 has finished for PR 29807 at commit
|
That is one issue our ongoing work SPARK-32943 wants to fix. The problem here involves not just one, but some issues. There are some complicated issue we need to address. Current approach to fix it in physical plan is too hacky, As I see it. |
What I show above is ScanOperator's issue, so change that code. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
will got plan as
we can see that get_json_object(s#2, $.a) will be computed tree times
Always there are expensive expressions are re-computed many times in such grammar。
This case is frequent in SQL in production environments and expr is complex and same.
So after judgement, we can compute these duplicated complex expression first with a projection.
The result plan like
The ProjectExec near Scan not contained by WholeStageCodegen since it not match this case now(won't occur in current code)
Why are the changes needed?
Does this PR introduce any user-facing change?
No
How was this patch tested?
Need add UT