Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-32939][SQL]Avoid re-compute expensive expression #29807

Closed

Conversation

AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

 test("SPARK-32939: Expensive expr re-compute demo") {
    withTable("t") {
      withTempDir { loc =>
        sql(
          s"""CREATE TABLE t(c1 INT, s STRING) PARTITIONED BY(P1 STRING)
             | LOCATION '${loc.getAbsolutePath}'
             |""".stripMargin)
        sql(
          """
            |SELECT c1,
            |case
            |  when get_json_object(s,'$.a')=1 then "a"
            |  when get_json_object(s,'$.a')=2 then "b"
            |end as s_type
            |FROM t
            |WHERE get_json_object(s,'$.a') in (1, 2)
          """.stripMargin).explain(true)
         }
    }
}

will got plan as

== Physical Plan ==
*(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) THEN a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS s_type#0]
+- *(1) Filter get_json_object(s#2, $.a) IN (1,2)
   +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

we can see that get_json_object(s#2, $.a) will be computed tree times
Always there are expensive expressions are re-computed many times in such grammar。
This case is frequent in SQL in production environments and expr is complex and same.
So after judgement, we can compute these duplicated complex expression first with a projection.
The result plan like

== Physical Plan ==
*(1) Project [c1#1, CASE WHEN (cast(expensive_col_6#6 as int) = 1) THEN a WHEN (cast(expensive_col_6#6 as int) = 2) THEN b END AS s_type#0]
+- *(1) Filter expensive_col_6#6 IN (1,2)
   +- Project [c1#1, s#2, get_json_object(s#2, $.a) AS expensive_col_6#6]
      +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

The ProjectExec near Scan not contained by WholeStageCodegen since it not match this case now(won't occur in current code)

Why are the changes needed?

Does this PR introduce any user-facing change?

No

How was this patch tested?

Need add UT

@SparkQA
Copy link

SparkQA commented Sep 19, 2020

Test build #128885 has finished for PR 29807 at commit 73e94c3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

private val EXPENSIVE_EXPR_PREFIX = "expensive_col_"

def extractExpensiveExprs(e: Expression): Seq[Expression] = e.collect {
case gjo: GetJsonObject => gjo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you handle only GetJsonObject, please narrow down the PR title exactly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you handle only GetJsonObject, please narrow down the PR title exactly.

Sorry for forgot to add tag [WIP]. Since I am not familiar with all expensive expr or function.
In our env, many data stored as Json, we use this a lot and always expr with get_json_object is slow.

So I made this and hope for some advise and to see if it is reasonable enough

@dongjoon-hyun
Copy link
Member

cc @viirya

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-32939][SQL]Avoid re-compute expensive expression [WIP][SPARK-32939][SQL]Avoid re-compute expensive expression Sep 20, 2020
@viirya
Copy link
Member

viirya commented Sep 20, 2020

@dongjoon-hyun Thanks for pinging me. Hmm, this is actually related to what we are working on SPARK-32943. We should not do it at physical plan level. We plan to tackle this kind of issue at optimizer.

@AngersZhuuuu
Copy link
Contributor Author

@dongjoon-hyun Thanks for pinging me. Hmm, this is actually related to what we are working on SPARK-32943. We should not do it at physical plan level. We plan to tackle this kind of issue at optimizer.

All right, I will try to make this resolved in Optimizer

@AngersZhuuuu
Copy link
Contributor Author

@dongjoon-hyun Thanks for pinging me. Hmm, this is actually related to what we are working on SPARK-32943. We should not do it at physical plan level. We plan to tackle this kind of issue at optimizer.

For this case, seem we still need to handle Physical plan level


== Optimized Logical Plan ==
Project [c1#1, CASE WHEN (cast(expensive_col_6#6 as int) = 1) THEN a WHEN (cast(expensive_col_6#6 as int) = 2) THEN b END AS s_type#0]
+- Filter expensive_col_6#6 IN (1,2)
   +- Project [c1#1, get_json_object(s#2, $.a) AS expensive_col_6#6]
      +- HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

== Physical Plan ==
Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) AS expensive_col_6#6 as int) = 1) THEN a WHEN (cast(get_json_object(s#2, $.a) AS expensive_col_6#6 as int) = 2) THEN b END AS s_type#0]
+- Filter get_json_object(s#2, $.a) AS expensive_col_6#6 IN (1,2)
   +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

if fields.forall(_.deterministic) && canPushThroughCondition(grandChild, condition) =>
case f @ Filter(condition, project @ Project(fields, grandChild))
if fields.forall(_.deterministic) && canPushThroughCondition(grandChild, condition) &&
fields.flatMap(extractExpensiveExprs(_)).isEmpty =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not change this, even we write sql like

SELECT c1,
case
  when a=1 then "a"
  when a=2 then "b"
end as s_type
FROM (
 SELECT c1, get_json_object(s,'$.a') as a
 FROM t
 ) tmp
WHERE a in (1, 2)

will be resolved as

== Physical Plan ==
*(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) THEN a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS s_type#0]
+- *(1) Filter get_json_object(s#2, $.a) IN (1,2)
   +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

@SparkQA
Copy link

SparkQA commented Sep 20, 2020

Test build #128915 has finished for PR 29807 at commit 05ae9ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Sep 20, 2020

For this case, seem we still need to handle Physical plan level


== Optimized Logical Plan ==
Project [c1#1, CASE WHEN (cast(expensive_col_6#6 as int) = 1) THEN a WHEN (cast(expensive_col_6#6 as int) = 2) THEN b END AS s_type#0]
+- Filter expensive_col_6#6 IN (1,2)
   +- Project [c1#1, get_json_object(s#2, $.a) AS expensive_col_6#6]
      +- HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

== Physical Plan ==
Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) AS expensive_col_6#6 as int) = 1) THEN a WHEN (cast(get_json_object(s#2, $.a) AS expensive_col_6#6 as int) = 2) THEN b END AS s_type#0]
+- Filter get_json_object(s#2, $.a) AS expensive_col_6#6 IN (1,2)
   +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

That is one issue our ongoing work SPARK-32943 wants to fix.

The problem here involves not just one, but some issues. There are some complicated issue we need to address. Current approach to fix it in physical plan is too hacky, As I see it.

@AngersZhuuuu
Copy link
Contributor Author

The problem here involves not just one, but some issues. There are some complicated issue we need to address. Current approach to fix it in physical plan is too hacky, As I see it.

What I show above is ScanOperator's issue, so change that code.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 31, 2020
@github-actions github-actions bot closed this Jan 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants