[WIP][SPARK-32939][SQL]Avoid re-compute expensive expression #29807

AngersZhuuuu · 2020-09-19T05:44:25Z

What changes were proposed in this pull request?

 test("SPARK-32939: Expensive expr re-compute demo") {
    withTable("t") {
      withTempDir { loc =>
        sql(
          s"""CREATE TABLE t(c1 INT, s STRING) PARTITIONED BY(P1 STRING)
             | LOCATION '${loc.getAbsolutePath}'
             |""".stripMargin)
        sql(
          """
            |SELECT c1,
            |case
            |  when get_json_object(s,'$.a')=1 then "a"
            |  when get_json_object(s,'$.a')=2 then "b"
            |end as s_type
            |FROM t
            |WHERE get_json_object(s,'$.a') in (1, 2)
          """.stripMargin).explain(true)
         }
    }
}

will got plan as

== Physical Plan ==
*(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) THEN a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS s_type#0]
+- *(1) Filter get_json_object(s#2, $.a) IN (1,2)
   +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

we can see that get_json_object(s#2, $.a) will be computed tree times
Always there are expensive expressions are re-computed many times in such grammar。
This case is frequent in SQL in production environments and expr is complex and same.
So after judgement, we can compute these duplicated complex expression first with a projection.
The result plan like

== Physical Plan ==
*(1) Project [c1#1, CASE WHEN (cast(expensive_col_6#6 as int) = 1) THEN a WHEN (cast(expensive_col_6#6 as int) = 2) THEN b END AS s_type#0]
+- *(1) Filter expensive_col_6#6 IN (1,2)
   +- Project [c1#1, s#2, get_json_object(s#2, $.a) AS expensive_col_6#6]
      +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

The ProjectExec near Scan not contained by WholeStageCodegen since it not match this case now(won't occur in current code)

Why are the changes needed?

Does this PR introduce any user-facing change?

No

How was this patch tested?

Need add UT

SparkQA · 2020-09-19T07:05:01Z

Test build #128885 has finished for PR 29807 at commit 73e94c3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-09-20T01:11:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala

+  private val EXPENSIVE_EXPR_PREFIX = "expensive_col_"
+
+  def extractExpensiveExprs(e: Expression): Seq[Expression] = e.collect {
+    case gjo: GetJsonObject => gjo


If you handle only GetJsonObject, please narrow down the PR title exactly.

If you handle only GetJsonObject, please narrow down the PR title exactly.

Sorry for forgot to add tag [WIP]. Since I am not familiar with all expensive expr or function.
In our env, many data stored as Json, we use this a lot and always expr with get_json_object is slow.

So I made this and hope for some advise and to see if it is reasonable enough

dongjoon-hyun · 2020-09-20T01:11:43Z

cc @viirya

viirya · 2020-09-20T02:27:06Z

@dongjoon-hyun Thanks for pinging me. Hmm, this is actually related to what we are working on SPARK-32943. We should not do it at physical plan level. We plan to tackle this kind of issue at optimizer.

AngersZhuuuu · 2020-09-20T07:11:11Z

@dongjoon-hyun Thanks for pinging me. Hmm, this is actually related to what we are working on SPARK-32943. We should not do it at physical plan level. We plan to tackle this kind of issue at optimizer.

All right, I will try to make this resolved in Optimizer

AngersZhuuuu · 2020-09-20T07:45:53Z

@dongjoon-hyun Thanks for pinging me. Hmm, this is actually related to what we are working on SPARK-32943. We should not do it at physical plan level. We plan to tackle this kind of issue at optimizer.

For this case, seem we still need to handle Physical plan level


== Optimized Logical Plan ==
Project [c1#1, CASE WHEN (cast(expensive_col_6#6 as int) = 1) THEN a WHEN (cast(expensive_col_6#6 as int) = 2) THEN b END AS s_type#0]
+- Filter expensive_col_6#6 IN (1,2)
   +- Project [c1#1, get_json_object(s#2, $.a) AS expensive_col_6#6]
      +- HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

== Physical Plan ==
Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) AS expensive_col_6#6 as int) = 1) THEN a WHEN (cast(get_json_object(s#2, $.a) AS expensive_col_6#6 as int) = 2) THEN b END AS s_type#0]
+- Filter get_json_object(s#2, $.a) AS expensive_col_6#6 IN (1,2)
   +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

AngersZhuuuu · 2020-09-20T08:21:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-      if fields.forall(_.deterministic) && canPushThroughCondition(grandChild, condition) =>
+    case f @ Filter(condition, project @ Project(fields, grandChild))
+      if fields.forall(_.deterministic) && canPushThroughCondition(grandChild, condition) &&
+        fields.flatMap(extractExpensiveExprs(_)).isEmpty =>


If not change this, even we write sql like

SELECT c1, case when a=1 then "a" when a=2 then "b" end as s_type FROM ( SELECT c1, get_json_object(s,'$.a') as a FROM t ) tmp WHERE a in (1, 2)

will be resolved as

== Physical Plan == *(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) THEN a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS s_type#0] +- *(1) Filter get_json_object(s#2, $.a) IN (1,2) +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

SparkQA · 2020-09-20T12:53:43Z

Test build #128915 has finished for PR 29807 at commit 05ae9ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-09-20T17:14:33Z

For this case, seem we still need to handle Physical plan level


== Optimized Logical Plan ==
Project [c1#1, CASE WHEN (cast(expensive_col_6#6 as int) = 1) THEN a WHEN (cast(expensive_col_6#6 as int) = 2) THEN b END AS s_type#0]
+- Filter expensive_col_6#6 IN (1,2)
   +- Project [c1#1, get_json_object(s#2, $.a) AS expensive_col_6#6]
      +- HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

== Physical Plan ==
Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) AS expensive_col_6#6 as int) = 1) THEN a WHEN (cast(get_json_object(s#2, $.a) AS expensive_col_6#6 as int) = 2) THEN b END AS s_type#0]
+- Filter get_json_object(s#2, $.a) AS expensive_col_6#6 IN (1,2)
   +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], Statistics(sizeInBytes=8.0 EiB)

That is one issue our ongoing work SPARK-32943 wants to fix.

The problem here involves not just one, but some issues. There are some complicated issue we need to address. Current approach to fix it in physical plan is too hacky, As I see it.

AngersZhuuuu · 2020-09-21T01:32:12Z

The problem here involves not just one, but some issues. There are some complicated issue we need to address. Current approach to fix it in physical plan is too hacky, As I see it.

What I show above is ScanOperator's issue, so change that code.

github-actions · 2020-12-31T01:04:26Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Update SparkPlanner.scala

73e94c3

probot-autolabeler bot added the SQL label Sep 19, 2020

dongjoon-hyun reviewed Sep 20, 2020

View reviewed changes

AngersZhuuuu changed the title ~~[SPARK-32939][SQL]Avoid re-compute expensive expression~~ [WIP][SPARK-32939][SQL]Avoid re-compute expensive expression Sep 20, 2020

AngersZhuuuu added 2 commits September 20, 2020 16:17

update

f91d6a5

Update SparkPlanner.scala

05ae9ec

AngersZhuuuu commented Sep 20, 2020

View reviewed changes

github-actions bot added the Stale label Dec 31, 2020

github-actions bot closed this Jan 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-32939][SQL]Avoid re-compute expensive expression #29807

[WIP][SPARK-32939][SQL]Avoid re-compute expensive expression #29807

AngersZhuuuu commented Sep 19, 2020

SparkQA commented Sep 19, 2020

dongjoon-hyun Sep 20, 2020

AngersZhuuuu Sep 20, 2020

dongjoon-hyun commented Sep 20, 2020

viirya commented Sep 20, 2020 •

edited

AngersZhuuuu commented Sep 20, 2020

AngersZhuuuu commented Sep 20, 2020

AngersZhuuuu Sep 20, 2020 •

edited

SparkQA commented Sep 20, 2020

viirya commented Sep 20, 2020 •

edited

AngersZhuuuu commented Sep 21, 2020

github-actions bot commented Dec 31, 2020

[WIP][SPARK-32939][SQL]Avoid re-compute expensive expression #29807

[WIP][SPARK-32939][SQL]Avoid re-compute expensive expression #29807

Conversation

AngersZhuuuu commented Sep 19, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Sep 19, 2020

dongjoon-hyun Sep 20, 2020

Choose a reason for hiding this comment

AngersZhuuuu Sep 20, 2020

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 20, 2020

viirya commented Sep 20, 2020 • edited

AngersZhuuuu commented Sep 20, 2020

AngersZhuuuu commented Sep 20, 2020

AngersZhuuuu Sep 20, 2020 • edited

Choose a reason for hiding this comment

SparkQA commented Sep 20, 2020

viirya commented Sep 20, 2020 • edited

AngersZhuuuu commented Sep 21, 2020

github-actions bot commented Dec 31, 2020

viirya commented Sep 20, 2020 •

edited

AngersZhuuuu Sep 20, 2020 •

edited

viirya commented Sep 20, 2020 •

edited