[SPARK-41805][SQL] Reuse expressions in WindowSpecDefinition #39333

wankunde · 2023-01-01T17:48:41Z

What changes were proposed in this pull request?

For complex expressions in window spec definition, we extract it and replace it with an AttributeReference (with an internal column name, e.g. "_w0").
We can reuse the extracted expressions to avoid generator unnecessary Window node.

For example, query:

SELECT max(value) over(partition by (key) order by cast(ts as timestamp)) as w1,
       min(value) over(partition by (key) order by cast(ts as timestamp)) as w2
FROM t1

before this PR:

Project [w1#229L, w2#230L]
+- Project [value#232L, key#231L, _w2#236, _w3#237, w1#229L, w2#230L, w1#229L, w2#230L]
   +- Window [min(value#232L) windowspecdefinition(key#231L, _w3#237 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS w2#230L], [key#231L], [_w3#237 ASC NULLS FIRST]
      +- Window [max(value#232L) windowspecdefinition(key#231L, _w2#236 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS w1#229L], [key#231L], [_w2#236 ASC NULLS FIRST]
         +- Project [value#232L, key#231L, cast(ts#233L as timestamp) AS _w2#236, cast(ts#233L as timestamp) AS _w3#237]
            +- SubqueryAlias spark_catalog.default.t1
               +- Relation spark_catalog.default.t1[key#231L,value#232L,ts#233L] parquet

after this PR:

Project [w1#229L, w2#230L]
+- Project [value#232L, key#231L, _w2#236, w1#229L, w2#230L, w1#229L, w2#230L]
   +- Window [max(value#232L) windowspecdefinition(key#231L, _w2#236 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS w1#229L, min(value#232L) windowspecdefinition(key#231L, _w2#236 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS w2#230L], [key#231L], [_w2#236 ASC NULLS FIRST]
      +- Project [value#232L, key#231L, cast(ts#233L as timestamp) AS _w2#236]
         +- SubqueryAlias spark_catalog.default.t1
            +- Relation spark_catalog.default.t1[key#231L,value#232L,ts#233L] parquet

Why are the changes needed?

Optimize window expressions

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

AmplabJenkins · 2023-01-02T16:10:29Z

Can one of the admins verify this patch?

wankunde · 2023-01-03T02:28:08Z

Hi, @wangyum @cloud-fan Could you help to review this PR? Thanks

cloud-fan · 2023-01-06T09:18:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -3230,15 +3230,15 @@ class Analyzer(override val catalogManager: CatalogManager)
      // For example, when we have col1 - Sum(col2 + col3) OVER (PARTITION BY col4 ORDER BY col5),
      // we need to make sure that col1 to col5 are all projected from the child of the Window
      // operator.
-      val extractedExprBuffer = new ArrayBuffer[NamedExpression]()
+      val extractedExprBufferMap = mutable.LinkedHashMap.empty[Expression, NamedExpression]


Suggested change

val extractedExprBufferMap = mutable.LinkedHashMap.empty[Expression, NamedExpression]

val extractedExprMap = mutable.LinkedHashMap.empty[Expression, NamedExpression]

cloud-fan · 2023-01-06T09:19:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -3230,15 +3230,15 @@ class Analyzer(override val catalogManager: CatalogManager)
      // For example, when we have col1 - Sum(col2 + col3) OVER (PARTITION BY col4 ORDER BY col5),
      // we need to make sure that col1 to col5 are all projected from the child of the Window
      // operator.
-      val extractedExprBuffer = new ArrayBuffer[NamedExpression]()
+      val extractedExprBufferMap = mutable.LinkedHashMap.empty[Expression, NamedExpression]
      def extractExpr(expr: Expression): Expression = expr match {
        case ne: NamedExpression =>
          // If a named expression is not in regularExpressions, add it to
          // extractedExprBuffer and replace it with an AttributeReference.


l

Suggested change

// extractedExprBuffer and replace it with an AttributeReference.

// extractedExprMap and replace it with an AttributeReference.

cloud-fan · 2023-01-06T09:20:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

          if (missingExpr.nonEmpty) {
-            extractedExprBuffer += ne
+            extractedExprBufferMap += ne -> ne


Suggested change

extractedExprBufferMap += ne -> ne

extractedExprBufferMap += ne.canonicalized -> ne

two expression may have the same semantic even if look differently, like a + 1 and 1 + a. Let's make sure the map key is canonicalizd expression

cloud-fan · 2023-01-06T09:21:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-          val withName = Alias(e, s"_w${extractedExprBuffer.length}")()
-          extractedExprBuffer += withName
-          withName.toAttribute
+          extractedExprBufferMap.getOrElseUpdate(e,


Suggested change

extractedExprBufferMap.getOrElseUpdate(e,

extractedExprBufferMap.getOrElseUpdate(e.canonicalizied,

cloud-fan · 2023-01-06T09:22:23Z

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q36.sf100/simplified.txt

              InputAdapter
                Exchange [_w1,_w2] #1
                  WholeStageCodegen (5)
-                    HashAggregate [i_category,i_class,spark_grouping_id,sum,sum] [sum(UnscaledValue(ss_net_profit)),sum(UnscaledValue(ss_ext_sales_price)),gross_margin,lochierarchy,_w1,_w2,_w3,sum,sum]
+                    HashAggregate [i_category,i_class,spark_grouping_id,sum,sum] [sum(UnscaledValue(ss_net_profit)),sum(UnscaledValue(ss_ext_sales_price)),gross_margin,lochierarchy,_w0,_w1,_w2,sum,sum]


do you know why the name becomes w0, w1, w2?

For window expr in q36:

rank() OVER ( PARTITION BY grouping(i_category) + grouping(i_class), CASE WHEN grouping(i_class) = 0 THEN i_category END ORDER BY sum(ss_net_profit) / sum(ss_ext_sales_price) ASC) AS rank_within_parent

will be parsed to

{WindowExpression@13158} rank() windowspecdefinition((cast((shiftright(spark_grouping_id#117L, 1) & 1) as tinyint) AS grouping(i_category)#122 + cast((shiftright(spark_grouping_id#117L, 0) & 1) as tinyint) AS grouping(i_class)#123), CASE WHEN (cast(cast((shiftright(spark_grouping_id#117L, 0) & 1) as tinyint) AS grouping(i_class)#124 as int) = 0) THEN i_category#118 END, (sum(ss_net_profit#26) / sum(ss_ext_sales_price#19)) ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) windowFunction = {Rank@9579} rank() windowSpec = {WindowSpecDefinition@13160} windowspecdefinition((cast((shiftright(spark_grouping_id#117L, 1) & 1) as tinyint) AS grouping(i_category)#122 + cast((shiftright(spark_grouping_id#117L, 0) & 1) as tinyint) AS grouping(i_class)#123), CASE WHEN (cast(cast((shiftright(spark_grouping_id#117L, 0) & 1) as tinyint) AS grouping(i_class)#124 as int) = 0) THEN i_category#118 END, (sum(ss_net_profit#26) / sum(ss_ext_sales_price#19)) ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$()))

Rule ResolveWindowOrder will add the orderSpec to rank function

object ResolveWindowOrder extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan.resolveExpressionsWithPruning( _.containsPattern(WINDOW_EXPRESSION), ruleId) { case WindowExpression(wf: WindowFunction, spec) if spec.orderSpec.isEmpty => throw QueryCompilationErrors.windowFunctionWithWindowFrameNotOrderedError(wf) case WindowExpression(rank: RankLike, spec) if spec.resolved => val order = spec.orderSpec.map(_.child) WindowExpression(rank.withOrder(order), spec) } }

so the window expression will be:

{WindowExpression@13219} rank((sum(ss_net_profit#26) / sum(ss_ext_sales_price#19))) windowspecdefinition((cast((shiftright(spark_grouping_id#117L, 1) & 1) as tinyint) AS grouping(i_category)#122 + cast((shiftright(spark_grouping_id#117L, 0) & 1) as tinyint) AS grouping(i_class)#123), CASE WHEN (cast(cast((shiftright(spark_grouping_id#117L, 0) & 1) as tinyint) AS grouping(i_class)#124 as int) = 0) THEN i_category#118 END, (sum(ss_net_profit#26) / sum(ss_ext_sales_price#19)) ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) windowFunction = {Rank@13049} rank((sum(ss_net_profit#26) / sum(ss_ext_sales_price#19))) windowSpec = {WindowSpecDefinition@13221} windowspecdefinition((cast((shiftright(spark_grouping_id#117L, 1) & 1) as tinyint) AS grouping(i_category)#122 + cast((shiftright(spark_grouping_id#117L, 0) & 1) as tinyint) AS grouping(i_class)#123), CASE WHEN (cast(cast((shiftright(spark_grouping_id#117L, 0) & 1) as tinyint) AS grouping(i_class)#124 as int) = 0) THEN i_category#118 END, (sum(ss_net_profit#26) / sum(ss_ext_sales_price#19)) ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$()))

Before this PR, rule ExtractWindowExpressions will extract the child expr (sum(ss_net_profit#26) / sum(ss_ext_sales_price#19)) in WindowFunction Rank() to _w0 and extract the orderSpec expr (sum(ss_net_profit#26) / sum(ss_ext_sales_price#19)) in WindowSpecDefinition to _w3 again.

After this PR, we will reuse the extract expr, and change _w3 to _w0 in q36 result.

I don't really understand why we need add orderSpec to Rank function. Could we remove

case WindowExpression(rank: RankLike, spec) if spec.resolved => val order = spec.orderSpec.map(_.child) WindowExpression(rank.withOrder(order), spec)

from ResolveWindowOrder in the following PR?

cloud-fan · 2023-01-09T05:34:11Z

thanks, merging to master!

For complex expressions in window spec definition, we extract it and replace it with an AttributeReference (with an internal column name, e.g. "_w0"). We can reuse the extracted expressions to avoid generator unnecessary Window node. For example, query: ``` SELECT max(value) over(partition by (key) order by cast(ts as timestamp)) as w1, min(value) over(partition by (key) order by cast(ts as timestamp)) as w2 FROM t1 ``` before this PR: ``` Project [w1#229L, w2#230L] +- Project [value#232L, key#231L, _w2#236, _w3#237, w1#229L, w2#230L, w1#229L, w2#230L] +- Window [min(value#232L) windowspecdefinition(key#231L, _w3#237 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS w2#230L], [key#231L], [_w3#237 ASC NULLS FIRST] +- Window [max(value#232L) windowspecdefinition(key#231L, _w2#236 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS w1#229L], [key#231L], [_w2#236 ASC NULLS FIRST] +- Project [value#232L, key#231L, cast(ts#233L as timestamp) AS _w2#236, cast(ts#233L as timestamp) AS _w3#237] +- SubqueryAlias spark_catalog.default.t1 +- Relation spark_catalog.default.t1[key#231L,value#232L,ts#233L] parquet ``` after this PR: ``` Project [w1#229L, w2#230L] +- Project [value#232L, key#231L, _w2#236, w1#229L, w2#230L, w1#229L, w2#230L] +- Window [max(value#232L) windowspecdefinition(key#231L, _w2#236 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS w1#229L, min(value#232L) windowspecdefinition(key#231L, _w2#236 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS w2#230L], [key#231L], [_w2#236 ASC NULLS FIRST] +- Project [value#232L, key#231L, cast(ts#233L as timestamp) AS _w2#236] +- SubqueryAlias spark_catalog.default.t1 +- Relation spark_catalog.default.t1[key#231L,value#232L,ts#233L] parquet ``` Optimize window expressions No Added UT Closes #39333 from wankunde/extract_window. Authored-by: Kun Wan <wankun@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-41805] Reuse expressions in WindowSpecDefinition

d6eeb38

github-actions bot added the SQL label Jan 1, 2023

Fix UT

b8c9573

wankunde changed the title ~~[SPARK-41805] Reuse expressions in WindowSpecDefinition~~ [SPARK-41805][SQL] Reuse expressions in WindowSpecDefinition Jan 2, 2023

cloud-fan reviewed Jan 6, 2023

View reviewed changes

wankunde added 2 commits January 7, 2023 10:54

Cache canonicalized expression in extractedExprMap

c807935

Fix UT

a297e89

wankunde requested a review from cloud-fan January 8, 2023 06:56

cloud-fan approved these changes Jan 9, 2023

View reviewed changes

cloud-fan closed this in 6f31566 Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41805][SQL] Reuse expressions in WindowSpecDefinition #39333

[SPARK-41805][SQL] Reuse expressions in WindowSpecDefinition #39333

wankunde commented Jan 1, 2023

AmplabJenkins commented Jan 2, 2023

wankunde commented Jan 3, 2023

cloud-fan Jan 6, 2023

cloud-fan Jan 6, 2023

cloud-fan Jan 6, 2023

cloud-fan Jan 6, 2023

cloud-fan Jan 6, 2023

cloud-fan Jan 6, 2023

wankunde Jan 7, 2023 •

edited

cloud-fan commented Jan 9, 2023

	val extractedExprBufferMap = mutable.LinkedHashMap.empty[Expression, NamedExpression]
	val extractedExprMap = mutable.LinkedHashMap.empty[Expression, NamedExpression]

	// extractedExprBuffer and replace it with an AttributeReference.
	// extractedExprMap and replace it with an AttributeReference.

	extractedExprBufferMap += ne -> ne
	extractedExprBufferMap += ne.canonicalized -> ne

	extractedExprBufferMap.getOrElseUpdate(e,
	extractedExprBufferMap.getOrElseUpdate(e.canonicalizied,

[SPARK-41805][SQL] Reuse expressions in WindowSpecDefinition #39333

[SPARK-41805][SQL] Reuse expressions in WindowSpecDefinition #39333

Conversation

wankunde commented Jan 1, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Jan 2, 2023

wankunde commented Jan 3, 2023

cloud-fan Jan 6, 2023

Choose a reason for hiding this comment

cloud-fan Jan 6, 2023

Choose a reason for hiding this comment

cloud-fan Jan 6, 2023

Choose a reason for hiding this comment

cloud-fan Jan 6, 2023

Choose a reason for hiding this comment

cloud-fan Jan 6, 2023

Choose a reason for hiding this comment

cloud-fan Jan 6, 2023

Choose a reason for hiding this comment

wankunde Jan 7, 2023 • edited

Choose a reason for hiding this comment

cloud-fan commented Jan 9, 2023

wankunde Jan 7, 2023 •

edited