[SPARK-29013][SQL] Structurally equivalent subexpression elimination #25717

viirya · 2019-09-06T18:42:58Z

What changes were proposed in this pull request?

We do semantically equivalent subexpression elimination in SparkSQL. However, for some expressions that are not semantically equivalent, but structurally equivalent, current subexpression elimination generates too many similar functions. These functions share same computation structure but only differ in input slots of current processing row.

For example, expression a is input[1] + input[2], expression b is input[3] + input[4]. They are not semantically equivalent in SparkSQL, but they have the same computation on different input data.

For such expressions, we can generate just one function, and pass in input slots during runtime.

It can reduce the length of generated code text, and save compilation time.

Why are the changes needed?

For complex query, current sub-expression elimination could generate too many similar functions. It leads long generated code text and increases compilation time.

For example, run the following query:

val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
df.createOrReplaceTempView("spark64kb")
val data = spark.sql("select * from spark64kb limit 10")
data.describe()

The longest compilation time observed in this query is 25816.203394ms. After this patch, the same compilation is reduced to 9143.778397ms.

Does this PR introduce any user-facing change?

This doesn't introduce user-facing change.
This feature is controlled by a SQL config spark.sql.structuralSubexpressionElimination.enabled.

How was this patch tested?

Added tests.

viirya · 2019-09-06T18:49:30Z

I observed this problem when running the test in #25642 .

Now it is WIP. I would like to see if whether all tests work and listen to the opinions from others.

viirya · 2019-09-07T00:06:13Z

cc @cloud-fan @maropu @mgaido91 @rednaxelafx

maropu · 2019-09-07T01:12:22Z

Ur, it looks a cool idea. I'll check the code tonight.

mgaido91 · 2019-09-07T08:18:39Z

thanks @viirya. I actually was thinking the same, so I like your proposal. Just one question: why are you limiting this to BoundReferences? Cannot we generalize this?

viirya · 2019-09-08T04:59:32Z

Thanks @maropu @mgaido91

For sub-expression elimination, I observed that many similar functions generated are only different in input slots. So this works for BoundReference. For BoundReference, we can parameterize the ordinals out of generated functions to achieve this feature.

I think it is probably very hard to generalize this.

maropu · 2019-09-08T05:28:09Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+   * structurally equivalent expressions. Non-recursive.
+   */
+  def addStructExpr(ctx: CodegenContext, expr: Expression): Unit = {
+    if (expr.deterministic) {


We cannot always share a function for non-deterministic cases? e.g.,

int subExpr1 = input[0] + random(); int subExpr2 = input[1] + random(); => int subExpr1 = subExpr(input[0]); int subExpr2 = subExpr(input[1]); int subExpr(int v) { return v + random(); }

Non-deterministic expressions can't do sub-expression elimination.

btw, this idea is limited to common subexprs? For example, the idea can cover a case like;

select sum(a + b), sum(b + c), sum(c + d), sum(d + e) from values (1, 1, 1, 1, 1) t(a, b, c, d, e)

?

It is probably suitable. Only if we want these functions sum(a+b)...etc. to be called in split functions. Their inputs can be parameterized.

maropu · 2019-09-08T08:06:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala

+ * but different slots of input tuple, we replace `BoundReference` with this parameterized
+ * version. The slot position is parameterized and is given at runtime.
+ */
+case class ParameterizedBoundReference(parameter: String, dataType: DataType, nullable: Boolean)


Since this is only used for codegen, how about moving this to org.apache.spark.sql.catalyst.expressions.codegen?

nit: parameter -> variableNameForOrdinal or paramNameForOrdinal?

Yeah, ok, I will.

maropu · 2019-09-08T08:07:56Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+  private def parameterizedBoundReferences(ctx: CodegenContext, expr: Expression): Expression = {
+    expr.transformUp {
+      case b: BoundReference =>
+        val param = ctx.freshName("boundInput")


nit: boundInput -> ordinal?

maropu · 2019-09-08T08:09:28Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala


-    if (!skip && !addExpr(expr)) {
-      childrenToRecurse.foreach(addExprTree)
+    if (!skip) {


We cannot do it like !skip && !addStructExpr(expr, exprMap) in the same way with addExprTree?

Well, this also recursively adds children if the parent expr was added. I tested with adding or not in prototyping this. Adding children saves more code text.

maropu · 2019-09-08T08:13:03Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+  case class StructuralExpr(e: Expression) {
+    def normalized(expr: Expression): Expression = {
+      expr.transformUp {
+        case b: ParameterizedBoundReference =>


To avoid unnecessary plan copys, we can check this equality based on BoundReference (by just copying it like b.copy(ordinal = 0 or -1?))? IIUC its ok to replace BoundReference with ParameterizedBoundReference just when generating code in https://github.com/apache/spark/pull/25717/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR1117?

Isn't b.copy also copying the input expr?

Ah, I see. But, if we write it like this (c1c5052), we don't need to pass CodegenContext into EquivalentExpressions?

Ok. looks good.

I just modified like this way, but it failed HashAggregationQuerySuite in jenkins and local.

I think I figured why. Made another commit.

The current one looks super good to me. Thanks!

maropu · 2019-09-08T08:52:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala

+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    assert(ctx.currentVars == null && ctx.INPUT_ROW != null,
+      "ParameterizedBoundReference can not be used in whole-stage codegen yet.")


Any barrier to support the whole-stage codegen case?

I think it can be used in whole-stage codegen. As this is applied to sub-expression elimination which is non whole-stage only, I'd also like to reduce the change of code in a single PR.

maropu · 2019-09-08T09:00:14Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+  def addStructExpr(ctx: CodegenContext, expr: Expression): Unit = {
+    if (expr.deterministic) {
+      val refs = expr.collect {
+        case b: BoundReference => b


nit: case b: BoundReference => Literal(0)?

maropu · 2019-09-08T09:14:33Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+      // We calculate function parameter length by the number of ints plus `INPUT_ROW` plus
+      // a int type result array index.
+      val parameterLength = CodeGenerator.calculateParamLength(refs.map(_ => Literal(0))) + 2
+      if (CodeGenerator.isValidParamLength(parameterLength)) {


If the length goes over the limit, the current logic gives up eliminating common exprs? If so, can we fall back into the non-structural mode?

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

SparkQA · 2019-09-08T22:07:52Z

Test build #110315 has finished for PR 25717 at commit 13f5ca6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ParameterizedBoundReference(parameter: String, dataType: DataType, nullable: Boolean)
case class StructuralExpr(e: Expression)

viirya · 2019-09-09T01:05:37Z

retest this please

SparkQA · 2019-09-09T05:10:12Z

Test build #110320 has finished for PR 25717 at commit 13f5ca6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ParameterizedBoundReference(parameter: String, dataType: DataType, nullable: Boolean)
case class StructuralExpr(e: Expression)

viirya · 2019-09-09T05:14:27Z

retest this please

SparkQA · 2019-09-09T19:05:23Z

Test build #110352 has finished for PR 25717 at commit f52cbde.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-10T04:23:30Z

Test build #110379 has finished for PR 25717 at commit cc0ee12.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ParameterizedBoundReference(ordinalParam: String, dataType: DataType, nullable: Boolean)

SparkQA · 2019-09-10T07:05:02Z

Test build #110397 has finished for PR 25717 at commit f447042.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2019-09-10T07:27:47Z

retest this please

SparkQA · 2019-09-10T11:23:21Z

Test build #110407 has finished for PR 25717 at commit f447042.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

SparkQA · 2019-09-14T18:39:35Z

Test build #110594 has finished for PR 25717 at commit 69b6ba4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-14T22:59:21Z

Test build #110598 has finished for PR 25717 at commit f447042.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-14T23:41:45Z

@cloud-fan @rednaxelafx @mgaido91 @kiszk Anyone can check this?

maropu · 2019-09-14T23:53:44Z

btw, this pr doesn't include end-to-end tests and do you think the queries in the existing tests are enough for the end-to-end tests of this pr? I'm not sure about how much the existing tests include structurally equivalent exprs though...

Also, no negative performance impacts on the existing queries, e.g., TPCDS?

viirya · 2019-09-15T00:31:55Z

btw, this pr doesn't include end-to-end tests and do you think the queries in the existing tests are enough for the end-to-end tests of this pr? I'm not sure about how much the existing tests include structurally equivalent exprs though...

Regarding end-to-end tests, I am not sure what test we need. If there are end-to-end tests of current sub-expresssion elimination, I think I can write tests based on that.

Also, no negative performance impacts on the existing queries, e.g., TPCDS?

I have not run a TPCDS benchmark with this. Just TPCDSQueryBenchmark is enough?

SparkQA · 2019-09-15T02:44:03Z

Test build #110602 has finished for PR 25717 at commit 71e0239.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2019-09-15T14:06:21Z

I will take a look on Monday or Tuesday.

kiszk · 2019-09-16T03:26:06Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

  /**
   * Adds each expression to this data structure, grouping them with existing equivalent
   * expressions. Non-recursive.
   * Returns true if there was already a matching expression.
   */
-  def addExpr(expr: Expression): Boolean = {
+  def addExpr(expr: Expression, exprMap: EquivalenceMap = this.equivalenceMap): Boolean = {


nit: Do we need = this.equivalenceMap? It seems that all of the callers pass two arguments.

addExpr is also used at PhysicalAggregation:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

Lines 222 to 229 in 2f3997f

expr.collect {

// addExpr() always returns false for non-deterministic expressions and do not add them.

case agg: AggregateExpression

if !equivalentAggregateExpressions.addExpr(agg) => agg

case udf: PythonUDF

if PythonUDF.isGroupedAggPandasUDF(udf) &&

!equivalentAggregateExpressions.addExpr(udf) => udf

}

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

kiszk · 2019-09-16T03:41:52Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

-    if (!skip && !addExpr(expr)) {
-      childrenToRecurse.foreach(addExprTree)
+    if (!skip && addStructExpr(expr)) {
+      childrenToRecurse(expr).foreach(addStructuralExprTree)


nit: do we want to add (_)?

addExprTree doesn't add (_) too. Just followed it. If no special reason, I will leave it.

I am neutral on this.
I was curious why this line added (_), but here does not add.

At that line, if not add a (_) to call addExprTree, will see compilation error:

[error] found : (org.apache.spark.sql.catalyst.expressions.Expression, equivalentExpressions.EquivalenceMap) => Unit [error] (which expands to) (org.apache.spark.sql.catalyst.expressions.Expression, scala.collection.mutable.HashMap[equivalentExpressions.Expr,scala.collection.mutable.ArrayBuff$r[org.apache.spark.sql.catalyst.expressions.Expression]]) => Unit [error] required: org.apache.spark.sql.catalyst.expressions.Expression => ? [error] expressions.foreach(equivalentExpressions.addExprTree) [error] ^

Because addExprTree actually needs to arguments, it doesn't match foreach's argument type.

SparkQA · 2019-09-17T02:00:03Z

Test build #110665 has finished for PR 25717 at commit 4700b89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2019-12-27T00:07:30Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

This comment has been minimized.

Sign in to view

dongjoon-hyun added the SQL label Sep 6, 2019

viirya force-pushed the structural-subexpr branch 2 times, most recently from 4be455c to c5d443a Compare September 6, 2019 21:12

This comment has been minimized.

Sign in to view

viirya force-pushed the structural-subexpr branch from 910967b to 6b44659 Compare September 8, 2019 04:41

maropu reviewed Sep 8, 2019

View reviewed changes

This comment has been minimized.

Sign in to view

maropu reviewed Sep 8, 2019

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala Show resolved Hide resolved

viirya force-pushed the structural-subexpr branch from 6b44659 to c660561 Compare September 8, 2019 15:35

This comment has been minimized.

Sign in to view

viirya force-pushed the structural-subexpr branch from c660561 to 81bb6ce Compare September 8, 2019 18:56

This comment has been minimized.

Sign in to view

Structurally equivalent subexpression elimination.

13f5ca6

viirya force-pushed the structural-subexpr branch from 81bb6ce to 13f5ca6 Compare September 8, 2019 20:40

viirya added 2 commits September 9, 2019 08:19

Merge remote-tracking branch 'upstream/master' into structural-subexpr

4dea062

Solve merging conflict.

f52cbde

Address comments.

cc0ee12

viirya force-pushed the structural-subexpr branch from c7f03a9 to cc0ee12 Compare September 10, 2019 00:09

Add few comment.

f447042

viirya changed the title ~~[WIP][SPARK-29013][SQL] Structurally equivalent subexpression elimination~~ [SPARK-29013][SQL] Structurally equivalent subexpression elimination Sep 10, 2019

maropu reviewed Sep 14, 2019

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala Outdated Show resolved Hide resolved

viirya force-pushed the structural-subexpr branch from 69b6ba4 to f447042 Compare September 14, 2019 19:07

Try again to address comment.

71e0239

kiszk reviewed Sep 16, 2019

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala Show resolved Hide resolved

kiszk reviewed Sep 16, 2019

View reviewed changes

Add doc to SortPrefix.

4700b89

github-actions bot added the Stale label Dec 27, 2019

viirya closed this Dec 27, 2019

viirya deleted the structural-subexpr branch December 27, 2023 18:37

	expr.collect {
	// addExpr() always returns false for non-deterministic expressions and do not add them.
	case agg: AggregateExpression
	if !equivalentAggregateExpressions.addExpr(agg) => agg
	case udf: PythonUDF
	if PythonUDF.isGroupedAggPandasUDF(udf) &&
	!equivalentAggregateExpressions.addExpr(udf) => udf
	}

[SPARK-29013][SQL] Structurally equivalent subexpression elimination #25717

[SPARK-29013][SQL] Structurally equivalent subexpression elimination #25717

Conversation

viirya commented Sep 6, 2019 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

viirya commented Sep 6, 2019

This comment has been minimized.

This comment has been minimized.

viirya commented Sep 7, 2019

This comment has been minimized.

maropu commented Sep 7, 2019

This comment has been minimized.

mgaido91 commented Sep 7, 2019

viirya commented Sep 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

Choose a reason for hiding this comment

maropu Sep 8, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Sep 14, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

SparkQA commented Sep 8, 2019

viirya commented Sep 9, 2019

SparkQA commented Sep 9, 2019

viirya commented Sep 9, 2019

SparkQA commented Sep 9, 2019

SparkQA commented Sep 10, 2019

SparkQA commented Sep 10, 2019

kiszk commented Sep 10, 2019

SparkQA commented Sep 10, 2019

SparkQA commented Sep 14, 2019

SparkQA commented Sep 14, 2019

maropu commented Sep 14, 2019

maropu commented Sep 14, 2019 • edited

viirya commented Sep 15, 2019

SparkQA commented Sep 15, 2019

kiszk commented Sep 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 17, 2019

github-actions bot commented Dec 27, 2019

viirya commented Sep 6, 2019 •

edited

maropu Sep 8, 2019 •

edited

viirya Sep 14, 2019 •

edited

maropu commented Sep 14, 2019 •

edited