[SPARK-10371] [SQL] Implement subexpr elimination for UnsafeProjections #9480

nongli · 2015-11-05T01:22:29Z

This patch adds the building blocks for codegening subexpr elimination and implements
it end to end for UnsafeProjection. The building blocks can be used to do the same thing
for other operators.

It introduces some utilities to compute common sub expressions. Expressions can be added to
this data structure. The expr and its children will be recursively matched against existing
expressions (ones previously added) and grouped into common groups. This is built using
the existing semanticEquals. It does not understand things like commutative or associative
expressions. This can be done as future work.

After building this data structure, the codegen process takes advantage of it by:

Generating a helper function in the generated class that computes the common
subexpression. This is done for all common subexpressions that have at least
two occurrences and the expression tree is sufficiently complex.
When generating the apply() function, if the helper function exists, call that
instead of regenerating the expression tree. Repeated calls to the helper function
shortcircuit the evaluation logic.

chenghao-intel · 2015-11-05T01:29:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+   * Returns the hash for this expression. Expressions that compute the same result, even if
+   * they differ cosmetically should return the same hash.
+   */
+  def semanticHash() : Int = {


Since all of the expressions are the case class, probably we don't need to our own way to computeHash.

I don't think so. Looking at thecomments on semanticEquals, we want to ignore cosmetic differences.

Identical hash values doesn't mean the identical values, I am suggesting to use the hashCode plus semanticEquals to identity the common expression, that's also the motivation of semanticEquals. See AttributeReference for details.

I think that's what I do. I put the exprs in a hash set using semantic hash/semantic equals.

Sorry, I am still confusing, why we can't use the hashCode instead.
The motivation of semanticEquals is to ignore the AttributeReference.name in comparison for AttributeReference, and the AttributeReference.hashCode also does ignore the AttributeReference.name, I don't think the cosmetic differences really exists for hashCode. Isn't it?

And there is more discussion on the semanticEquals can be found at #6587, even, I don't think we need the semanticEquals if we changed the implementation of AttributeReference.equals, as it does make lots of code complicated like this one, by using the semanticEquals, other than equals or ==.

I took a quick look at 6587 and I agree with michael about semantic equals. I think when equivalence classes are added, semantic equals is even more different than equals.

That being said, this patch is not the motivation to do this. If we decide to remove semanticEquals, this patch can be updated trivially to use equals.

Regarding hashCode vs semanticHash code, I think it does no? It looks to me like the hash everything, including the cosmetic stuff but please correct me if I'm wrong. In general, i think it makes sense to implement hash if you implement equals.

When you look at the methods equals, semanticEquals and hashCode of the AttributeReference, you will see that they are not matched, as equals will take consideration of the name, but the other 2 are not, that's why I am thinking we can also use the hashCode, instead of adding the new method semanticHash.

Anyway, it's not an external API, we can change it back anytime, as 1.6 is almost code freeze, and this is critical for people now.

chenghao-intel · 2015-11-05T01:41:40Z

My 2 cents, Common Expression Elimination is a very interesting/useful optimization, and I did the same thing for Shark quite long time ago, for my understanding, we probably need to add additional expression/operator for Sequential/Conditional execution. instead of do it directly via code gen.

e.g.

(a+b) + (a+b) => Add(Add(a, b), Add(a, b)) => Sequential(Alias(a1, Add(a, b)), Add(a1,a1))

So we can eliminate the common operations for Intermediate Representation (IR), which will make the codegen part code more clean and simple.

SparkQA · 2015-11-05T03:51:03Z

Test build #45075 has finished for PR 9480 at commit 2feafbc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

nongli · 2015-11-05T17:55:16Z

@chenghao-intel I'm not sure what you are suggesting. How do you suggest we execute the expressions? The transformation you have as an example makes sense but how does it get evaluated? What does Sequential do for the execution side? Does it add an additional materialization to evaluate the earlier stages? How does this work with a more complicated expression DAG?

yhuai · 2015-11-06T05:20:03Z

I took one pass. Overall pretty good! @davies Can you also take a look?

chenghao-intel · 2015-11-06T06:15:47Z

In general, I think we'd better add more IR (Intermediate Representation, should be the Expression or Logical Plan node here) for common sub expression cases, and we then we can identify the common sub expressions and transform to the new IRs, which should be more clean for code gen part of work.

As the codegen code is a runtime behavior, which is difficult to debug and maintain, more IR will be helpful for this case, and that's also how a modern compiler works.

And I don't think the approach I described can be done right now, it's will be a big change, actually, we have lots of similar optimization like "Common Table Expression" (CTE), or self join can be optimized in the same way, it will be cool if we can figure a approach for the general common xxx elimination.

davies · 2015-11-06T19:29:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

@@ -203,6 +203,10 @@ case class AttributeReference(
    case _ => false
  }

+  override def semanticHash(): Int = {
+   this.exprId.hashCode()


Should we have a test case for this?

Is there an easy way to test this? The end to end tests to exercise this.

What happens if we do not have this?

It works fine currently because by the time this has run, the expression is a bound attribute. We should consider running the logic of finding subexpressions earlier, as chenghao suggests. I added a unit test that exposes the problem of not defining this. As a general rule, I think any class that defines semanticEquals should define semanticHash, even if it is the default implementation.

I see, thanks!

Nit: I think the indent is off here.

nongli · 2015-11-06T20:04:54Z

In general, I think we'd better add more IR (Intermediate Representation, should be the Expression or Logical Plan node here) for common sub expression cases, and we then we can identify the common sub expressions and transform to the new IRs, which should be more clean for code gen part of work.

As the codegen code is a runtime behavior, which is difficult to debug and maintain, more IR will be helpful for this case, and that's also how a modern compiler works.

And I don't think the approach I described can be done right now, it's will be a big change, actually, we have lots of similar optimization like "Common Table Expression" (CTE), or self join can be optimized in the same way, it will be cool if we can figure a approach for the general common xxx elimination.

I'm still not sure how you are suggesting we execute the version you are proposing. I'd guess you do that by adding an additional project that creates a new intermediate row and substitute the common subexpression to read from that new intermediate row. I think this is bad for performance. One of the big benefits of codegen is to remove those "unnecessary" intermediate rows.

I agree that we should expand this logic in planning and analysis but I think the utility class to compute equivalence is reusable in other parts.

davies · 2015-11-06T20:07:15Z

This looks pretty good to me, just a few minor comments. Should we have a benchmark to see is there any regression for tiny common subexpression? If not much, then we don't need to have a cost based one.

nongli · 2015-11-06T21:53:04Z

I ran this benchmark:

    val data = sqlContext.range(20 * 1024 * 1024).toDF("v").registerTempTable("t")
    for (i <- 0 until 4) {
      val t1 = System.currentTimeMillis ()
      val c1 = sql("select v FROM t").rdd.filter (_ => true).count ()
      val t2 = System.currentTimeMillis ()
      val c2 = sql("select (v + v), (v + v) from t").rdd.filter (_ => true).count
      val t3 = System.currentTimeMillis ()

      println(s"Iteration $i")
      println(s"  Q1($c1): ${t2 - t1} ms")
      println(s"  Q2($c2): ${t3 - t2} ms")
    }

With subexpression elimination enabled:

Iteration 0
  Q1(20971520): 2304 ms
  Q2(20971520): 1595 ms
Iteration 1
  Q1(20971520): 1298 ms
  Q2(20971520): 1460 ms
Iteration 2
  Q1(20971520): 1351 ms
  Q2(20971520): 1435 ms
Iteration 3
  Q1(20971520): 1259 ms
  Q2(20971520): 1497 ms

With it disabled:
Iteration 0
  Q1(20971520): 2091 ms
  Q2(20971520): 1618 ms
Iteration 1
  Q1(20971520): 1277 ms
  Q2(20971520): 1505 ms
Iteration 2
  Q1(20971520): 1222 ms
  Q2(20971520): 1468 ms
Iteration 3
  Q1(20971520): 1239 ms
  Q2(20971520): 1489 ms

The difference is small but it appears to be faster even in this case when the exprs are simple.

davies · 2015-11-06T21:59:31Z

Thanks for the numbers, that's cool.

SparkQA · 2015-11-06T22:47:42Z

Test build #45259 has finished for PR 9480 at commit e65def9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

davies · 2015-11-06T23:01:25Z

LGTM

marmbrus · 2015-11-07T00:13:39Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+   * Returns all the equivalent sets of expressions.
+   */
+  def getAllEquivalentExprs: Seq[Seq[Expression]] = {
+    equivalenceMap.map { case(k, v) => {


equivalenceMap.values?

marmbrus · 2015-11-07T00:28:05Z

This is awesome. LGTM.

SparkQA · 2015-11-07T00:32:52Z

Test build #45263 has finished for PR 9480 at commit 3a10f7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

SparkQA · 2015-11-07T00:51:21Z

Test build #1999 has finished for PR 9480 at commit 3a10f7d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

marmbrus · 2015-11-07T00:55:17Z

I don't think the most recent failure is your fault.

@rxin, did you patch break the build?

File "/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/sql/readwriter.py", line 205, in pyspark.sql.readwriter.DataFrameReader.text
Failed example:
    df.collect()
Differences (ndiff with -expected +actual):
    - [Row(text=u'hello'), Row(text=u'this')]
    ?      ^ --                ^ --
    + [Row(value=u'hello'), Row(value=u'this')]
    ?      ^^^^                 ^^^^
**********************************************************************
   1 of   2 in pyspark.sql.readwriter.DataFrameReader.text
***Test Failed*** 1 failures.

chenghao-intel · 2015-11-07T02:19:11Z

I'm still not sure how you are suggesting we execute the version you are proposing. I'd guess you do that by adding an additional project that creates a new intermediate row and substitute the common subexpression to read from that new intermediate row. I think this is bad for performance. One of the big benefits of codegen is to remove those "unnecessary" intermediate rows.

I don't think I will do that in the way you described, actually we are working on the common xxx elimination right now, probably more information will be shared later, will ping you then.

This patch adds the building blocks for codegening subexpr elimination and implements it end to end for UnsafeProjection. The building blocks can be used to do the same thing for other operators. It introduces some utilities to compute common sub expressions. Expressions can be added to this data structure. The expr and its children will be recursively matched against existing expressions (ones previously added) and grouped into common groups. This is built using the existing `semanticEquals`. It does not understand things like commutative or associative expressions. This can be done as future work. After building this data structure, the codegen process takes advantage of it by: 1. Generating a helper function in the generated class that computes the common subexpression. This is done for all common subexpressions that have at least two occurrences and the expression tree is sufficiently complex. 2. When generating the apply() function, if the helper function exists, call that instead of regenerating the expression tree. Repeated calls to the helper function shortcircuit the evaluation logic.

SparkQA · 2015-11-09T20:34:19Z

Test build #45384 has finished for PR 9480 at commit 5dbc047.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

SparkQA · 2015-11-10T01:04:18Z

Test build #45443 has finished for PR 9480 at commit 6cf0186.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

marmbrus · 2015-11-10T01:57:03Z

test this please

SparkQA · 2015-11-10T04:20:49Z

Test build #45468 has finished for PR 9480 at commit 6cf0186.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

marmbrus · 2015-11-10T19:21:54Z

I think I can resolve the conflicts manually. Merging to master and 1.6.

Thanks!

This patch adds the building blocks for codegening subexpr elimination and implements it end to end for UnsafeProjection. The building blocks can be used to do the same thing for other operators. It introduces some utilities to compute common sub expressions. Expressions can be added to this data structure. The expr and its children will be recursively matched against existing expressions (ones previously added) and grouped into common groups. This is built using the existing `semanticEquals`. It does not understand things like commutative or associative expressions. This can be done as future work. After building this data structure, the codegen process takes advantage of it by: 1. Generating a helper function in the generated class that computes the common subexpression. This is done for all common subexpressions that have at least two occurrences and the expression tree is sufficiently complex. 2. When generating the apply() function, if the helper function exists, call that instead of regenerating the expression tree. Repeated calls to the helper function shortcircuit the evaluation logic. Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #9480 from nongli/spark-10371. (cherry picked from commit 87aedc4) Signed-off-by: Michael Armbrust <michael@databricks.com>

chenghao-intel reviewed Nov 5, 2015
View reviewed changes

rxin mentioned this pull request Nov 5, 2015

SPARK-11420 Updating Stddev support via Imperative Aggregate #9380

Closed

davies reviewed Nov 6, 2015
View reviewed changes

marmbrus reviewed Nov 7, 2015
View reviewed changes

nongli and others added 4 commits November 9, 2015 11:53

Code review comments from davies.

05b59e6

Added a few more test cases.

fa7aa9d

Clean up from CR.

a9bcdb0

Add flag to disable.

6cf0186

nongli force-pushed the spark-10371 branch from 5dbc047 to 6cf0186 Compare November 9, 2015 23:50

asfgit closed this in 87aedc4 Nov 10, 2015

gatorsmile mentioned this pull request Nov 17, 2015

[SPARK-8658] [SQL] [FOLLOW-UP] AttributeReference's equals method compares all the members #9761

Closed

nongli deleted the spark-10371 branch November 19, 2015 21:05

[SPARK-10371] [SQL] Implement subexpr elimination for UnsafeProjections #9480

[SPARK-10371] [SQL] Implement subexpr elimination for UnsafeProjections #9480

Conversation

nongli commented Nov 5, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenghao-intel commented Nov 5, 2015

SparkQA commented Nov 5, 2015

nongli commented Nov 5, 2015

yhuai commented Nov 6, 2015

chenghao-intel commented Nov 6, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nongli commented Nov 6, 2015

davies commented Nov 6, 2015

nongli commented Nov 6, 2015

davies commented Nov 6, 2015

SparkQA commented Nov 6, 2015

davies commented Nov 6, 2015

Choose a reason for hiding this comment

marmbrus commented Nov 7, 2015

SparkQA commented Nov 7, 2015

SparkQA commented Nov 7, 2015

marmbrus commented Nov 7, 2015

chenghao-intel commented Nov 7, 2015

SparkQA commented Nov 9, 2015

SparkQA commented Nov 10, 2015

marmbrus commented Nov 10, 2015

SparkQA commented Nov 10, 2015

marmbrus commented Nov 10, 2015