Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10371] [SQL] Implement subexpr elimination for UnsafeProjections #9480

Closed
wants to merge 5 commits into from

Conversation

nongli
Copy link
Contributor

@nongli nongli commented Nov 5, 2015

This patch adds the building blocks for codegening subexpr elimination and implements
it end to end for UnsafeProjection. The building blocks can be used to do the same thing
for other operators.

It introduces some utilities to compute common sub expressions. Expressions can be added to
this data structure. The expr and its children will be recursively matched against existing
expressions (ones previously added) and grouped into common groups. This is built using
the existing semanticEquals. It does not understand things like commutative or associative
expressions. This can be done as future work.

After building this data structure, the codegen process takes advantage of it by:

  1. Generating a helper function in the generated class that computes the common
    subexpression. This is done for all common subexpressions that have at least
    two occurrences and the expression tree is sufficiently complex.
  2. When generating the apply() function, if the helper function exists, call that
    instead of regenerating the expression tree. Repeated calls to the helper function
    shortcircuit the evaluation logic.

* Returns the hash for this expression. Expressions that compute the same result, even if
* they differ cosmetically should return the same hash.
*/
def semanticHash() : Int = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all of the expressions are the case class, probably we don't need to our own way to computeHash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Looking at thecomments on semanticEquals, we want to ignore cosmetic differences.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical hash values doesn't mean the identical values, I am suggesting to use the hashCode plus semanticEquals to identity the common expression, that's also the motivation of semanticEquals. See AttributeReference for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what I do. I put the exprs in a hash set using semantic hash/semantic equals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am still confusing, why we can't use the hashCode instead.
The motivation of semanticEquals is to ignore the AttributeReference.name in comparison for AttributeReference, and the AttributeReference.hashCode also does ignore the AttributeReference.name, I don't think the cosmetic differences really exists for hashCode. Isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there is more discussion on the semanticEquals can be found at #6587, even, I don't think we need the semanticEquals if we changed the implementation of AttributeReference.equals, as it does make lots of code complicated like this one, by using the semanticEquals, other than equals or ==.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look at 6587 and I agree with michael about semantic equals. I think when equivalence classes are added, semantic equals is even more different than equals.

That being said, this patch is not the motivation to do this. If we decide to remove semanticEquals, this patch can be updated trivially to use equals.

Regarding hashCode vs semanticHash code, I think it does no? It looks to me like the hash everything, including the cosmetic stuff but please correct me if I'm wrong. In general, i think it makes sense to implement hash if you implement equals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you look at the methods equals, semanticEquals and hashCode of the AttributeReference, you will see that they are not matched, as equals will take consideration of the name, but the other 2 are not, that's why I am thinking we can also use the hashCode, instead of adding the new method semanticHash.

Anyway, it's not an external API, we can change it back anytime, as 1.6 is almost code freeze, and this is critical for people now.

@chenghao-intel
Copy link
Contributor

My 2 cents, Common Expression Elimination is a very interesting/useful optimization, and I did the same thing for Shark quite long time ago, for my understanding, we probably need to add additional expression/operator for Sequential/Conditional execution. instead of do it directly via code gen.

e.g.

(a+b) + (a+b) => Add(Add(a, b), Add(a, b)) => Sequential(Alias(a1, Add(a, b)), Add(a1,a1))

So we can eliminate the common operations for Intermediate Representation (IR), which will make the codegen part code more clean and simple.

@SparkQA
Copy link

SparkQA commented Nov 5, 2015

Test build #45075 has finished for PR 9480 at commit 2feafbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

@nongli
Copy link
Contributor Author

nongli commented Nov 5, 2015

@chenghao-intel I'm not sure what you are suggesting. How do you suggest we execute the expressions? The transformation you have as an example makes sense but how does it get evaluated? What does Sequential do for the execution side? Does it add an additional materialization to evaluate the earlier stages? How does this work with a more complicated expression DAG?

@yhuai
Copy link
Contributor

yhuai commented Nov 6, 2015

I took one pass. Overall pretty good! @davies Can you also take a look?

@chenghao-intel
Copy link
Contributor

In general, I think we'd better add more IR (Intermediate Representation, should be the Expression or Logical Plan node here) for common sub expression cases, and we then we can identify the common sub expressions and transform to the new IRs, which should be more clean for code gen part of work.

As the codegen code is a runtime behavior, which is difficult to debug and maintain, more IR will be helpful for this case, and that's also how a modern compiler works.

And I don't think the approach I described can be done right now, it's will be a big change, actually, we have lots of similar optimization like "Common Table Expression" (CTE), or self join can be optimized in the same way, it will be cool if we can figure a approach for the general common xxx elimination.

@@ -203,6 +203,10 @@ case class AttributeReference(
case _ => false
}

override def semanticHash(): Int = {
this.exprId.hashCode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a test case for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an easy way to test this? The end to end tests to exercise this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we do not have this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works fine currently because by the time this has run, the expression is a bound attribute. We should consider running the logic of finding subexpressions earlier, as chenghao suggests. I added a unit test that exposes the problem of not defining this. As a general rule, I think any class that defines semanticEquals should define semanticHash, even if it is the default implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think the indent is off here.

@nongli
Copy link
Contributor Author

nongli commented Nov 6, 2015

In general, I think we'd better add more IR (Intermediate Representation, should be the Expression or Logical Plan node here) for common sub expression cases, and we then we can identify the common sub expressions and transform to the new IRs, which should be more clean for code gen part of work.

As the codegen code is a runtime behavior, which is difficult to debug and maintain, more IR will be helpful for this case, and that's also how a modern compiler works.

And I don't think the approach I described can be done right now, it's will be a big change, actually, we have lots of similar optimization like "Common Table Expression" (CTE), or self join can be optimized in the same way, it will be cool if we can figure a approach for the general common xxx elimination.

I'm still not sure how you are suggesting we execute the version you are proposing. I'd guess you do that by adding an additional project that creates a new intermediate row and substitute the common subexpression to read from that new intermediate row. I think this is bad for performance. One of the big benefits of codegen is to remove those "unnecessary" intermediate rows.

I agree that we should expand this logic in planning and analysis but I think the utility class to compute equivalence is reusable in other parts.

@davies
Copy link
Contributor

davies commented Nov 6, 2015

This looks pretty good to me, just a few minor comments. Should we have a benchmark to see is there any regression for tiny common subexpression? If not much, then we don't need to have a cost based one.

@nongli
Copy link
Contributor Author

nongli commented Nov 6, 2015

I ran this benchmark:

    val data = sqlContext.range(20 * 1024 * 1024).toDF("v").registerTempTable("t")
    for (i <- 0 until 4) {
      val t1 = System.currentTimeMillis ()
      val c1 = sql("select v FROM t").rdd.filter (_ => true).count ()
      val t2 = System.currentTimeMillis ()
      val c2 = sql("select (v + v), (v + v) from t").rdd.filter (_ => true).count
      val t3 = System.currentTimeMillis ()

      println(s"Iteration $i")
      println(s"  Q1($c1): ${t2 - t1} ms")
      println(s"  Q2($c2): ${t3 - t2} ms")
    }

With subexpression elimination enabled:

Iteration 0
  Q1(20971520): 2304 ms
  Q2(20971520): 1595 ms
Iteration 1
  Q1(20971520): 1298 ms
  Q2(20971520): 1460 ms
Iteration 2
  Q1(20971520): 1351 ms
  Q2(20971520): 1435 ms
Iteration 3
  Q1(20971520): 1259 ms
  Q2(20971520): 1497 ms
With it disabled:
Iteration 0
  Q1(20971520): 2091 ms
  Q2(20971520): 1618 ms
Iteration 1
  Q1(20971520): 1277 ms
  Q2(20971520): 1505 ms
Iteration 2
  Q1(20971520): 1222 ms
  Q2(20971520): 1468 ms
Iteration 3
  Q1(20971520): 1239 ms
  Q2(20971520): 1489 ms

The difference is small but it appears to be faster even in this case when the exprs are simple.

@davies
Copy link
Contributor

davies commented Nov 6, 2015

Thanks for the numbers, that's cool.

@SparkQA
Copy link

SparkQA commented Nov 6, 2015

Test build #45259 has finished for PR 9480 at commit e65def9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

@davies
Copy link
Contributor

davies commented Nov 6, 2015

LGTM

* Returns all the equivalent sets of expressions.
*/
def getAllEquivalentExprs: Seq[Seq[Expression]] = {
equivalenceMap.map { case(k, v) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equivalenceMap.values?

@marmbrus
Copy link
Contributor

marmbrus commented Nov 7, 2015

This is awesome. LGTM.

@SparkQA
Copy link

SparkQA commented Nov 7, 2015

Test build #45263 has finished for PR 9480 at commit 3a10f7d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

@SparkQA
Copy link

SparkQA commented Nov 7, 2015

Test build #1999 has finished for PR 9480 at commit 3a10f7d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

@marmbrus
Copy link
Contributor

marmbrus commented Nov 7, 2015

I don't think the most recent failure is your fault.

@rxin, did you patch break the build?

File "/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/sql/readwriter.py", line 205, in pyspark.sql.readwriter.DataFrameReader.text
Failed example:
    df.collect()
Differences (ndiff with -expected +actual):
    - [Row(text=u'hello'), Row(text=u'this')]
    ?      ^ --                ^ --
    + [Row(value=u'hello'), Row(value=u'this')]
    ?      ^^^^                 ^^^^
**********************************************************************
   1 of   2 in pyspark.sql.readwriter.DataFrameReader.text
***Test Failed*** 1 failures.

@chenghao-intel
Copy link
Contributor

I'm still not sure how you are suggesting we execute the version you are proposing. I'd guess you do that by adding an additional project that creates a new intermediate row and substitute the common subexpression to read from that new intermediate row. I think this is bad for performance. One of the big benefits of codegen is to remove those "unnecessary" intermediate rows.

I don't think I will do that in the way you described, actually we are working on the common xxx elimination right now, probably more information will be shared later, will ping you then.

nongli and others added 4 commits November 9, 2015 11:53
This patch adds the building blocks for codegening subexpr elimination and implements
it end to end for UnsafeProjection. The building blocks can be used to do the same thing
for other operators.

It introduces some utilities to compute common sub expressions. Expressions can be added to
this data structure. The expr and its children will be recursively matched against existing
expressions (ones previously added) and grouped into common groups. This is built using
the existing `semanticEquals`. It does not understand things like commutative or associative
expressions. This can be done as future work.

After building this data structure, the codegen process takes advantage of it by:
  1. Generating a helper function in the generated class that computes the common
     subexpression. This is done for all common subexpressions that have at least
     two occurrences and the expression tree is sufficiently complex.
  2. When generating the apply() function, if the helper function exists, call that
     instead of regenerating the expression tree. Repeated calls to the helper function
     shortcircuit the evaluation logic.
@SparkQA
Copy link

SparkQA commented Nov 9, 2015

Test build #45384 has finished for PR 9480 at commit 5dbc047.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45443 has finished for PR 9480 at commit 6cf0186.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

@marmbrus
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45468 has finished for PR 9480 at commit 6cf0186.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class EquivalentExpressions\n * case class Expr(e: Expression)\n * case class SubExprEliminationState(\n

@marmbrus
Copy link
Contributor

I think I can resolve the conflicts manually. Merging to master and 1.6.

Thanks!

asfgit pushed a commit that referenced this pull request Nov 10, 2015
This patch adds the building blocks for codegening subexpr elimination and implements
it end to end for UnsafeProjection. The building blocks can be used to do the same thing
for other operators.

It introduces some utilities to compute common sub expressions. Expressions can be added to
this data structure. The expr and its children will be recursively matched against existing
expressions (ones previously added) and grouped into common groups. This is built using
the existing `semanticEquals`. It does not understand things like commutative or associative
expressions. This can be done as future work.

After building this data structure, the codegen process takes advantage of it by:
  1. Generating a helper function in the generated class that computes the common
     subexpression. This is done for all common subexpressions that have at least
     two occurrences and the expression tree is sufficiently complex.
  2. When generating the apply() function, if the helper function exists, call that
     instead of regenerating the expression tree. Repeated calls to the helper function
     shortcircuit the evaluation logic.

Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #9480 from nongli/spark-10371.

(cherry picked from commit 87aedc4)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in 87aedc4 Nov 10, 2015
@nongli nongli deleted the spark-10371 branch November 19, 2015 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants