New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10371] [SQL] Implement subexpr elimination for UnsafeProjections #9480
Conversation
* Returns the hash for this expression. Expressions that compute the same result, even if | ||
* they differ cosmetically should return the same hash. | ||
*/ | ||
def semanticHash() : Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all of the expressions are the case class, probably we don't need to our own way to computeHash
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Looking at thecomments on semanticEquals, we want to ignore cosmetic differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Identical hash values doesn't mean the identical values, I am suggesting to use the hashCode
plus semanticEquals
to identity the common expression, that's also the motivation of semanticEquals
. See AttributeReference
for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what I do. I put the exprs in a hash set using semantic hash/semantic equals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am still confusing, why we can't use the hashCode
instead.
The motivation of semanticEquals
is to ignore the AttributeReference.name
in comparison for AttributeReference
, and the AttributeReference.hashCode
also does ignore the AttributeReference.name
, I don't think the cosmetic differences
really exists for hashCode
. Isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there is more discussion on the semanticEquals
can be found at #6587, even, I don't think we need the semanticEquals
if we changed the implementation of AttributeReference.equals
, as it does make lots of code complicated like this one, by using the semanticEquals
, other than equals
or ==
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look at 6587 and I agree with michael about semantic equals. I think when equivalence classes are added, semantic equals is even more different than equals.
That being said, this patch is not the motivation to do this. If we decide to remove semanticEquals, this patch can be updated trivially to use equals.
Regarding hashCode vs semanticHash code, I think it does no? It looks to me like the hash everything, including the cosmetic stuff but please correct me if I'm wrong. In general, i think it makes sense to implement hash if you implement equals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you look at the methods equals
, semanticEquals
and hashCode
of the AttributeReference
, you will see that they are not matched, as equals
will take consideration of the name
, but the other 2 are not, that's why I am thinking we can also use the hashCode
, instead of adding the new method semanticHash
.
Anyway, it's not an external API, we can change it back anytime, as 1.6 is almost code freeze, and this is critical for people now.
My 2 cents, Common Expression Elimination is a very interesting/useful optimization, and I did the same thing for Shark quite long time ago, for my understanding, we probably need to add additional expression/operator for Sequential/Conditional execution. instead of do it directly via code gen. e.g.
So we can eliminate the common operations for Intermediate Representation (IR), which will make the codegen part code more clean and simple. |
Test build #45075 has finished for PR 9480 at commit
|
@chenghao-intel I'm not sure what you are suggesting. How do you suggest we execute the expressions? The transformation you have as an example makes sense but how does it get evaluated? What does Sequential do for the execution side? Does it add an additional materialization to evaluate the earlier stages? How does this work with a more complicated expression DAG? |
I took one pass. Overall pretty good! @davies Can you also take a look? |
In general, I think we'd better add more IR (Intermediate Representation, should be the Expression or Logical Plan node here) for common sub expression cases, and we then we can identify the common sub expressions and transform to the new IRs, which should be more clean for code gen part of work. As the codegen code is a runtime behavior, which is difficult to debug and maintain, more IR will be helpful for this case, and that's also how a modern compiler works. And I don't think the approach I described can be done right now, it's will be a big change, actually, we have lots of similar optimization like "Common Table Expression" (CTE), or self join can be optimized in the same way, it will be cool if we can figure a approach for the general common xxx elimination. |
@@ -203,6 +203,10 @@ case class AttributeReference( | |||
case _ => false | |||
} | |||
|
|||
override def semanticHash(): Int = { | |||
this.exprId.hashCode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a test case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an easy way to test this? The end to end tests to exercise this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we do not have this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works fine currently because by the time this has run, the expression is a bound attribute. We should consider running the logic of finding subexpressions earlier, as chenghao suggests. I added a unit test that exposes the problem of not defining this. As a general rule, I think any class that defines semanticEquals should define semanticHash, even if it is the default implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think the indent is off here.
I'm still not sure how you are suggesting we execute the version you are proposing. I'd guess you do that by adding an additional project that creates a new intermediate row and substitute the common subexpression to read from that new intermediate row. I think this is bad for performance. One of the big benefits of codegen is to remove those "unnecessary" intermediate rows. I agree that we should expand this logic in planning and analysis but I think the utility class to compute equivalence is reusable in other parts. |
This looks pretty good to me, just a few minor comments. Should we have a benchmark to see is there any regression for tiny common subexpression? If not much, then we don't need to have a cost based one. |
I ran this benchmark:
With subexpression elimination enabled:
The difference is small but it appears to be faster even in this case when the exprs are simple. |
Thanks for the numbers, that's cool. |
Test build #45259 has finished for PR 9480 at commit
|
LGTM |
* Returns all the equivalent sets of expressions. | ||
*/ | ||
def getAllEquivalentExprs: Seq[Seq[Expression]] = { | ||
equivalenceMap.map { case(k, v) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equivalenceMap.values
?
This is awesome. LGTM. |
Test build #45263 has finished for PR 9480 at commit
|
Test build #1999 has finished for PR 9480 at commit
|
I don't think the most recent failure is your fault. @rxin, did you patch break the build?
|
I don't think I will do that in the way you described, actually we are working on the common xxx elimination right now, probably more information will be shared later, will ping you then. |
This patch adds the building blocks for codegening subexpr elimination and implements it end to end for UnsafeProjection. The building blocks can be used to do the same thing for other operators. It introduces some utilities to compute common sub expressions. Expressions can be added to this data structure. The expr and its children will be recursively matched against existing expressions (ones previously added) and grouped into common groups. This is built using the existing `semanticEquals`. It does not understand things like commutative or associative expressions. This can be done as future work. After building this data structure, the codegen process takes advantage of it by: 1. Generating a helper function in the generated class that computes the common subexpression. This is done for all common subexpressions that have at least two occurrences and the expression tree is sufficiently complex. 2. When generating the apply() function, if the helper function exists, call that instead of regenerating the expression tree. Repeated calls to the helper function shortcircuit the evaluation logic.
Test build #45384 has finished for PR 9480 at commit
|
Test build #45443 has finished for PR 9480 at commit
|
test this please |
Test build #45468 has finished for PR 9480 at commit
|
I think I can resolve the conflicts manually. Merging to master and 1.6. Thanks! |
This patch adds the building blocks for codegening subexpr elimination and implements it end to end for UnsafeProjection. The building blocks can be used to do the same thing for other operators. It introduces some utilities to compute common sub expressions. Expressions can be added to this data structure. The expr and its children will be recursively matched against existing expressions (ones previously added) and grouped into common groups. This is built using the existing `semanticEquals`. It does not understand things like commutative or associative expressions. This can be done as future work. After building this data structure, the codegen process takes advantage of it by: 1. Generating a helper function in the generated class that computes the common subexpression. This is done for all common subexpressions that have at least two occurrences and the expression tree is sufficiently complex. 2. When generating the apply() function, if the helper function exists, call that instead of regenerating the expression tree. Repeated calls to the helper function shortcircuit the evaluation logic. Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #9480 from nongli/spark-10371. (cherry picked from commit 87aedc4) Signed-off-by: Michael Armbrust <michael@databricks.com>
This patch adds the building blocks for codegening subexpr elimination and implements
it end to end for UnsafeProjection. The building blocks can be used to do the same thing
for other operators.
It introduces some utilities to compute common sub expressions. Expressions can be added to
this data structure. The expr and its children will be recursively matched against existing
expressions (ones previously added) and grouped into common groups. This is built using
the existing
semanticEquals
. It does not understand things like commutative or associativeexpressions. This can be done as future work.
After building this data structure, the codegen process takes advantage of it by:
subexpression. This is done for all common subexpressions that have at least
two occurrences and the expression tree is sufficiently complex.
instead of regenerating the expression tree. Repeated calls to the helper function
shortcircuit the evaluation logic.