Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][Spark-SQL] Optimize the Constant Folding for Expression #482

Closed

Conversation

chenghao-intel
Copy link
Contributor

Currently, expression does not support the "constant null" well in constant folding.
e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime.

For example:

explain select isnull(key+null)  from src;
== Logical Plan ==
Project [HiveGenericUdf#isnull((key#30 + CAST(null, IntegerType))) AS c_0#28]
 MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [true AS c_0#28]
 MetastoreRelation default, src, None

== Physical Plan ==
Project [true AS c_0#28]
 HiveTableScan [], (MetastoreRelation default, src, None), None

I've create a new Optimization rule called NullPropagation for such kind of constant folding.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AndreSchumacher
Copy link
Contributor

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14385/

// always be null
object alwaysNull extends Nullability(-1)
// never be null
object neverNull extends Nullability(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a naming issue, objects like this should be capitalized (PossibleNull, AlwaysNull, NeverNull). It might also be good to rename PossibleNull to PossiblyNull.

@marmbrus
Copy link
Contributor

I'm not sure if we want to overload the nullability interface in this way. This nullability bit currently is mostly intended for use when determining storage layout of tuples (i.e. do we need to keep a null bit around or can we operate on primitives). What is going on here is something like constraint propagation + null domination, which we might be able to accomplish more generally.

Also, I'm not sure if its a good idea to provide a default implementation of nullability as there are quite a few operators whose nullability isn't defined by their children (many aggregate operators for example).

Perhaps there is a simpler way to do this using an improved rule in the optimizer.

@chenghao-intel
Copy link
Contributor Author

@marmbrus I will review the nullability for aggregate operators etc. and it also may could be folded by statically analyze the nullability. (for example "sum(a)" but "a" could be an expression "Always Be Null").

I was thinking not to extend the original nullability of expression, however only the "Never Be Null" and "Always Be Null" will be helpful in constant folding rule, but the boolean value of nullability is only represented as "Never Be Null" or "Possibly Be Null".

I think we need to find some other way to show "Always Be Null" state for expression if we don't want to extend the nullability.

@chenghao-intel
Copy link
Contributor Author

I have 2 options for the changes:
Option 1) is add an new method in expression, which tells the optimize rule, what it produces if part of its arguments are "always be null" / "possibly be null" / "never be null". the rule may test every expressions.
Option 2) the rule itself enumerates all of the expressions, and understand what if part of the arguments is "always be null" etc., like the BooleanSimplification does.

But still seems the "constraint propagation" is the best. Not sure if you have better idea on this.

@marmbrus
Copy link
Contributor

I like the option of doing it similar to BooleanSimplification. That seems nice and explicit. I also think this rule will play nicely with more general constraint propagation when we add that in the future.

@chenghao-intel
Copy link
Contributor Author

@marmbrus , can you please review this again? I've rewritten the code as rule-based.

// Skip redundant folding of literals.
case l: Literal => l
// if it's foldable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can omit this comment since the line beneath reads if e.foldable.

@marmbrus
Copy link
Contributor

Thanks for working on this! I think the rule based version is a lot clearer.

@chenghao-intel
Copy link
Contributor Author

Thank you @marmbrus , I've updated the code a little bit.

I've created a new rule "NullPropagation", it transforms the expression tree from bottom to top, and keep unchanged for the rule constant folding.

@marmbrus
Copy link
Contributor

test this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14534/

@marmbrus
Copy link
Contributor

Can you please fix the style issues.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14637/

@@ -114,37 +114,37 @@ package object dsl {
def attr = analysis.UnresolvedAttribute(s)

/** Creates a new AttributeReference of type boolean */
def boolean = AttributeReference(s, BooleanType, nullable = false)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for changing all of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "def boolean" / "def string" should return an Attribute with nullability = true, that does not harm for the correctness, otherwise, it may bring a wrong hint for the optimization of rule NullPropagation.

For example:

val row = new GenericRow(Array[Any] ("a", null))
val c1 = 'a.string.at(0) 
val c2 = 'b.string.at(1) // nullable should be true, otherwise does't reflect the real situation.
assert(evaluate(IsNull(c2), row) == true)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@chenghao-intel
Copy link
Contributor Author

@marmbrus I have updated the code as your comment, can you review/retest it?

@marmbrus
Copy link
Contributor

marmbrus commented May 5, 2014

test this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14652/

baseValue(o)
}
val value = child.eval(input)
if(value == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after if, below too.

@marmbrus
Copy link
Contributor

marmbrus commented May 5, 2014

Other than a very minor style issue, this LGTM.

@pwendell or @rxin please merge when ready.

@chenghao-intel
Copy link
Contributor Author

Fixed. Can you retest this please?

@marmbrus
Copy link
Contributor

marmbrus commented May 6, 2014

test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14692/

@marmbrus
Copy link
Contributor

marmbrus commented May 7, 2014

Thanks for doing this :)

@pwendell, can you please merge?

@rxin
Copy link
Contributor

rxin commented May 7, 2014

I've merged this. Thanks!

asfgit pushed a commit that referenced this pull request May 7, 2014
Currently, expression does not support the "constant null" well in constant folding.
e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime.

For example:
```
explain select isnull(key+null)  from src;
== Logical Plan ==
Project [HiveGenericUdf#isnull((key#30 + CAST(null, IntegerType))) AS c_0#28]
 MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [true AS c_0#28]
 MetastoreRelation default, src, None

== Physical Plan ==
Project [true AS c_0#28]
 HiveTableScan [], (MetastoreRelation default, src, None), None
```

I've create a new Optimization rule called NullPropagation for such kind of constant folding.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #482 from chenghao-intel/optimize_constant_folding and squashes the following commits:

2f14b50 [Cheng Hao] Fix code style issues
68b9fad [Cheng Hao] Remove the Literal pattern matching for NullPropagation
29c8166 [Cheng Hao] Update the code for feedback of code review
50444cc [Cheng Hao] Remove the unnecessary null checking
80f9f18 [Cheng Hao] Update the UnitTest for aggregation constant folding
27ea3d7 [Cheng Hao] Fix Constant Folding Bugs & Add More Unittests
b28e03a [Cheng Hao] Merge pull request #1 from marmbrus/pr/482
9ccefdb [Michael Armbrust] Add tests for optimized expression evaluation.
543ef9d [Cheng Hao] fix code style issues
9cf0396 [Cheng Hao] update code according to the code review comment
536c005 [Cheng Hao] Add Exceptional case for constant folding
3c045c7 [Cheng Hao] Optimize the Constant Folding by adding more rules
2645d4f [Cheng Hao] Constant Folding(null propagation)

(cherry picked from commit 3eb53bd)
Signed-off-by: Reynold Xin <rxin@apache.org>
@asfgit asfgit closed this in 3eb53bd May 7, 2014
pwendell added a commit to pwendell/spark that referenced this pull request May 12, 2014
Added StreamingContext.awaitTermination to streaming examples

StreamingContext.start() currently starts a non-daemon thread which prevents termination of a Spark Streaming program even if main function has exited. Since the expected behavior of a streaming program is to run until explicitly killed, this was sort of fine when spark streaming applications are launched from the command line. However, when launched in Yarn-standalone mode, this did not work as the driver effectively got terminated when the main function exits. So SparkStreaming examples did not work on Yarn.

This addition to the examples ensures that the examples work on Yarn and also ensures that everyone learns that StreamingContext.awaitTermination() being necessary for SparkStreaming programs to wait.

The true bug-fix of making sure all threads by Spark Streaming are daemon threads is left for post-0.9.
@chenghao-intel chenghao-intel deleted the optimize_constant_folding branch May 28, 2014 02:36
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
Currently, expression does not support the "constant null" well in constant folding.
e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime.

For example:
```
explain select isnull(key+null)  from src;
== Logical Plan ==
Project [HiveGenericUdf#isnull((key#30 + CAST(null, IntegerType))) AS c_0#28]
 MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [true AS c_0#28]
 MetastoreRelation default, src, None

== Physical Plan ==
Project [true AS c_0#28]
 HiveTableScan [], (MetastoreRelation default, src, None), None
```

I've create a new Optimization rule called NullPropagation for such kind of constant folding.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Michael Armbrust <michael@databricks.com>

Closes apache#482 from chenghao-intel/optimize_constant_folding and squashes the following commits:

2f14b50 [Cheng Hao] Fix code style issues
68b9fad [Cheng Hao] Remove the Literal pattern matching for NullPropagation
29c8166 [Cheng Hao] Update the code for feedback of code review
50444cc [Cheng Hao] Remove the unnecessary null checking
80f9f18 [Cheng Hao] Update the UnitTest for aggregation constant folding
27ea3d7 [Cheng Hao] Fix Constant Folding Bugs & Add More Unittests
b28e03a [Cheng Hao] Merge pull request apache#1 from marmbrus/pr/482
9ccefdb [Michael Armbrust] Add tests for optimized expression evaluation.
543ef9d [Cheng Hao] fix code style issues
9cf0396 [Cheng Hao] update code according to the code review comment
536c005 [Cheng Hao] Add Exceptional case for constant folding
3c045c7 [Cheng Hao] Optimize the Constant Folding by adding more rules
2645d4f [Cheng Hao] Constant Folding(null propagation)
andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Jan 8, 2015
Added StreamingContext.awaitTermination to streaming examples

StreamingContext.start() currently starts a non-daemon thread which prevents termination of a Spark Streaming program even if main function has exited. Since the expected behavior of a streaming program is to run until explicitly killed, this was sort of fine when spark streaming applications are launched from the command line. However, when launched in Yarn-standalone mode, this did not work as the driver effectively got terminated when the main function exits. So SparkStreaming examples did not work on Yarn.

This addition to the examples ensures that the examples work on Yarn and also ensures that everyone learns that StreamingContext.awaitTermination() being necessary for SparkStreaming programs to wait.

The true bug-fix of making sure all threads by Spark Streaming are daemon threads is left for post-0.9.
(cherry picked from commit 0367981)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants