Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2054][SQL] Code Generation for Expression Evaluation #993

Closed
wants to merge 36 commits into from

Conversation

marmbrus
Copy link
Contributor

@marmbrus marmbrus commented Jun 6, 2014

Adds a new method for evaluating expressions using code that is generated though Scala reflection. This functionality is configured by the SQLConf option spark.sql.codegen and is currently turned off by default.

Evaluation can be done in several specialized ways:

  • Projection - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row. This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection).
  • Ordering - Compares two rows based on a list of SortOrder expressions
  • Condition - Returns true or false given an input row.

For each of the above operations there is both a Generated and Interpreted version. When generation for a given expression type is undefined, the code generator falls back on calling the eval function of the expression class. Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT.

This PR also contains a new type of Aggregation operator, GeneratedAggregate, that performs aggregation by using generated Projection code. Currently the required expression rewriting only works for simple aggregations like SUM and COUNT. This functionality will be extended in a future PR.

This PR also performs several clean ups that simplified the implementation:

  • The notion of Binding all expressions in a tree automatically before query execution has been removed. Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above. In cases when the standard eval method is going to be called, binding can still be done manually using BindReferences. There are a few reasons for this change: First, there were many operators where it just didn't work before. For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with BoundReferences are broken. Specifically, we have had a few bugs where partitioning breaks because of the binding.
  • A copy of the current SQLContext is automatically propagated to all SparkPlan nodes by the query planner. Before this was done ad-hoc for the nodes that needed this. However, this required a lot of boilerplate as one had to always remember to make it @transient and also had to modify the otherCopyArgs.

@AmplabJenkins
Copy link

Merged build triggered.

@rxin
Copy link
Contributor

rxin commented Jun 6, 2014

One more to do is maven build ...

@concretevitamin
Copy link
Contributor

Another TODO might be to beef up IN's code gen semantics (recall "NULL in NULL" and the alike cases).

def currentValue: Row = mutableRow

def target(row: MutableRow): MutableProjection = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add some scaladoc to explain how this is used?

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15499/

def dataType = StringType
def nullable = string.nullable

override def eval(input: Row) = ???
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be filled in?

@hsaputra
Copy link
Contributor

hsaputra commented Jun 6, 2014

HI @marmbrus, one general comment about the PR, could you kindly add object or class header comment to describe why each of them needed and the context why they are used.
It should be very useful for people trying to use and help to improve and fix issues in the module later.


object CodeGeneration

class CodeGenerator extends Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be helpful to add class header comment to describe the usage of this class in bigger context.

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 27, 2014

QA tests have started for PR 993. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17251/consoleFull

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala
@SparkQA
Copy link

SparkQA commented Jul 27, 2014

QA tests have started for PR 993. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17252/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 27, 2014

QA results for PR 993:
- This patch PASSES unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17251/consoleFull

* Defaults to false as this feature is currently experimental.
*/
private[spark] def codegenEnabled: Boolean =
if (get("spark.sql.codegen", "false") == "true") true else false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collected all Spark SQL configurations properties in object SQLConf in the JDBC Thrift server PR. We can put this one there too.

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala
@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA tests have started for PR 993. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17295/consoleFull

Conflicts:
	project/SparkBuild.scala
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala
@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 993. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17376/consoleFull

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
	sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
	sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
Conflicts:
	sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
@marmbrus
Copy link
Contributor Author

test this please

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 993. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17402/consoleFull

@marmbrus
Copy link
Contributor Author

Thanks for looking at this everyone. I've merged it into master!

@asfgit asfgit closed this in 8446746 Jul 30, 2014
@ueshin
Copy link
Member

ueshin commented Jul 30, 2014

Hi @marmbrus, thanks for great work!
But it seems to break build.

I got the following result when I run sbt assembly or sbt publish-local:

[error] (catalyst/compile:doc) Scaladoc generation failed

and I found a lot of error messages in the build log saying value q is not a member of StringContext.

@marmbrus
Copy link
Contributor Author

@ueshin thanks for reporting. I've opened #1653 to fix this.

@marmbrus
Copy link
Contributor Author

@ueshin this should be fixed in master. Please let me know if you have any other problems.

@marmbrus marmbrus deleted the newCodeGen branch August 27, 2014 20:46
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Adds a new method for evaluating expressions using code that is generated though Scala reflection.  This functionality is configured by the SQLConf option `spark.sql.codegen` and is currently turned off by default.

Evaluation can be done in several specialized ways:
 - *Projection* - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row.  This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection).
 - *Ordering* - Compares two rows based on a list of `SortOrder` expressions
 - *Condition* - Returns `true` or `false` given an input row.

For each of the above operations there is both a Generated and Interpreted version.  When generation for a given expression type is undefined, the code generator falls back on calling the `eval` function of the expression class.  Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT.

This PR also contains a new type of Aggregation operator, `GeneratedAggregate`, that performs aggregation by using generated `Projection` code.  Currently the required expression rewriting only works for simple aggregations like `SUM` and `COUNT`.  This functionality will be extended in a future PR.

This PR also performs several clean ups that simplified the implementation:
 - The notion of `Binding` all expressions in a tree automatically before query execution has been removed.  Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above.  In cases when the standard eval method is going to be called, binding can still be done manually using `BindReferences`.  There are a few reasons for this change:  First, there were many operators where it just didn't work before.  For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with `BoundReferences` are broken.  Specifically, we have had a few bugs where partitioning breaks because of the binding.
 - A copy of the current `SQLContext` is automatically propagated to all `SparkPlan` nodes by the query planner.  Before this was done ad-hoc for the nodes that needed this.  However, this required a lot of boilerplate as one had to always remember to make it `transient` and also had to modify the `otherCopyArgs`.

Author: Michael Armbrust <michael@databricks.com>

Closes apache#993 from marmbrus/newCodeGen and squashes the following commits:

96ef82c [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
f34122d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
67b1c48 [Michael Armbrust] Use conf variable in SQLConf object
4bdc42c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
41a40c9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
de22aac [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
fed3634 [Michael Armbrust] Inspectors are not serializable.
ef8d42b [Michael Armbrust] comments
533fdfd [Michael Armbrust] More logging of expression rewriting for GeneratedAggregate.
3cd773e [Michael Armbrust] Allow codegen for Generate.
64b2ee1 [Michael Armbrust] Implement copy
3587460 [Michael Armbrust] Drop unused string builder function.
9cce346 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
1a61293 [Michael Armbrust] Address review comments.
0672e8a [Michael Armbrust] Address comments.
1ec2d6e [Michael Armbrust] Address comments
033abc6 [Michael Armbrust] off by default
4771fab [Michael Armbrust] Docs, more test coverage.
d30fee2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
d2ad5c5 [Michael Armbrust] Refactor putting SQLContext into SparkPlan. Fix ordering, other test cases.
be2cd6b [Michael Armbrust] WIP: Remove old method for reference binding, more work on configuration.
bc88ecd [Michael Armbrust] Style
6cc97ca [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
4220f1e [Michael Armbrust] Better config, docs, etc.
ca6cc6b [Michael Armbrust] WIP
9d67d85 [Michael Armbrust] Fix hive planner
fc522d5 [Michael Armbrust] Hook generated aggregation in to the planner.
e742640 [Michael Armbrust] Remove unneeded changes and code.
675e679 [Michael Armbrust] Upgrade paradise.
0093376 [Michael Armbrust] Comment / indenting cleanup.
d81f998 [Michael Armbrust] include schema for binding.
0e889e8 [Michael Armbrust] Use typeOf instead tq
f623ffd [Michael Armbrust] Quiet logging from test suite.
efad14f [Michael Armbrust] Remove some half finished functions.
92e74a4 [Michael Armbrust] add overrides
a2b5408 [Michael Armbrust] WIP: Code generation with scala reflection.
wangyum pushed a commit that referenced this pull request May 26, 2023
…hole plan exchange and subquery reuse (#993)

* [CARMEL-6055] Backport [SPARK-29375][SPARK-35855][SPARK-28940][SPARK-32041][SQL] Whole plan exchange and subquery reuse

* fix ut

* fix ut

* fix ut

* fix ut

* fix ut
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
…mory from config to overwrite the SPARK_DAEMON_MEMORY (apache#993)

Co-authored-by: Tetiana Fioshkina <tetiana.fioshkina@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants