[SPARK-2054][SQL] Code Generation for Expression Evaluation #993

marmbrus · 2014-06-06T06:02:48Z

Adds a new method for evaluating expressions using code that is generated though Scala reflection. This functionality is configured by the SQLConf option spark.sql.codegen and is currently turned off by default.

Evaluation can be done in several specialized ways:

Projection - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row. This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection).
Ordering - Compares two rows based on a list of SortOrder expressions
Condition - Returns true or false given an input row.

For each of the above operations there is both a Generated and Interpreted version. When generation for a given expression type is undefined, the code generator falls back on calling the eval function of the expression class. Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT.

This PR also contains a new type of Aggregation operator, GeneratedAggregate, that performs aggregation by using generated Projection code. Currently the required expression rewriting only works for simple aggregations like SUM and COUNT. This functionality will be extended in a future PR.

This PR also performs several clean ups that simplified the implementation:

The notion of Binding all expressions in a tree automatically before query execution has been removed. Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above. In cases when the standard eval method is going to be called, binding can still be done manually using BindReferences. There are a few reasons for this change: First, there were many operators where it just didn't work before. For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with BoundReferences are broken. Specifically, we have had a few bugs where partitioning breaks because of the binding.
A copy of the current SQLContext is automatically propagated to all SparkPlan nodes by the query planner. Before this was done ad-hoc for the nodes that needed this. However, this required a lot of boilerplate as one had to always remember to make it @transient and also had to modify the otherCopyArgs.

AmplabJenkins · 2014-06-06T06:07:49Z

Merged build triggered.

rxin · 2014-06-06T06:23:42Z

One more to do is maven build ...

concretevitamin · 2014-06-06T06:38:27Z

Another TODO might be to beef up IN's code gen semantics (recall "NULL in NULL" and the alike cases).

rxin · 2014-06-06T07:56:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala

  def currentValue: Row = mutableRow

+  def target(row: MutableRow): MutableProjection = {


maybe add some scaladoc to explain how this is used?

AmplabJenkins · 2014-06-06T07:56:31Z

Merged build started.

AmplabJenkins · 2014-06-06T07:57:34Z

Merged build finished.

AmplabJenkins · 2014-06-06T07:57:35Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15499/

rxin · 2014-06-06T07:58:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala

+  def dataType = StringType
+  def nullable = string.nullable
+
+  override def eval(input: Row) = ???


should this be filled in?

hsaputra · 2014-06-06T15:25:04Z

HI @marmbrus, one general comment about the PR, could you kindly add object or class header comment to describe why each of them needed and the context why they are used.
It should be very useful for people trying to use and help to improve and fix issues in the module later.

hsaputra · 2014-06-06T15:26:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/GeneratedRow.scala

+
+object CodeGeneration
+
+class CodeGenerator extends Logging {


Would be helpful to add class header comment to describe the usage of this class in bigger context.

AmplabJenkins · 2014-06-10T08:12:50Z

Build triggered.

AmplabJenkins · 2014-07-09T06:06:09Z

Merged build triggered.

AmplabJenkins · 2014-07-09T06:06:18Z

Merged build started.

SparkQA · 2014-07-27T22:03:39Z

QA tests have started for PR 993. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17251/consoleFull

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala

SparkQA · 2014-07-27T22:08:46Z

QA tests have started for PR 993. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17252/consoleFull

SparkQA · 2014-07-27T23:51:08Z

QA results for PR 993:
- This patch PASSES unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17251/consoleFull

liancheng · 2014-07-28T01:28:21Z

sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

+   * Defaults to false as this feature is currently experimental.
+   */
+  private[spark] def codegenEnabled: Boolean =
+    if (get("spark.sql.codegen", "false") == "true") true else false


Collected all Spark SQL configurations properties in object SQLConf in the JDBC Thrift server PR. We can put this one there too.

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala

SparkQA · 2014-07-28T18:38:48Z

QA tests have started for PR 993. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17295/consoleFull

Conflicts: project/SparkBuild.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala

SparkQA · 2014-07-29T20:33:57Z

QA tests have started for PR 993. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17376/consoleFull

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

marmbrus · 2014-07-30T01:27:33Z

test this please

SparkQA · 2014-07-30T01:28:52Z

QA tests have started for PR 993. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17402/consoleFull

marmbrus · 2014-07-30T04:00:31Z

Thanks for looking at this everyone. I've merged it into master!

ueshin · 2014-07-30T06:35:17Z

Hi @marmbrus, thanks for great work!
But it seems to break build.

I got the following result when I run sbt assembly or sbt publish-local:

[error] (catalyst/compile:doc) Scaladoc generation failed

and I found a lot of error messages in the build log saying value q is not a member of StringContext.

marmbrus · 2014-07-30T07:02:11Z

@ueshin thanks for reporting. I've opened #1653 to fix this.

marmbrus · 2014-07-30T20:20:31Z

@ueshin this should be fixed in master. Please let me know if you have any other problems.

Adds a new method for evaluating expressions using code that is generated though Scala reflection. This functionality is configured by the SQLConf option `spark.sql.codegen` and is currently turned off by default. Evaluation can be done in several specialized ways: - *Projection* - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row. This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection). - *Ordering* - Compares two rows based on a list of `SortOrder` expressions - *Condition* - Returns `true` or `false` given an input row. For each of the above operations there is both a Generated and Interpreted version. When generation for a given expression type is undefined, the code generator falls back on calling the `eval` function of the expression class. Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT. This PR also contains a new type of Aggregation operator, `GeneratedAggregate`, that performs aggregation by using generated `Projection` code. Currently the required expression rewriting only works for simple aggregations like `SUM` and `COUNT`. This functionality will be extended in a future PR. This PR also performs several clean ups that simplified the implementation: - The notion of `Binding` all expressions in a tree automatically before query execution has been removed. Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above. In cases when the standard eval method is going to be called, binding can still be done manually using `BindReferences`. There are a few reasons for this change: First, there were many operators where it just didn't work before. For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with `BoundReferences` are broken. Specifically, we have had a few bugs where partitioning breaks because of the binding. - A copy of the current `SQLContext` is automatically propagated to all `SparkPlan` nodes by the query planner. Before this was done ad-hoc for the nodes that needed this. However, this required a lot of boilerplate as one had to always remember to make it `transient` and also had to modify the `otherCopyArgs`. Author: Michael Armbrust <michael@databricks.com> Closes apache#993 from marmbrus/newCodeGen and squashes the following commits: 96ef82c [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen f34122d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen 67b1c48 [Michael Armbrust] Use conf variable in SQLConf object 4bdc42c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen 41a40c9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen de22aac [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen fed3634 [Michael Armbrust] Inspectors are not serializable. ef8d42b [Michael Armbrust] comments 533fdfd [Michael Armbrust] More logging of expression rewriting for GeneratedAggregate. 3cd773e [Michael Armbrust] Allow codegen for Generate. 64b2ee1 [Michael Armbrust] Implement copy 3587460 [Michael Armbrust] Drop unused string builder function. 9cce346 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen 1a61293 [Michael Armbrust] Address review comments. 0672e8a [Michael Armbrust] Address comments. 1ec2d6e [Michael Armbrust] Address comments 033abc6 [Michael Armbrust] off by default 4771fab [Michael Armbrust] Docs, more test coverage. d30fee2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen d2ad5c5 [Michael Armbrust] Refactor putting SQLContext into SparkPlan. Fix ordering, other test cases. be2cd6b [Michael Armbrust] WIP: Remove old method for reference binding, more work on configuration. bc88ecd [Michael Armbrust] Style 6cc97ca [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen 4220f1e [Michael Armbrust] Better config, docs, etc. ca6cc6b [Michael Armbrust] WIP 9d67d85 [Michael Armbrust] Fix hive planner fc522d5 [Michael Armbrust] Hook generated aggregation in to the planner. e742640 [Michael Armbrust] Remove unneeded changes and code. 675e679 [Michael Armbrust] Upgrade paradise. 0093376 [Michael Armbrust] Comment / indenting cleanup. d81f998 [Michael Armbrust] include schema for binding. 0e889e8 [Michael Armbrust] Use typeOf instead tq f623ffd [Michael Armbrust] Quiet logging from test suite. efad14f [Michael Armbrust] Remove some half finished functions. 92e74a4 [Michael Armbrust] add overrides a2b5408 [Michael Armbrust] WIP: Code generation with scala reflection.

…hole plan exchange and subquery reuse (#993) * [CARMEL-6055] Backport [SPARK-29375][SPARK-35855][SPARK-28940][SPARK-32041][SQL] Whole plan exchange and subquery reuse * fix ut * fix ut * fix ut * fix ut * fix ut

…mory from config to overwrite the SPARK_DAEMON_MEMORY (apache#993) Co-authored-by: Tetiana Fioshkina <tetiana.fioshkina@hpe.com>

rxin reviewed Jun 6, 2014
View reviewed changes

hsaputra reviewed Jun 6, 2014
View reviewed changes

marmbrus added 7 commits July 8, 2014 22:04

WIP: Code generation with scala reflection.

a2b5408

add overrides

92e74a4

Remove some half finished functions.

efad14f

Quiet logging from test suite.

f623ffd

Use typeOf instead tq

0e889e8

include schema for binding.

d81f998

Comment / indenting cleanup.

0093376

Inspectors are not serializable.

fed3634

Merge remote-tracking branch 'origin/master' into newCodeGen

de22aac

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala

liancheng reviewed Jul 28, 2014
View reviewed changes

Merge remote-tracking branch 'origin/master' into newCodeGen

41a40c9

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala

marmbrus added 2 commits July 29, 2014 13:26

Merge remote-tracking branch 'origin/master' into newCodeGen

4bdc42c

Conflicts: project/SparkBuild.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala

Use conf variable in SQLConf object

67b1c48

marmbrus added 2 commits July 29, 2014 18:13

Merge remote-tracking branch 'apache/master' into newCodeGen

96ef82c

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

asfgit closed this in 8446746 Jul 30, 2014

marmbrus deleted the newCodeGen branch August 27, 2014 20:46

liancheng mentioned this pull request Sep 3, 2014

[SPARK-2219][SQL] Added support for the "add jar" command #2242

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2054][SQL] Code Generation for Expression Evaluation #993

[SPARK-2054][SQL] Code Generation for Expression Evaluation #993

marmbrus commented Jun 6, 2014

AmplabJenkins commented Jun 6, 2014

rxin commented Jun 6, 2014

concretevitamin commented Jun 6, 2014

rxin Jun 6, 2014

AmplabJenkins commented Jun 6, 2014

AmplabJenkins commented Jun 6, 2014

AmplabJenkins commented Jun 6, 2014

rxin Jun 6, 2014

hsaputra commented Jun 6, 2014

hsaputra Jun 6, 2014

AmplabJenkins commented Jun 10, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

SparkQA commented Jul 27, 2014

SparkQA commented Jul 27, 2014

SparkQA commented Jul 27, 2014

liancheng Jul 28, 2014

SparkQA commented Jul 28, 2014

SparkQA commented Jul 29, 2014

marmbrus commented Jul 30, 2014

SparkQA commented Jul 30, 2014

marmbrus commented Jul 30, 2014

ueshin commented Jul 30, 2014

marmbrus commented Jul 30, 2014

marmbrus commented Jul 30, 2014

		def currentValue: Row = mutableRow

		def target(row: MutableRow): MutableProjection = {

[SPARK-2054][SQL] Code Generation for Expression Evaluation #993

[SPARK-2054][SQL] Code Generation for Expression Evaluation #993

Conversation

marmbrus commented Jun 6, 2014

AmplabJenkins commented Jun 6, 2014

rxin commented Jun 6, 2014

concretevitamin commented Jun 6, 2014

rxin Jun 6, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Jun 6, 2014

AmplabJenkins commented Jun 6, 2014

AmplabJenkins commented Jun 6, 2014

rxin Jun 6, 2014

Choose a reason for hiding this comment

hsaputra commented Jun 6, 2014

hsaputra Jun 6, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Jun 10, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

SparkQA commented Jul 27, 2014

SparkQA commented Jul 27, 2014

SparkQA commented Jul 27, 2014

liancheng Jul 28, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 28, 2014

SparkQA commented Jul 29, 2014

marmbrus commented Jul 30, 2014

SparkQA commented Jul 30, 2014

marmbrus commented Jul 30, 2014

ueshin commented Jul 30, 2014

marmbrus commented Jul 30, 2014

marmbrus commented Jul 30, 2014