-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2054][SQL] Code Generation for Expression Evaluation #993
Conversation
Merged build triggered. |
One more to do is maven build ... |
Another TODO might be to beef up IN's code gen semantics (recall "NULL in NULL" and the alike cases). |
def currentValue: Row = mutableRow | ||
|
||
def target(row: MutableRow): MutableProjection = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add some scaladoc to explain how this is used?
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15499/ |
def dataType = StringType | ||
def nullable = string.nullable | ||
|
||
override def eval(input: Row) = ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be filled in?
HI @marmbrus, one general comment about the PR, could you kindly add object or class header comment to describe why each of them needed and the context why they are used. |
|
||
object CodeGeneration | ||
|
||
class CodeGenerator extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be helpful to add class header comment to describe the usage of this class in bigger context.
Build triggered. |
Merged build triggered. |
Merged build started. |
QA tests have started for PR 993. This patch DID NOT merge cleanly! |
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala
QA tests have started for PR 993. This patch merges cleanly. |
QA results for PR 993: |
* Defaults to false as this feature is currently experimental. | ||
*/ | ||
private[spark] def codegenEnabled: Boolean = | ||
if (get("spark.sql.codegen", "false") == "true") true else false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collected all Spark SQL configurations properties in object SQLConf
in the JDBC Thrift server PR. We can put this one there too.
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala
QA tests have started for PR 993. This patch merges cleanly. |
Conflicts: project/SparkBuild.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala
QA tests have started for PR 993. This patch merges cleanly. |
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
test this please |
QA tests have started for PR 993. This patch merges cleanly. |
Thanks for looking at this everyone. I've merged it into master! |
Hi @marmbrus, thanks for great work! I got the following result when I run
and I found a lot of error messages in the build log saying |
@ueshin this should be fixed in master. Please let me know if you have any other problems. |
Adds a new method for evaluating expressions using code that is generated though Scala reflection. This functionality is configured by the SQLConf option `spark.sql.codegen` and is currently turned off by default. Evaluation can be done in several specialized ways: - *Projection* - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row. This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection). - *Ordering* - Compares two rows based on a list of `SortOrder` expressions - *Condition* - Returns `true` or `false` given an input row. For each of the above operations there is both a Generated and Interpreted version. When generation for a given expression type is undefined, the code generator falls back on calling the `eval` function of the expression class. Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT. This PR also contains a new type of Aggregation operator, `GeneratedAggregate`, that performs aggregation by using generated `Projection` code. Currently the required expression rewriting only works for simple aggregations like `SUM` and `COUNT`. This functionality will be extended in a future PR. This PR also performs several clean ups that simplified the implementation: - The notion of `Binding` all expressions in a tree automatically before query execution has been removed. Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above. In cases when the standard eval method is going to be called, binding can still be done manually using `BindReferences`. There are a few reasons for this change: First, there were many operators where it just didn't work before. For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with `BoundReferences` are broken. Specifically, we have had a few bugs where partitioning breaks because of the binding. - A copy of the current `SQLContext` is automatically propagated to all `SparkPlan` nodes by the query planner. Before this was done ad-hoc for the nodes that needed this. However, this required a lot of boilerplate as one had to always remember to make it `transient` and also had to modify the `otherCopyArgs`. Author: Michael Armbrust <michael@databricks.com> Closes apache#993 from marmbrus/newCodeGen and squashes the following commits: 96ef82c [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen f34122d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen 67b1c48 [Michael Armbrust] Use conf variable in SQLConf object 4bdc42c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen 41a40c9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen de22aac [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen fed3634 [Michael Armbrust] Inspectors are not serializable. ef8d42b [Michael Armbrust] comments 533fdfd [Michael Armbrust] More logging of expression rewriting for GeneratedAggregate. 3cd773e [Michael Armbrust] Allow codegen for Generate. 64b2ee1 [Michael Armbrust] Implement copy 3587460 [Michael Armbrust] Drop unused string builder function. 9cce346 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen 1a61293 [Michael Armbrust] Address review comments. 0672e8a [Michael Armbrust] Address comments. 1ec2d6e [Michael Armbrust] Address comments 033abc6 [Michael Armbrust] off by default 4771fab [Michael Armbrust] Docs, more test coverage. d30fee2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen d2ad5c5 [Michael Armbrust] Refactor putting SQLContext into SparkPlan. Fix ordering, other test cases. be2cd6b [Michael Armbrust] WIP: Remove old method for reference binding, more work on configuration. bc88ecd [Michael Armbrust] Style 6cc97ca [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen 4220f1e [Michael Armbrust] Better config, docs, etc. ca6cc6b [Michael Armbrust] WIP 9d67d85 [Michael Armbrust] Fix hive planner fc522d5 [Michael Armbrust] Hook generated aggregation in to the planner. e742640 [Michael Armbrust] Remove unneeded changes and code. 675e679 [Michael Armbrust] Upgrade paradise. 0093376 [Michael Armbrust] Comment / indenting cleanup. d81f998 [Michael Armbrust] include schema for binding. 0e889e8 [Michael Armbrust] Use typeOf instead tq f623ffd [Michael Armbrust] Quiet logging from test suite. efad14f [Michael Armbrust] Remove some half finished functions. 92e74a4 [Michael Armbrust] add overrides a2b5408 [Michael Armbrust] WIP: Code generation with scala reflection.
…hole plan exchange and subquery reuse (#993) * [CARMEL-6055] Backport [SPARK-29375][SPARK-35855][SPARK-28940][SPARK-32041][SQL] Whole plan exchange and subquery reuse * fix ut * fix ut * fix ut * fix ut * fix ut
…mory from config to overwrite the SPARK_DAEMON_MEMORY (apache#993) Co-authored-by: Tetiana Fioshkina <tetiana.fioshkina@hpe.com>
Adds a new method for evaluating expressions using code that is generated though Scala reflection. This functionality is configured by the SQLConf option
spark.sql.codegen
and is currently turned off by default.Evaluation can be done in several specialized ways:
SortOrder
expressionstrue
orfalse
given an input row.For each of the above operations there is both a Generated and Interpreted version. When generation for a given expression type is undefined, the code generator falls back on calling the
eval
function of the expression class. Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT.This PR also contains a new type of Aggregation operator,
GeneratedAggregate
, that performs aggregation by using generatedProjection
code. Currently the required expression rewriting only works for simple aggregations likeSUM
andCOUNT
. This functionality will be extended in a future PR.This PR also performs several clean ups that simplified the implementation:
Binding
all expressions in a tree automatically before query execution has been removed. Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above. In cases when the standard eval method is going to be called, binding can still be done manually usingBindReferences
. There are a few reasons for this change: First, there were many operators where it just didn't work before. For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality withBoundReferences
are broken. Specifically, we have had a few bugs where partitioning breaks because of the binding.SQLContext
is automatically propagated to allSparkPlan
nodes by the query planner. Before this was done ad-hoc for the nodes that needed this. However, this required a lot of boilerplate as one had to always remember to make it@transient
and also had to modify theotherCopyArgs
.