New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12796] [SQL] Whole stage codegen #10735
Conversation
Test build #49298 has finished for PR 10735 at commit
|
def testWholeStage(values: Int): Unit = { | ||
val benchmark = new Benchmark("Single Int Column Scan", values) | ||
|
||
benchmark.addCase("Without whole stage codegen") { iter => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also consider switching the order and see if the results change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmark will run each case 5 times in a row, so I think the order should not matter, or all the results we see should be re-visited. cc @nongli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it and it didn't make any difference.
FYI on my machine with multiple runs, the speedup is around 2.5X. I have less cores but slightly higher frequency.
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also got 2.5X for the first run, once increased the number of rows, it became 3.0X, because there are some overhead for both query (catalyst and spark job).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should matter on the order or at least the benchmark should be written that way.
@davies can you include the generated code for the benchmark? |
@@ -106,7 +106,27 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializ | |||
final def execute(): RDD[InternalRow] = { | |||
RDDOperationScope.withScope(sparkContext, nodeName, false, true) { | |||
prepare() | |||
doExecute() | |||
|
|||
if (sqlContext.conf.wholeStageEnabled && supportCodeGen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to actually make this a first class concept in the planner. Perhaps a phase that collects valid nodes and replaces them with an explicit operator called WholeStageCodeGen
that holds all the operators that are getting. I'm not sure how hard that would be, but it would make debugging this a lot easier if it was visible from explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If WholeStageCodeGen
hold a list/tree of operators, we still want to see the details of the tree.
Right now, we can have a special flag for the operator that support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would push back here. The whole goal of this initial effort is to put some infrastructure in place so that we can iterate quickly moving forward. Right now its really hard given any query plan to understand what is happening under the covers because the control flow isn't explicitly visible anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's some code I wrote a while ago that collapses operators by stages when I wanted to refactor the operators to have the local execution mode: https://github.com/apache/spark/compare/master...rxin:local-iter3?expand=1
It is not complete (e.g. doesn't collapse inputs, or handle joins completely), but it could be a good start.
Test build #49344 has finished for PR 10735 at commit
|
Test build #49348 has finished for PR 10735 at commit
|
} | ||
|
||
/** | ||
* The code generated for this query (some comments and empty lines are removed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @nongli probably means pasting it here in the pull request, not pasting it into the source code...
|
Test build #49375 has finished for PR 10735 at commit
|
Test build #2382 has finished for PR 10735 at commit
|
Test build #2381 has finished for PR 10735 at commit
|
Test build #49388 has finished for PR 10735 at commit
|
protected InternalRow currentRow; | ||
protected Iterator<InternalRow> input; | ||
|
||
public boolean hasNext() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we control this API, can we simplify the iterator API?
I think if we just combined hasNext/next() to next() which returns null if there are no more is simpler. The current set up is very annoying. For example, it's not valid to call next() without calling hasNext().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we talked about using a single next()
function when we were talking about local iterators and we all agreed that was a simpler interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the adapter for RDD interface, we can't change the API without changing RDD, I think.
We could do these refactor later, once we have more clear picture of how all the things work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea it might be hard to change the iterator interface as part of this pr, if it is intertwined with the RDD iterator.
Test build #49463 has finished for PR 10735 at commit
|
Test build #49467 has finished for PR 10735 at commit
|
Test build #49471 has finished for PR 10735 at commit
|
13a422d
to
be494fb
Compare
Test build #49487 has finished for PR 10735 at commit
|
Test build #49493 has finished for PR 10735 at commit
|
Test build #49518 has finished for PR 10735 at commit
|
Test build #49519 has finished for PR 10735 at commit
|
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodegenFallback.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateMutableProjection.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratePredicate.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateSafeProjection.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/GenerateColumnAccessor.scala
Test build #2386 has finished for PR 10735 at commit
|
Test build #2387 has finished for PR 10735 at commit
|
Test build #49521 has finished for PR 10735 at commit
|
I'm going to merge this into master, any new comments will be addressed by follow up PR. |
This is the initial work for whole stage codegen, it support Projection/Filter/Range, we will continue work on this to support more physical operators.
A micro benchmark show that a query with range, filter and projection could be 3X faster then before.
It's turned on by default. For a tree that have at least two chained plans, a WholeStageCodegen will be inserted into it, for example, the following plan
will be translated into
Here is the call graph to generate Java source for A and B (A support codegen, but B does not):
A SparkPlan that support codegen need to implement doProduce() and doConsume():