New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2973] [SQL] Avoid spark job for take on all ExecutedCommands #5247
Conversation
Test build #29355 has started for PR 5247 at commit
|
Test build #29355 has finished for PR 5247 at commit
|
Test PASSed. |
Can we use LocalRelation? |
Yes, that's also what I'm going to say. As described in the JIRA ticket title, using a |
Ok. Get it |
Test build #29365 has started for PR 5247 at commit
|
Test build #29365 has finished for PR 5247 at commit
|
Test PASSed. |
@rxin @liancheng is this ok? |
@liancheng can you look at this ? thanks. |
/cc @liancheng |
Thanks for working on this! I have one issue with the current implementation. In particular, it is essentially doing query planning inside of [[DataFrame]]. If you look, this is very close to the logic that can be found here: spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala Line 263 in a95043b
I'd rather not spread this logic out over several different places. Really, it seems to me like
What do you think? |
Hi @marmbrus, Second, your suggestion is really useful, how about change as following:
will transform LocalRelation to LocalTableScan |
You are correct, thanks for clarifying. Query planning was not the right phrase, but really my point was that ideally the logic in DataFrame would handle only ensuring commands are executed eagerly. Hopefully this can be as simple as matching on I like your second point, but I'd make one change. Lets not put it in the optimizer since that feels a little weird to me (its not really an optimization). Instead, in the query planner lets go directly from |
yeah, its good to directly from RunnableCommand to LocalTableScan, i am updating this. |
Test build #30216 has started for PR 5247 at commit
|
Test build #30216 has finished for PR 5247 at commit
|
Test FAILed. |
Test build #30317 has started for PR 5247 at commit |
Here is a problem need to be fixed: the ddl command will be executed twice |
Test build #30317 has finished for PR 5247 at commit
|
Test FAILed. |
Test FAILed. |
_: WriteToFile => | ||
case _ : Command => | ||
queryExecution.sparkPlan.executeCollect() | ||
queryExecution.analyzed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will leads to execute command twice when we do action operator on dataframe, such as
sql(s"CREATE DATABASE xxx").count()
first execution is the eager execution when constructing dataframe
second is when execute count, it also trigger execution.
So maybe we still need construct LocalRelation here?
@marmbrus i think we can not directly from RunnableCommand to LocalTableScan in planner, that will leads to execute command twice as i described before. So here is other two ways to do this: your idea? |
Test build #30711 has started for PR 5247 at commit |
Test build #30711 has finished for PR 5247 at commit
|
Test FAILed. |
Jenkins failed |
Jenkins, retest this please. |
Test build #30715 has started for PR 5247 at commit |
Test build #30715 has finished for PR 5247 at commit
|
Test PASSed. |
@marmbrus any more comments? |
Retest this please |
Merged build triggered. |
Merged build started. |
Test build #31679 has started for PR 5247 at commit |
Test build #31679 has finished for PR 5247 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
/cc @marmbrus |
Merged build triggered. |
Merged build started. |
Test build #32289 has started for PR 5247 at commit |
Test build #32289 has finished for PR 5247 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
ping @marmbrus |
This implementation is still making changes to the query plan in |
sql("show tables").take(1) still start a spark job on the master branch. |
If we can do that in the query planner that sounds reasonable to me. It
|
close this, will file a new PR when i have time to fix the test failure |
I have manually tested this with
it will not start a spark job.